Databricks <> Graphistry Tutorial: Notebooks & Dashboards on IoT data#
This tutorial visualizes a set of sensors by clustering them based on lattitude/longitude and overlaying summary statistics
We show how to load the interactive plots both with Databricks notebook and dashboard modes. The general flow should work in other PySpark environments as well.
Steps:
Install Graphistry
Prepare IoT data
Plot in a notebook
Plot in a dashboard
Plot as a shareable URL
Install & authenticate with graphistry server#
[ ]:
# Uncomment and run first time or
# have databricks admin install graphistry python library:
# https://docs.databricks.com/en/libraries/package-repositories.html#pypi-package
#%pip install graphistry
[ ]:
# Required to run after pip install to pick up new python package:
dbutils.library.restartPython()
[ ]:
import graphistry # if not yet available, install pygraphistry and/or restart Python kernel using the cells above
graphistry.__version__
Use databricks secrets to retrieve graphistry creds and pass to register#
[ ]:
# As a best practice, use databricks secrets to store graphistry personal key (access token)
# create databricks secrets: https://docs.databricks.com/en/security/secrets/index.html
# create graphistry personal key: https://hub.graphistry.com/account/tokens
graphistry.register(api=3,
personal_key_id=dbutils.secrets.get(scope="my-secret-scope", key="graphistry-personal_key_id"),
personal_key_secret=dbutils.secrets.get(scope="my-secret-scope", key="graphistry-personal_key_secret"),
protocol='https',
server='hub.graphistry.com')
# Alternatively, use username and password:
# graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com')
# For more options, see https://github.com/graphistry/pygraphistry#configure
Prepare IoT data#
Sample data provided by Databricks
We create tables for different plots:
Raw table of device sensor reads
Summarized table:
rounded latitude/longitude
summarize min/max/avg for battery_level, c02_level, humidity, timestamp
[ ]:
# Load the data from its source.
devices = spark.read \
.format('json') \
.load('/databricks-datasets/iot/iot_devices.json')
# Show the results.
print('type: ', str(type(devices)))
display(devices.take(10))
[ ]:
from pyspark.sql import functions as F
from pyspark.sql.functions import concat_ws, col, round
devices_with_rounded_locations = (
devices
.withColumn(
'location_rounded1',
concat_ws(
'_',
round(col('latitude'), 0).cast('integer'),
round(col('longitude'), 0).cast('integer')))
.withColumn(
'location_rounded2',
concat_ws(
'_',
round(col('latitude'), -1).cast('integer'),
round(col('longitude'), -1).cast('integer')))
)
cols = ['battery_level', 'c02_level', 'humidity', 'timestamp']
id_cols = ['cca2', 'cca3', 'cn', 'device_name', 'ip', 'location_rounded1', 'location_rounded2']
devices_summarized = (
devices_with_rounded_locations.groupby('device_id').agg(
*[F.min(col) for col in cols],
*[F.max(col) for col in cols],
*[F.avg(col) for col in cols],
*[F.first(col) for col in id_cols]
)
)
# [(from1, to1), ...]
renames = (
[('device_id', 'device_id')]
+ [(f'first({col})', f'{col}') for col in id_cols]
+ [(f'min({col})', f'{col}_min') for col in cols]
+ [(f'max({col})', f'{col}_max') for col in cols]
+ [(f'avg({col})', f'{col}_avg') for col in cols]
)
devices_summarized = devices_summarized.select(list(
map(lambda old,new:F.col(old).alias(new),*zip(*renames))
))
display(devices_summarized.take(10))
Notebook plot#
Simple: Graph connections between
device_name
andcca3
(country code)Advanced: Graph multiple connections, like
ip -> device_name
andlocaation_rounded1 -> ip
[ ]:
(
graphistry
.edges(devices.sample(fraction=0.1).toPandas(), 'device_name', 'cca3') \
.settings(url_params={'strongGravity': 'true'}) \
.plot()
)
[ ]:
hg = graphistry.hypergraph(
devices_with_rounded_locations.sample(fraction=0.1).toPandas(),
['ip', 'device_name', 'location_rounded1', 'location_rounded2', 'cca3'],
direct=True,
opts={
'EDGES': {
'ip': ['device_name'],
'location_rounded1': ['ip'],
'location_rounded2': ['ip'],
'cca3': ['location_rounded2']
}
})
g = hg['graph']
g = g.settings(url_params={'strongGravity': 'true'}) # this setting is great!
g.plot()
Dashboard plot#
Make a
graphistry
object as usual…… Then disable the splash screen and optionally set custom dimensions
The visualization will now load without needing to interact in the dashboard (view
-> + New Dashboard
)
[ ]:
(
g
.settings(url_params={'splashAfter': 'false'}) # extends existing setting
.plot(override_html_style="""
border: 1px #DDD dotted;
width: 50em; height: 50em;
""")
)