Databricks <> Graphistry Tutorial: Notebooks & Dashboards on IoT data#

This tutorial visualizes a set of sensors by clustering them based on lattitude/longitude and overlaying summary statistics

We show how to load the interactive plots both with Databricks notebook and dashboard modes. The general flow should work in other PySpark environments as well.

Steps:

  • Install Graphistry

  • Prepare IoT data

  • Plot in a notebook

  • Plot in a dashboard

  • Plot as a shareable URL

Install & authenticate with graphistry server#

[ ]:
# Uncomment and run first time or
#  have databricks admin install graphistry python library:
#  https://docs.databricks.com/en/libraries/package-repositories.html#pypi-package

#%pip install graphistry
[ ]:
# Required to run after pip install to pick up new python package:
dbutils.library.restartPython()
[ ]:
import graphistry  # if not yet available, install pygraphistry and/or restart Python kernel using the cells above
graphistry.__version__

Use databricks secrets to retrieve graphistry creds and pass to register#

[ ]:

# As a best practice, use databricks secrets to store graphistry personal key (access token) # create databricks secrets: https://docs.databricks.com/en/security/secrets/index.html # create graphistry personal key: https://hub.graphistry.com/account/tokens graphistry.register(api=3, personal_key_id=dbutils.secrets.get(scope="my-secret-scope", key="graphistry-personal_key_id"), personal_key_secret=dbutils.secrets.get(scope="my-secret-scope", key="graphistry-personal_key_secret"), protocol='https', server='hub.graphistry.com') # Alternatively, use username and password: # graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com') # For more options, see https://github.com/graphistry/pygraphistry#configure

Prepare IoT data#

Sample data provided by Databricks

We create tables for different plots:

  • Raw table of device sensor reads

  • Summarized table:

    • rounded latitude/longitude

    • summarize min/max/avg for battery_level, c02_level, humidity, timestamp

[ ]:
# Load the data from its source.
devices = spark.read \
  .format('json') \
  .load('/databricks-datasets/iot/iot_devices.json')

# Show the results.
print('type: ', str(type(devices)))
display(devices.take(10))
[ ]:
from pyspark.sql import functions as F
from pyspark.sql.functions import concat_ws, col, round

devices_with_rounded_locations = (
    devices
    .withColumn(
        'location_rounded1',
        concat_ws(
            '_',
            round(col('latitude'), 0).cast('integer'),
            round(col('longitude'), 0).cast('integer')))
    .withColumn(
        'location_rounded2',
        concat_ws(
            '_',
            round(col('latitude'), -1).cast('integer'),
            round(col('longitude'), -1).cast('integer')))
)

cols = ['battery_level', 'c02_level', 'humidity', 'timestamp']
id_cols = ['cca2', 'cca3', 'cn', 'device_name', 'ip', 'location_rounded1', 'location_rounded2']
devices_summarized = (
    devices_with_rounded_locations.groupby('device_id').agg(
        *[F.min(col) for col in cols],
        *[F.max(col) for col in cols],
        *[F.avg(col) for col in cols],
        *[F.first(col) for col in id_cols]
    )
)

# [(from1, to1), ...]
renames = (
    [('device_id', 'device_id')]
    + [(f'first({col})', f'{col}') for col in id_cols]
    + [(f'min({col})', f'{col}_min') for col in cols]
    + [(f'max({col})', f'{col}_max') for col in cols]
    + [(f'avg({col})', f'{col}_avg') for col in cols]
 )
devices_summarized = devices_summarized.select(list(
       map(lambda old,new:F.col(old).alias(new),*zip(*renames))
       ))

display(devices_summarized.take(10))

Notebook plot#

  • Simple: Graph connections between device_name and cca3 (country code)

  • Advanced: Graph multiple connections, like ip -> device_name and locaation_rounded1 -> ip

[ ]:
(
    graphistry
        .edges(devices.sample(fraction=0.1).toPandas(), 'device_name', 'cca3') \
        .settings(url_params={'strongGravity': 'true'}) \
        .plot()
)
[ ]:
hg = graphistry.hypergraph(
    devices_with_rounded_locations.sample(fraction=0.1).toPandas(),
    ['ip', 'device_name', 'location_rounded1', 'location_rounded2', 'cca3'],
    direct=True,
    opts={
        'EDGES': {
            'ip': ['device_name'],
            'location_rounded1': ['ip'],
            'location_rounded2': ['ip'],
            'cca3': ['location_rounded2']
        }
    })
g = hg['graph']
g = g.settings(url_params={'strongGravity': 'true'})  # this setting is great!

g.plot()

Dashboard plot#

  • Make a graphistry object as usual…

  • … Then disable the splash screen and optionally set custom dimensions

The visualization will now load without needing to interact in the dashboard (view -> + New Dashboard)

[ ]:
(
    g
        .settings(url_params={'splashAfter': 'false'})  # extends existing setting
        .plot(override_html_style="""
            border: 1px #DDD dotted;
            width: 50em; height: 50em;
        """)
)

Plot as a Shareable URL#

[ ]:
url = g.plot(render=False)
url
[ ]: