Databricks <> Graphistry Tutorial: Notebooks & Dashboards on IoT data

Databricks <> Graphistry Tutorial: Notebooks & Dashboards on IoT data#

This tutorial visualizes a set of sensors by clustering them based on lattitude/longitude and overlaying summary statistics

We show how to load the interactive plots both with Databricks notebook and dashboard modes. The general flow should work in other PySpark environments as well.

Steps:

  • Install Graphistry

  • Prepare IoT data

  • Plot in a notebook

  • Plot in a dashboard

  • Plot as a shareable URL

Install & connect#

[ ]:
# Uncomment and run first time
! pip install graphistry
#! pip install git+https://github.com/graphistry/pygraphistry.git@dev/databricks

# Can sometimes help:
#dbutils.library.restartPython()
Requirement already satisfied: graphistry in /local_disk0/.ephemeral_nfs/envs/pythonEnv-969db892-92cf-4b34-a5cf-61642fa76e77/lib/python3.9/site-packages (0.28.5)
Requirement already satisfied: numpy in /databricks/python3/lib/python3.9/site-packages (from graphistry) (1.20.3)
Requirement already satisfied: pandas>=0.17.0 in /databricks/python3/lib/python3.9/site-packages (from graphistry) (1.3.4)
Requirement already satisfied: packaging>=20.1 in /databricks/python3/lib/python3.9/site-packages (from graphistry) (21.0)
Requirement already satisfied: squarify in /local_disk0/.ephemeral_nfs/envs/pythonEnv-969db892-92cf-4b34-a5cf-61642fa76e77/lib/python3.9/site-packages (from graphistry) (0.4.3)
Requirement already satisfied: palettable>=3.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-969db892-92cf-4b34-a5cf-61642fa76e77/lib/python3.9/site-packages (from graphistry) (3.3.0)
Requirement already satisfied: typing-extensions in /databricks/python3/lib/python3.9/site-packages (from graphistry) (3.10.0.2)
Requirement already satisfied: pyarrow>=0.15.0 in /databricks/python3/lib/python3.9/site-packages (from graphistry) (7.0.0)
Requirement already satisfied: requests in /databricks/python3/lib/python3.9/site-packages (from graphistry) (2.26.0)
Requirement already satisfied: pyparsing>=2.0.2 in /databricks/python3/lib/python3.9/site-packages (from packaging>=20.1->graphistry) (3.0.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /databricks/python3/lib/python3.9/site-packages (from pandas>=0.17.0->graphistry) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /databricks/python3/lib/python3.9/site-packages (from pandas>=0.17.0->graphistry) (2021.3)
Requirement already satisfied: six>=1.5 in /databricks/python3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas>=0.17.0->graphistry) (1.16.0)
Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.9/site-packages (from requests->graphistry) (3.2)
Requirement already satisfied: charset-normalizer~=2.0.0 in /databricks/python3/lib/python3.9/site-packages (from requests->graphistry) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /databricks/python3/lib/python3.9/site-packages (from requests->graphistry) (1.26.7)
Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.9/site-packages (from requests->graphistry) (2021.10.8)
WARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.
You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-969db892-92cf-4b34-a5cf-61642fa76e77/bin/python -m pip install --upgrade pip' command.

[ ]:
#Optional: Uncomment - We find this speeds up calls 10%+ on some datasets
#spark.conf.set("spark.sql.execution.arrow.enabled", "true")
[ ]:
import graphistry  # if not yet available, install and/or restart Python kernel using the above

# To specify Graphistry account & server, use:
# graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com')
# For more options, see https://github.com/graphistry/pygraphistry#configure

graphistry.__version__
Out[12]: '0.28.5'

Prepare IoT data#

Sample data provided by Databricks

We create tables for different plots:

  • Raw table of device sensor reads

  • Summarized table:

    • rounded latitude/longitude

    • summarize min/max/avg for battery_level, c02_level, humidity, timestamp

[ ]:
# Load the data from its source.
devices = spark.read \
  .format('json') \
  .load('/databricks-datasets/iot/iot_devices.json')

# Show the results.
print('type: ', str(type(devices)))
display(devices.take(10))
type:  <class 'pyspark.sql.dataframe.DataFrame'>

battery_levelc02_levelcca2cca3cndevice_iddevice_namehumidityiplatitudelcdlongitudescaletemptimestamp
8868USUSAUnited States1meter-gauge-1xbYRYcj5168.161.225.138.0green-97.0Celsius341458444054093
71473NONORNorway2sensor-pad-2n2Pea70213.161.254.162.47red6.15Celsius111458444054119
21556ITITAItaly3device-mac-36TWSKiT4488.36.5.142.83red12.83Celsius191458444054120
61080USUSAUnited States4sensor-pad-4mzWkz3266.39.173.15444.06yellow-121.32Celsius281458444054121
4931PHPHLPhilippines5therm-stick-5gimpUrBB62203.82.41.914.58green120.97Celsius251458444054122
31210USUSAUnited States6sensor-pad-6al7RTAobR51204.116.105.6735.93yellow-85.46Celsius271458444054122
31129CNCHNChina7meter-gauge-7GeDoanM26220.173.179.122.82yellow108.32Celsius181458444054123
01536JPJPNJapan8sensor-pad-8xUD6pzsQI35210.173.177.135.69red139.69Celsius271458444054123
3807JPJPNJapan9device-mac-9GcjZ2pw85118.23.68.22735.69green139.69Celsius131458444054124
71470USUSAUnited States10sensor-pad-10BsywSYUF56208.109.163.21833.61red-111.89Celsius261458444054125
[ ]:
from pyspark.sql import functions as F
from pyspark.sql.functions import concat_ws, col, round

devices_with_rounded_locations = (
    devices
    .withColumn(
        'location_rounded1',
        concat_ws(
            '_',
            round(col('latitude'), 0).cast('integer'),
            round(col('longitude'), 0).cast('integer')))
    .withColumn(
        'location_rounded2',
        concat_ws(
            '_',
            round(col('latitude'), -1).cast('integer'),
            round(col('longitude'), -1).cast('integer')))
)

cols = ['battery_level', 'c02_level', 'humidity', 'timestamp']
id_cols = ['cca2', 'cca3', 'cn', 'device_name', 'ip', 'location_rounded1', 'location_rounded2']
devices_summarized = (
    devices_with_rounded_locations.groupby('device_id').agg(
        *[F.min(col) for col in cols],
        *[F.max(col) for col in cols],
        *[F.avg(col) for col in cols],
        *[F.first(col) for col in id_cols]
    )
)

# [(from1, to1), ...]
renames = (
    [('device_id', 'device_id')]
    + [(f'first({col})', f'{col}') for col in id_cols]
    + [(f'min({col})', f'{col}_min') for col in cols]
    + [(f'max({col})', f'{col}_max') for col in cols]
    + [(f'avg({col})', f'{col}_avg') for col in cols]
 )
devices_summarized = devices_summarized.select(list(
       map(lambda old,new:F.col(old).alias(new),*zip(*renames))
       ))

display(devices_summarized.take(10))
device_idcca2cca3cndevice_nameiplocation_rounded1location_rounded2battery_level_minc02_level_minhumidity_mintimestamp_minbattery_level_maxc02_level_maxhumidity_maxtimestamp_maxbattery_level_avgc02_level_avghumidity_avgtimestamp_avg
1USUSAUnited Statesmeter-gauge-1xbYRYcj68.161.225.138_-9740_-100886851145844405409388685114584440540938.0868.051.01.458444054093E12
2NONORNorwaysensor-pad-2n2Pea213.161.254.162_660_1071473701458444054119714737014584440541197.01473.070.01.458444054119E12
3ITITAItalydevice-mac-36TWSKiT88.36.5.143_1340_1021556441458444054120215564414584440541202.01556.044.01.45844405412E12
4USUSAUnited Statessensor-pad-4mzWkz66.39.173.15444_-12140_-12061080321458444054121610803214584440541216.01080.032.01.458444054121E12
5PHPHLPhilippinestherm-stick-5gimpUrBB203.82.41.915_12110_120493162145844405412249316214584440541224.0931.062.01.458444054122E12
6USUSAUnited Statessensor-pad-6al7RTAobR204.116.105.6736_-8540_-9031210511458444054122312105114584440541223.01210.051.01.458444054122E12
7CNCHNChinameter-gauge-7GeDoanM220.173.179.123_10820_11031129261458444054123311292614584440541233.01129.026.01.458444054123E12
8JPJPNJapansensor-pad-8xUD6pzsQI210.173.177.136_14040_14001536351458444054123015363514584440541230.01536.035.01.458444054123E12
9JPJPNJapandevice-mac-9GcjZ2pw118.23.68.22736_14040_140380785145844405412438078514584440541243.0807.085.01.458444054124E12
10USUSAUnited Statessensor-pad-10BsywSYUF208.109.163.21834_-11230_-11071470561458444054125714705614584440541257.01470.056.01.458444054125E12

Notebook plot#

  • Simple: Graph connections between device_name and cca3 (country code)

  • Advanced: Graph multiple connections, like ip -> device_name and locaation_rounded1 -> ip

[ ]:
(
    graphistry
        .edges(devices.sample(fraction=0.1), 'device_name', 'cca3') \
        .settings(url_params={'strongGravity': 'true'}) \
        .plot()
)
[ ]:
hg = graphistry.hypergraph(
    devices_with_rounded_locations.sample(fraction=0.1).toPandas(),
    ['ip', 'device_name', 'location_rounded1', 'location_rounded2', 'cca3'],
    direct=True,
    opts={
        'EDGES': {
            'ip': ['device_name'],
            'location_rounded1': ['ip'],
            'location_rounded2': ['ip'],
            'cca3': ['location_rounded2']
        }
    })
g = hg['graph']
g = g.settings(url_params={'strongGravity': 'true'})  # this setting is great!

g.plot()
# links 79200
# events 19800
# attrib entities 41197

Dashboard plot#

  • Make a graphistry object as usual…

  • … Then disable the splash screen and optionally set custom dimensions

The visualization will now load without needing to interact in the dashboard (view -> + New Dashboard)

[ ]:
(
    g
        .settings(url_params={'splashAfter': 'false'})  # extends existing setting
        .plot(override_html_style="""
            border: 1px #DDD dotted;
            width: 50em; height: 50em;
        """)
)

Plot as a Shareable URL#

[ ]:
url = g.plot(render=False)
url
Out[18]: 'https://hub.graphistry.com/graph/graph.html?dataset=187d97493ce54498b820f727877eda4b&type=arrow&viztoken=b3106e8a-cbe9-4802-8519-97e1d0d539c3&usertag=50d9aebe-pygraphistry-0.28.5&splashAfter=1669270570&info=true&strongGravity=true'
[ ]: