Databricks <> Graphistry Tutorial: Notebooks & Dashboards on IoT data#
This tutorial visualizes a set of sensors by clustering them based on lattitude/longitude and overlaying summary statistics
We show how to load the interactive plots both with Databricks notebook and dashboard modes. The general flow should work in other PySpark environments as well.
Steps:
Install Graphistry
Prepare IoT data
Plot in a notebook
Plot in a dashboard
Plot as a shareable URL
Install & connect#
[ ]:
# Uncomment and run first time
! pip install graphistry
#! pip install git+https://github.com/graphistry/pygraphistry.git@dev/databricks
# Can sometimes help:
#dbutils.library.restartPython()
Requirement already satisfied: graphistry in /local_disk0/.ephemeral_nfs/envs/pythonEnv-969db892-92cf-4b34-a5cf-61642fa76e77/lib/python3.9/site-packages (0.28.5)
Requirement already satisfied: numpy in /databricks/python3/lib/python3.9/site-packages (from graphistry) (1.20.3)
Requirement already satisfied: pandas>=0.17.0 in /databricks/python3/lib/python3.9/site-packages (from graphistry) (1.3.4)
Requirement already satisfied: packaging>=20.1 in /databricks/python3/lib/python3.9/site-packages (from graphistry) (21.0)
Requirement already satisfied: squarify in /local_disk0/.ephemeral_nfs/envs/pythonEnv-969db892-92cf-4b34-a5cf-61642fa76e77/lib/python3.9/site-packages (from graphistry) (0.4.3)
Requirement already satisfied: palettable>=3.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-969db892-92cf-4b34-a5cf-61642fa76e77/lib/python3.9/site-packages (from graphistry) (3.3.0)
Requirement already satisfied: typing-extensions in /databricks/python3/lib/python3.9/site-packages (from graphistry) (3.10.0.2)
Requirement already satisfied: pyarrow>=0.15.0 in /databricks/python3/lib/python3.9/site-packages (from graphistry) (7.0.0)
Requirement already satisfied: requests in /databricks/python3/lib/python3.9/site-packages (from graphistry) (2.26.0)
Requirement already satisfied: pyparsing>=2.0.2 in /databricks/python3/lib/python3.9/site-packages (from packaging>=20.1->graphistry) (3.0.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /databricks/python3/lib/python3.9/site-packages (from pandas>=0.17.0->graphistry) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /databricks/python3/lib/python3.9/site-packages (from pandas>=0.17.0->graphistry) (2021.3)
Requirement already satisfied: six>=1.5 in /databricks/python3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas>=0.17.0->graphistry) (1.16.0)
Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.9/site-packages (from requests->graphistry) (3.2)
Requirement already satisfied: charset-normalizer~=2.0.0 in /databricks/python3/lib/python3.9/site-packages (from requests->graphistry) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /databricks/python3/lib/python3.9/site-packages (from requests->graphistry) (1.26.7)
Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.9/site-packages (from requests->graphistry) (2021.10.8)
WARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.
You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-969db892-92cf-4b34-a5cf-61642fa76e77/bin/python -m pip install --upgrade pip' command.
[ ]:
#Optional: Uncomment - We find this speeds up calls 10%+ on some datasets
#spark.conf.set("spark.sql.execution.arrow.enabled", "true")
[ ]:
import graphistry # if not yet available, install and/or restart Python kernel using the above
# To specify Graphistry account & server, use:
# graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com')
# For more options, see https://github.com/graphistry/pygraphistry#configure
graphistry.__version__
Out[12]: '0.28.5'
Prepare IoT data#
Sample data provided by Databricks
We create tables for different plots:
Raw table of device sensor reads
Summarized table:
rounded latitude/longitude
summarize min/max/avg for battery_level, c02_level, humidity, timestamp
[ ]:
# Load the data from its source.
devices = spark.read \
.format('json') \
.load('/databricks-datasets/iot/iot_devices.json')
# Show the results.
print('type: ', str(type(devices)))
display(devices.take(10))
type: <class 'pyspark.sql.dataframe.DataFrame'>
| battery_level | c02_level | cca2 | cca3 | cn | device_id | device_name | humidity | ip | latitude | lcd | longitude | scale | temp | timestamp |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 868 | US | USA | United States | 1 | meter-gauge-1xbYRYcj | 51 | 68.161.225.1 | 38.0 | green | -97.0 | Celsius | 34 | 1458444054093 |
| 7 | 1473 | NO | NOR | Norway | 2 | sensor-pad-2n2Pea | 70 | 213.161.254.1 | 62.47 | red | 6.15 | Celsius | 11 | 1458444054119 |
| 2 | 1556 | IT | ITA | Italy | 3 | device-mac-36TWSKiT | 44 | 88.36.5.1 | 42.83 | red | 12.83 | Celsius | 19 | 1458444054120 |
| 6 | 1080 | US | USA | United States | 4 | sensor-pad-4mzWkz | 32 | 66.39.173.154 | 44.06 | yellow | -121.32 | Celsius | 28 | 1458444054121 |
| 4 | 931 | PH | PHL | Philippines | 5 | therm-stick-5gimpUrBB | 62 | 203.82.41.9 | 14.58 | green | 120.97 | Celsius | 25 | 1458444054122 |
| 3 | 1210 | US | USA | United States | 6 | sensor-pad-6al7RTAobR | 51 | 204.116.105.67 | 35.93 | yellow | -85.46 | Celsius | 27 | 1458444054122 |
| 3 | 1129 | CN | CHN | China | 7 | meter-gauge-7GeDoanM | 26 | 220.173.179.1 | 22.82 | yellow | 108.32 | Celsius | 18 | 1458444054123 |
| 0 | 1536 | JP | JPN | Japan | 8 | sensor-pad-8xUD6pzsQI | 35 | 210.173.177.1 | 35.69 | red | 139.69 | Celsius | 27 | 1458444054123 |
| 3 | 807 | JP | JPN | Japan | 9 | device-mac-9GcjZ2pw | 85 | 118.23.68.227 | 35.69 | green | 139.69 | Celsius | 13 | 1458444054124 |
| 7 | 1470 | US | USA | United States | 10 | sensor-pad-10BsywSYUF | 56 | 208.109.163.218 | 33.61 | red | -111.89 | Celsius | 26 | 1458444054125 |
[ ]:
from pyspark.sql import functions as F
from pyspark.sql.functions import concat_ws, col, round
devices_with_rounded_locations = (
devices
.withColumn(
'location_rounded1',
concat_ws(
'_',
round(col('latitude'), 0).cast('integer'),
round(col('longitude'), 0).cast('integer')))
.withColumn(
'location_rounded2',
concat_ws(
'_',
round(col('latitude'), -1).cast('integer'),
round(col('longitude'), -1).cast('integer')))
)
cols = ['battery_level', 'c02_level', 'humidity', 'timestamp']
id_cols = ['cca2', 'cca3', 'cn', 'device_name', 'ip', 'location_rounded1', 'location_rounded2']
devices_summarized = (
devices_with_rounded_locations.groupby('device_id').agg(
*[F.min(col) for col in cols],
*[F.max(col) for col in cols],
*[F.avg(col) for col in cols],
*[F.first(col) for col in id_cols]
)
)
# [(from1, to1), ...]
renames = (
[('device_id', 'device_id')]
+ [(f'first({col})', f'{col}') for col in id_cols]
+ [(f'min({col})', f'{col}_min') for col in cols]
+ [(f'max({col})', f'{col}_max') for col in cols]
+ [(f'avg({col})', f'{col}_avg') for col in cols]
)
devices_summarized = devices_summarized.select(list(
map(lambda old,new:F.col(old).alias(new),*zip(*renames))
))
display(devices_summarized.take(10))
| device_id | cca2 | cca3 | cn | device_name | ip | location_rounded1 | location_rounded2 | battery_level_min | c02_level_min | humidity_min | timestamp_min | battery_level_max | c02_level_max | humidity_max | timestamp_max | battery_level_avg | c02_level_avg | humidity_avg | timestamp_avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | US | USA | United States | meter-gauge-1xbYRYcj | 68.161.225.1 | 38_-97 | 40_-100 | 8 | 868 | 51 | 1458444054093 | 8 | 868 | 51 | 1458444054093 | 8.0 | 868.0 | 51.0 | 1.458444054093E12 |
| 2 | NO | NOR | Norway | sensor-pad-2n2Pea | 213.161.254.1 | 62_6 | 60_10 | 7 | 1473 | 70 | 1458444054119 | 7 | 1473 | 70 | 1458444054119 | 7.0 | 1473.0 | 70.0 | 1.458444054119E12 |
| 3 | IT | ITA | Italy | device-mac-36TWSKiT | 88.36.5.1 | 43_13 | 40_10 | 2 | 1556 | 44 | 1458444054120 | 2 | 1556 | 44 | 1458444054120 | 2.0 | 1556.0 | 44.0 | 1.45844405412E12 |
| 4 | US | USA | United States | sensor-pad-4mzWkz | 66.39.173.154 | 44_-121 | 40_-120 | 6 | 1080 | 32 | 1458444054121 | 6 | 1080 | 32 | 1458444054121 | 6.0 | 1080.0 | 32.0 | 1.458444054121E12 |
| 5 | PH | PHL | Philippines | therm-stick-5gimpUrBB | 203.82.41.9 | 15_121 | 10_120 | 4 | 931 | 62 | 1458444054122 | 4 | 931 | 62 | 1458444054122 | 4.0 | 931.0 | 62.0 | 1.458444054122E12 |
| 6 | US | USA | United States | sensor-pad-6al7RTAobR | 204.116.105.67 | 36_-85 | 40_-90 | 3 | 1210 | 51 | 1458444054122 | 3 | 1210 | 51 | 1458444054122 | 3.0 | 1210.0 | 51.0 | 1.458444054122E12 |
| 7 | CN | CHN | China | meter-gauge-7GeDoanM | 220.173.179.1 | 23_108 | 20_110 | 3 | 1129 | 26 | 1458444054123 | 3 | 1129 | 26 | 1458444054123 | 3.0 | 1129.0 | 26.0 | 1.458444054123E12 |
| 8 | JP | JPN | Japan | sensor-pad-8xUD6pzsQI | 210.173.177.1 | 36_140 | 40_140 | 0 | 1536 | 35 | 1458444054123 | 0 | 1536 | 35 | 1458444054123 | 0.0 | 1536.0 | 35.0 | 1.458444054123E12 |
| 9 | JP | JPN | Japan | device-mac-9GcjZ2pw | 118.23.68.227 | 36_140 | 40_140 | 3 | 807 | 85 | 1458444054124 | 3 | 807 | 85 | 1458444054124 | 3.0 | 807.0 | 85.0 | 1.458444054124E12 |
| 10 | US | USA | United States | sensor-pad-10BsywSYUF | 208.109.163.218 | 34_-112 | 30_-110 | 7 | 1470 | 56 | 1458444054125 | 7 | 1470 | 56 | 1458444054125 | 7.0 | 1470.0 | 56.0 | 1.458444054125E12 |
Notebook plot#
Simple: Graph connections between
device_nameandcca3(country code)Advanced: Graph multiple connections, like
ip -> device_nameandlocaation_rounded1 -> ip
[ ]:
(
graphistry
.edges(devices.sample(fraction=0.1), 'device_name', 'cca3') \
.settings(url_params={'strongGravity': 'true'}) \
.plot()
)
[ ]:
hg = graphistry.hypergraph(
devices_with_rounded_locations.sample(fraction=0.1).toPandas(),
['ip', 'device_name', 'location_rounded1', 'location_rounded2', 'cca3'],
direct=True,
opts={
'EDGES': {
'ip': ['device_name'],
'location_rounded1': ['ip'],
'location_rounded2': ['ip'],
'cca3': ['location_rounded2']
}
})
g = hg['graph']
g = g.settings(url_params={'strongGravity': 'true'}) # this setting is great!
g.plot()
# links 79200
# events 19800
# attrib entities 41197
Dashboard plot#
Make a
graphistryobject as usual…… Then disable the splash screen and optionally set custom dimensions
The visualization will now load without needing to interact in the dashboard (view -> + New Dashboard)
[ ]:
(
g
.settings(url_params={'splashAfter': 'false'}) # extends existing setting
.plot(override_html_style="""
border: 1px #DDD dotted;
width: 50em; height: 50em;
""")
)