Graphistry Neptune Gremlin identity graph demo#
PyGraphistry helps connect to graph data sources, wrangle them with Python dataframe tools, and visualize them with Graphistry. It’s often used in notebooks, data apps, and dashboards.
This notebook uses PyGraphistry to quickly: * Connect toNeptune * RunGremlinqueries via built-in bindings over gremlin_python * Convert to dataframes for data wrangling: CPU via Pandas and GPU via RAPIDS cuDF * Visualize by automatically generating rich, interactive, & GPU-accelerated Graphistry graph visualization sessions * Share & embed your beautiful results
For any API used below, run help(graphistry.the_method)
for a quick view of its docs
The demo is on AWS Neptune’s identity graph data sample from our joint graph-app-kit tutorial. If you have your own dataset, including non-identity data, the example queries should still work.
Setup#
Optional - Quicklaunch via graph-app-kit for Neptune: * Neptune: It is tested on Neptune’s identity graph database sample kit, and you can swap in your own * Graphistry: Use your own, get a free Hub account, or launch in AWS alongside Neptune’s VPC and public subnet * Notebook: Use your own, or launch in AWS alongside Neptune’s VPC and public subnet
If you hit gremlinpython
event runtime bugs, try this gist for solving them
Install#
Already provided in graphistry envs
[1]:
# ! pip install -u gremlinpython graphistry
# ! pip install -u pandas
# see https://rapids.ai/ if trying GPU dataframes
Imports#
[2]:
! pip show gremlinpython graphistry | grep 'Name\|Version'
Name: gremlinpython
Version: 3.4.10
Name: graphistry
Version: 0.19.0+5.g5ce1d3fb0
[3]:
import graphistry
graphistry.__version__
[3]:
'0.19.0+5.g5ce1d3fb0'
Configure#
[4]:
# To specify Graphistry account & server, use:
# graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com')
# For more options, see https://github.com/graphistry/pygraphistry#configure
[29]:
NEPTUNE_READER_PROTOCOL='wss'
NEPTUNE_READER_HOST='neptunedbcluster-abc.cluster-ro-xyz.us-east-1.neptune.amazonaws.com'
NEPTUNE_READER_PORT='8182'
endpoint = f'{NEPTUNE_READER_PROTOCOL}://{NEPTUNE_READER_HOST}:{NEPTUNE_READER_PORT}/gremlin'
endpoint
[29]:
'wss://neptunedbcluster-abc.cluster-ro-xyz.us-east-1.neptune.amazonaws.com:8182/gremlin'
[6]:
#import logging
#logging.basicConfig(level=logging.DEBUG)
Connect#
[7]:
graphistry.register(**GRAPHISTRY_CFG)
g = graphistry.neptune(endpoint=endpoint)
g._gremlin_client
[7]:
<gremlin_python.driver.client.Client at 0x7fdfc230e3d0>
Query & plot#
PyGraphistry automatically converts gremlin results into node/edge dataframes
Edge queries typically only return node IDs; call
fetch_nodes()
to enrich yourg._nodes
dataframePyGraphistry plots dataframes
[25]:
%%time
g2 = g.gremlin('g.E().limit(10000)')
CPU times: user 4.96 s, sys: 27.9 ms, total: 4.99 s
Wall time: 4.95 s
[26]:
print('NODES:')
g2._nodes.info()
g2._nodes.sample(3)
NODES:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8106 entries, 0 to 8105
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 8106 non-null object
1 label 8106 non-null object
dtypes: object(2)
memory usage: 126.8+ KB
[26]:
id | label | |
---|---|---|
4102 | ed95a9a5be30e4c8/e212d4b4d4a865a/7e3e41e09dfe6... | website |
6496 | 6ea77fc3ea42bd5b/87be29bd5615083/d4392e74543e413 | website |
7540 | 4c980617e02858a4/7de2f069da3a3655/30591f4d8c71... | website |
[27]:
print('EDGES:')
print(g2._edges.info())
g2._edges.sample(3)
EDGES:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 10000 non-null object
1 label 10000 non-null object
2 src 10000 non-null object
3 dst 10000 non-null object
dtypes: object(4)
memory usage: 312.6+ KB
None
[27]:
id | label | src | dst | |
---|---|---|---|---|
2814 | f7803bf0ac187592421c0695792b698f43b596ce | visited | 556de63e26686d50/95263499b67bbda1?f300c39f4f33... | 48e740025e70e4e38dc87928cd45357c |
8081 | fe80cddfec97a7dd802cf93cf277da01d9b5fb65 | visited | 3ccec85ce35ea661?fa76e6024017220f | 23c31ea91be100fd224dff1499939851 |
2046 | 4e5290971de41c1e1bcb7433e53ffc6321e410cf | visited | 6ea77fc3ea42bd5b/9c280de73bf0fb32/bb555a4d63de... | 9e77c2a52fdf9f9b7416e85cabaf7c76 |
[28]:
%%time
# Enrich nodes dataframe with any available server property data
g3 = g2.fetch_nodes()
print(g3._nodes.info())
g3._nodes.sample(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8106 entries, 0 to 8105
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 8106 non-null object
1 label 8106 non-null object
dtypes: object(2)
memory usage: 126.8+ KB
None
CPU times: user 4.32 s, sys: 43.9 ms, total: 4.37 s
Wall time: 4.33 s
[28]:
id | label | |
---|---|---|
1242 | 4c980617e02858a4/7de2f069da3a3655/30591f4d8c71... | website |
3190 | 493a46bbfd2029ae/4a0cad2f071a71ce/f9ba18598922... | website |
6782 | 4c980617e02858a4/7de2f069da3a3655/30591f4d8c71... | website |
[19]:
%%time
g3.plot()
CPU times: user 59.8 ms, sys: 4 ms, total: 63.8 ms
Wall time: 1.68 s
[19]:
Customize your visuals & Embed#
Graphistry visualizes data with smart defaults: community-based coloring, degree-based sizing, force-directed layout, auto-zoom, and built-in visual analytics. However, it often helps to configure your visuals ahead of time.
Example: * Enable legend on new column ‘type’ * Color nodes by node column ‘type’ * Pick icons based on node type * Set background color to match notebook * Use a tighter layout
See further examples at the PyGraphistry github repo
[24]:
%%time
g4 = (g3
# Add node column 'type' based on gremlin-provided column 'label'
# The legend auto-detects this column and appears
.nodes(lambda g: g._nodes.assign(type=g._nodes['label']))
.encode_point_color('type', categorical_mapping={
'website': 'blue',
'transientId': 'green'
})
.encode_point_icon('type', categorical_mapping ={
'website': 'link',
'transientId': 'barcode'
})
.addStyle(bg={'color': '#eee'}, page={'title': 'My Graph'})
# More: https://hub.graphistry.com/docs/api/1/rest/url/
.settings(url_params={'play': 2000})
)
g4.plot()
CPU times: user 63.5 ms, sys: 3.88 ms, total: 67.3 ms
Wall time: 1.62 s
[24]:
Generate URL for other systems#
[23]:
%%time
url = g4.plot(render=False)
url
CPU times: user 64.8 ms, sys: 0 ns, total: 64.8 ms
Wall time: 1.67 s
[23]:
'https://hub.graphistry.com/graph/graph.html?dataset=7405d0ac396a47ea9ee84acab7b0b31d&type=arrow&viztoken=c5e68946-e922-487e-9484-ef8fc9e2c8f9&usertag=5bf3845f-pygraphistry-0.19.0+5.g5ce1d3fb0&splashAfter=1625879227&info=true&strongGravity=False&play=2000'
Next steps#
Go deeper with PyGraphistry: Examples for customization, GPU graph analytics, and more
Explore gremlinpython
Dashboarding with graph-app-kit / Streamlit’s Neptune integration
Amazon Neptune’s launch announce & tutorial
Try a CSV upload on Hub or Launch your own Graphistry server
Additional Graphistry APIs: REST, React, JS, …
[ ]: