Graphistry Neptune Gremlin identity graph demo#

PyGraphistry helps connect to graph data sources, wrangle them with Python dataframe tools, and visualize them with Graphistry. It’s often used in notebooks, data apps, and dashboards.

This notebook uses PyGraphistry to quickly: * Connect toNeptune * RunGremlinqueries via built-in bindings over gremlin_python * Convert to dataframes for data wrangling: CPU via Pandas and GPU via RAPIDS cuDF * Visualize by automatically generating rich, interactive, & GPU-accelerated Graphistry graph visualization sessions * Share & embed your beautiful results

For any API used below, run help(graphistry.the_method) for a quick view of its docs

The demo is on AWS Neptune’s identity graph data sample from our joint graph-app-kit tutorial. If you have your own dataset, including non-identity data, the example queries should still work.

Setup#

Optional - Quicklaunch via graph-app-kit for Neptune: * Neptune: It is tested on Neptune’s identity graph database sample kit, and you can swap in your own * Graphistry: Use your own, get a free Hub account, or launch in AWS alongside Neptune’s VPC and public subnet * Notebook: Use your own, or launch in AWS alongside Neptune’s VPC and public subnet

If you hit gremlinpython event runtime bugs, try this gist for solving them

Install#

Already provided in graphistry envs

[1]:
# ! pip install -u gremlinpython graphistry
# ! pip install -u pandas
# see https://rapids.ai/ if trying GPU dataframes

Imports#

[2]:
! pip show gremlinpython graphistry | grep 'Name\|Version'
Name: gremlinpython
Version: 3.4.10
Name: graphistry
Version: 0.19.0+5.g5ce1d3fb0
[3]:
import graphistry
graphistry.__version__
[3]:
'0.19.0+5.g5ce1d3fb0'

Configure#

[4]:
# To specify Graphistry account & server, use:
# graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com')
# For more options, see https://github.com/graphistry/pygraphistry#configure
[29]:
NEPTUNE_READER_PROTOCOL='wss'
NEPTUNE_READER_HOST='neptunedbcluster-abc.cluster-ro-xyz.us-east-1.neptune.amazonaws.com'
NEPTUNE_READER_PORT='8182'

endpoint = f'{NEPTUNE_READER_PROTOCOL}://{NEPTUNE_READER_HOST}:{NEPTUNE_READER_PORT}/gremlin'
endpoint
[29]:
'wss://neptunedbcluster-abc.cluster-ro-xyz.us-east-1.neptune.amazonaws.com:8182/gremlin'
[6]:
#import logging
#logging.basicConfig(level=logging.DEBUG)

Connect#

[7]:
graphistry.register(**GRAPHISTRY_CFG)

g = graphistry.neptune(endpoint=endpoint)

g._gremlin_client
[7]:
<gremlin_python.driver.client.Client at 0x7fdfc230e3d0>

Query & plot#

  • PyGraphistry automatically converts gremlin results into node/edge dataframes

  • Edge queries typically only return node IDs; call fetch_nodes() to enrich your g._nodes dataframe

  • PyGraphistry plots dataframes

[25]:
%%time

g2 = g.gremlin('g.E().limit(10000)')

CPU times: user 4.96 s, sys: 27.9 ms, total: 4.99 s
Wall time: 4.95 s
[26]:
print('NODES:')
g2._nodes.info()
g2._nodes.sample(3)
NODES:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8106 entries, 0 to 8105
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      8106 non-null   object
 1   label   8106 non-null   object
dtypes: object(2)
memory usage: 126.8+ KB
[26]:
id label
4102 ed95a9a5be30e4c8/e212d4b4d4a865a/7e3e41e09dfe6... website
6496 6ea77fc3ea42bd5b/87be29bd5615083/d4392e74543e413 website
7540 4c980617e02858a4/7de2f069da3a3655/30591f4d8c71... website
[27]:
print('EDGES:')
print(g2._edges.info())

g2._edges.sample(3)
EDGES:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      10000 non-null  object
 1   label   10000 non-null  object
 2   src     10000 non-null  object
 3   dst     10000 non-null  object
dtypes: object(4)
memory usage: 312.6+ KB
None
[27]:
id label src dst
2814 f7803bf0ac187592421c0695792b698f43b596ce visited 556de63e26686d50/95263499b67bbda1?f300c39f4f33... 48e740025e70e4e38dc87928cd45357c
8081 fe80cddfec97a7dd802cf93cf277da01d9b5fb65 visited 3ccec85ce35ea661?fa76e6024017220f 23c31ea91be100fd224dff1499939851
2046 4e5290971de41c1e1bcb7433e53ffc6321e410cf visited 6ea77fc3ea42bd5b/9c280de73bf0fb32/bb555a4d63de... 9e77c2a52fdf9f9b7416e85cabaf7c76
[28]:
%%time

# Enrich nodes dataframe with any available server property data

g3 = g2.fetch_nodes()

print(g3._nodes.info())

g3._nodes.sample(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8106 entries, 0 to 8105
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      8106 non-null   object
 1   label   8106 non-null   object
dtypes: object(2)
memory usage: 126.8+ KB
None
CPU times: user 4.32 s, sys: 43.9 ms, total: 4.37 s
Wall time: 4.33 s
[28]:
id label
1242 4c980617e02858a4/7de2f069da3a3655/30591f4d8c71... website
3190 493a46bbfd2029ae/4a0cad2f071a71ce/f9ba18598922... website
6782 4c980617e02858a4/7de2f069da3a3655/30591f4d8c71... website
[19]:
%%time

g3.plot()
CPU times: user 59.8 ms, sys: 4 ms, total: 63.8 ms
Wall time: 1.68 s
[19]:

Customize your visuals & Embed#

Graphistry visualizes data with smart defaults: community-based coloring, degree-based sizing, force-directed layout, auto-zoom, and built-in visual analytics. However, it often helps to configure your visuals ahead of time.

Example: * Enable legend on new column ‘type’ * Color nodes by node column ‘type’ * Pick icons based on node type * Set background color to match notebook * Use a tighter layout

See further examples at the PyGraphistry github repo

[24]:
%%time

g4 = (g3

      # Add node column 'type' based on gremlin-provided column 'label'
      # The legend auto-detects this column and appears
      .nodes(lambda g: g._nodes.assign(type=g._nodes['label']))

      .encode_point_color('type', categorical_mapping={
          'website': 'blue',
          'transientId': 'green'
      })

      .encode_point_icon('type', categorical_mapping ={
          'website': 'link',
          'transientId': 'barcode'
      })

      .addStyle(bg={'color': '#eee'}, page={'title': 'My Graph'})

      # More: https://hub.graphistry.com/docs/api/1/rest/url/
      .settings(url_params={'play': 2000})
)

g4.plot()
CPU times: user 63.5 ms, sys: 3.88 ms, total: 67.3 ms
Wall time: 1.62 s
[24]:

Generate URL for other systems#

[23]:
%%time

url = g4.plot(render=False)

url
CPU times: user 64.8 ms, sys: 0 ns, total: 64.8 ms
Wall time: 1.67 s
[23]:
'https://hub.graphistry.com/graph/graph.html?dataset=7405d0ac396a47ea9ee84acab7b0b31d&type=arrow&viztoken=c5e68946-e922-487e-9484-ef8fc9e2c8f9&usertag=5bf3845f-pygraphistry-0.19.0+5.g5ce1d3fb0&splashAfter=1625879227&info=true&strongGravity=False&play=2000'

Next steps#

[ ]: