Graphistry Neptune Gremlin identity graph demo#

PyGraphistry helps connect to graph data sources, wrangle them with Python dataframe tools, and visualize them with Graphistry. It’s often used in notebooks, data apps, and dashboards.

This notebook uses PyGraphistry to quickly: * Connect toNeptune * RunGremlinqueries via built-in bindings over gremlin_python * Convert to dataframes for data wrangling: CPU via Pandas and GPU via RAPIDS cuDF * Visualize by automatically generating rich, interactive, & GPU-accelerated Graphistry graph visualization sessions * Share & embed your beautiful results

For any API used below, run help(graphistry.the_method) for a quick view of its docs

The demo is on AWS Neptune’s identity graph data sample from our joint graph-app-kit tutorial. If you have your own dataset, including non-identity data, the example queries should still work.

Setup#

Optional - Quicklaunch via graph-app-kit for Neptune: * Neptune: It is tested on Neptune’s identity graph database sample kit, and you can swap in your own * Graphistry: Use your own, get a free Hub account, or launch in AWS alongside Neptune’s VPC and public subnet * Notebook: Use your own, or launch in AWS alongside Neptune’s VPC and public subnet

If you hit gremlinpython event runtime bugs, try this gist for solving them

Install#

Already provided in graphistry envs

[1]:

# ! pip install -u gremlinpython graphistry
# ! pip install -u pandas
# see https://rapids.ai/ if trying GPU dataframes

Imports#

[2]:

! pip show gremlinpython graphistry | grep 'Name\|Version'

Name: gremlinpython
Version: 3.4.10
Name: graphistry
Version: 0.19.0+5.g5ce1d3fb0

[3]:

import graphistry
graphistry.__version__

[3]:

'0.19.0+5.g5ce1d3fb0'

Configure#

[4]:

# To specify Graphistry account & server, use:
# graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com')
# For more options: https://pygraphistry.readthedocs.io/en/latest/server/register.html

[29]:

NEPTUNE_READER_PROTOCOL='wss'
NEPTUNE_READER_HOST='neptunedbcluster-abc.cluster-ro-xyz.us-east-1.neptune.amazonaws.com'
NEPTUNE_READER_PORT='8182'

endpoint = f'{NEPTUNE_READER_PROTOCOL}://{NEPTUNE_READER_HOST}:{NEPTUNE_READER_PORT}/gremlin'
endpoint

[29]:

'wss://neptunedbcluster-abc.cluster-ro-xyz.us-east-1.neptune.amazonaws.com:8182/gremlin'

[6]:

#import logging
#logging.basicConfig(level=logging.DEBUG)

Connect#

[7]:

graphistry.register(**GRAPHISTRY_CFG)

g = graphistry.neptune(endpoint=endpoint)

g._gremlin_client

[7]:

<gremlin_python.driver.client.Client at 0x7fdfc230e3d0>

Query & plot#

PyGraphistry automatically converts gremlin results into node/edge dataframes
Edge queries typically only return node IDs; call fetch_nodes() to enrich your g._nodes dataframe
PyGraphistry plots dataframes

[25]:

%%time

g2 = g.gremlin('g.E().limit(10000)')

CPU times: user 4.96 s, sys: 27.9 ms, total: 4.99 s
Wall time: 4.95 s

[26]:

print('NODES:')
g2._nodes.info()
g2._nodes.sample(3)

NODES:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8106 entries, 0 to 8105
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      8106 non-null   object
 1   label   8106 non-null   object
dtypes: object(2)
memory usage: 126.8+ KB

[26]:

	id	label
4102	ed95a9a5be30e4c8/e212d4b4d4a865a/7e3e41e09dfe6...	website
6496	6ea77fc3ea42bd5b/87be29bd5615083/d4392e74543e413	website
7540	4c980617e02858a4/7de2f069da3a3655/30591f4d8c71...	website

[27]:

print('EDGES:')
print(g2._edges.info())

g2._edges.sample(3)

EDGES:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      10000 non-null  object
 1   label   10000 non-null  object
 2   src     10000 non-null  object
 3   dst     10000 non-null  object
dtypes: object(4)
memory usage: 312.6+ KB
None

[27]:

	id	label	src	dst
2814	f7803bf0ac187592421c0695792b698f43b596ce	visited	556de63e26686d50/95263499b67bbda1?f300c39f4f33...	48e740025e70e4e38dc87928cd45357c
8081	fe80cddfec97a7dd802cf93cf277da01d9b5fb65	visited	3ccec85ce35ea661?fa76e6024017220f	23c31ea91be100fd224dff1499939851
2046	4e5290971de41c1e1bcb7433e53ffc6321e410cf	visited	6ea77fc3ea42bd5b/9c280de73bf0fb32/bb555a4d63de...	9e77c2a52fdf9f9b7416e85cabaf7c76

[28]:

%%time

# Enrich nodes dataframe with any available server property data

g3 = g2.fetch_nodes()

print(g3._nodes.info())

g3._nodes.sample(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8106 entries, 0 to 8105
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      8106 non-null   object
 1   label   8106 non-null   object
dtypes: object(2)
memory usage: 126.8+ KB
None
CPU times: user 4.32 s, sys: 43.9 ms, total: 4.37 s
Wall time: 4.33 s

[28]:

	id	label
1242	4c980617e02858a4/7de2f069da3a3655/30591f4d8c71...	website
3190	493a46bbfd2029ae/4a0cad2f071a71ce/f9ba18598922...	website
6782	4c980617e02858a4/7de2f069da3a3655/30591f4d8c71...	website

[19]:

%%time

g3.plot()

CPU times: user 59.8 ms, sys: 4 ms, total: 63.8 ms
Wall time: 1.68 s

[19]:

Customize your visuals & Embed#

Graphistry visualizes data with smart defaults: community-based coloring, degree-based sizing, force-directed layout, auto-zoom, and built-in visual analytics. However, it often helps to configure your visuals ahead of time.

Example: * Enable legend on new column ‘type’ * Color nodes by node column ‘type’ * Pick icons based on node type * Set background color to match notebook * Use a tighter layout

See further examples at the PyGraphistry github repo

[24]:

%%time

g4 = (g3

      # Add node column 'type' based on gremlin-provided column 'label'
      # The legend auto-detects this column and appears
      .nodes(lambda g: g._nodes.assign(type=g._nodes['label']))

      .encode_point_color('type', categorical_mapping={
          'website': 'blue',
          'transientId': 'green'
      })

      .encode_point_icon('type', categorical_mapping ={
          'website': 'link',
          'transientId': 'barcode'
      })

      .addStyle(bg={'color': '#eee'}, page={'title': 'My Graph'})

      # More: https://hub.graphistry.com/docs/api/1/rest/url/
      .settings(url_params={'play': 2000})
)

g4.plot()

CPU times: user 63.5 ms, sys: 3.88 ms, total: 67.3 ms
Wall time: 1.62 s

[24]:

Generate URL for other systems#

[23]:

%%time

url = g4.plot(render=False)

url

CPU times: user 64.8 ms, sys: 0 ns, total: 64.8 ms
Wall time: 1.67 s

[23]:

'https://hub.graphistry.com/graph/graph.html?dataset=7405d0ac396a47ea9ee84acab7b0b31d&type=arrow&viztoken=c5e68946-e922-487e-9484-ef8fc9e2c8f9&usertag=5bf3845f-pygraphistry-0.19.0+5.g5ce1d3fb0&splashAfter=1625879227&info=true&strongGravity=False&play=2000'

Next steps#

Go deeper with PyGraphistry: Examples for customization, GPU graph analytics, and more
Explore gremlinpython
Dashboarding with graph-app-kit / Streamlit’s Neptune integration
- Amazon Neptune’s launch announce & tutorial
Try a CSV upload on Hub or Launch your own Graphistry server
Additional Graphistry APIs: REST, React, JS, …

[ ]:

Graphistry Neptune Gremlin identity graph demo

Contents