GFQL: Hop-and-chain - PyGraphistry Cypher-style graph pattern matching on dataframes#
PyGraphistry supports a rich subset of the popular Cypher graph query language, which you can run on dataframes without needing to install a database nor native libraries. It is natively integrated with dataframes and thus has a Python-native syntax rather than the traditional string syntax.
PyGraphistry graph pattern matching features key similarities with Cypher
Multi-hop searching
Predicates on node and edge attributes
Ability to identify matching nodes and edges
It is different in a few key ways
Pure PyData (Python/C++/Fortran): No need to install databases, Java, etc.,
pip install pygraphistry
is enoughIt is collection-oriented rather than path-oriented: All operations are guaranteed to translate to efficiently vectorized dataframe operations rather than asymptotically slower per-row path operations typical of traditional graph query engines
Advanced users can insert custom predicates as native Python dataframe code
Tutorial
Install & configure
Load & enrich a US congress twitter interaction dataset
Simple graph filtering:
g.hop()
andg.chain([...])
Multi-hop and paths-between-nodes pattern mining
Advanced filter predicates
Result labeling
1. Install & configure#
[ ]:
# ! pip install graphistry[igraph]
Imports#
[141]:
import pandas as pd
import graphistry
from graphistry import (
# graph operators
n, e_undirected, e_forward, e_reverse,
# attribute predicates
is_in, ge, startswith, contains, match as match_re
)
[ ]:
graphistry.register(api=3, username='...', password='...')
2. Load & enrich a US congress twitter interaction dataset#
Data#
Download
Turn json into a Pandas edges dataframe
Turn edges dataframe into a PyGraphistry graph
Enrich nodes and edges with some useful graph metrics
Visualize full graph to test
[9]:
# ! wget -q https://snap.stanford.edu/data/congress_network.zip
# ! unzip congress_network.zip
total 1.2M
drwxr-xr-x 1 root root 4.0K Dec 4 03:56 .
drwxr-xr-x 1 root root 4.0K Dec 4 03:33 ..
-rw-r--r-- 1 root root 150K May 9 2017 Attribute
-rw-r--r-- 1 root root 14K May 9 2017 Class_info
drwxr-xr-x 4 root root 4.0K Nov 30 14:24 .config
-rw-r--r-- 1 root root 190K Aug 5 05:26 congress_network.zip
-rw-r--r-- 1 root root 320K May 9 2017 edgelist
drwxr-xr-x 1 root root 4.0K Nov 30 14:27 sample_data
-rw-r--r-- 1 root root 16 May 9 2017 Statistics
-rw-r--r-- 1 root root 221K Dec 4 03:53 twitter.zip
-rw-r--r-- 1 root root 299K May 9 2017 vertex2aid
[40]:
import json
with open('congress_network/congress_network_data.json', 'r') as file:
data = json.load(file)
edges = []
for i, name in enumerate(data[0]['usernameList']):
for ii, j in enumerate(data[0]['outList'][i]):
edges.append({
'from': name,
'to': data[0]['usernameList'][j],
'weight': data[0]['outWeight'][i][ii]
})
edges_df = pd.DataFrame(edges)
print(edges_df.shape)
edges_df.sample(5)
(13289, 3)
[40]:
from | to | weight | |
---|---|---|---|
11112 | RepBobbyRush | janschakowsky | 0.034364 |
3836 | RepCori | Ilhan | 0.015936 |
5282 | RepTedDeutch | RepDWStweets | 0.003268 |
12352 | BennieGThompson | RepStricklandWA | 0.006849 |
9358 | RepCarolMiller | RepTroyNehls | 0.005291 |
Load dataframe as a PyGraphistry graph#
Turn into a graph and precompute some useful graph metrics
Recall that a g
object, underneath, is essentially just two dataframes, g._edges
and g._nodes
, and with many useful graph methods:
[77]:
# Shape
g = graphistry.edges(edges_df, 'from', 'to')
# Enrich & style
# Tip: Switch from compute_igraph to compute_cugraph when GPUs are available
g2 = (g
.materialize_nodes()
.nodes(lambda g: g._nodes.assign(title=g._nodes.id))
.edges(lambda g: g._edges.assign(weight2=g._edges.weight))
.bind(point_title='title')
.compute_igraph('community_infomap')
.compute_igraph('pagerank')
.get_degrees()
.encode_point_color(
'community_infomap',
as_categorical=True,
categorical_mapping={
0: '#32a9a2', # vibrant teal
1: '#ff6b6b', # soft coral
2: '#f9d342', # muted yellow
}
)
)
g2._nodes
WARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDsWARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDs
[77]:
id | title | community_infomap | pagerank | degree_in | degree_out | degree | |
---|---|---|---|---|---|---|---|
0 | SenatorBaldwin | SenatorBaldwin | 0 | 0.001422 | 26 | 20 | 46 |
1 | SenJohnBarrasso | SenJohnBarrasso | 0 | 0.001179 | 22 | 19 | 41 |
2 | SenatorBennet | SenatorBennet | 0 | 0.001995 | 33 | 22 | 55 |
3 | MarshaBlackburn | MarshaBlackburn | 0 | 0.001331 | 18 | 38 | 56 |
4 | SenBlumenthal | SenBlumenthal | 0 | 0.001672 | 30 | 35 | 65 |
... | ... | ... | ... | ... | ... | ... | ... |
470 | RepJoeWilson | RepJoeWilson | 1 | 0.001780 | 21 | 38 | 59 |
471 | RobWittman | RobWittman | 1 | 0.001017 | 13 | 19 | 32 |
472 | rep_stevewomack | rep_stevewomack | 1 | 0.002637 | 35 | 19 | 54 |
473 | RepJohnYarmuth | RepJohnYarmuth | 2 | 0.000555 | 5 | 20 | 25 |
474 | RepLeeZeldin | RepLeeZeldin | 1 | 0.000511 | 3 | 25 | 28 |
475 rows × 7 columns
[79]:
g2.plot()
[79]:
3. Simple filtering: g.hop()
& g.chain([...])
#
We can filter by nodes, edges, and combinations of them
The result is a graph where we can inspect the node and edge tables, or perform further graph operations, like visualization or further searches
Key concepts
There are 2 key methods: * g.hop(...)
: filter triples of source node, edge, destination node * g.chain([....])
: arbitrarily long sequence of node and edge predicates
They reuse column operations core to dataframe libraries, such as comparison operators on strings, numbers, and dates
Sample tasks
This section shows how to:
Find SenSchumer and his immediate community (infomap metric)
Look at his entire community
Find everyone with high edge weight from/to SenSchumer; 2 hops either direction
Find everyone in his community
[80]:
g2.chain([n({'title': 'SenSchumer'})])._nodes
[80]:
id | title | community_infomap | pagerank | degree_in | degree_out | degree | |
---|---|---|---|---|---|---|---|
0 | SenSchumer | SenSchumer | 2 | 0.001296 | 25 | 97 | 122 |
You can also pass chain()
a sequence of node and edge expressions
[81]:
g_immediate_community2 = g2.chain([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])
print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)
58 senators 69 relns
[81]:
from | to | weight2 | |
---|---|---|---|
22 | SenSchumer | JacksonLeeTX18 | 0.001546 |
46 | SenSchumer | RepSarbanes | 0.001546 |
23 | SenSchumer | RepJayapal | 0.001546 |
53 | SenSchumer | PeterWelch | 0.001546 |
25 | SenSchumer | RepDaveJoyce | 0.001546 |
26 | SenSchumer | RepRobinKelly | 0.001546 |
28 | SenSchumer | RepAndyKimNJ | 0.001546 |
29 | SenSchumer | RepBarbaraLee | 0.001546 |
50 | SenSchumer | RepPaulTonko | 0.001546 |
32 | SenSchumer | RepMeijer | 0.001546 |
[82]:
g_immediate_community2.plot()
[82]:
Often, we are just filtering on a src node / edge / dst node triple, so hop()
is a short-form for this. All the hop()
parameters can also be passed to edge expressions as well.
[83]:
g_community2 = g2.hop(source_node_match={'community_infomap': 2}, destination_node_match={'community_infomap': 2})
print(len(g_community2._nodes), 'senators', len(g_community2._edges), 'relns')
g_community2._edges.sort_values(by=['weight2']).head(10)
214 senators 4993 relns
[83]:
from | to | weight | weight2 | |
---|---|---|---|---|
378 | RepDonBeyer | RepSpeier | 0.000658 | 0.000658 |
354 | RepDonBeyer | repcleaver | 0.000658 | 0.000658 |
353 | RepDonBeyer | RepYvetteClarke | 0.000658 | 0.000658 |
352 | RepDonBeyer | RepCasten | 0.000658 | 0.000658 |
349 | RepDonBeyer | RepBeatty | 0.000658 | 0.000658 |
360 | RepDonBeyer | RepGaramendi | 0.000658 | 0.000658 |
361 | RepDonBeyer | RepChuyGarcia | 0.000658 | 0.000658 |
362 | RepDonBeyer | RepRaulGrijalva | 0.000658 | 0.000658 |
365 | RepDonBeyer | USRepKeating | 0.000658 | 0.000658 |
366 | RepDonBeyer | RepRickLarsen | 0.000658 | 0.000658 |
[86]:
g_community2.encode_point_color('pagerank', ['blue', 'yellow', 'red'], as_continuous=True).plot()
[86]:
4. Multi-hop and paths-between-nodes pattern mining#
Method chain([...])
can be used for looking more than one hop out, and even finding paths between nodes.
Ex: All people bridging SenSchumer and SpeakerPelosi
[94]:
g_shumer_pelosi_bridges = g2.chain([
n({'title': 'SenSchumer'}),
e_undirected(),
n(),
e_undirected(),
n({'title': 'SpeakerPelosi'})
])
print(len(g_shumer_pelosi_bridges._nodes), 'senators')
g_shumer_pelosi_bridges._edges.sort_values(by='weight').head(5)
66 senators
[94]:
from | to | weight | weight2 | |
---|---|---|---|---|
86 | RepJayapal | SpeakerPelosi | 0.000871 | 0.000871 |
47 | SenSchumer | RepMeijer | 0.001546 | 0.001546 |
23 | SenSchumer | RepBuddyCarter | 0.001546 | 0.001546 |
24 | SenSchumer | RepJudyChu | 0.001546 | 0.001546 |
26 | SenSchumer | repcleaver | 0.001546 | 0.001546 |
[92]:
g_shumer_pelosi_bridges.plot()
[92]:
5. Advanced filter predicates#
We can use a variety of predicates for filtering nodes and edges beyond attribute value equality.
Common tasks include comparing attributes using: * Set inclusion: is_in([...])
* Numeric comparisons: gt(...)
, lt(...)
, ge(...)
, le(...)
* String comparison: startswith(...)
, endswith(...)
, contains(...)
* Regular expression matching: matches(...)
* Duplicate checking: duplicated()
Graph where nodes are in the top 20 pagerank:
[134]:
top_20_pr = g2._nodes.pagerank.sort_values(ascending=False, ignore_index=True)[19]
top_20_pr
[134]:
0.005888600097034367
[128]:
g_high_pr = g2.chain([
n({'pagerank': ge(top_20_pr)}),
e_undirected(),
n({'pagerank': ge(top_20_pr)}),
])
len(g_high_pr._nodes)
[128]:
20
[129]:
g_high_pr.plot()
[129]:
Graph where the name includes Leader
[136]:
g_leaders = g2.hop(
source_node_match={'title': contains('Leader')},
destination_node_match = {'title': contains('Leader')}
)
print(len(g_leaders._nodes), 'leaders')
g_leaders.plot()
2 leaders
[136]:
Graph of leaders and senators
[139]:
g_leaders_and_senators = g2.hop(
source_node_match={'title': match_re(r'Sen|Leader')},
destination_node_match = {'title': match_re(r'Sen|Leader')}
)
print(len(g_leaders_and_senators._nodes), 'leaders and senators')
g_leaders_and_senators.plot()
67 leaders and senators
[139]:
6. Result labeling#
It can be useful to name node and edges within the path query for downstream reasoning:
[156]:
g_bridges2 = g2.chain([
n({'title': 'SenSchumer'}),
e_undirected(name='from_schumer'),
n(name='found_bridge'),
e_undirected(name='from_pelosi'),
n({'title': 'SpeakerPelosi'})
])
print(len(g_bridges2._nodes), 'senators in full graph')
named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')
g_bridges2.encode_point_color(
'found_bridge',
as_categorical=True,
categorical_mapping={
True: 'orange',
False: 'silver'
}
).plot()
66 senators in full graph
64 bridging senators
75 relns from_schumer 83 relns from_pelosi
[156]:
[ ]: