GFQL: Hop-and-chain - PyGraphistry Cypher-style graph pattern matching on dataframes#

PyGraphistry supports a rich subset of the popular Cypher graph query language, which you can run on dataframes without needing to install a database nor native libraries. It is natively integrated with dataframes and thus has a Python-native syntax rather than the traditional string syntax.

PyGraphistry graph pattern matching features key similarities with Cypher

  • Multi-hop searching

  • Predicates on node and edge attributes

  • Ability to identify matching nodes and edges

It is different in a few key ways

  • Pure PyData (Python/C++/Fortran): No need to install databases, Java, etc., pip install pygraphistry is enough

  • It is collection-oriented rather than path-oriented: All operations are guaranteed to translate to efficiently vectorized dataframe operations rather than asymptotically slower per-row path operations typical of traditional graph query engines

  • Advanced users can insert custom predicates as native Python dataframe code


Tutorial

  1. Install & configure

  2. Load & enrich a US congress twitter interaction dataset

  3. Simple graph filtering: g.hop() and g.chain([...])

  4. Multi-hop and paths-between-nodes pattern mining

  5. Advanced filter predicates

  6. Result labeling

1. Install & configure#

[ ]:
# ! pip install graphistry[igraph]

Imports#

[141]:
import pandas as pd

import graphistry

from graphistry import (

    # graph operators
    n, e_undirected, e_forward, e_reverse,

    # attribute predicates
    is_in, ge, startswith, contains, match as match_re
)
[ ]:
graphistry.register(api=3, username='...', password='...')

2. Load & enrich a US congress twitter interaction dataset#

Data#

  • Download

  • Turn json into a Pandas edges dataframe

  • Turn edges dataframe into a PyGraphistry graph

  • Enrich nodes and edges with some useful graph metrics

  • Visualize full graph to test

[9]:
# ! wget -q https://snap.stanford.edu/data/congress_network.zip
# ! unzip congress_network.zip

total 1.2M
drwxr-xr-x 1 root root 4.0K Dec  4 03:56 .
drwxr-xr-x 1 root root 4.0K Dec  4 03:33 ..
-rw-r--r-- 1 root root 150K May  9  2017 Attribute
-rw-r--r-- 1 root root  14K May  9  2017 Class_info
drwxr-xr-x 4 root root 4.0K Nov 30 14:24 .config
-rw-r--r-- 1 root root 190K Aug  5 05:26 congress_network.zip
-rw-r--r-- 1 root root 320K May  9  2017 edgelist
drwxr-xr-x 1 root root 4.0K Nov 30 14:27 sample_data
-rw-r--r-- 1 root root   16 May  9  2017 Statistics
-rw-r--r-- 1 root root 221K Dec  4 03:53 twitter.zip
-rw-r--r-- 1 root root 299K May  9  2017 vertex2aid
[40]:
import json

with open('congress_network/congress_network_data.json', 'r') as file:
    data = json.load(file)

edges = []
for i, name in enumerate(data[0]['usernameList']):
  for ii, j in enumerate(data[0]['outList'][i]):
    edges.append({
        'from': name,
        'to': data[0]['usernameList'][j],
        'weight': data[0]['outWeight'][i][ii]
    })
edges_df = pd.DataFrame(edges)

print(edges_df.shape)
edges_df.sample(5)
(13289, 3)
[40]:
from to weight
11112 RepBobbyRush janschakowsky 0.034364
3836 RepCori Ilhan 0.015936
5282 RepTedDeutch RepDWStweets 0.003268
12352 BennieGThompson RepStricklandWA 0.006849
9358 RepCarolMiller RepTroyNehls 0.005291

Load dataframe as a PyGraphistry graph#

Turn into a graph and precompute some useful graph metrics

Recall that a g object, underneath, is essentially just two dataframes, g._edges and g._nodes, and with many useful graph methods:

[77]:
# Shape
g = graphistry.edges(edges_df, 'from', 'to')

# Enrich & style
# Tip: Switch from compute_igraph to compute_cugraph when GPUs are available
g2 = (g
      .materialize_nodes()
      .nodes(lambda g: g._nodes.assign(title=g._nodes.id))
      .edges(lambda g: g._edges.assign(weight2=g._edges.weight))
      .bind(point_title='title')
      .compute_igraph('community_infomap')
      .compute_igraph('pagerank')
      .get_degrees()
      .encode_point_color(
          'community_infomap',
          as_categorical=True,
          categorical_mapping={
              0: '#32a9a2', # vibrant teal
              1: '#ff6b6b', # soft coral
              2: '#f9d342', # muted yellow
          }
      )
)

g2._nodes
WARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDsWARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDs
[77]:
id title community_infomap pagerank degree_in degree_out degree
0 SenatorBaldwin SenatorBaldwin 0 0.001422 26 20 46
1 SenJohnBarrasso SenJohnBarrasso 0 0.001179 22 19 41
2 SenatorBennet SenatorBennet 0 0.001995 33 22 55
3 MarshaBlackburn MarshaBlackburn 0 0.001331 18 38 56
4 SenBlumenthal SenBlumenthal 0 0.001672 30 35 65
... ... ... ... ... ... ... ...
470 RepJoeWilson RepJoeWilson 1 0.001780 21 38 59
471 RobWittman RobWittman 1 0.001017 13 19 32
472 rep_stevewomack rep_stevewomack 1 0.002637 35 19 54
473 RepJohnYarmuth RepJohnYarmuth 2 0.000555 5 20 25
474 RepLeeZeldin RepLeeZeldin 1 0.000511 3 25 28

475 rows × 7 columns

[79]:
g2.plot()
[79]:

3. Simple filtering: g.hop() & g.chain([...])#

We can filter by nodes, edges, and combinations of them

The result is a graph where we can inspect the node and edge tables, or perform further graph operations, like visualization or further searches

Key concepts

There are 2 key methods: * g.hop(...): filter triples of source node, edge, destination node * g.chain([....]): arbitrarily long sequence of node and edge predicates

They reuse column operations core to dataframe libraries, such as comparison operators on strings, numbers, and dates

Sample tasks

This section shows how to:

  • Find SenSchumer and his immediate community (infomap metric)

  • Look at his entire community

  • Find everyone with high edge weight from/to SenSchumer; 2 hops either direction

  • Find everyone in his community

[80]:
g2.chain([n({'title': 'SenSchumer'})])._nodes
[80]:
id title community_infomap pagerank degree_in degree_out degree
0 SenSchumer SenSchumer 2 0.001296 25 97 122

You can also pass chain() a sequence of node and edge expressions

[81]:
g_immediate_community2 = g2.chain([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])

print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)
58 senators 69 relns
[81]:
from to weight2
22 SenSchumer JacksonLeeTX18 0.001546
46 SenSchumer RepSarbanes 0.001546
23 SenSchumer RepJayapal 0.001546
53 SenSchumer PeterWelch 0.001546
25 SenSchumer RepDaveJoyce 0.001546
26 SenSchumer RepRobinKelly 0.001546
28 SenSchumer RepAndyKimNJ 0.001546
29 SenSchumer RepBarbaraLee 0.001546
50 SenSchumer RepPaulTonko 0.001546
32 SenSchumer RepMeijer 0.001546
[82]:
g_immediate_community2.plot()
[82]:

Often, we are just filtering on a src node / edge / dst node triple, so hop() is a short-form for this. All the hop() parameters can also be passed to edge expressions as well.

[83]:
g_community2 = g2.hop(source_node_match={'community_infomap': 2}, destination_node_match={'community_infomap': 2})

print(len(g_community2._nodes), 'senators', len(g_community2._edges), 'relns')
g_community2._edges.sort_values(by=['weight2']).head(10)
214 senators 4993 relns
[83]:
from to weight weight2
378 RepDonBeyer RepSpeier 0.000658 0.000658
354 RepDonBeyer repcleaver 0.000658 0.000658
353 RepDonBeyer RepYvetteClarke 0.000658 0.000658
352 RepDonBeyer RepCasten 0.000658 0.000658
349 RepDonBeyer RepBeatty 0.000658 0.000658
360 RepDonBeyer RepGaramendi 0.000658 0.000658
361 RepDonBeyer RepChuyGarcia 0.000658 0.000658
362 RepDonBeyer RepRaulGrijalva 0.000658 0.000658
365 RepDonBeyer USRepKeating 0.000658 0.000658
366 RepDonBeyer RepRickLarsen 0.000658 0.000658
[86]:
g_community2.encode_point_color('pagerank', ['blue', 'yellow', 'red'], as_continuous=True).plot()
[86]:

4. Multi-hop and paths-between-nodes pattern mining#

Method chain([...]) can be used for looking more than one hop out, and even finding paths between nodes.

Ex: All people bridging SenSchumer and SpeakerPelosi

[94]:
g_shumer_pelosi_bridges = g2.chain([
    n({'title': 'SenSchumer'}),
    e_undirected(),
    n(),
    e_undirected(),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_shumer_pelosi_bridges._nodes), 'senators')
g_shumer_pelosi_bridges._edges.sort_values(by='weight').head(5)
66 senators
[94]:
from to weight weight2
86 RepJayapal SpeakerPelosi 0.000871 0.000871
47 SenSchumer RepMeijer 0.001546 0.001546
23 SenSchumer RepBuddyCarter 0.001546 0.001546
24 SenSchumer RepJudyChu 0.001546 0.001546
26 SenSchumer repcleaver 0.001546 0.001546
[92]:
g_shumer_pelosi_bridges.plot()
[92]:

5. Advanced filter predicates#

We can use a variety of predicates for filtering nodes and edges beyond attribute value equality.

Common tasks include comparing attributes using: * Set inclusion: is_in([...]) * Numeric comparisons: gt(...), lt(...), ge(...), le(...) * String comparison: startswith(...), endswith(...), contains(...) * Regular expression matching: matches(...) * Duplicate checking: duplicated()

Graph where nodes are in the top 20 pagerank:

[134]:
top_20_pr = g2._nodes.pagerank.sort_values(ascending=False, ignore_index=True)[19]
top_20_pr
[134]:
0.005888600097034367
[128]:
g_high_pr = g2.chain([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)
[128]:
20
[129]:
g_high_pr.plot()
[129]:

Graph where the name includes Leader

[136]:
g_leaders = g2.hop(
    source_node_match={'title': contains('Leader')},
    destination_node_match = {'title': contains('Leader')}
)

print(len(g_leaders._nodes), 'leaders')

g_leaders.plot()
2 leaders
[136]:

Graph of leaders and senators

[139]:
g_leaders_and_senators = g2.hop(
    source_node_match={'title': match_re(r'Sen|Leader')},
    destination_node_match = {'title': match_re(r'Sen|Leader')}
)

print(len(g_leaders_and_senators._nodes), 'leaders and senators')

g_leaders_and_senators.plot()
67 leaders and senators
[139]:

6. Result labeling#

It can be useful to name node and edges within the path query for downstream reasoning:

[156]:
g_bridges2 = g2.chain([
    n({'title': 'SenSchumer'}),
    e_undirected(name='from_schumer'),
    n(name='found_bridge'),
    e_undirected(name='from_pelosi'),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_bridges2._nodes), 'senators in full graph')

named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')

g_bridges2.encode_point_color(
    'found_bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'orange',
        False: 'silver'
    }
).plot()
66 senators in full graph
64 bridging senators
75 relns from_schumer 83 relns from_pelosi
[156]:
[ ]: