1. Install & configure#
Graph Pattern Mining with hop() and gfql()#
This tutorial demonstrates how to use PyGraphistry’s hop() and gfql() methods for graph pattern mining and traversal.
Key concepts: - g.hop(): Filter by source node → edge → destination node patterns - g.gfql(): Chain multiple node and edge filters for complex patterns - Predicates: Use comparisons, string matching, and other filters - Result labeling: Name intermediate results for analysis
We’ll explore these concepts using a US Congress Twitter interaction dataset.
[ ]:
graphistry.register(api=3, username='...', password='...')
[ ]:
import pandas as pd
import graphistry
from graphistry.compute.predicates import is_in, gt, lt, ge, le, eq, ne
from graphistry.compute.predicates import contains, startswith, endswith
from graphistry.compute.predicates import is_in as match_re # For regex matching
from graphistry.compute.ast import n, e_forward, e_reverse, e_undirected, e
2. Load & enrich a US congress twitter interaction dataset#
3. Simple filtering: g.hop() & g.gfql([...])#
We can filter by nodes, edges, and combinations of them
The result is a graph where we can inspect the node and edge tables, or perform further graph operations, like visualization or further searches
Key concepts
There are 2 key methods: * g.hop(...): filter triples of source node, edge, destination node * g.gfql([....]): arbitrarily long sequence of node and edge predicates
They reuse column operations core to dataframe libraries, such as comparison operators on strings, numbers, and dates
Sample tasks
This section shows how to:
Find SenSchumer and his immediate community (infomap metric)
Look at his entire community
Find everyone with high edge weight from/to SenSchumer; 2 hops either direction
Find everyone in his community
[ ]:
# Load the US Congress Twitter interaction dataset
# This dataset contains Twitter interactions between members of the US Congress
edges_df = pd.read_csv('https://raw.githubusercontent.com/graphistry/pygraphistry/master/demos/data/twitter_congress_edges.csv')
print(f"Loaded {len(edges_df)} edges")
edges_df.head()
[ ]:
g2.gfql([n({'title': 'SenSchumer'})])._nodes
[ ]:
### First, let's find immediate connections to SenSchumer
[ ]:
g_immediate_community2 = g2.gfql([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])
print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)
[77]:
# Shape
g = graphistry.edges(edges_df, 'from', 'to')
# Enrich & style
# Tip: Switch from compute_igraph to compute_cugraph when GPUs are available
g2 = (g
.materialize_nodes()
.nodes(lambda g: g._nodes.assign(title=g._nodes.id))
.edges(lambda g: g._edges.assign(weight2=g._edges.weight))
.bind(point_title='title')
.compute_igraph('community_infomap')
.compute_igraph('pagerank')
.get_degrees()
.encode_point_color(
'community_infomap',
as_categorical=True,
categorical_mapping={
0: '#32a9a2', # vibrant teal
1: '#ff6b6b', # soft coral
2: '#f9d342', # muted yellow
}
)
)
g2._nodes
WARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDsWARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDs
[77]:
| id | title | community_infomap | pagerank | degree_in | degree_out | degree | |
|---|---|---|---|---|---|---|---|
| 0 | SenatorBaldwin | SenatorBaldwin | 0 | 0.001422 | 26 | 20 | 46 |
| 1 | SenJohnBarrasso | SenJohnBarrasso | 0 | 0.001179 | 22 | 19 | 41 |
| 2 | SenatorBennet | SenatorBennet | 0 | 0.001995 | 33 | 22 | 55 |
| 3 | MarshaBlackburn | MarshaBlackburn | 0 | 0.001331 | 18 | 38 | 56 |
| 4 | SenBlumenthal | SenBlumenthal | 0 | 0.001672 | 30 | 35 | 65 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 470 | RepJoeWilson | RepJoeWilson | 1 | 0.001780 | 21 | 38 | 59 |
| 471 | RobWittman | RobWittman | 1 | 0.001017 | 13 | 19 | 32 |
| 472 | rep_stevewomack | rep_stevewomack | 1 | 0.002637 | 35 | 19 | 54 |
| 473 | RepJohnYarmuth | RepJohnYarmuth | 2 | 0.000555 | 5 | 20 | 25 |
| 474 | RepLeeZeldin | RepLeeZeldin | 1 | 0.000511 | 3 | 25 | 28 |
475 rows × 7 columns
[79]:
g2.plot()
[79]:
3. Simple filtering: g.hop() & g.gfql([...])#
We can filter by nodes, edges, and combinations of them
The result is a graph where we can inspect the node and edge tables, or perform further graph operations, like visualization or further searches
Key concepts
There are 2 key methods: * g.hop(...): filter triples of source node, edge, destination node * g.gfql([....]): arbitrarily long sequence of node and edge predicates
They reuse column operations core to dataframe libraries, such as comparison operators on strings, numbers, and dates
Sample tasks
This section shows how to:
Find SenSchumer and his immediate community (infomap metric)
Look at his entire community
Find everyone with high edge weight from/to SenSchumer; 2 hops either direction
Find everyone in his community
[ ]:
g2.gfql([n({'title': 'SenSchumer'})])._nodes
4. Multi-hop and paths-between-nodes pattern mining#
Method gfql([...]) can be used for looking more than one hop out, and even finding paths between nodes.
[ ]:
g_immediate_community2 = g2.gfql([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])
print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)
[ ]:
g_shumer_pelosi_bridges = g2.gfql([
n({'title': 'SenSchumer'}),
e_undirected(),
n(),
e_undirected(),
n({'title': 'SpeakerPelosi'})
])
print(len(g_shumer_pelosi_bridges._nodes), 'senators')
g_shumer_pelosi_bridges._edges.sort_values(by='weight').head(5)
Often, we are just filtering on a src node / edge / dst node triple, so hop() is a short-form for this. All the hop() parameters can also be passed to edge expressions as well.
[83]:
g_community2 = g2.hop(source_node_match={'community_infomap': 2}, destination_node_match={'community_infomap': 2})
print(len(g_community2._nodes), 'senators', len(g_community2._edges), 'relns')
g_community2._edges.sort_values(by=['weight2']).head(10)
214 senators 4993 relns
[83]:
| from | to | weight | weight2 | |
|---|---|---|---|---|
| 378 | RepDonBeyer | RepSpeier | 0.000658 | 0.000658 |
| 354 | RepDonBeyer | repcleaver | 0.000658 | 0.000658 |
| 353 | RepDonBeyer | RepYvetteClarke | 0.000658 | 0.000658 |
| 352 | RepDonBeyer | RepCasten | 0.000658 | 0.000658 |
| 349 | RepDonBeyer | RepBeatty | 0.000658 | 0.000658 |
| 360 | RepDonBeyer | RepGaramendi | 0.000658 | 0.000658 |
| 361 | RepDonBeyer | RepChuyGarcia | 0.000658 | 0.000658 |
| 362 | RepDonBeyer | RepRaulGrijalva | 0.000658 | 0.000658 |
| 365 | RepDonBeyer | USRepKeating | 0.000658 | 0.000658 |
| 366 | RepDonBeyer | RepRickLarsen | 0.000658 | 0.000658 |
[86]:
g_community2.encode_point_color('pagerank', ['blue', 'yellow', 'red'], as_continuous=True).plot()
[86]:
4. Multi-hop and paths-between-nodes pattern mining#
Method gfql([...]) can be used for looking more than one hop out, and even finding paths between nodes.
g_high_pr = g2.gfql([ n({‘pagerank’: ge(top_20_pr)}), e_undirected(), n({‘pagerank’: ge(top_20_pr)}),])
len(g_high_pr._nodes)
[ ]:
g_high_pr = g2.gfql([
n({'pagerank': ge(top_20_pr)}),
e_undirected(),
n({'pagerank': ge(top_20_pr)}),
])
len(g_high_pr._nodes)
[92]:
g_shumer_pelosi_bridges.plot()
[92]:
5. Advanced filter predicates#
We can use a variety of predicates for filtering nodes and edges beyond attribute value equality.
Common tasks include comparing attributes using: * Set inclusion: is_in([...]) * Numeric comparisons: gt(...), lt(...), ge(...), le(...) * String comparison: startswith(...), endswith(...), contains(...) * Regular expression matching: matches(...) * Duplicate checking: duplicated()
Graph where nodes are in the top 20 pagerank:
[134]:
top_20_pr = g2._nodes.pagerank.sort_values(ascending=False, ignore_index=True)[19]
top_20_pr
[134]:
0.005888600097034367
[ ]:
g_high_pr = g2.gfql([
n({'pagerank': ge(top_20_pr)}),
e_undirected(),
n({'pagerank': ge(top_20_pr)}),
])
len(g_high_pr._nodes)
[ ]:
g_bridges2 = g2.gfql([
n({'title': 'SenSchumer'}),
e_undirected(name='from_schumer'),
n(name='found_bridge'),
e_undirected(name='from_pelosi'),
n({'title': 'SpeakerPelosi'})
])
print(len(g_bridges2._nodes), 'senators in full graph')
named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')
g_bridges2.encode_point_color(
'found_bridge',
as_categorical=True,
categorical_mapping={
True: 'orange',
False: 'silver'
}
).plot()
Graph where the name includes Leader
[136]:
g_leaders = g2.hop(
source_node_match={'title': contains('Leader')},
destination_node_match = {'title': contains('Leader')}
)
print(len(g_leaders._nodes), 'leaders')
g_leaders.plot()
2 leaders
[136]:
Graph of leaders and senators
[139]:
g_leaders_and_senators = g2.hop(
source_node_match={'title': match_re(r'Sen|Leader')},
destination_node_match = {'title': match_re(r'Sen|Leader')}
)
print(len(g_leaders_and_senators._nodes), 'leaders and senators')
g_leaders_and_senators.plot()
67 leaders and senators
[139]:
6. Result labeling#
It can be useful to name node and edges within the path query for downstream reasoning:
[ ]:
g_bridges2 = g2.gfql([
n({'title': 'SenSchumer'}),
e_undirected(name='from_schumer'),
n(name='found_bridge'),
e_undirected(name='from_pelosi'),
n({'title': 'SpeakerPelosi'})
])
print(len(g_bridges2._nodes), 'senators in full graph')
named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')
g_bridges2.encode_point_color(
'found_bridge',
as_categorical=True,
categorical_mapping={
True: 'orange',
False: 'silver'
}
).plot()
[ ]: