Graph Pattern Mining with hop() and gfql()#

This tutorial demonstrates how to use PyGraphistry’s hop() and gfql() methods for graph pattern mining and traversal.

Key concepts: - g.hop(): Filter by source node → edge → destination node patterns - g.gfql(): Chain multiple node and edge filters for complex patterns - Predicates: Use comparisons, string matching, and other filters - Result labeling: Name intermediate results for analysis

We’ll explore these concepts using a US Congress Twitter interaction dataset.

1. Install & configure#

[ ]:
graphistry.register(api=3, username='...', password='...')
[ ]:
import pandas as pd
import graphistry
from graphistry.compute.predicates import is_in, gt, lt, ge, le, eq, ne
from graphistry.compute.predicates import contains, startswith, endswith
from graphistry.compute.predicates import is_in as match_re  # For regex matching
from graphistry.compute.ast import n, e_forward, e_reverse, e_undirected, e

2. Load & enrich a US congress twitter interaction dataset#

This notebook uses an aggregated version of the Twitter-Congress dataset by Drew Conway: drewconway/Twitter-Congress

We collapse multiedges into a single weighted edge and store the result in demos/data/twitter_congress_edges_weighted.csv.gz for reproducible docs builds.

3. Simple filtering: g.hop() & g.gfql([...])#

We can filter by nodes, edges, and combinations of them

The result is a graph where we can inspect the node and edge tables, or perform further graph operations, like visualization or further searches

Key concepts

There are 2 key methods: * g.hop(...): filter triples of source node, edge, destination node * g.gfql([....]): arbitrarily long sequence of node and edge predicates

They reuse column operations core to dataframe libraries, such as comparison operators on strings, numbers, and dates

Sample tasks

This section shows how to:

  • Find SenSchumer and his immediate community (infomap metric)

  • Look at his entire community

  • Find everyone with high edge weight from/to SenSchumer; 2 hops either direction

  • Find everyone in his community

[ ]:
# Load the US Congress Twitter interaction dataset
# This dataset contains Twitter interactions between members of the US Congress
edges_df = pd.read_csv('../../data/twitter_congress_edges_weighted.csv.gz')
print(f"Loaded {len(edges_df)} edges")
edges_df.head()

[ ]:
g2.gfql([n({'title': 'SenSchumer'})])._nodes
[ ]:
### First, let's find immediate connections to SenSchumer
[ ]:
g_immediate_community2 = g2.gfql([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])

print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)
[77]:
# Shape
g = graphistry.edges(edges_df, 'from', 'to')

# Enrich & style
# Tip: Switch from compute_igraph to compute_cugraph when GPUs are available
g2 = (g
      .materialize_nodes()
      .nodes(lambda g: g._nodes.assign(title=g._nodes.id))
      .edges(lambda g: g._edges.assign(weight2=g._edges.weight))
      .bind(point_title='title')
      .compute_igraph('community_infomap')
      .compute_igraph('pagerank')
      .get_degrees()
      .encode_point_color(
          'community_infomap',
          as_categorical=True,
          categorical_mapping={
              0: '#32a9a2', # vibrant teal
              1: '#ff6b6b', # soft coral
              2: '#f9d342', # muted yellow
          }
      )
)

g2._nodes
WARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDsWARNING:root:edge index g._edge not set so using edge index as ID; set g._edge via g.edges(), or change merge_if_existing to FalseWARNING:root:edge index g._edge __edge_index__ missing as attribute in ig; using ig edge order for IDs
[77]:
id title community_infomap pagerank degree_in degree_out degree
0 SenatorBaldwin SenatorBaldwin 0 0.001422 26 20 46
1 SenJohnBarrasso SenJohnBarrasso 0 0.001179 22 19 41
2 SenatorBennet SenatorBennet 0 0.001995 33 22 55
3 MarshaBlackburn MarshaBlackburn 0 0.001331 18 38 56
4 SenBlumenthal SenBlumenthal 0 0.001672 30 35 65
... ... ... ... ... ... ... ...
470 RepJoeWilson RepJoeWilson 1 0.001780 21 38 59
471 RobWittman RobWittman 1 0.001017 13 19 32
472 rep_stevewomack rep_stevewomack 1 0.002637 35 19 54
473 RepJohnYarmuth RepJohnYarmuth 2 0.000555 5 20 25
474 RepLeeZeldin RepLeeZeldin 1 0.000511 3 25 28

475 rows × 7 columns

[79]:
g2.plot()
[79]:

3. Simple filtering: g.hop() & g.gfql([...])#

We can filter by nodes, edges, and combinations of them

The result is a graph where we can inspect the node and edge tables, or perform further graph operations, like visualization or further searches

Key concepts

There are 2 key methods: * g.hop(...): filter triples of source node, edge, destination node * g.gfql([....]): arbitrarily long sequence of node and edge predicates

They reuse column operations core to dataframe libraries, such as comparison operators on strings, numbers, and dates

Sample tasks

This section shows how to:

  • Find SenSchumer and his immediate community (infomap metric)

  • Look at his entire community

  • Find everyone with high edge weight from/to SenSchumer; 2 hops either direction

  • Find everyone in his community

[ ]:
g2.gfql([n({'title': 'SenSchumer'})])._nodes

4. Multi-hop and paths-between-nodes pattern mining#

Method gfql([...]) can be used for looking more than one hop out, and even finding paths between nodes.

[ ]:
g_immediate_community2 = g2.gfql([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])

print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)
[ ]:
g_shumer_pelosi_bridges = g2.gfql([
    n({'title': 'SenSchumer'}),
    e_undirected(),
    n(),
    e_undirected(),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_shumer_pelosi_bridges._nodes), 'senators')
g_shumer_pelosi_bridges._edges.sort_values(by='weight').head(5)

Often, we are just filtering on a src node / edge / dst node triple, so hop() is a short-form for this. All the hop() parameters can also be passed to edge expressions as well.

[83]:
g_community2 = g2.hop(source_node_match={'community_infomap': 2}, destination_node_match={'community_infomap': 2})

print(len(g_community2._nodes), 'senators', len(g_community2._edges), 'relns')
g_community2._edges.sort_values(by=['weight2']).head(10)
214 senators 4993 relns
[83]:
from to weight weight2
378 RepDonBeyer RepSpeier 0.000658 0.000658
354 RepDonBeyer repcleaver 0.000658 0.000658
353 RepDonBeyer RepYvetteClarke 0.000658 0.000658
352 RepDonBeyer RepCasten 0.000658 0.000658
349 RepDonBeyer RepBeatty 0.000658 0.000658
360 RepDonBeyer RepGaramendi 0.000658 0.000658
361 RepDonBeyer RepChuyGarcia 0.000658 0.000658
362 RepDonBeyer RepRaulGrijalva 0.000658 0.000658
365 RepDonBeyer USRepKeating 0.000658 0.000658
366 RepDonBeyer RepRickLarsen 0.000658 0.000658
[86]:
g_community2.encode_point_color('pagerank', ['blue', 'yellow', 'red'], as_continuous=True).plot()
[86]:

4. Multi-hop and paths-between-nodes pattern mining#

Method gfql([...]) can be used for looking more than one hop out, and even finding paths between nodes.

g_high_pr = g2.gfql([ n({‘pagerank’: ge(top_20_pr)}), e_undirected(), n({‘pagerank’: ge(top_20_pr)}),])

len(g_high_pr._nodes)

[ ]:
g_high_pr = g2.gfql([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)
[92]:
g_shumer_pelosi_bridges.plot()
[92]:

5. Advanced filter predicates#

We can use a variety of predicates for filtering nodes and edges beyond attribute value equality.

Common tasks include comparing attributes using: * Set inclusion: is_in([...]) * Numeric comparisons: gt(...), lt(...), ge(...), le(...) * String comparison: startswith(...), endswith(...), contains(...) * Regular expression matching: matches(...) * Duplicate checking: duplicated()

Graph where nodes are in the top 20 pagerank:

[134]:
top_20_pr = g2._nodes.pagerank.sort_values(ascending=False, ignore_index=True)[19]
top_20_pr
[134]:
0.005888600097034367
[ ]:
g_high_pr = g2.gfql([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)
[ ]:
g_bridges2 = g2.gfql([
    n({'title': 'SenSchumer'}),
    e_undirected(name='from_schumer'),
    n(name='found_bridge'),
    e_undirected(name='from_pelosi'),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_bridges2._nodes), 'senators in full graph')

named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')

g_bridges2.encode_point_color(
    'found_bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'orange',
        False: 'silver'
    }
).plot()

Graph where the name includes Leader

[136]:
g_leaders = g2.hop(
    source_node_match={'title': contains('Leader')},
    destination_node_match = {'title': contains('Leader')}
)

print(len(g_leaders._nodes), 'leaders')

g_leaders.plot()
2 leaders
[136]:

Graph of leaders and senators

[139]:
g_leaders_and_senators = g2.hop(
    source_node_match={'title': match_re(r'Sen|Leader')},
    destination_node_match = {'title': match_re(r'Sen|Leader')}
)

print(len(g_leaders_and_senators._nodes), 'leaders and senators')

g_leaders_and_senators.plot()
67 leaders and senators
[139]:

6. Result labeling#

It can be useful to name node and edges within the path query for downstream reasoning:

[6]:
g_bridges2 = g2.gfql([
    n({'title': 'SenSchumer'}),
    e_undirected(name='from_schumer'),
    n(name='found_bridge'),
    e_undirected(name='from_pelosi'),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_bridges2._nodes), 'senators in full graph')

named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')

g_bridges2.encode_point_color(
    'found_bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'orange',
        False: 'silver'
    }
).plot()
25 senators in full graph
23 bridging senators
23 relns from_schumer 32 relns from_pelosi
[6]:

7. Pattern Reuse with Let Bindings#

The let operator allows you to define named graph patterns that can be referenced multiple times in your query. This is particularly useful for: - Creating reusable pattern components - Building complex patterns from simpler building blocks - Avoiding repetition in pattern definitions

Let’s explore how to use let bindings for finding triangles and other complex patterns.

[7]:
# Finding triangles using let bindings
# Define a reusable pattern for high-influence nodes (top 30% pagerank)
top_30_pr = g2._nodes.pagerank.quantile(0.7)

# Find triangles of high-influence members
g_triangles = g2.gfql([
    {
        'let': {
            # Define a pattern for high-influence nodes
            'influential': n({'pagerank': ge(top_30_pr)}),
            # Define a pattern for strong connections
            'strong_edge': e_undirected({'weight': ge(0.01)})
        }
    },
    # Use the defined patterns to find triangles
    {'pattern': 'influential', 'name': 'node_a'},
    {'pattern': 'strong_edge'},
    {'pattern': 'influential', 'name': 'node_b'},
    {'pattern': 'strong_edge'},
    {'pattern': 'influential', 'name': 'node_c'},
    {'pattern': 'strong_edge'},
    {'pattern': 'influential', 'name': 'node_a'}  # Close the triangle
])

print(f"Found {len(g_triangles._nodes)} nodes in triangles")
print(f"Found {len(g_triangles._edges)} edges in triangles")

# Visualize the triangles
g_triangles.encode_point_color('community_infomap', as_categorical=True).plot()
Found 108 nodes in triangles
Found 2772 edges in triangles
[7]:

Finding Community Bridge Patterns with Let#

Let’s use let to define reusable patterns for finding members who bridge different communities:

[8]:
# Find members who bridge communities using let bindings
g_community_bridges = g2.gfql([
    {
        'let': {
            # Pattern for community 0 members
            'community_0': n({'community_infomap': 0}),
            # Pattern for community 1 members
            'community_1': n({'community_infomap': 1}),
            # Pattern for community 2 members
            'community_2': n({'community_infomap': 2}),
            # Pattern for any edge
            'any_edge': e_undirected()
        }
    },
    # Find paths from community 0 to community 1 through community 2
    {'pattern': 'community_0', 'name': 'start'},
    {'pattern': 'any_edge'},
    {'pattern': 'community_2', 'name': 'bridge'},
    {'pattern': 'any_edge'},
    {'pattern': 'community_1', 'name': 'end'}
])

print(f"Found {len(g_community_bridges._nodes)} nodes in bridging pattern")
bridges = g_community_bridges._nodes[g_community_bridges._nodes.bridge]
print(f"Community 2 members acting as bridges: {list(bridges.title.values)}")

# Visualize with bridge nodes highlighted
g_community_bridges.encode_point_color(
    'bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'red',
        False: 'lightgray'
    }
).encode_point_size('bridge', categorical_mapping={True: 80, False: 40}).plot()
Found 7 nodes in bridging pattern
Community 2 members acting as bridges: ['RepBoswell']
[8]:

Complex Pattern Composition with Let#

Let’s create more sophisticated patterns by composing smaller patterns:

[9]:
# Find star patterns around influential nodes
# A star pattern is where one central node connects to multiple others

g_star_patterns = g2.gfql([
    {
        'let': {
            # Very influential nodes (top 10%)
            'very_influential': n({'pagerank': ge(g2._nodes.pagerank.quantile(0.9))}),
            # Moderately influential nodes (top 50%)
            'moderately_influential': n({'pagerank': ge(g2._nodes.pagerank.quantile(0.5))}),
            # Strong bidirectional connection
            'strong_connection': e_undirected({'weight': ge(0.02)})
        }
    },
    # Find star patterns: very influential center connected to multiple moderately influential nodes
    {'pattern': 'very_influential', 'name': 'center'},
    {'pattern': 'strong_connection'},
    {'pattern': 'moderately_influential', 'name': 'spoke1'},
    # Return to center
    e_undirected(),
    {'pattern': 'very_influential', 'name': 'center'},
    {'pattern': 'strong_connection'},
    {'pattern': 'moderately_influential', 'name': 'spoke2'},
    # Return to center again
    e_undirected(),
    {'pattern': 'very_influential', 'name': 'center'},
    {'pattern': 'strong_connection'},
    {'pattern': 'moderately_influential', 'name': 'spoke3'}
])

print(f"Found {len(g_star_patterns._nodes)} nodes in star patterns")
centers = g_star_patterns._nodes[g_star_patterns._nodes.center]
print(f"Central nodes: {list(centers.title.unique())[:5]}...")  # Show first 5

# Visualize with centers highlighted
g_star_patterns.encode_point_color(
    'center',
    as_categorical=True,
    categorical_mapping={
        True: 'gold',
        False: 'lightblue'
    }
).encode_point_size(
    'center',
    categorical_mapping={True: 100, False: 50}
).plot()
Found 177 nodes in star patterns
Central nodes: ['GOPLeader', 'RepBachmann', 'RepBlackburn', 'RepBoehner', 'RepChaffetz']...
[9]:

Benefits of Let Bindings#

The let operator provides several advantages:

  1. Reusability: Define a pattern once and use it multiple times

  2. Readability: Give meaningful names to complex patterns

  3. Maintainability: Change pattern definitions in one place

  4. Composability: Build complex patterns from simpler components

This makes it easier to explore and mine complex graph patterns in your data!