Tutorial: GFQL remote mode#

Running GFQL on remote servers helps with scenarios like large workloads benefiting from GPU acceleration depite no local GPU, when the data is already on a remote Graphistry server, and other team and production setting needs.

The following examples walk through several common scenarios:

Uploading data and running GFQL remotely on it
Binding to existing remote data and running GFQL remotely on it
Control how much data is returned and in what format
Control CPU vs GPU execution

See also the sibling tutorial for running arbitrary GPU Python remotely for even more powerful scenarios.

Setup#

Note: Ensure the GFQL endpoint is enabled for the API user

Imports#

[ ]:

import pandas as pd
import graphistry
from graphistry import n, e_undirected, e_forward

# Import Python API for cleaner syntax with let bindings
from graphistry.compute.ast import ref, let, call

graphistry.__version__

[2]:

graphistry.register(api=3, username='FILL_ME_IN', password='FILL_ME_IN', protocol='https', server='hub.graphistry.com')

Data#

Create an edge table. For simplicity, we will leave the nodes table implicit:

[3]:

e_df = pd.DataFrame({
    's': ['a', 'b', 'c'],
    'd': ['b', 'c', 'd'],
    'v': ['x', 'y', 'z'],
    'u': [2, 4, 6]
})

g = graphistry.edges(e_df, 's', 'd')

Upload data#

Uploaded datasets have a nodes File, edges File, and combined graph Dataset. You can inspect these on your Plottable objects.

Remote-mode GFQL calls will automatically upload your graph if not already sent. If a table has already been recently in the session, the PyGraphistry client is smart enough to detect this and reuse the File ID handle instead of reuploading the data. However, in application code, we recommend explicitly uploading in your code flow to enable easier reuse and more predictable code flow.

[4]:

%%time
g2 = g.upload()

{
    'dataset_id': g2._dataset_id,
    'nodes_file_id': g2._nodes_file_id,
    'edges_file_id': g2._edges_file_id
}

CPU times: user 84.2 ms, sys: 13.2 ms, total: 97.4 ms
Wall time: 1.47 s

[4]:

{'dataset_id': '3a479d960595447e9e4f1b83ace969ed',
 'nodes_file_id': None,
 'edges_file_id': 'cd5bf7c37f1b4ced85a4d23b6f841be6'}

The edge table does not need to get re-uploaded

[5]:

%%time

# Much faster as g._edges is not re-uploaded, and instead g2._edges_file_id is reused
g2b = g.upload()

assert g2b._dataset_id != g2._dataset_id, "Each upload is a new Dataset object"
assert g2b._edges_file_id == g2._edges_file_id, "Dataframe files get automatically reused"

{
    'dataset_id': g2._dataset_id,
    'nodes_file_id': g2._nodes_file_id,
    'edges_file_id': g2._edges_file_id
}

CPU times: user 45 ms, sys: 1.61 ms, total: 46.6 ms
Wall time: 605 ms

[5]:

{'dataset_id': '3a479d960595447e9e4f1b83ace969ed',
 'nodes_file_id': None,
 'edges_file_id': 'cd5bf7c37f1b4ced85a4d23b6f841be6'}

Query remote data#

Regular chain calls can be called in remote mode, and return back the resulting graph

[6]:

two_hop_query = [
    n({'id': 'a'}),
    e_forward(hops=2),
    n()
]

[7]:

%%time

two_hop_g = g2.gfql_remote(two_hop_query)

CPU times: user 37.9 ms, sys: 9.9 ms, total: 47.8 ms
Wall time: 613 ms

[8]:

two_hop_g._edges

[8]:

	s	d	v	u
0	a	b	x	2
1	b	c	y	4

[9]:

two_hop_g._nodes

[9]:

	id
0	a
1	b
2	c

[10]:

assert len(two_hop_g._edges) == len(g.gfql(two_hop_query)._edges), "Remote result should match local results"

Ensure GPU mode in remote execution#

Explicitly set the remote engine= configuration to "cudf" (GPU) or "pandas" (CPU), or leave unconfigured to let the runtime decide

[11]:

%%time
two_hop_g_gpu1 = g2.gfql_remote(two_hop_query, engine='cudf')

CPU times: user 48.4 ms, sys: 0 ns, total: 48.4 ms
Wall time: 598 ms

[12]:

%%time
two_hop_g_cpu1 = g2.gfql_remote(two_hop_query, engine='pandas')

CPU times: user 50 ms, sys: 744 µs, total: 50.8 ms
Wall time: 590 ms

You can move the results to a local if available:

[13]:

try:
    two_hop_g_gpu1 = two_hop_g_gpu1.to_cudf()
    print(type(two_hop_g_gpu1._edges))
except Exception as e:
    print('Error moving to a local GPU, do you have a GPU and is cudf configured?')
    print(e)

<class 'cudf.core.dataframe.DataFrame'>

Fetch only subsets of the data#

You can fetch only subsets of the remote data:

Shape: Check result counts without downloading the graph#

Often the important aspect is whether or not a search had hits, and you rather not pay the performance penalty of transfering all the hits. In these cases, switch to gfql_remote_shape():

[14]:

g2.chain_remote_shape(two_hop_query)

[14]:

	kind	rows	cols
0	nodes	3	1
1	edges	2	4

Return only nodes#

[15]:

%%time

two_hops_nodes = g2.gfql_remote(two_hop_query, output_type="nodes")

assert two_hops_nodes._edges is None, "No edges returned"

two_hops_nodes._nodes

CPU times: user 51.8 ms, sys: 74 µs, total: 51.9 ms
Wall time: 637 ms

[15]:

	id
0	a
1	b
2	c

Return only edges#

[16]:

%%time

two_hops_edges = g2.gfql_remote(two_hop_query, output_type="edges")

assert two_hops_edges._nodes is None, "No nodes returned"

two_hops_edges._edges

CPU times: user 54.1 ms, sys: 3.58 ms, total: 57.6 ms
Wall time: 609 ms

[16]:

	s	d	v	u
0	a	b	x	2
1	b	c	y	4

Return subset of attributes#

Whether returning both nodes and edges, or only one type of these, you can also pick a subset of the columns to fetch back. For example, you may only desire the IDs, as the full data may be prohibitively large, and you may already have the relevant data locally.

[17]:

%%time

two_hops_IDs_g = g2.gfql_remote(two_hop_query, node_col_subset=['id'], edge_col_subset=['s', 'd'])

CPU times: user 47.3 ms, sys: 7.85 ms, total: 55.1 ms
Wall time: 609 ms

[18]:

two_hops_IDs_g._nodes

[18]:

	id
0	a
1	b
2	c

[19]:

assert 'v' not in two_hops_IDs_g._edges.columns, "Only columns in the subset are returned"

two_hops_IDs_g._edges

[19]:

	s	d
0	a	b
1	b	c

Bind, use, and fetch existing remote data#

When a remote graph dataset ID is already known, bind to it and use it

Locally bind to remote data#

[20]:

%%time

g3_bound = graphistry.bind(dataset_id=g2._dataset_id)

{
    'dataset_id': g3_bound._dataset_id,
    'has local nodes': g3_bound._nodes is not None,
    'has local edges': g3_bound._edges is not None
}

CPU times: user 125 µs, sys: 34 µs, total: 159 µs
Wall time: 161 µs

[20]:

{'dataset_id': '5990e1142056407ea3b13639521ffb56',
 'has local nodes': False,
 'has local edges': False}

Remotely query remote data#

Use chain_remote() and gfql_remote_shape() as usual:

[21]:

g3_bound.chain_remote_shape(two_hop_query)

[21]:

	kind	rows	cols
0	nodes	3	1
1	edges	2	4

Fetch remote data#

Use gfql_remote() to fetch the nodes and edges table. Note that the below takes care to fetch nodes that are not connected to any edges.

[22]:

%%time

remote_g_nodes = g3_bound.gfql_remote([n()], output_type='nodes')
remote_g_edges = g3_bound.gfql_remote([e_undirected()], output_type='edges')

g3_fetched_g = (graphistry
    .nodes(remote_g_nodes._nodes, 'id')
    .edges(remote_g_edges._edges,  's', 'd')
)

CPU times: user 116 ms, sys: 10.5 ms, total: 127 ms
Wall time: 1.33 s

[23]:

print('Node ID column:', g3_fetched_g._node)
g3_fetched_g._nodes

Node ID column: id

[23]:

	id
0	a
1	b
2	c
3	d

[24]:

print('Edge src/dst columns:', g3_fetched_g._source, g3_fetched_g._destination)
g3_fetched_g._edges

Edge src/dst columns: s d

[24]:

	s	d	v	u
0	a	b	x	2
1	b	c	y	4
2	c	d	z	6

[ ]:

Combining Let Bindings with Call Operations#

Let bindings in GFQL allow you to create named intermediate results and compose complex operations. When combined with call operations in remote mode, you can orchestrate sophisticated graph analyses entirely on the server, minimizing data transfer and leveraging server-side GPU acceleration.

Example 1: PageRank Analysis with Filtering#

This example demonstrates using let bindings to: 1. Compute PageRank scores 2. Filter high-value nodes 3. Extract subgraphs around important nodes 4. Return results for visualization

[ ]:

# Create a more complex graph for demonstration
complex_edges = pd.DataFrame({
    's': ['a', 'b', 'c', 'd', 'e', 'f', 'a', 'b', 'c', 'd'],
    'd': ['b', 'c', 'd', 'e', 'f', 'a', 'c', 'd', 'e', 'f'],
    'weight': [1, 2, 1, 3, 1, 2, 1, 2, 1, 1],
    'type': ['follow', 'mention', 'follow', 'follow', 'mention', 'follow', 'mention', 'follow', 'follow', 'mention']
})

g_complex = graphistry.edges(complex_edges, 's', 'd').upload()
print(f"Uploaded graph with {len(complex_edges)} edges")

[ ]:

%%time

# Define a complex query using Python API for cleaner syntax
pagerank_analysis_query = let({
    # Step 1: Compute PageRank scores
    'with_pagerank': call('compute_pagerank', {}),

    # Step 2: Filter nodes with high PageRank scores
    'important_nodes': ref('with_pagerank', [
        n({'filter': {'gte': [{'col': 'pagerank'}, 0.15]}})
    ]),

    # Step 3: Get 1-hop neighborhoods of important nodes
    'important_neighborhoods': ref('with_pagerank', [
        n({'filter': {'gte': [{'col': 'pagerank'}, 0.15]}}),
        e_undirected({'hops': 1}),
        n()
    ])
})

# Note: The 'in' clause is automatically the last binding when using Python let()
# To specify a different output, pass it as second argument: let(bindings, 'output_name')

# Execute the query remotely - chain_remote accepts Python objects directly!
result = g_complex.gfql_remote([pagerank_analysis_query])

print(f"Result has {len(result._nodes)} nodes and {len(result._edges)} edges")
print("\nNodes with PageRank scores:")
print(result._nodes)

Example 2: Multi-Stage Analysis with Different Edge Types#

This example shows how to use let bindings to analyze different edge types separately and combine the results:

Python API vs JSON Format Comparison#

The examples above use the clean Python API. For reference, here’s what the equivalent JSON format looks like:

[ ]:

# Comparison: Python API vs JSON format

# Clean Python API (what we use above):
python_query = let({
    'data': call('compute_pagerank', {}),
    'filtered': ref('data', [
        n({'filter': {'gte': [{'col': 'pagerank'}, 0.15]}})
    ])
})

# Equivalent verbose JSON format:
json_query = {
    'let': {
        'data': {
            'type': 'Call',
            'function': 'compute_pagerank',
            'params': {}
        },
        'filtered': {
            'type': 'Ref',
            'ref': 'data',
            'chain': [{
                'type': 'Node',
                'filter_dict': {
                    'filter': {'gte': [{'col': 'pagerank'}, 0.15]}
                }
            }]
        }
    },
    'in': {'type': 'Ref', 'ref': 'filtered', 'chain': []}
}

# Both work with chain_remote:
# result = g.gfql_remote([python_query])  # Clean!
# result = g.gfql_remote([json_query])    # Verbose but equivalent

print("Python object converts to JSON:")
print(python_query.to_json())

[ ]:

%%time

# Analyze different edge types using clean Python API
edge_type_analysis = let({
    # Analyze follow edges
    'follow_network': e_undirected({
        'filter': {'eq': [{'col': 'type'}, 'follow']}
    }),

    # Compute centrality on follow network
    'follow_centrality': ref('follow_network', [
        call('compute_degree_centrality', {})
    ]),

    # Get nodes that are highly connected in the follow network
    'influential_nodes': ref('follow_centrality', [
        n({'filter': {'gte': [{'col': 'degree_centrality'}, 0.5]}}),
        e_undirected({'hops': 1}),
        n()
    ])
})

# Execute remotely
influential_result = g_complex.gfql_remote([edge_type_analysis])

print(f"Found {len(influential_result._nodes)} influential nodes")
print(f"Connected by {len(influential_result._edges)} edges")
print("\nInfluential nodes with centrality scores:")
print(influential_result._nodes)

Example 3: Conditional Analysis with Let Bindings#

This example demonstrates using let bindings to perform conditional analysis based on graph properties:

[ ]:

%%time

# Complex analysis with multiple algorithms using Python API
comprehensive_analysis = let({
    # Base graph with PageRank computation
    'enriched_graph': call('compute_pagerank', {}),

    # Add centrality metrics
    'with_centrality': ref('enriched_graph', [
        call('compute_degree_centrality', {})
    ]),

    # Find bridge nodes (high PageRank, low-medium centrality)
    'bridge_nodes': ref('with_centrality', [
        n({
            'filter': {
                'and': [
                    {'gte': [{'col': 'pagerank'}, 0.1]},
                    {'lte': [{'col': 'degree_centrality'}, 0.7]}
                ]
            }
        })
    ]),

    # Find hub nodes (high degree centrality)
    'hub_nodes': ref('with_centrality', [
        n({'filter': {'gte': [{'col': 'degree_centrality'}, 0.7]}})
    ]),

    # Get connections between bridges and hubs
    'critical_paths': ref('with_centrality', [
        n({
            'filter': {
                'and': [
                    {'gte': [{'col': 'pagerank'}, 0.1]},
                    {'lte': [{'col': 'degree_centrality'}, 0.7]}
                ]
            }
        }),
        e_forward(),
        n({'filter': {'gte': [{'col': 'degree_centrality'}, 0.7]}})
    ])
})

# Execute remotely with GPU acceleration
critical_paths_result = g_complex.gfql_remote([comprehensive_analysis], engine='cudf')

print(f"Critical paths network: {len(critical_paths_result._nodes)} nodes and {len(critical_paths_result._edges)} edges")

# Check if we got results
if len(critical_paths_result._nodes) > 0:
    print("\nCritical path nodes:")
    print(critical_paths_result._nodes)
else:
    print("\nNo critical paths found with current thresholds")

Example 4: Visualization-Ready Analysis#

This example shows how to prepare data for visualization by enriching it with multiple metrics and creating a focused subgraph:

[ ]:

%%time

# Prepare visualization-ready data with all enrichments
viz_prep_query = {
    'let': {
        # Compute all metrics - sequential operations
        'with_pagerank': {
            'call': {'method': 'compute_pagerank', 'args': [], 'kwargs': {}}
        },

        'with_metrics': {
            'type': 'Ref',
            'ref': 'with_pagerank',
            'chain': [
                {'call': {'method': 'compute_degree_centrality', 'args': [], 'kwargs': {}}},
                # Add node colors based on PageRank
                {
                    'call': {
                        'method': 'nodes',
                        'args': [],
                        'kwargs': {
                            'assign': {
                                'node_color': {
                                    'case': [
                                        {
                                            'when': {'gte': [{'col': 'pagerank'}, 0.2]},
                                            'then': 65280  # Green for high PageRank
                                        },
                                        {
                                            'when': {'gte': [{'col': 'pagerank'}, 0.15]},
                                            'then': 16776960  # Yellow for medium
                                        }
                                    ],
                                    'else': 16711680  # Red for low
                                },
                                'node_size': {
                                    'mul': [
                                        {'col': 'degree_centrality'},
                                        50  # Scale factor
                                    ]
                                }
                            }
                        }
                    }
                }
            ]
        },

        # Add edge styling based on type and weight
        'styled_graph': {
            'type': 'Ref',
            'ref': 'with_metrics',
            'chain': [
                {
                    'call': {
                        'method': 'edges',
                        'args': [],
                        'kwargs': {
                            'assign': {
                                'edge_color': {
                                    'case': [
                                        {
                                            'when': {'eq': [{'col': 'type'}, 'follow']},
                                            'then': 255  # Blue for follows
                                        }
                                    ],
                                    'else': 16711935  # Magenta for mentions
                                },
                                'edge_weight': {
                                    'col': 'weight'
                                }
                            }
                        }
                    }
                }
            ]
        },

        # Focus on top nodes and their connections
        'viz_subgraph': {
            'type': 'Ref',
            'ref': 'styled_graph',
            'chain': [
                {
                    'n': {
                        'filter': {
                            'or': [
                                {'gte': [{'col': 'pagerank'}, 0.15]},
                                {'gte': [{'col': 'degree_centrality'}, 0.6]}
                            ]
                        }
                    }
                },
                {'e_undirected': {'hops': 1}},
                {'n': {}}
            ]
        }
    },

    'in': {'type': 'Ref', 'ref': 'viz_subgraph', 'chain': []}
}

# Get visualization-ready data
viz_result = g_complex.gfql_remote([viz_prep_query])

print(f"Visualization subgraph: {len(viz_result._nodes)} nodes, {len(viz_result._edges)} edges")
print("\nNodes with visualization attributes:")
print(viz_result._nodes)
print("\nEdges with styling:")
print(viz_result._edges)

# Ready to visualize
# viz_result.plot()  # Uncomment to create visualization

Key Benefits of Let Bindings with Remote Calls#

Server-Side Orchestration: All operations happen on the server, minimizing data transfer
Named Intermediate Results: Create readable, reusable steps in complex analyses
GPU Acceleration: Leverage server GPU for compute-intensive operations like PageRank
Composability: Build complex workflows from simple building blocks
Efficiency: Avoid redundant computations by reusing named results

When working with large graphs, this approach is particularly powerful as it allows you to: - Perform multiple analyses without downloading intermediate results - Chain together different algorithms and filters - Prepare visualization-ready data entirely on the server - Return only the final, filtered results you need

Tutorial: GFQL remote mode

Contents