Tutorial: GFQL remote mode#

Running GFQL on remote servers helps with scenarios like large workloads benefiting from GPU acceleration depite no local GPU, when the data is already on a remote Graphistry server, and other team and production setting needs.

The following examples walk through several common scenarios:

  • Uploading data and running GFQL remotely on it

  • Binding to existing remote data and running GFQL remotely on it

  • Control how much data is returned and in what format

  • Control CPU vs GPU execution

See also the sibling tutorial for running arbitrary GPU Python remotely for even more powerful scenarios.

Setup#

Note: Ensure the GFQL endpoint is enabled for the API user

Imports#

[1]:
import pandas as pd
import graphistry
from graphistry import n, e_undirected, e_forward
graphistry.__version__
[1]:
'0+unknown'
[2]:
graphistry.register(api=3, username='FILL_ME_IN', password='FILL_ME_IN', protocol='https', server='hub.graphistry.com')

Data#

Create an edge table. For simplicity, we will leave the nodes table implicit:

[3]:
e_df = pd.DataFrame({
    's': ['a', 'b', 'c'],
    'd': ['b', 'c', 'd'],
    'v': ['x', 'y', 'z'],
    'u': [2, 4, 6]
})

g = graphistry.edges(e_df, 's', 'd')

Upload data#

Uploaded datasets have a nodes File, edges File, and combined graph Dataset. You can inspect these on your Plottable objects.

Remote-mode GFQL calls will automatically upload your graph if not already sent. If a table has already been recently in the session, the PyGraphistry client is smart enough to detect this and reuse the File ID handle instead of reuploading the data. However, in application code, we recommend explicitly uploading in your code flow to enable easier reuse and more predictable code flow.

[4]:
%%time
g2 = g.upload()

{
    'dataset_id': g2._dataset_id,
    'nodes_file_id': g2._nodes_file_id,
    'edges_file_id': g2._edges_file_id
}
CPU times: user 84.2 ms, sys: 13.2 ms, total: 97.4 ms
Wall time: 1.47 s
[4]:
{'dataset_id': '3a479d960595447e9e4f1b83ace969ed',
 'nodes_file_id': None,
 'edges_file_id': 'cd5bf7c37f1b4ced85a4d23b6f841be6'}

The edge table does not need to get re-uploaded

[5]:
%%time

# Much faster as g._edges is not re-uploaded, and instead g2._edges_file_id is reused
g2b = g.upload()

assert g2b._dataset_id != g2._dataset_id, "Each upload is a new Dataset object"
assert g2b._edges_file_id == g2._edges_file_id, "Dataframe files get automatically reused"

{
    'dataset_id': g2._dataset_id,
    'nodes_file_id': g2._nodes_file_id,
    'edges_file_id': g2._edges_file_id
}
CPU times: user 45 ms, sys: 1.61 ms, total: 46.6 ms
Wall time: 605 ms
[5]:
{'dataset_id': '3a479d960595447e9e4f1b83ace969ed',
 'nodes_file_id': None,
 'edges_file_id': 'cd5bf7c37f1b4ced85a4d23b6f841be6'}

Query remote data#

Regular chain calls can be called in remote mode, and return back the resulting graph

[6]:
two_hop_query = [
    n({'id': 'a'}),
    e_forward(hops=2),
    n()
]
[7]:
%%time

two_hop_g = g2.chain_remote(two_hop_query)
CPU times: user 37.9 ms, sys: 9.9 ms, total: 47.8 ms
Wall time: 613 ms
[8]:
two_hop_g._edges
[8]:
s d v u
0 a b x 2
1 b c y 4
[9]:
two_hop_g._nodes
[9]:
id
0 a
1 b
2 c
[10]:
assert len(two_hop_g._edges) == len(g.chain(two_hop_query)._edges), "Remote result should match local results"

Ensure GPU mode in remote execution#

Explicitly set the remote engine= configuration to "cudf" (GPU) or "pandas" (CPU), or leave unconfigured to let the runtime decide

[11]:
%%time
two_hop_g_gpu1 = g2.chain_remote(two_hop_query, engine='cudf')
CPU times: user 48.4 ms, sys: 0 ns, total: 48.4 ms
Wall time: 598 ms
[12]:
%%time
two_hop_g_cpu1 = g2.chain_remote(two_hop_query, engine='pandas')
CPU times: user 50 ms, sys: 744 µs, total: 50.8 ms
Wall time: 590 ms

You can move the results to a local if available:

[13]:
try:
    two_hop_g_gpu1 = two_hop_g_gpu1.to_cudf()
    print(type(two_hop_g_gpu1._edges))
except Exception as e:
    print('Error moving to a local GPU, do you have a GPU and is cudf configured?')
    print(e)
<class 'cudf.core.dataframe.DataFrame'>

Fetch only subsets of the data#

You can fetch only subsets of the remote data:

Shape: Check result counts without downloading the graph#

Often the important aspect is whether or not a search had hits, and you rather not pay the performance penalty of transfering all the hits. In these cases, switch to chain_remote_shape():

[14]:
g2.chain_remote_shape(two_hop_query)
[14]:
kind rows cols
0 nodes 3 1
1 edges 2 4

Return only nodes#

[15]:
%%time

two_hops_nodes = g2.chain_remote(two_hop_query, output_type="nodes")

assert two_hops_nodes._edges is None, "No edges returned"

two_hops_nodes._nodes
CPU times: user 51.8 ms, sys: 74 µs, total: 51.9 ms
Wall time: 637 ms
[15]:
id
0 a
1 b
2 c

Return only edges#

[16]:
%%time

two_hops_edges = g2.chain_remote(two_hop_query, output_type="edges")

assert two_hops_edges._nodes is None, "No nodes returned"

two_hops_edges._edges
CPU times: user 54.1 ms, sys: 3.58 ms, total: 57.6 ms
Wall time: 609 ms
[16]:
s d v u
0 a b x 2
1 b c y 4

Return subset of attributes#

Whether returning both nodes and edges, or only one type of these, you can also pick a subset of the columns to fetch back. For example, you may only desire the IDs, as the full data may be prohibitively large, and you may already have the relevant data locally.

[17]:
%%time

two_hops_IDs_g = g2.chain_remote(two_hop_query, node_col_subset=['id'], edge_col_subset=['s', 'd'])
CPU times: user 47.3 ms, sys: 7.85 ms, total: 55.1 ms
Wall time: 609 ms
[18]:
two_hops_IDs_g._nodes
[18]:
id
0 a
1 b
2 c
[19]:
assert 'v' not in two_hops_IDs_g._edges.columns, "Only columns in the subset are returned"

two_hops_IDs_g._edges
[19]:
s d
0 a b
1 b c

Bind, use, and fetch existing remote data#

When a remote graph dataset ID is already known, bind to it and use it

Locally bind to remote data#

[20]:
%%time

g3_bound = graphistry.bind(dataset_id=g2._dataset_id)

{
    'dataset_id': g3_bound._dataset_id,
    'has local nodes': g3_bound._nodes is not None,
    'has local edges': g3_bound._edges is not None
}
CPU times: user 125 µs, sys: 34 µs, total: 159 µs
Wall time: 161 µs
[20]:
{'dataset_id': '5990e1142056407ea3b13639521ffb56',
 'has local nodes': False,
 'has local edges': False}

Remotely query remote data#

Use chain_remote() and chain_remote_shape() as usual:

[21]:
g3_bound.chain_remote_shape(two_hop_query)
[21]:
kind rows cols
0 nodes 3 1
1 edges 2 4

Fetch remote data#

Use chain_remote() to fetch the nodes and edges table. Note that the below takes care to fetch nodes that are not connected to any edges.

[22]:
%%time

remote_g_nodes = g3_bound.chain_remote([n()], output_type='nodes')
remote_g_edges = g3_bound.chain_remote([e_undirected()], output_type='edges')

g3_fetched_g = (graphistry
    .nodes(remote_g_nodes._nodes, 'id')
    .edges(remote_g_edges._edges,  's', 'd')
)
CPU times: user 116 ms, sys: 10.5 ms, total: 127 ms
Wall time: 1.33 s
[23]:
print('Node ID column:', g3_fetched_g._node)
g3_fetched_g._nodes
Node ID column: id
[23]:
id
0 a
1 b
2 c
3 d
[24]:
print('Edge src/dst columns:', g3_fetched_g._source, g3_fetched_g._destination)
g3_fetched_g._edges
Edge src/dst columns: s d
[24]:
s d v u
0 a b x 2
1 b c y 4
2 c d z 6
[ ]: