Overview of GFQL#

New to GFQL, the open source dataframe-native graph query language? This article overviews the gaps it fills, special features like GPU accelerations, and where to go next.

Why GFQL?#

GFQL addresses a critical gap in the data community by providing an in-process graph query language that operates at the compute tier. This means you can:

Graph search: Easily and efficiently query and filter nodes and edges using a familiar syntax.
Avoid External Infrastructure: Avoid calls to external infrastructures and eliminate the need for extra databases.
Leverage Existing Workflows: Integrate with your current Python data science tools and libraries.
Achieve High Performance: Utilize GPU acceleration for massive speedups in graph processing.
Simplify Graph Analytics: Write expressive and concise graph queries in Python.

Key Features#

Dataframe-Native Integration: Works directly with Pandas, Polars, cuDF, and Apache Arrow dataframes.
High Performance: Optimized for both CPU and GPU execution, capable of processing billions of edges.
Ease of Use: Install via pip and start querying without the need for external databases.
Seamless Visualization: Integrated with PyGraphistry for GPU-accelerated graph visualization.
Flexibility: Suitable for a wide range of applications, including cybersecurity, fraud detection, financial analysis, and more.
Architectural Freedom: Use GFQL with your dataframes on your local CPU/GPU, or offload to a remote GPU cluster.

Installation Guide#

GFQL is built into pygraphistry:

pip install graphistry

Ensure you have pandas or cudf installed, depending on whether you want to run on CPU or GPU.

For more information, see Install .

Key GFQL Concepts#

GFQL works on the same graphs as the rest of the PyGraphistry library. The operations run on top of the dataframe engine of your choice, with initial support for Pandas dataframes (CPU) and cuDF dataframes (GPU).

Nodes and Edges: Represented using dataframes, making integration with Pandas and cuDF seamless
Cypher strings: Write queries as Cypher strings — g.gfql("MATCH (n) WHERE n.score > 5 RETURN n")
Native chains: Or compose queries as Python objects — g.gfql([n({"score": gt(5)})])
Predicates: Apply conditions to filter nodes and edges based on their properties, reusing the optimized native operations of the underlying dataframe engine
Same-path constraints (WHERE): Relate attributes across steps in a chain using where
Row pipelines (`MATCH … RETURN` style): Move from graph pattern matches to tabular results with rows(), where_rows(), return_(), order_by(), group_by(), skip(), and limit()
Result kinds: Some stages keep you in graph state, while row-pipeline stages and row-returning local Cypher CALL queries move you into row state
GPU & CPU vectorization: GFQL automatically leverages GPU acceleration and in-memory columnar processing for massive speedups on your queries
Optional remote mode: Bind to remote data or upload it quickly as Arrow, and run your same Python and GFQL queries on remote GPU resources when available

Choosing Entry Points And Result Kinds#

Use the entrypoint that matches where the query executes:

Local in-memory GFQL / Cypher-style execution: g.gfql([…]) or g.gfql(“MATCH …”) runs on the current Plottable in pandas/cuDF.
Remote GFQL execution: g.gfql_remote([…]) runs the same GFQL chains/DAGs remotely, which is useful for larger datasets and remote GPU execution. See GFQL Remote Mode.

Warning

graphistry.cypher(”…”) and g.cypher(”…”) are a separate remote database Cypher path (for example, Neo4j/Neptune integrations), not the GFQL execution surface described on this page. Do not treat them as interchangeable with g.gfql(…) or g.gfql_remote(…).

GFQL pipelines also have two practical result kinds:

Graph state: Traversable graph results with meaningful _nodes and _edges. Matchers, graph-preserving call(…) transforms, let() / ref() DAG stages, local Cypher CALL graphistry.*.write() queries, and local Cypher GRAPH { MATCH … } constructors stay in graph state.
Row state: Tabular results stored in _nodes, with _edges reduced to an empty placeholder frame. Row-pipeline steps like rows(), with_(), select(), return_(), group_by(), and row-returning local Cypher CALL … YIELD … RETURN … queries move into row state.
A bare local Cypher procedure call without .write() is also row-returning. For example, CALL graphistry.degree() materializes the default procedure output columns into _nodes and clears _edges.

If you need to enrich a graph and keep matching locally, use graph-preserving call() / let() composition or a bare local Cypher CALL graphistry.*.write(). The local Cypher compiler currently supports graphistry.degree.write() plus graphistry.igraph.<alg>.write() and graphistry.cugraph.<alg>.write() for algorithms exposed through compute_igraph() / compute_cugraph(), along with a curated NetworkX subset including graphistry.nx.pagerank.write(), graphistry.nx.betweenness_centrality.write(), graphistry.nx.degree_centrality.write(), graphistry.nx.closeness_centrality.write(), graphistry.nx.eigenvector_centrality.write(), graphistry.nx.katz_centrality.write(), graphistry.nx.connected_components.write(), graphistry.nx.strongly_connected_components.write(), graphistry.nx.core_number.write(), graphistry.nx.hits.write(), graphistry.nx.edge_betweenness_centrality.write(), and graphistry.nx.k_core.write().

Quick Examples#

GFQL supports Cypher strings and native Python chains through the same g.gfql(...) entrypoint:

Find Nodes of a Certain Type

# Cypher string — returns a DataFrame of matching nodes
nodes_df = g.gfql("MATCH (n {type: 'person'}) RETURN n")._nodes

# Equivalent native chain
from graphistry import n
nodes_df = g.gfql([ n({"type": "person"}) ])._nodes

Extract a Subgraph

# Cypher string — GRAPH { } returns a subgraph with ._nodes and ._edges
g2 = g.gfql(
    "GRAPH { "
    "MATCH (a)-[e]->(b) "
    "WHERE e.interesting = true "
    "}"
)

# Equivalent native chain
from graphistry import n, e_forward
g2 = g.gfql([n(), e_forward({"interesting": True}, hops=2) ])
g2.plot()

Same-Path Constraints (WHERE)

Example: Match an account and its owner when both steps share an attribute.

from graphistry import n, e_forward, col, compare

g_filtered = g.gfql(
    [
        n({"type": "account"}, name="a"),
        e_forward(),
        n({"type": "user"}, name="c"),
    ],
    where=[compare(col("a", "owner_id"), "==", col("c", "owner_id"))],
)

Row-Pipeline `RETURN` Example

Example: Match people, filter rows, project columns, then sort/limit.

from graphistry import n, e_forward, gt
from graphistry.compute import rows, where_rows, return_, order_by, limit

top_people = g.gfql([
    n({"type": "Person"}),
    e_forward({"type": "FOLLOWS"}),
    n({"type": "Person", "score": gt(0)}, name="p"),
    rows(table="nodes", source="p"),
    where_rows(expr="score >= 50"),
    return_(["id", "name", "score"]),
    order_by([("score", "desc"), ("name", "asc")]),
    limit(10),
])

top_people._nodes

Local Cypher `CALL … .write()` Example

Example: Enrich a graph locally, keep graph state, then run a later MATCH.

g_enriched = g.gfql("CALL graphistry.degree.write()")
assert not g_enriched._edges.empty
top_degree = g_enriched.gfql(
    "MATCH (n) "
    "WHERE n.degree >= 2 "
    "RETURN n.id AS id, n.degree AS degree "
    "ORDER BY degree DESC, id ASC "
    "LIMIT 10"
)

top_degree._nodes

Local Cypher row-returning `CALL` Example

Example: Omit .write() when you want procedure rows instead of an enriched graph.

degree_rows = g.gfql("CALL graphistry.degree()")
assert degree_rows._edges.empty
degree_rows._nodes

This row result uses nodeId as the row identifier, stores the projected procedure outputs in _nodes, and clears _edges. Use .write() when the next step needs graph topology.

Example visualization (static):

GFQL 2-hop example rendered with plot_static — 2-hop "interesting" edges rendered with `plot_static()`.

Find Nodes 1-2 Hops Away and Label Each Hop

Example: Find nodes up to 2 hops away from node “a” and label each hop.

from graphistry import n, e_undirected

g_2_hops = g.gfql([
    n({g._node: "a"}),
    e_undirected(name="hop1"),
    e_undirected(name="hop2")
])
first_hop_edges = g_2_hops._edges[ g_2_hops._edges["hop1"] == True ]
print('Number of first-hop edges:', len(first_hop_edges))

Filter by Date/Time

Example: Find recent transactions using temporal predicates.

from graphistry import n, e_forward
from graphistry.compute import gt, between
from datetime import datetime, date, time
import pandas as pd

# Find transactions after a specific date
recent = g.gfql([
    n(), e_forward(edge_match={"timestamp": gt(pd.Timestamp("2023-01-01"))}), n()
])

# Find transactions in a date range during business hours
business_hours_txns = g.gfql([
    n(), e_forward(edge_match={
        "date": between(date(2023, 6, 1), date(2023, 6, 30)),
        "time": between(time(9, 0), time(17, 0))
    }), n()
])

Query for Transaction Nodes Between Risky Nodes

Example: Find transaction nodes between two kinds of risky nodes.

from graphistry import n, e_forward, e_reverse

g_risky = g.gfql([
    n({"risk1": True}),
    e_forward(to_fixed_point=True),
    n({"type": "transaction"}, name="hit"),
    e_reverse(to_fixed_point=True),
    n({"risk2": True})
])
hits = g_risky._nodes[ g_risky._nodes["hit"] == True ]
print('Number of transaction hits:', len(hits))

Filter by Multiple Node Types Using `is_in`

Example: Filter nodes and edges by multiple types.

from graphistry import n, e_forward, e_reverse, is_in

g_filtered = g.gfql([
    n({"type": is_in(["person", "company"])}),
    e_forward({"e_type": is_in(["owns", "reviews"])}, to_fixed_point=True),
    n({"type": is_in(["transaction", "account"])}, name="hit"),
    e_reverse(to_fixed_point=True),
    n({"risk2": True})
])
hits = g_filtered._nodes[ g_filtered._nodes["hit"] == True ]
print('Number of filtered hits:', len(hits))

DAG Patterns with Let Bindings

GFQL’s Let bindings enable you to compose complex graph analyses by defining named subgraphs and operations that can reference each other. Like variables in programming, Let bindings make it easy to manipulate multiple graphs and subgraphs within a single query, while maintaining all the benefits of GFQL like GPU acceleration.

Traditional Python approach (manual variable management):

# Traditional Python: Manually manage intermediate results
persons = g.gfql([n({'type': 'person'})])
adults = persons.gfql([n({'age': ge(18)})])
friends = adults.gfql([e_forward({'type': 'knows'})])
# Each step requires careful tracking of which graph to operate on

GFQL Let approach (declarative DAG with named bindings):

from graphistry import let, ref, n, e_forward, ge

# GFQL Let: Define a DAG of named operations
result = g.gfql(let({
    'persons': n({'type': 'person'}),
    'adults': ref('persons', [n({'age': ge(18)})]),  # Reference and filter persons
    'connections': [
        n({'type': 'person', 'age': ge(18)}),
        e_forward({'type': 'knows'}),
        n()  # Find connections from adults
    ]
}))

# Access any named result from the DAG
adults = result._nodes[result._nodes['adults']]
connections = result._edges[result._edges['connections']]

Key advantages of GFQL Let: - Named subgraphs: Create reusable named graph operations like constants in code - Dependency management: Automatically resolves dependencies between operations - Composability: Build complex multi-stage analyses from simpler named operations - GPU preservation: All operations maintain GPU acceleration when available - Clean semantics: Express complex graph analyses as clear, declarative DAGs

Leveraging GPU Acceleration#

GFQL runs the same query on four interchangeable engines, all returning identical results: pandas (CPU, default), polars (CPU columnar — up to ~38x over pandas, no GPU), cudf (NVIDIA GPU), and polars-gpu (NVIDIA GPU). engine='auto' resolves to cudf for cuDF input and pandas otherwise; polars / polars-gpu are explicit opt-in (auto never selects them — so a Polars-frame graph run with the default is coerced to pandas; pass engine='polars' to stay native). Neither silently bridges: polars-gpu is GPU-or-error, and unsupported Polars/Cypher shapes are declined during validation, compilation, or planning before execution rather than falling back to pandas. See Choosing an Engine for the decision matrix and benchmarks.

When you use cuDF (GPU) dataframes with engine='auto', GFQL executes queries on the GPU for massive speedups.

Automatic GPU Acceleration (cuDF)

Example: Run GFQL queries with GPU dataframes.

import cudf
import graphistry

# Load data into GPU dataframes
e_gdf = cudf.read_parquet('edges.parquet')
n_gdf = cudf.read_parquet('nodes.parquet')

# Create a graph with GPU dataframes
g_gpu = graphistry.edges(e_gdf, 'src', 'dst').nodes(n_gdf, 'id')

# Run GFQL query (executes on GPU)
g_result = g_gpu.gfql([ ... ])  # Your GFQL query here
print('Number of resulting edges:', len(g_result._edges))

Selecting an Engine Explicitly

Example: set the engine for a CPU columnar speedup or to force a specific GPU engine.

g_result = g.gfql([ ... ], engine='polars')        # CPU columnar, no GPU
g_result = g_gpu.gfql([ ... ], engine='cudf')       # NVIDIA GPU, eager
g_result = g_gpu.gfql([ ... ], engine='polars-gpu') # NVIDIA GPU, fused plan

Run Remotely#

You may want to run GFQL remotely such as if the data is remote, e.g., in Hub or cloud storage, and you have faster remote GPU servers for acting on it.

Bind to Remote Data and Query

Example: Bind to remote data and run queries on remote GPU resources.

import graphistry
from graphistry import n, e

g = graphistry.bind(dataset_id='my-dataset-id')

nodes_df = g.gfql_remote([ n() ])._nodes

Upload Data and Run GPU Python Remotely

Example: Upload local data to a remote GPU server and run full GPU Python tasks on it.

import graphistry
from graphistry import n, e

# Fully self-contained so can be transferred
def my_remote_trim_graph_task(g):
    # Trick: You can also put database fetch calls here!
    return (g
        .nodes(g._nodes[:10])
        .edges(g._edges[:10])
    )

# Upload any local graph data to the remote server
g2 = g1.upload()
print(g2._dataset_id, g2._nodes_file_id, g2._edges_file_id)

# Compute on it locally
g_result = g2.python_remote_g(my_remote_trim_graph_task)
print('Number of resulting edges:', len(g_result._edges))

See also python_remote_table() and python_remote_json() for returning other types of data.

Visualizing GFQL Results#

GFQL integrates with PyGraphistry, allowing you to visualize your graphs with GPU-accelerated rendering.

Example: Visualize high PageRank nodes.

from graphistry import n, e

# Compute PageRank using cuGraph (GPU)
g_enriched = g_result.compute_cugraph('pagerank')

# Filter nodes with high PageRank
g_high_pagerank = g_enriched.gfql([
    n(query='pagerank > 0.1'), e(), n(query='pagerank > 0.1')
])

# Plot the subgraph
g_high_pagerank.plot()

Example visualization (graphviz):

Example visualization (interactive):

Learn More

Explore the following sections to dive deeper into GFQL’s capabilities:

10 Minutes to GFQL: A quickstart guide to get you up and running.
- 10 Minutes to GFQL
Hop & Chain Quick Reference: Learn how to chain multiple operations to build complex queries.
- GFQL Quick Reference
Predicates Quick Reference: Apply advanced filtering using predicates.
- GFQL Operator Reference

GFQL APIs#

Access detailed documentation of GFQL’s API:

Chain Operations: Learn how to chain multiple operations to build complex queries.
- GFQL Chain Matcher
Hop Functions: Understand how to traverse the graph using hop functions.
- GFQL Hop Matcher
Predicates: Apply advanced filtering using predicates.
- GFQL Attribute Matchers