AI

AI#

graphistry[‘ai’] provides a set of utilities for AI and machine learning workflows on graphs, with optional GPU support

Featurize#

class graphistry.feature_utils.Embedding(df)#

Bases: object

Generates random embeddings of a given dimension that aligns with the index of the dataframe

Parameters:: df (DataFrame)

fit(n_dim)#

Parameters:: n_dim (int)

fit_transform(n_dim)#

Parameters:: n_dim (int)

transform(ids)#

Return type:: DataFrame

class graphistry.feature_utils.FastEncoder(df, y, kind='nodes')#

Bases: object

Parameters:

df (DataFrame)
y (DataFrame)

fit(src=None, dst=None, *args, **kwargs)#

fit_transform(src=None, dst=None, *args, **kwargs)#

scale(X=None, y=None, return_pipeline=False, *args, **kwargs)#

Fits new scaling functions on df, y via args-kwargs

Example:

from graphisty.features import SCALERS, SCALER_OPTIONS
print(SCALERS)
g = graphistry.nodes(df)
# set a scaling strategy for features and targets -- umap uses those and produces different results depending.
g2 = g.umap(use_scaler='standard', use_scaler_target=None)

# later if you want to scale new data, you can do so
X, y = g2.transform(df, df, scaled=False)  # unscaled transformer output
# now scale with new settings
X_scaled, y_scaled = g2.scale(X, y, use_scaler='minmax', use_scaler_target='kbins', n_bins=5)
# fit some other pipeline
clf.fit(X_scaled, y_scaled)

args:

;X: pd.DataFrame of features
:y: pd.DataFrame of target features
:kind: str, one of 'nodes' or 'edges'
*args, **kwargs: passed to smart_scaler pipeline

returns:: scaled X, y

transform(df, ydf=None)#

Raw transform, no scaling.

Parameters:

df (DataFrame)
ydf (DataFrame | None)

transform_scaled(df, ydf=None, scaling_pipeline=None, scaling_pipeline_target=None)#

class graphistry.feature_utils.FastMLB(mlb, in_column, out_columns)#

Bases: object

fit(X, y=None)#

get_feature_names_in()#

get_feature_names_out()#

transform(df)#

class graphistry.feature_utils.FeatureMixin(*a, **kw)#

Bases: ComputeMixin

FeatureMixin for automatic featurization of nodes and edges DataFrames. Subclasses UMAPMixin for umap-ing of automatic features.

Usage:

g = graphistry.nodes(df, 'node_column')
g2 = g.featurize()

or for edges,

g = graphistry.edges(df, 'src', 'dst')
g2 = g.featurize(kind='edges')

or chain them for both nodes and edges,

g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node_column')
g2 = g.featurize().featurize(kind='edges')

DGL_graph: Any | None#

addStyle(fg=None, bg=None, page=None, logo=None)#

Parameters:

fg (Dict[str, Any] | None)
bg (Dict[str, Any] | None)
page (Dict[str, Any] | None)
logo (Dict[str, Any] | None)

Return type:

Plottable

base_url_client(v=None)#

Parameters:: v (str | None)
Return type:: str

base_url_server(v=None)#

Parameters:: v (str | None)
Return type:: str

bind(source=None, destination=None, node=None, edge=None, edge_title=None, edge_label=None, edge_color=None, edge_weight=None, edge_size=None, edge_opacity=None, edge_icon=None, edge_source_color=None, edge_destination_color=None, point_title=None, point_label=None, point_color=None, point_weight=None, point_size=None, point_opacity=None, point_icon=None, point_x=None, point_y=None, point_longitude=None, point_latitude=None, dataset_id=None, url=None, nodes_file_id=None, edges_file_id=None, schema=None)#

Parameters:

source (str | None)
destination (str | None)
node (str | None)
edge (str | None)
edge_title (str | None)
edge_label (str | None)
edge_color (str | None)
edge_weight (str | None)
edge_size (str | None)
edge_opacity (str | None)
edge_icon (str | None)
edge_source_color (str | None)
edge_destination_color (str | None)
point_title (str | None)
point_label (str | None)
point_color (str | None)
point_weight (str | None)
point_size (str | None)
point_opacity (str | None)
point_icon (str | None)
point_x (str | None)
point_y (str | None)
point_longitude (str | None)
point_latitude (str | None)
dataset_id (str | None)
url (str | None)
nodes_file_id (str | None)
edges_file_id (str | None)
schema (Any | None)

Return type:

Plottable

chain(*args, **kwargs)#

Deprecated since version 2.XX.X: Use gfql() instead for a unified API that supports both chains and DAGs.

Chain a list of ASTObject (node/edge) traversal operations

Return subgraph of matches according to the list of node & edge matchers If any matchers are named, add a correspondingly named boolean-valued column to the output

For direct calls, exposes convenience List[ASTObject]. Internal operational should prefer Chain.

Use engine=’cudf’ to force automatic GPU acceleration mode

Parameters:

ops – List[ASTObject] Various node and edge matchers
validate_schema – Whether to validate the chain against the graph schema before executing
policy – Optional policy dict for hooks
context – Optional ExecutionContext for tracking execution state
start_nodes – Optional node wavefront for the first traversal step

Returns:

Plotter

Return type:

Plotter

chain_remote(*args, **kwargs)#

Deprecated since version 2.XX.X: Use gfql_remote() instead for a unified API that supports both chains and DAGs.

Remotely run GFQL chain query on a remote dataset.

Uses the latest bound _dataset_id, and uploads current dataset if not already bound. Note that rebinding calls of edges() and nodes() reset the _dataset_id binding.

param chain:

GFQL query as a Python object, serialized GFQL JSON, or Cypher string

type chain:

Union[Chain, List[ASTObject], Dict[str, JSONVal], ASTLet, str]

param api_token:

Optional JWT token. If not provided, refreshes JWT and uses that.

type api_token:

Optional[str]

param dataset_id:

Optional dataset_id. If not provided, will fallback to self._dataset_id. If not provided, will upload current data, store that dataset_id, and run GFQL against that.

type dataset_id:

Optional[str]

param output_type:

Whether to return nodes and edges (“all”, default), Plottable with just nodes (“nodes”), or Plottable with just edges (“edges”). For just a dataframe of the resultant graph shape (output_type=”shape”), use instead chain_remote_shape().

type output_type:

OutputType

param format:

What format to fetch results. We recommend a columnar format such as parquet, which it defaults to when output_type is not shape.

type format:

Optional[FormatType]

param df_export_args:

When server parses data, any additional parameters to pass in.

type df_export_args:

Optional[Dict, str, Any]]

param node_col_subset:

When server returns nodes, what property subset to return. Defaults to all.

type node_col_subset:

Optional[List[str]]

param edge_col_subset:

When server returns edges, what property subset to return. Defaults to all.

type edge_col_subset:

Optional[List[str]]

param engine:

Override which run mode GFQL uses. Defaults to ‘auto’ which auto-detects based on DataFrame type. Also accepts ‘pandas’ or ‘cudf’.

type engine:

EngineAbstractType

param validate:

Whether to locally test code, and if uploading data, the data. Default true.

type validate:

bool

param persist:

Whether to persist dataset on server and return dataset_id for immediate URL generation. Default false.

type persist:

bool
Example: Explicitly upload graph and return subgraph where nodes have at least one edge
import graphistry
from graphistry import n, e
es = pandas.DataFrame({'src': [0,1,2], 'dst': [1,2,0]})
g1 = graphistry.edges(es, 'src', 'dst').upload()
assert g1._dataset_id, "Graph should have uploaded"

g2 = g1.chain_remote([n(), e(), n()])
print(f'dataset id: {g2._dataset_id}, # nodes: {len(g2._nodes)}')
Example: Return subgraph where nodes have at least one edge, with implicit upload
import graphistry
from graphistry import n, e
es = pandas.DataFrame({'src': [0,1,2], 'dst': [1,2,0]})
g1 = graphistry.edges(es, 'src', 'dst')
g2 = g1.chain_remote([n(), e(), n()])
print(f'dataset id: {g2._dataset_id}, # nodes: {len(g2._nodes)}')
Example: Return subgraph where nodes have at least one edge, with implicit upload, and force GPU mode
import graphistry
from graphistry import n, e
es = pandas.DataFrame({'src': [0,1,2], 'dst': [1,2,0]})
g1 = graphistry.edges(es, 'src', 'dst')
g2 = g1.chain_remote([n(), e(), n()], engine='cudf')
print(f'dataset id: {g2._dataset_id}, # nodes: {len(g2._nodes)}')

Return type:: Plottable

chain_remote_shape(*args, **kwargs)#

Deprecated since version 2.XX.X: Use gfql_remote_shape() instead for a unified API that supports both chains and DAGs.

Like chain_remote(), except instead of returning a Plottable, returns a pd.DataFrame of the shape of the resulting graph.

Useful as a fast success indicator that avoids the need to return a full graph when a match finds hits, return just the metadata.

Example: Upload graph and compute number of nodes with at least one edge

import graphistry
es = pandas.DataFrame({'src': [0,1,2], 'dst': [1,2,0]})
g1 = graphistry.edges(es, 'src', 'dst').upload()
assert g1._dataset_id, "Graph should have uploaded"

shape_df = g1.chain_remote_shape([n(), e(), n()])
print(shape_df)

Example: Compute number of nodes with at least one edge, with implicit upload, and force GPU mode

import graphistry
es = pandas.DataFrame({'src': [0,1,2], 'dst': [1,2,0]})
g1 = graphistry.edges(es, 'src', 'dst')

shape_df = g1.chain_remote_shape([n(), e(), n()], engine='cudf')
print(shape_df)

Return type:: DataFrame

client_protocol_hostname(v=None)#

Parameters:: v (str | None)
Return type:: str

collapse(node, attribute, column, self_edges=False, unwrap=False, verbose=False)#

Topology-aware collapse by given column attribute starting at node

Traverses directed graph from start node node and collapses clusters of nodes that share the same property so that topology is preserved.

Parameters:

node (str | int) – start node to begin traversal
attribute (str | int) – the given attribute to collapse over within column
column (str | int) – the column of nodes DataFrame that contains attribute to collapse over
self_edges (bool) – whether to include self edges in the collapsed graph
unwrap (bool) – whether to unwrap the collapsed graph into a single node
verbose (bool) – whether to print out collapse summary information

:returns:A new Graphistry instance with nodes and edges DataFrame containing collapsed nodes and edges given by column attribute – nodes and edges DataFrames contain six new columns collapse_{node | edges} and final_{node | edges}, while original (node, src, dst) columns are left untouched :rtype: Plottable

collections(collections=None, show_collections=None, collections_global_node_color=None, collections_global_edge_color=None, validate='autofix', warn=True)#

Parameters:

collections (str | CollectionSet | CollectionIntersection | List[CollectionSet | CollectionIntersection] | None)
show_collections (bool | None)
collections_global_node_color (str | None)
collections_global_edge_color (str | None)
validate (Literal['strict', 'strict-fast', 'autofix'] | bool)
warn (bool)

Return type:

Plottable

compute_cugraph(alg, out_col=None, params={}, kind='Graph', directed=True, G=None)#

Parameters:

alg (str)
out_col (str | None)
params (dict)
kind (Literal['Graph', 'MultiGraph', 'BiPartiteGraph'])
G (Any | None)

Return type:

Plottable

compute_igraph(alg, out_col=None, directed=None, use_vids=False, params={}, stringify_rich_types=True)#

Parameters:

alg (str)
out_col (str | None)
directed (bool | None)
use_vids (bool)
params (dict)
stringify_rich_types (bool)

Return type:

Plottable

compute_networkx(alg, out_col=None, params=None, directed=True, G=None)#

Parameters:

alg (str)
out_col (str | None)
params (Dict[str, Any] | None)
directed (bool)
G (Any | None)

Return type:

Plottable

copy()#

Return type:: Plottable

description(description)#

Parameters:: description (str)
Return type:: Plottable

drop_nodes(nodes)#: return g with any nodes/edges involving the node id series removed

edges(edges, source=None, destination=None, edge=None, *args, **kwargs)#

Parameters:

edges (Callable | Any)
source (str | None)
destination (str | None)
edge (str | None)
args (Any)
kwargs (Any)

Return type:

Plottable

embed(relation, proto='DistMult', embedding_dim=32, use_feat=False, X=None, epochs=2, batch_size=32, train_split=0.8, sample_size=1000, num_steps=50, lr=0.01, inplace=False, device='cpu', evaluate=True, *args, **kwargs)#

Parameters:

relation (str)
proto (str | Callable[[Any, Any, Any], Any] | None)
embedding_dim (int)
use_feat (bool)
X (DataFrame | np.ndarray | List[str] | None)
epochs (int)
batch_size (int)
train_split (float | int)
sample_size (int)
num_steps (int)
lr (float)
inplace (bool | None)
device (str | None)
evaluate (bool)

Return type:

Plottable

encode_axis(rows=[])#

Parameters:: rows (List[Dict])
Return type:: Plottable

encode_edge_badge(column, position='TopRight', categorical_mapping=Ellipsis, continuous_binning=Ellipsis, default_mapping=Ellipsis, comparator=Ellipsis, color=Ellipsis, bg=Ellipsis, fg=Ellipsis, for_current=False, for_default=True, as_text=Ellipsis, blend_mode=Ellipsis, style=Ellipsis, border=Ellipsis, shape=Ellipsis)#

Parameters:

column (str)
position (str)
categorical_mapping (Dict[Any, Any] | None)
continuous_binning (List[Any] | None)
default_mapping (Any | None)
comparator (Callable[[Any, Any], int] | None)
color (str | None)
bg (str | None)
fg (str | None)
for_current (bool)
for_default (bool)
as_text (bool | None)
blend_mode (str | None)
style (Dict[str, Any] | None)
border (Dict[str, Any] | None)
shape (str | None)

Return type:

Plottable

encode_edge_color(column, palette=Ellipsis, as_categorical=Ellipsis, as_continuous=Ellipsis, categorical_mapping=Ellipsis, default_mapping=Ellipsis, for_default=True, for_current=False)#

Parameters:

column (str)
palette (List[str] | None)
as_categorical (bool | None)
as_continuous (bool | None)
categorical_mapping (Dict[Any, Any] | None)
default_mapping (str | None)
for_default (bool)
for_current (bool)

Return type:

Plottable

encode_edge_icon(column, categorical_mapping=Ellipsis, continuous_binning=Ellipsis, default_mapping=Ellipsis, comparator=Ellipsis, for_default=True, for_current=False, as_text=False, blend_mode=Ellipsis, style=Ellipsis, border=Ellipsis, shape=Ellipsis)#

Parameters:

column (str)
categorical_mapping (Dict[Any, str] | None)
continuous_binning (List[Any] | None)
default_mapping (str | None)
comparator (Callable[[Any, Any], int] | None)
for_default (bool)
for_current (bool)
as_text (bool)
blend_mode (str | None)
style (Dict[str, Any] | None)
border (Dict[str, Any] | None)
shape (str | None)

Return type:

Plottable

encode_edge_label(*args, **kwargs)#

Parameters:

args (Any)
kwargs (Any)

Return type:

Plottable

encode_edge_opacity(*args, **kwargs)#

Parameters:

args (Any)
kwargs (Any)

Return type:

Plottable

encode_edge_size(*args, **kwargs)#

Parameters:

args (Any)
kwargs (Any)

Return type:

Plottable

encode_edge_title(*args, **kwargs)#

Parameters:

args (Any)
kwargs (Any)

Return type:

Plottable

encode_edge_weight(*args, **kwargs)#

Parameters:

args (Any)
kwargs (Any)

Return type:

Plottable

encode_point_badge(column, position='TopRight', categorical_mapping=Ellipsis, continuous_binning=Ellipsis, default_mapping=Ellipsis, comparator=Ellipsis, color=Ellipsis, bg=Ellipsis, fg=Ellipsis, for_current=False, for_default=True, as_text=Ellipsis, blend_mode=Ellipsis, style=Ellipsis, border=Ellipsis, shape=Ellipsis)#

Parameters:

column (str)
position (str)
categorical_mapping (Dict[Any, Any] | None)
continuous_binning (List[Any] | None)
default_mapping (Any | None)
comparator (Callable[[Any, Any], int] | None)
color (str | None)
bg (str | None)
fg (str | None)
for_current (bool)
for_default (bool)
as_text (bool | None)
blend_mode (str | None)
style (Dict[str, Any] | None)
border (Dict[str, Any] | None)
shape (str | None)

Return type:

Plottable

encode_point_color(column, palette=Ellipsis, as_categorical=Ellipsis, as_continuous=Ellipsis, categorical_mapping=Ellipsis, default_mapping=Ellipsis, for_default=True, for_current=False)#

Parameters:

column (str)
palette (List[str] | None)
as_categorical (bool | None)
as_continuous (bool | None)
categorical_mapping (Dict[Any, Any] | None)
default_mapping (str | None)
for_default (bool)
for_current (bool)

Return type:

Plottable

encode_point_icon(column, categorical_mapping=Ellipsis, continuous_binning=Ellipsis, default_mapping=Ellipsis, comparator=Ellipsis, for_default=True, for_current=False, as_text=False, blend_mode=Ellipsis, style=Ellipsis, border=Ellipsis, shape=Ellipsis)#

Parameters:

column (str)
categorical_mapping (Dict[Any, str] | None)
continuous_binning (List[Any] | None)
default_mapping (str | None)
comparator (Callable[[Any, Any], int] | None)
for_default (bool)
for_current (bool)
as_text (bool)
blend_mode (str | None)
style (Dict[str, Any] | None)
border (Dict[str, Any] | None)
shape (str | None)

Return type:

Plottable

encode_point_label(*args, **kwargs)#

Parameters:

args (Any)
kwargs (Any)

Return type:

Plottable

encode_point_opacity(*args, **kwargs)#

Parameters:

args (Any)
kwargs (Any)

Return type:

Plottable

encode_point_size(column, categorical_mapping=Ellipsis, default_mapping=Ellipsis, for_default=True, for_current=False)#

Parameters:

column (str)
categorical_mapping (Dict[Any, int | float] | None)
default_mapping (int | float | None)
for_default (bool)
for_current (bool)

Return type:

Plottable

encode_point_title(*args, **kwargs)#

Parameters:

args (Any)
kwargs (Any)

Return type:

Plottable

fa2_layout(fa2_params=None, circle_layout_params=None, singleton_layout=None, partition_key=None, engine='auto', allow_cpu_fallback=False)#

Parameters:

fa2_params (Dict[str, Any] | None)
circle_layout_params (Dict[str, Any] | None)
singleton_layout (Callable[[Plottable, Tuple[float, float, float, float] | Any], Plottable] | None)
partition_key (str | None)
engine (EngineAbstract | Literal['pandas', 'cudf', 'dask', 'dask_cudf', 'auto'])
allow_cpu_fallback (bool)

Return type:

Plottable

featurize(kind='nodes', X=None, y=None, use_scaler=None, use_scaler_target=None, cardinality_threshold=40, cardinality_threshold_target=400, n_topics=42, n_topics_target=12, multilabel=False, embedding=False, use_ngrams=False, ngram_range=(1, 3), max_df=0.2, min_df=3, min_words=4.5, model_name='paraphrase-MiniLM-L6-v2', impute=True, n_quantiles=100, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', similarity=None, categories='auto', keep_n_decimals=5, remove_node_column=True, inplace=False, feature_engine='auto', dbscan=False, min_dist=0.5, min_samples=1, memoize=True, verbose=False)#

Featurize Nodes or Edges of the underlying nodes/edges DataFrames.

Parameters:

kind (str) – specify whether to featurize nodes or edges. Edge featurization includes a pairwise src-to-dst feature block using a MultiLabelBinarizer, with any other columns being treated the same way as with nodes featurization.
X (List[str] | str | DataFrame | None) – Optional input, default None. If symbolic, evaluated against self data based on kind. If None, will featurize all columns of DataFrame
y (List[str] | str | DataFrame | None) – Optional Target(s) columns or explicit DataFrame, default None
use_scaler (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None) – selects which scaler (and automatically imputes missing values using mean strategy) to scale the data. Please see scikits-learn documentation https://scikit-learn.org/stable/modules/preprocessing.html Here ‘standard’ corresponds to ‘StandardScaler’ in scikits.
use_scaler_target (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None) – selects which scaler to scale the target
cardinality_threshold (int) – skrub threshold on cardinality of categorical labels across columns. If value is greater than threshold, will run GapEncoder (a topic model) on column. If below, will one-hot_encode. Default 40.
cardinality_threshold_target (int) – similar to cardinality_threshold, but for target features. Default is set high (400), as targets generally want to be one-hot encoded, but sometimes it can be useful to use GapEncoder (ie, set threshold lower) to create regressive targets, especially when those targets are textual/softly categorical and have semantic meaning across different labels. Eg, suppose a column has fields like [‘Application Fraud’, ‘Other Statuses’, ‘Lost-Target scaling using/Stolen Fraud’, ‘Investigation Fraud’, …] the GapEncoder will concentrate the ‘Fraud’ labels together.
n_topics (int) – the number of topics to use in the GapEncoder if cardinality_thresholds is saturated. Default is 42, but good rule of thumb is to consult the Johnson-Lindenstrauss Lemma https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma or use the simplified random walk estimate => n_topics_lower_bound ~ (pi/2) * (N-documents)**(1/4)
n_topics_target (int) – the number of topics to use in the GapEncoder if cardinality_thresholds_target is saturated for the target(s). Default 12.
min_words (float) – sets threshold on how many words to consider in a textual column if it is to be considered in the text processing pipeline. Set this very high if you want any textual columns to bypass the transformer, in favor of GapEncoder (topic modeling). Set to 0 to force all named columns to be encoded as textual (embedding)
model_name (str) – Sentence Transformer model to use. Default Paraphrase model makes useful vectors, but at cost of encoding time. If faster encoding is needed, average_word_embeddings_komninos is useful and produces less semantically relevant vectors. Please see sentence_transformer (https://www.sbert.net/) library for all available models.
multilabel (bool) – if True, will encode a single target column composed of lists of lists as multilabel outputs. This only works with y=[‘a_single_col’], default False
embedding (bool) – If True, produces a random node embedding of size n_topics default, False. If no node features are provided, will produce random embeddings (for GNN models, for example)
use_ngrams (bool) – If True, will encode textual columns as TfIdf Vectors, default, False.
ngram_range (tuple) – if use_ngrams=True, can set ngram_range, eg: tuple = (1, 3)
max_df (float) – if use_ngrams=True, set max word frequency to consider in vocabulary eg: max_df = 0.2,
min_df (int) – if use_ngrams=True, set min word count to consider in vocabulary eg: min_df = 3 or 0.00001
categories (str | None) – Optional[str] in [“auto”, “k-means”, “most_frequent”], decides which category to select in Similarity Encoding, default ‘auto’
impute (bool) – Whether to impute missing values, default True
n_quantiles (int) – if use_scaler = ‘quantile’, sets the quantile bin size.
output_distribution (str) – if use_scaler = ‘quantile’, can return distribution as [“normal”, “uniform”]
quantile_range – if use_scaler = ‘robust’|’quantile’, sets the quantile range.
n_bins (int) – number of bins to use in kbins discretizer, default 10
encode (str) – encoding for KBinsDiscretizer, can be one of onehot, onehot-dense, ordinal, default ‘ordinal’
strategy (str) – strategy for KBinsDiscretizer, can be one of uniform, quantile, kmeans, default ‘quantile’
n_quantiles – if use_scaler = “quantile”, sets the number of quantiles, default=100
output_distribution – if use_scaler=”quantile”|”robust”, choose from [“normal”, “uniform”]
dbscan (bool) – whether to run DBSCAN, default False.
min_dist (float) – DBSCAN eps parameter, default 0.5.
min_samples (int) – DBSCAN min_samples parameter, default 5.
keep_n_decimals (int) – number of decimals to keep
remove_node_column (bool) – whether to remove node column so it is not featurized, default True.
inplace (bool) – whether to not return new graphistry instance or not, default False.
memoize (bool) – whether to store and reuse results across runs, default True.
similarity (str | None)
feature_engine (Literal['none', 'pandas', 'skrub', 'torch', 'dirty_cat', 'auto'])
verbose (bool)

Returns:

graphistry instance with new attributes set by the featurization process.

featurize_or_get_edges_dataframe_if_X_is_None(X=None, y=None, use_scaler=None, use_scaler_target=None, cardinality_threshold=40, cardinality_threshold_target=400, n_topics=42, n_topics_target=7, multilabel=False, use_ngrams=False, ngram_range=(1, 3), max_df=0.2, min_df=3, min_words=2.5, model_name='paraphrase-MiniLM-L6-v2', similarity=None, categories='auto', impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5, feature_engine='pandas', reuse_if_existing=False, memoize=True, verbose=False)#

helper method gets edge feature and target matrix if X, y are not specified

Parameters:

X (List[str] | str | DataFrame | None) – Data Matrix
y (List[str] | str | DataFrame | None) – target, default None
use_scaler (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None)
use_scaler_target (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None)
cardinality_threshold (int)
cardinality_threshold_target (int)
n_topics (int)
n_topics_target (int)
multilabel (bool)
use_ngrams (bool)
ngram_range (tuple)
max_df (float)
min_df (int)
min_words (float)
model_name (str)
similarity (str | None)
categories (str | None)
impute (bool)
n_quantiles (int)
output_distribution (str)
n_bins (int)
encode (str)
strategy (str)
keep_n_decimals (int)
feature_engine (Literal['none', 'pandas', 'skrub', 'torch'])
memoize (bool)
verbose (bool)

Returns:

data X and y

Return type:

Tuple[DataFrame, DataFrame | None, object]

featurize_or_get_nodes_dataframe_if_X_is_None(X=None, y=None, use_scaler=None, use_scaler_target=None, cardinality_threshold=40, cardinality_threshold_target=400, n_topics=42, n_topics_target=7, multilabel=False, embedding=False, use_ngrams=False, ngram_range=(1, 3), max_df=0.2, min_df=3, min_words=2.5, model_name='paraphrase-MiniLM-L6-v2', similarity=None, categories='auto', impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5, remove_node_column=True, feature_engine='pandas', reuse_if_existing=False, memoize=True, verbose=False)#

helper method gets node feature and target matrix if X, y are not specified. if X, y are specified will set them as _node_target and _node_target attributes

Parameters:

X (List[str] | str | DataFrame | None)
y (List[str] | str | DataFrame | None)
use_scaler (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None)
use_scaler_target (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None)
cardinality_threshold (int)
cardinality_threshold_target (int)
n_topics (int)
n_topics_target (int)
multilabel (bool)
embedding (bool)
use_ngrams (bool)
ngram_range (tuple)
max_df (float)
min_df (int)
min_words (float)
model_name (str)
similarity (str | None)
categories (str | None)
impute (bool)
n_quantiles (int)
output_distribution (str)
n_bins (int)
encode (str)
strategy (str)
keep_n_decimals (int)
remove_node_column (bool)
feature_engine (Literal['none', 'pandas', 'skrub', 'torch'])
memoize (bool)
verbose (bool)

Return type:

Tuple[DataFrame, DataFrame, object]

filter_edges_by_dict(*args, **kwargs)#: filter edges to those that match all values in filter_dict

filter_nodes_by_dict(*args, **kwargs)#: filter nodes to those that match all values in filter_dict

filter_weighted_edges(scale=1.0, index_to_nodes_dict=None, inplace=False, kind='nodes')#

Parameters:

scale (float)
index_to_nodes_dict (Dict | None)
inplace (bool)
kind (Literal['nodes', 'edges'])

Return type:

Plottable | None

from_cugraph(G, node_attributes=None, edge_attributes=None, load_nodes=True, load_edges=True, merge_if_existing=True)#

Parameters:

node_attributes (List[str] | None)
edge_attributes (List[str] | None)
load_nodes (bool)
load_edges (bool)
merge_if_existing (bool)

Return type:

Plottable

from_igraph(ig, node_attributes=None, edge_attributes=None, load_nodes=True, load_edges=True, merge_if_existing=True)#

Parameters:

ig (Any)
node_attributes (List[str] | None)
edge_attributes (List[str] | None)
load_nodes (bool)
load_edges (bool)
merge_if_existing (bool)

Return type:

Plottable

from_networkx(G)#

Parameters:: G (Any)
Return type:: Plottable

get_degrees(col='degree', degree_in='degree_in', degree_out='degree_out')#

Decorate nodes table with degree info

Edges must be dataframe-like: pandas, cudf, …

Parameters determine generated column names

Warning: Self-cycles are currently double-counted. This may change.

Example: Generate degree columns

edges = pd.DataFrame({'s': ['a','b','c','d'], 'd': ['c','c','e','e']})
g = graphistry.edges(edges, 's', 'd')
print(g._nodes)  # None
g2 = g.get_degrees()
print(g2._nodes)  # pd.DataFrame with 'id', 'degree', 'degree_in', 'degree_out'

Parameters:

col (str)
degree_in (str)
degree_out (str)

get_indegrees(col='degree_in')#

See get_degrees

Parameters:: col (str)

get_matrix(columns=None, kind='nodes', target=False)#

Returns feature matrix, and if columns are specified, returns matrix with only the columns that contain the string column_part in their name.`X = g.get_matrix([‘feature1’, ‘feature2’])` will retrieve a feature matrix with only the columns that contain the string feature1 or feature2 in their name. Most useful for topic modeling, where the column names are of the form topic_0: descriptor, topic_1: descriptor, etc. Can retrieve unique columns in original dataframe, or actual topic features like [ip_part, shoes, preference_x, etc]. Powerful way to retrieve features from a featurized graph by column or (top) features of interest.

Example:

# get the full feature matrices
X = g.get_matrix()
y = g.get_matrix(target=True)

# get subset of features, or topics, given topic model encoding
X = g2.get_matrix(['172', 'percent'])
X.columns
    => ['ip_172.56.104.67', 'ip_172.58.129.252', 'item_percent']
# or in targets
y = g2.get_matrix(['total', 'percent'], target=True)
y.columns
    => ['basket_price_total', 'conversion_percent', 'CTR_percent', 'CVR_percent']

# not as useful for sbert features. 

Caveats:

if you have a column name that is a substring of another column name, you may get unexpected results.

Args:

columns (Union[List, str]):: list of column names or a single column name that may exist in columns of the feature matrix. If None, returns original feature matrix
kind (str, optional):: Node or Edge features. Defaults to ‘nodes’.
target (bool, optional):: If True, returns the target matrix. Defaults to False.

Returns:

pd.DataFrame: feature matrix with only the columns that contain the string column_part in their name.

Parameters:

columns (List | str | None)
kind (Literal['nodes', 'edges'])
target (bool)

Return type:

DataFrame

get_outdegrees(col='degree_out')#

See get_degrees

Parameters:: col (str)

get_topological_levels(level_col='level', allow_cycles=True, warn_cycles=True, remove_self_loops=True)#

Label nodes on column level_col based on topological sort depth Supports pandas + cudf, using parallelism within each level computation Options: * allow_cycles: if False and detects a cycle, throw ValueException, else break cycle by picking a lowest-in-degree node * warn_cycles: if True and detects a cycle, proceed with a warning * remove_self_loops: preprocess by removing self-cycles. Avoids allow_cycles=False, warn_cycles=True messages.

Example:

edges_df = gpd.DataFrame({‘s’: [‘a’, ‘b’, ‘c’, ‘d’],’d’: [‘b’, ‘c’, ‘e’, ‘e’]}) g = graphistry.edges(edges_df, ‘s’, ‘d’) g2 = g.get_topological_levels() g2._nodes.info() # pd.DataFrame with | ‘id’ , ‘level’ |

Parameters:

level_col (str)
allow_cycles (bool)
warn_cycles (bool)
remove_self_loops (bool)

Return type:

Plottable

gfql(*args, **kwargs)#

Execute a GFQL query - either a chain or a DAG

Unified entrypoint that automatically detects query type and dispatches to the appropriate execution engine.

Parameters:

query – GFQL query - ASTObject, List[ASTObject], Chain, ASTLet, dict, or supported query string
engine – Execution engine (auto, pandas, cudf)
output – For DAGs, name of binding to return (default: last executed)
policy – Optional policy hooks for external control (preload, postload, precall, postcall phases)
where – Optional same-path constraints for list/Chain queries
language – Optional string-query language selector. Defaults to "cypher" when query is a string.
params – Optional parameter dictionary for string-query compilation
validate – When True, run local preflight validation before execution via g.gfql_validate(...).
shortest_path_backend – Backend for shortestPath execution: "auto" (default), "igraph" (require igraph, raise if missing), "cugraph" (require cugraph, raise if missing), or "bfs" (always use DataFrame BFS). "auto" tries cugraph on CUDF engine, igraph on pandas, falls back to BFS silently.

Returns:

Resulting Plottable

Return type:

Plottable

gfql_remote(chain, api_token=None, dataset_id=None, output_type='all', format=None, df_export_args=None, node_col_subset=None, edge_col_subset=None, engine='auto', validate=True, persist=False, params=None, output=None)#

Run GFQL query remotely.

This is the remote execution version of gfql(). It supports chains, Let/DAG patterns, and Cypher strings.

The query is compiled locally and sent to the server as wire-protocol JSON. A gfql_query field carries the full typed envelope (including WHERE clauses); gfql_operations carries a flat array for backward compatibility with older servers.

Parameters:

chain (Chain | List[ASTObject] | ASTLet | Dict[str, None | bool | str | float | int | List[Any] | Dict[str, Any]] | str) – GFQL query — Chain, List[ASTObject], ASTLet, Dict, or Cypher string (compiled locally before sending).
params (Dict[str, Any] | None) – Optional parameter dict for Cypher string queries (e.g., params={"val": 10} for $val references).
api_token (str | None)
dataset_id (str | None)
output_type (Literal['all', 'nodes', 'edges', 'shape'])
format (Literal['json', 'csv', 'parquet'] | None)
df_export_args (Dict[str, Any] | None)
node_col_subset (List[str] | None)
edge_col_subset (List[str] | None)
engine (EngineAbstract | Literal['pandas', 'cudf', 'dask', 'dask_cudf', 'auto'])
validate (bool)
persist (bool)
output (str | None)

Return type:

Plottable

Example:

# Chain (existing)
g.gfql_remote([n(), e(), n()])

# Cypher string with params
g.gfql_remote(
    "MATCH (n) WHERE n.score > $cutoff RETURN n",
    params={"cutoff": 10},
)

# GRAPH constructor
g.gfql_remote("GRAPH { MATCH (a)-[r]->(b) WHERE a.score > 5 }")

See chain_remote() for additional parameter documentation.

gfql_remote_shape(chain, api_token=None, dataset_id=None, format=None, df_export_args=None, node_col_subset=None, edge_col_subset=None, engine='auto', validate=True, persist=False)#

Get shape metadata for remote GFQL query execution.

This is the remote shape version of gfql(). Returns metadata about the resulting graph without downloading the full data.

See chain_remote_shape() for detailed documentation (chain_remote_shape is deprecated).

Parameters:

chain (Chain | List[ASTObject] | ASTLet | Dict[str, None | bool | str | float | int | List[Any] | Dict[str, Any]] | str)
api_token (str | None)
dataset_id (str | None)
format (Literal['json', 'csv', 'parquet'] | None)
df_export_args (Dict[str, Any] | None)
node_col_subset (List[str] | None)
edge_col_subset (List[str] | None)
engine (EngineAbstract | Literal['pandas', 'cudf', 'dask', 'dask_cudf', 'auto'])
validate (bool)
persist (bool)

Return type:

DataFrame

gfql_validate(*args, **kwargs)#

Validate a GFQL/Cypher query without executing it.

Raises structured GFQL exceptions on validation failures and never dispatches query execution operators.

graph(ig)#

Parameters:: ig (Any)
Return type:: Plottable

hop(*args, **kwargs)#

Given a graph and some source nodes, return subgraph of all paths within k-hops from the sources

This can be faster than the equivalent chain([…]) call that wraps it with additional steps

See chain() examples for examples of many of the parameters

g: Plotter nodes: dataframe with id column matching g._node. None signifies all nodes (default). hops: consider paths of length 1 to ‘hops’ steps, if any (default 1). Shorthand for max_hops. min_hops/max_hops: inclusive traversal bounds; defaults preserve legacy behavior (min=1 unless max=0; max defaults to hops). output_min_hops/output_max_hops: optional output slice applied after traversal; defaults keep all traversed hops up to max_hops. Useful for showing a subrange (e.g., min/max = 2..4 but display only hops 3..4). label_node_hops/label_edge_hops: optional column names for hop numbers (omit or None to skip). Nodes record the first retained hop step they are reached (1 = first expansion); when min_hops prunes shorter branches, labels reflect the shortest retained path. Edges record the hop step that traversed them. label_seeds: when True and labeling, also write hop 0 for seed nodes in the node label column. to_fixed_point: keep hopping until no new nodes are found (ignores hops) direction: ‘forward’, ‘reverse’, ‘undirected’ edge_match: dict of kv-pairs to exact match (see also: filter_edges_by_dict) source_node_match: dict of kv-pairs to match nodes before hopping (including intermediate) destination_node_match: dict of kv-pairs to match nodes after hopping (including intermediate) source_node_query: dataframe query to match nodes before hopping (including intermediate) destination_node_query: dataframe query to match nodes after hopping (including intermediate) edge_query: dataframe query to match edges before hopping (including intermediate) return_as_wave_front: Exclude starting node(s) in return, returning only encountered nodes include_zero_hop_seed: internal Cypher opt-in for exact zero-hop path semantics Note: chain() reverse passes set return_as_wave_front=True and use target_wave_front to constrain reachability. target_wave_front: Only consider these nodes + self._nodes for reachability engine: ‘auto’, ‘pandas’, ‘cudf’ (GPU)

hypergraph(raw_events=None, *, entity_types=None, opts={}, drop_na=True, drop_edge_attrs=False, verbose=True, direct=False, engine='auto', npartitions=None, chunksize=None, from_edges=False, return_as='graph')#

Parameters:

raw_events (Any | None)
entity_types (List[str] | None)
opts (dict)
drop_na (bool)
drop_edge_attrs (bool)
verbose (bool)
direct (bool)
engine (EngineAbstract | Literal['pandas', 'cudf', 'dask', 'dask_cudf', 'auto'])
npartitions (int | None)
chunksize (int | None)
from_edges (bool)
return_as (Literal['graph', 'all', 'entities', 'events', 'edges', 'nodes'])

Return type:

Plottable | HypergraphResult | Any

igraph2pandas(ig)#

Parameters:: ig (Any)
Return type:: Tuple[DataFrame, DataFrame]

infer_labels()#

Return type:: Plottable

keep_nodes(nodes)#: Limit nodes and edges to those selected by parameter nodes For edges, both source and destination must be in nodes Nodes can be a list or series of node IDs, or a dictionary When a dictionary, each key corresponds to a node column, and nodes will be included when all match

layout_cugraph(layout='force_atlas2', params={}, kind='Graph', directed=True, G=None, bind_position=True, x_out_col='x', y_out_col='y', play=0)#

Parameters:

layout (str)
params (dict)
kind (Literal['Graph', 'MultiGraph', 'BiPartiteGraph'])
G (Any | None)
bind_position (bool)
x_out_col (str)
y_out_col (str)
play (int | None)

Return type:

Plottable

layout_graphviz(prog='dot', args=None, directed=True, strict=False, graph_attr=None, node_attr=None, edge_attr=None, skip_styling=False, render_to_disk=False, path=None, format=None)#

Parameters:

prog (Literal['acyclic', 'ccomps', 'circo', 'dot', 'fdp', 'gc', 'gvcolor', 'gvpr', 'neato', 'nop', 'osage', 'patchwork', 'sccmap', 'sfdp', 'tred', 'twopi', 'unflatten'])
args (str | None)
directed (bool)
strict (bool)
graph_attr (Dict[Literal['_background', 'bb', 'beautify', 'bgcolor', 'center', 'charset', 'class', 'clusterrank', 'colorscheme', 'comment', 'compound', 'concentrate', 'Damping', 'defaultdist', 'dim', 'dimen', 'diredgeconstraints', 'dpi', 'epsilon', 'esep', 'fontcolor', 'fontname', 'fontnames', 'fontpath', 'fontsize', 'forcelabels', 'gradientangle', 'href', 'id', 'imagepath', 'inputscale', 'K', 'label', 'label_scheme', 'labeljust', 'labelloc', 'landscape', 'layerlistsep', 'layers', 'layerselect', 'layersep', 'layout', 'levels', 'levelsgap', 'lheight', 'linelength', 'lp', 'lwidth', 'margin', 'maxiter', 'mclimit', 'mindist', 'mode', 'model', 'newrank', 'nodesep', 'nojustify', 'normalize', 'notranslate', 'nslimit', 'nslimit1', 'oneblock', 'ordering', 'orientation', 'outputorder', 'overlap', 'overlap_scaling', 'overlap_shrink', 'pack', 'packmode', 'pad', 'page', 'pagedir', 'quadtree', 'quantum', 'rankdir', 'ranksep', 'ratio', 'remincross', 'repulsiveforce', 'resolution', 'root', 'rotate', 'rotation', 'scale', 'searchsize', 'sep', 'showboxes', 'size', 'smoothing', 'sortv', 'splines', 'start', 'style', 'stylesheet', 'target', 'TBbalance', 'tooltip', 'truecolor', 'URL', 'viewport', 'voro_margin', 'xdotversion'], ~typing.Any] | None)
node_attr (Dict[Literal['area', 'class', 'color', 'colorscheme', 'comment', 'distortion', 'fillcolor', 'fixedsize', 'fontcolor', 'fontname', 'fontsize', 'gradientangle', 'group', 'height', 'href', 'id', 'image', 'imagepos', 'imagescale', 'label', 'labelloc', 'layer', 'margin', 'nojustify', 'ordering', 'orientation', 'penwidth', 'peripheries', 'pin', 'pos', 'rects', 'regular', 'root', 'samplepoints', 'shape', 'shapefile', 'showboxes', 'sides', 'skew', 'sortv', 'style', 'target', 'tooltip', 'URL', 'vertices', 'width', 'xlabel', 'xlp', 'z'], ~typing.Any] | None)
edge_attr (Dict[Literal['arrowhead', 'arrowsize', 'arrowtail', 'class', 'color', 'colorscheme', 'comment', 'constraint', 'decorate', 'dir', 'edgehref', 'edgetarget', 'edgetooltip', 'edgeURL', 'fillcolor', 'fontcolor', 'fontname', 'fontsize', 'head_lp', 'headclip', 'headhref', 'headlabel', 'headport', 'headtarget', 'headtooltip', 'headURL', 'href', 'id', 'label', 'labelangle', 'labeldistance', 'labelfloat', 'labelfontcolor', 'labelfontname', 'labelfontsize', 'labelhref', 'labeltarget', 'labeltooltip', 'labelURL', 'layer', 'len', 'lhead', 'lp', 'ltail', 'minlen', 'nojustify', 'penwidth', 'pos', 'samehead', 'sametail', 'showboxes', 'style', 'tail_lp', 'tailclip', 'tailhref', 'taillabel', 'tailport', 'tailtarget', 'tailtooltip', 'tailURL', 'target', 'tooltip', 'URL', 'weight', 'xlabel', 'xlp'], ~typing.Any] | None)
skip_styling (bool)
render_to_disk (bool)
path (str | None)
format (Literal['canon', 'cmap', 'cmapx', 'cmapx_np', 'dia', 'dot', 'fig', 'gd', 'gd2', 'gif', 'hpgl', 'imap', 'imap_np', 'ismap', 'jpe', 'jpeg', 'jpg', 'mif', 'mp', 'pcl', 'pdf', 'pic', 'plain', 'plain-ext', 'png', 'ps', 'ps2', 'svg', 'svgz', 'vml', 'vmlz', 'vrml', 'vtx', 'wbmp', 'xdot', 'xlib'] | None)

Return type:

Plottable

layout_igraph(layout, directed=None, use_vids=False, bind_position=True, x_out_col='x', y_out_col='y', play=0, params={})#

Parameters:

layout (str)
directed (bool | None)
use_vids (bool)
bind_position (bool)
x_out_col (str)
y_out_col (str)
play (int | None)
params (dict)

Return type:

Plottable

layout_settings(play=None, locked_x=None, locked_y=None, locked_r=None, left=None, top=None, right=None, bottom=None, lin_log=None, strong_gravity=None, dissuade_hubs=None, edge_influence=None, precision_vs_speed=None, gravity=None, scaling_ratio=None)#

Parameters:

play (int | None)
locked_x (bool | None)
locked_y (bool | None)
locked_r (bool | None)
left (float | None)
top (float | None)
right (float | None)
bottom (float | None)
lin_log (bool | None)
strong_gravity (bool | None)
dissuade_hubs (bool | None)
edge_influence (float | None)
precision_vs_speed (float | None)
gravity (float | None)
scaling_ratio (float | None)

Return type:

Plottable

materialize_nodes(reuse=True, engine=EngineAbstract.AUTO)#

Generate g._nodes based on g._edges

Uses g._node for node id if exists, else ‘id’

Edges must be dataframe-like: cudf, pandas, …

When reuse=True and g._nodes is not None, use it

Example: Generate nodes

edges = pd.DataFrame({'s': ['a','b','c','d'], 'd': ['c','c','e','e']})
g = graphistry.edges(edges, 's', 'd')
print(g._nodes)  # None
g2 = g.materialize_nodes()
print(g2._nodes)  # pd.DataFrame

Parameters:

reuse (bool)
engine (EngineAbstract | str)

Return type:

Plottable

name(name)#

Parameters:: name (str)
Return type:: Plottable

networkx2pandas(G)#

Parameters:: G (Any)
Return type:: Tuple[DataFrame, DataFrame]

networkx_checkoverlap(g)#

Parameters:: g (Any)
Return type:: None

nodes(nodes, node=None, *args, **kwargs)#

Parameters:

nodes (Callable | Any)
node (str | None)
args (Any)
kwargs (Any)

Return type:

Plottable

pandas2igraph(edges, directed=True)#

Parameters:

edges (DataFrame)
directed (bool)

Return type:

Any

pipe(graph_transform, *args, **kwargs)#

Parameters:

graph_transform (Callable)
args (Any)
kwargs (Any)

Return type:

Plottable

plot(graph=None, nodes=None, name=None, description=None, render='auto', skip_upload=False, as_files=False, memoize=True, erase_files_on_fail=True, extra_html='', override_html_style=None, validate='autofix', warn=True, schema_validate=False)#

Parameters:

graph (Any | None)
nodes (Any | None)
name (str | None)
description (str | None)
render (bool | Literal['auto'] | ~typing.Literal['g', 'url', 'ipython', 'databricks', 'browser'] | None)
skip_upload (bool)
as_files (bool)
memoize (bool)
erase_files_on_fail (bool)
extra_html (str)
override_html_style (str | None)
validate (Literal['strict', 'strict-fast', 'autofix'] | bool)
warn (bool)
schema_validate (Literal['strict', 'autofix'] | bool)

Return type:

Any

privacy(mode=None, notify=None, invited_users=None, message=None, mode_action=None)#

Parameters:

mode (Literal['private', 'organization', 'public'] | None)
notify (bool | None)
invited_users (List[str] | None)
message (str | None)
mode_action (Literal['10', '20'] | None)

Return type:

Plottable

protocol(v=None)#

Parameters:: v (str | None)
Return type:: str

prune_self_edges()#

python_remote_g(*args, **kwargs)#

Remotely run Python code on a remote dataset that returns a Plottable

Uses the latest bound _dataset_id, and uploads current dataset if not already bound. Note that rebinding calls of edges() and nodes() reset the _dataset_id binding.

Parameters:

code (Union[str, Callable[..., object]]) – Python code that includes a top-level function def task(g: Plottable) -> Union[str, Dict].
api_token (Optional[str]) – Optional JWT token. If not provided, refreshes JWT and uses that.
dataset_id (Optional[str]) – Optional dataset_id. If not provided, will fallback to self._dataset_id. If not defined, will upload current data, store that dataset_id, and run code against that.
format (Optional[FormatType]) – What format to fetch results. Defaults to ‘parquet’.
output_type (Optional[OutputTypeGraph]) – What shape of output to fetch. Defaults to ‘all’. Options include ‘nodes’, ‘edges’, ‘all’ (both). For other variants, see python_remote_shape and python_remote_json.
engine (EngineAbstractType) – Override which run mode GFQL uses. Defaults to ‘auto’ which auto-detects based on DataFrame type. Also accepts ‘pandas’ or ‘cudf’.
run_label (Optional[str]) – Optional label for the run for serverside job tracking.
validate (bool) – Whether to locally test code, and if uploading data, the data. Default true.

Return type:

Any

Example: Upload data and count the results

import graphistry
from graphistry import n, e
es = pandas.DataFrame({'src': [0,1,2], 'dst': [1,2,0]})
g1 = graphistry
    .edges(es, source='src', destination='dst')
    .upload()
assert g1._dataset_id is not None, "Successfully uploaded"
g2 = g1.python_remote_g(
    code='''
        from typing import Any, Dict
        from graphistry import Plottable

        def task(g: Plottable) -> Dict[str, Any]:
            return g
    ''',
    engine='cudf')
num_edges = len(g2._edges)
print(f'num_edges: {num_edges}')

python_remote_json(*args, **kwargs)#

Remotely run Python code on a remote dataset that returns json

Uses the latest bound _dataset_id, and uploads current dataset if not already bound. Note that rebinding calls of edges() and nodes() reset the _dataset_id binding.

Parameters:

code (Union[str, Callable[..., object]]) – Python code that includes a top-level function def task(g: Plottable) -> Union[str, Dict].
api_token (Optional[str]) – Optional JWT token. If not provided, refreshes JWT and uses that.
dataset_id (Optional[str]) – Optional dataset_id. If not provided, will fallback to self._dataset_id. If not defined, will upload current data, store that dataset_id, and run code against that.
engine (EngineAbstractType) – Override which run mode GFQL uses. Defaults to ‘auto’ which auto-detects based on DataFrame type. Also accepts ‘pandas’ or ‘cudf’.
run_label (Optional[str]) – Optional label for the run for serverside job tracking.
validate (bool) – Whether to locally test code, and if uploading data, the data. Default true.

Return type:

Any

Example: Upload data and count the results

import graphistry
from graphistry import n, e
es = pandas.DataFrame({'src': [0,1,2], 'dst': [1,2,0]})
g1 = graphistry
    .edges(es, source='src', destination='dst')
    .upload()
assert g1._dataset_id is not None, "Successfully uploaded"
obj = g1.python_remote_json(
    code='''
        from typing import Any, Dict
        from graphistry import Plottable

        def task(g: Plottable) -> Dict[str, Any]:
            return {'num_edges': len(g._edges)}
    ''',
    engine='cudf')
num_edges = obj['num_edges']
print(f'num_edges: {num_edges}')

python_remote_table(*args, **kwargs)#

Remotely run Python code on a remote dataset that returns a table

Uses the latest bound _dataset_id, and uploads current dataset if not already bound. Note that rebinding calls of edges() and nodes() reset the _dataset_id binding.

Parameters:

code (Union[str, Callable[..., object]]) – Python code that includes a top-level function def task(g: Plottable) -> Union[str, Dict].
api_token (Optional[str]) – Optional JWT token. If not provided, refreshes JWT and uses that.
dataset_id (Optional[str]) – Optional dataset_id. If not provided, will fallback to self._dataset_id. If not defined, will upload current data, store that dataset_id, and run code against that.
format (Optional[FormatType]) – What format to fetch results. Defaults to ‘parquet’.
output_type (Optional[OutputTypeGraph]) – What shape of output to fetch. Defaults to ‘table’. Options include ‘table’, ‘nodes’, and ‘edges’.
engine (EngineAbstractType) – Override which run mode GFQL uses. Defaults to ‘auto’ which auto-detects based on DataFrame type. Also accepts ‘pandas’ or ‘cudf’.
run_label (Optional[str]) – Optional label for the run for serverside job tracking.
validate (bool) – Whether to locally test code, and if uploading data, the data. Default true.

Return type:

Any

Example: Upload data and count the results

import graphistry
from graphistry import n, e
es = pandas.DataFrame({'src': [0,1,2], 'dst': [1,2,0]})
g1 = graphistry
    .edges(es, source='src', destination='dst')
    .upload()
assert g1._dataset_id is not None, "Successfully uploaded"
edges_df = g1.python_remote_table(
    code='''
        from typing import Any, Dict
        from graphistry import Plottable

        def task(g: Plottable) -> Dict[str, Any]:
            return g._edges
    ''',
    engine='cudf')
num_edges = len(edges_df)
print(f'num_edges: {num_edges}')

reset_caches()#

Return type:: None

scale(df=None, y=None, kind='nodes', use_scaler=None, use_scaler_target=None, impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5, return_scalers=False)#

Scale data using the same scalers as used in the featurization step.

Example

g = graphistry.nodes(df)
X, y = g.featurize().scale(kind='nodes', use_scaler='robust', use_scaler_target='kbins', n_bins=3)

# or 
g = graphistry.nodes(df)
# set a scaling strategy for features and targets -- umap uses those and produces different results depending.
g2 = g.umap(use_scaler='standard', use_scaler_target=None)

# later if you want to scale new data, you can do so
X, y = g2.transform(df, df, scale=False)
X_scaled, y_scaled = g2.scale(X, y, use_scaler='minmax', use_scaler_target='kbins', n_bins=5)
# fit some other pipeline
clf.fit(X_scaled, y_scaled)

Args:

df:

pd.DataFrame, raw data to transform, if None, will use data from featurization fit

y:

pd.DataFrame, optional target data

kind:

str, one of nodes, edges

use_scaler:

Scaling transformer

use_scaler_target:

Scaling transformer on target

impute:

bool, if True, will impute missing values

n_quantiles:

int, number of quantiles to use for quantile scaler

output_distribution:

str, one of normal, uniform, lognormal

quantile_range:

tuple, range of quantiles to use for quantile scaler

n_bins:

int, number of bins to use for KBinsDiscretizer

encode:

str, one of ordinal, onehot, onehot-dense, binary

strategy:

str, one of uniform, quantile, kmeans

keep_n_decimals:

int, number of decimals to keep after scaling

return_scalers:

bool, if True, will return the scalers used to scale the data

Returns:

(X, y) transformed data if return_graph is False or a graph with inferred edges if return_graph is True, or (X, y, scaler, scaler_target) if return_scalers is True

Parameters:

df (DataFrame | None)
y (DataFrame | None)
kind (str)
use_scaler (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None)
use_scaler_target (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None)
impute (bool)
n_quantiles (int)
output_distribution (str)
n_bins (int)
encode (str)
strategy (str)
keep_n_decimals (int)
return_scalers (bool)

scene_settings(menu=None, info=None, show_arrows=None, point_size=None, edge_curvature=None, edge_opacity=None, point_opacity=None)#

Parameters:

menu (bool | None)
info (bool | None)
show_arrows (bool | None)
point_size (float | None)
edge_curvature (float | None)
edge_opacity (float | None)
point_opacity (float | None)

Return type:

Plottable

search(query, cols=None, thresh=5000, fuzzy=True, top_n=10)#

Parameters:

query (str)
thresh (float)
fuzzy (bool)
top_n (int)

search_graph(query, scale=0.5, top_n=100, thresh=5000, broader=False, inplace=False)#

Parameters:

query (str)
scale (float)
top_n (int)
thresh (float)
broader (bool)
inplace (bool)

Return type:

Plottable

server(v=None)#

Parameters:: v (str | None)
Return type:: str

session: ClientSession#

settings(height=None, url_params=None, render=None, validate='autofix', warn=True)#

Parameters:

height (int | None)
url_params (Dict[str, None | str | int | float | bool | List[SettingsValue] | Dict[str, SettingsValue]] | None)
render (bool | Literal['auto'] | ~typing.Literal['g', 'url', 'ipython', 'databricks', 'browser'] | None)
validate (Literal['strict', 'strict-fast', 'autofix'] | bool)
warn (bool)

Return type:

Plottable

style(fg=None, bg=None, page=None, logo=None)#

Parameters:

fg (Dict[str, Any] | None)
bg (Dict[str, Any] | None)
page (Dict[str, Any] | None)
logo (Dict[str, Any] | None)

Return type:

Plottable

to_arrow(table=None, validate='autofix', warn=True, schema_validate=False, schema_table='edges')#

Parameters:

table (Any | None)
validate (Literal['strict', 'strict-fast', 'autofix'] | bool)
warn (bool)
schema_validate (Literal['strict', 'autofix'] | bool)
schema_table (str)

Return type:

Any | None

to_cudf()#

Convert to GPU mode by converting any defined nodes and edges to cudf dataframes

When nodes or edges are already cudf dataframes, they are left as is

Parameters:: g (Plottable) – Graphistry object
Returns:: Graphistry object
Return type:: Plottable

to_cugraph(directed=True, include_nodes=True, node_attributes=None, edge_attributes=None, kind='Graph')#

Parameters:

directed (bool)
include_nodes (bool)
node_attributes (List[str] | None)
edge_attributes (List[str] | None)
kind (Literal['Graph', 'MultiGraph', 'BiPartiteGraph'])

Return type:

Any

to_igraph(directed=True, include_nodes=True, node_attributes=None, edge_attributes=None, use_vids=False)#

Parameters:

directed (bool)
include_nodes (bool)
node_attributes (List[str] | None)
edge_attributes (List[str] | None)
use_vids (bool)

Return type:

Any

to_pandas()#

Convert nodes and edges to pandas DataFrames.

Supports all input types: cuDF, Arrow, Polars, Spark, dask, and pandas (identity).

Return type:: Plottable

transform(df: DataFrame, y: DataFrame | None = None, kind: str = 'nodes', min_dist: str | float | int = 'auto', n_neighbors: int = 7, merge_policy: bool = False, sample: int | None = None, *, return_graph: Literal[True] = True, scaled: bool = True, verbose: bool = False) → Plottable#

transform(df: DataFrame, y: DataFrame | None = None, kind: str = 'nodes', min_dist: str | float | int = 'auto', n_neighbors: int = 7, merge_policy: bool = False, sample: int | None = None, *, return_graph: Literal[False], scaled: bool = True, verbose: bool = False) → Tuple[DataFrame, DataFrame]

Transform new data and append to existing graph, or return dataframes

args:

df:

pd.DataFrame, raw data to transform

ydf:

pd.DataFrame, optional

kind:

str # one of nodes, edges

return_graph:

bool, if True, will return a graph with inferred edges.

merge_policy:

bool, if True, adds batch to existing graph nodes via nearest neighbors. If False, will infer edges only between nodes in the batch, default False

min_dist:

float, if return_graph is True, will use this value in NN search, or ‘auto’ to infer a good value. min_dist represents the maximum distance between two samples for one to be considered as in the neighborhood of the other.

sample:

int, if return_graph is True, will use sample edges of existing graph to fill out the new graph

n_neighbors:

int, if return_graph is True, will use this value for n_neighbors in Nearest Neighbors search

scaled:

bool, if True, will use scaled transformation of data set during featurization, default True

verbose:

bool, if True, will print metadata about the graph construction, default False

Returns:

X, y: pd.DataFrame, transformed data if return_graph is False or a graphistry Plottable with inferred edges if return_graph is True

transform_umap(df, y=None, kind='nodes', min_dist='auto', n_neighbors=7, merge_policy=False, sample=None, *, return_graph=True, fit_umap_embedding=True, umap_transform_kwargs={})#

Parameters:

df (DataFrame)
y (DataFrame | None)
kind (Literal['nodes', 'edges'])
min_dist (str | float | int)
n_neighbors (int)
merge_policy (bool)
sample (int | None)
return_graph (bool)
fit_umap_embedding (bool)
umap_transform_kwargs (Dict[str, Any])

Return type:

Tuple[DataFrame, DataFrame, DataFrame] | Plottable

umap(X=None, y=None, kind='nodes', scale=1.0, n_neighbors=12, min_dist=0.1, spread=0.5, local_connectivity=1, repulsion_strength=1, negative_sample_rate=5, n_components=2, metric='euclidean', suffix='', play=0, encode_position=True, encode_weight=True, dbscan=False, engine='auto', feature_engine='auto', inplace=False, memoize=True, umap_kwargs={}, umap_fit_kwargs={}, umap_transform_kwargs={}, **featurize_kwargs)#

Parameters:

X (DataFrame | np.ndarray | List[str] | None)
y (DataFrame | np.ndarray | List[str] | None)
kind (Literal['nodes', 'edges'])
scale (float)
n_neighbors (int)
min_dist (float)
spread (float)
local_connectivity (int)
repulsion_strength (float)
negative_sample_rate (int)
n_components (int)
metric (str)
suffix (str)
play (int | None)
encode_position (bool)
encode_weight (bool)
dbscan (bool)
engine (Literal['auto', 'cuml', 'umap_learn'])
feature_engine (str)
inplace (bool)
memoize (bool)
umap_kwargs (Dict[str, Any])
umap_fit_kwargs (Dict[str, Any])
umap_transform_kwargs (Dict[str, Any])
featurize_kwargs (Any)

Return type:

Plottable | None

umap_fit(X, y=None, umap_fit_kwargs={})#

Parameters:

X (DataFrame)
y (DataFrame | None)
umap_fit_kwargs (Dict[str, Any])

Return type:

Plottable

umap_lazy_init(res, n_neighbors=12, min_dist=0.1, spread=0.5, local_connectivity=1, repulsion_strength=1, negative_sample_rate=5, n_components=2, metric='euclidean', engine='auto', suffix='', umap_kwargs={}, umap_fit_kwargs={}, umap_transform_kwargs={})#

Parameters:

res (Plottable)
n_neighbors (int)
min_dist (float)
spread (float)
local_connectivity (int)
repulsion_strength (float)
negative_sample_rate (int)
n_components (int)
metric (str)
engine (Literal['auto', 'cuml', 'umap_learn'])
suffix (str)
umap_kwargs (Dict[str, Any])
umap_fit_kwargs (Dict[str, Any])
umap_transform_kwargs (Dict[str, Any])

Return type:

Plottable

upload(memoize=True, erase_files_on_fail=True, validate='autofix', warn=True, schema_validate=False)#

Parameters:

memoize (bool)
erase_files_on_fail (bool)
validate (Literal['strict', 'strict-fast', 'autofix'] | bool)
warn (bool)
schema_validate (Literal['strict', 'autofix'] | bool)

Return type:

Plottable

property url: str | None#

validate_arrow_schema(table='edges', *, validate='strict', warn=True)#

Parameters:

table (str)
validate (Literal['strict', 'autofix'] | bool)
warn (bool)

Return type:

Any | None

class graphistry.feature_utils.callThrough(x)#: Bases: object

graphistry.feature_utils.check_if_textual_column(df, col, confidence=0.35, min_words=2.5)#

Checks if col column of df is textual or not using basic heuristics

Parameters:

df (DataFrame) – DataFrame
col (str) – column name
confidence (float) – threshold float value between 0 and 1. If column col has confidence more elements as type str it will pass it onto next stage of evaluation. Default 0.35
min_words (float) – mean minimum words threshold. If mean words across col is greater than this, it is deemed textual. Default 2.5

Returns:

bool, whether column is textual or not

Return type:

bool

graphistry.feature_utils.concat_text(df, text_cols)#

graphistry.feature_utils.drop_duplicates_with_warning(df)#

Parameters:: df (DataFrame)
Return type:: DataFrame

graphistry.feature_utils.encode_edges(edf, src, dst, mlb, fit=False)#

edge encoder – creates multilabelBinarizer on edge pairs.

Args:: edf (pd.DataFrame): edge dataframe src (string): source column dst (string): destination column mlb (sklearn): multilabelBinarizer fit (bool, optional): If true, fits multilabelBinarizer. Defaults to False.

Returns:: tuple: pd.DataFrame, multilabelBinarizer

graphistry.feature_utils.encode_multi_target(ydf, mlb=None)#

graphistry.feature_utils.encode_textual(df, min_words=2.5, model_name='paraphrase-MiniLM-L6-v2', use_ngrams=False, ngram_range=(1, 3), max_df=0.2, min_df=3)#

Parameters:

df (DataFrame)
min_words (float)
model_name (str)
use_ngrams (bool)
ngram_range (tuple)
max_df (float)
min_df (int)

Return type:

Tuple[DataFrame, List, Any]

graphistry.feature_utils.features_without_target(df, y=None)#

Checks if y DataFrame column name is in df, and removes it from df if so

Parameters:

df (DataFrame) – model DataFrame
y (List | str | DataFrame | None) – target DataFrame

Returns:

DataFrames of model and target

Return type:

DataFrame

graphistry.feature_utils.find_bad_set_columns(df, bad_set=['[]'])#

Finds columns that if not coerced to strings, will break processors.

Parameters:

df (DataFrame) – DataFrame
bad_set (List) – List of strings to look for.

Returns:

list

graphistry.feature_utils.fit_pipeline(X, transformer, keep_n_decimals=5)#

Helper to fit DataFrame over transformer pipeline. Rounds resulting matrix X by keep_n_digits if not 0, which helps for when transformer pipeline is scaling or imputer which sometime introduce small negative numbers, and umap metrics like Hellinger need to be positive :param X: DataFrame to transform. :param transformer: Pipeline object to fit and transform :param keep_n_decimals: Int of how many decimal places to keep in rounded transformed data

Parameters:

X (DataFrame)
keep_n_decimals (int)

Return type:

DataFrame

graphistry.feature_utils.get_cardinality_ratio(df)#

Calculates the ratio of unique values to total number of rows of DataFrame

Parameters:: df (DataFrame) – DataFrame

graphistry.feature_utils.get_dataframe_by_column_dtype(df, include=None, exclude=None)#

graphistry.feature_utils.get_matrix_by_column_part(X, column_part)#

Get the feature matrix by column part existing in column names.

Parameters:

X (DataFrame)
column_part (str)

Return type:

DataFrame

graphistry.feature_utils.get_matrix_by_column_parts(X, column_parts)#

Get the feature matrix by column parts list existing in column names.

Parameters:

X (DataFrame)
column_parts (list | str | None)

Return type:

DataFrame

graphistry.feature_utils.get_numeric_transformers(ndf, y=None)#

graphistry.feature_utils.get_preprocessing_pipeline(use_scaler='robust', impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='quantile')#

Helper function for imputing and scaling np.ndarray data using different scaling transformers.

Parameters:

X – np.ndarray
impute (bool) – whether to run imputing or not
use_scaler (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile']) – Selects scaling transformer
n_quantiles (int) – if use_scaler = ‘quantile’, sets the quantile bin size.
output_distribution (str) – if use_scaler = ‘quantile’, can return distribution as [“normal”, “uniform”]
quantile_range – if use_scaler = ‘robust’/’quantile’, sets the quantile range.
n_bins (int) – number of bins to use in kbins discretizer
encode (str) – encoding for KBinsDiscretizer, can be one of onehot, onehot-dense, ordinal, default ‘ordinal’
strategy (str) – strategy for KBinsDiscretizer, can be one of uniform, quantile, kmeans, default ‘quantile’

Returns:

scaled array, imputer instances or None, scaler instance or None

Return type:

Any

graphistry.feature_utils.get_text_preprocessor(ngram_range=(1, 3), max_df=0.2, min_df=3)#

graphistry.feature_utils.get_textual_columns(df, min_words=2.5)#

Collects columns from df that it deems are textual.

Parameters:

df (DataFrame) – DataFrame
min_words (float)

Returns:

list of columns names

Return type:

List

graphistry.feature_utils.group_columns_by_dtypes(df, verbose=True)#

Parameters:

df (DataFrame)
verbose (bool)

Return type:

Dict

graphistry.feature_utils.identity(x)#

graphistry.feature_utils.impute_and_scale_df(df, use_scaler='robust', impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5)#

Parameters:

df (DataFrame)
use_scaler (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'])
impute (bool)
n_quantiles (int)
output_distribution (str)
n_bins (int)
encode (str)
strategy (str)
keep_n_decimals (int)

Return type:

Tuple[DataFrame, Any]

graphistry.feature_utils.is_cudf_df(df)#

Parameters:: df (Any)
Return type:: bool

graphistry.feature_utils.is_cudf_s(s)#

Parameters:: s (Any)
Return type:: bool

graphistry.feature_utils.is_dataframe_all_numeric(df)#

Parameters:: df (DataFrame)
Return type:: bool

graphistry.feature_utils.make_array(X)#

graphistry.feature_utils.normalize_X_y(X, y, feature_names_in=None, target_names_in=None)#

Prepare for most finnicky featurizers: drop duplicates, and remove targets from data

Warns on fixed violations

Parameters:

X (DataFrame)
y (DataFrame)
feature_names_in (Index | None)
target_names_in (Index | None)

Return type:

Tuple[DataFrame, DataFrame]

graphistry.feature_utils.passthrough_df_cols(df, columns)#

graphistry.feature_utils.process_dirty_dataframes(ndf, y, cardinality_threshold=40, cardinality_threshold_target=400, n_topics=42, n_topics_target=7, similarity=None, categories='auto', multilabel=False, feature_engine='pandas')#

skrub encoder for record level data. Will automatically turn inhomogeneous dataframe into matrix using smart conversion tricks.

Parameters:

ndf (DataFrame) – node DataFrame
y (DataFrame | None) – target DataFrame or series
cardinality_threshold (int) – For ndf columns, below this threshold, encoder is OneHot, above, it is GapEncoder
cardinality_threshold_target (int) – For target columns, below this threshold, encoder is OneHot, above, it is GapEncoder
n_topics (int) – number of topics for GapEncoder, default 42
similarity (str | None) – one of ‘ngram’, ‘levenshtein-ratio’, ‘jaro’, or’jaro-winkler’}) – The type of pairwise string similarity to use. If None or False, uses a TableVectorizer
n_topics_target (int)
categories (str | None)
multilabel (bool)
feature_engine (Literal['none', 'pandas', 'skrub', 'torch'])

Returns:

Encoded data matrix and target (if not None), the data encoder, and the label encoder.

Return type:

Tuple[DataFrame, DataFrame | None, Any, Any]

graphistry.feature_utils.process_edge_dataframes(edf, y, src, dst, cardinality_threshold=40, cardinality_threshold_target=400, n_topics=42, n_topics_target=7, use_scaler=None, use_scaler_target=None, multilabel=False, use_ngrams=False, ngram_range=(1, 3), max_df=0.2, min_df=3, min_words=2.5, model_name='paraphrase-MiniLM-L6-v2', similarity=None, categories='auto', impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5, feature_engine='pandas')#

Custom Edge-record encoder. Uses a MultiLabelBinarizer to generate a src/dst vector and then process_textual_or_other_dataframes that encodes any other data present in edf, textual or not.

Parameters:

edf (DataFrame) – pandas DataFrame of edge features
y (DataFrame) – pandas DataFrame of edge labels
src (str) – source column to select in edf
dst (str) – destination column to select in edf
use_scaler (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None) – Scaling transformer
use_scaler_target' – Scaling transformer for target
cardinality_threshold (int)
cardinality_threshold_target (int)
n_topics (int)
n_topics_target (int)
use_scaler_target (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None)
multilabel (bool)
use_ngrams (bool)
ngram_range (tuple)
max_df (float)
min_df (int)
min_words (float)
model_name (str)
similarity (str | None)
categories (str | None)
impute (bool)
n_quantiles (int)
output_distribution (str)
n_bins (int)
encode (str)
strategy (str)
keep_n_decimals (int)
feature_engine (Literal['none', 'pandas', 'skrub', 'torch'])

Returns:

Encoded data matrix and target (if not None), the data encoders, and the label encoder.

Return type:

Tuple[DataFrame, DataFrame, DataFrame, DataFrame, List[Any], Any, Any | None, Any | None, Any, List[str]]

graphistry.feature_utils.process_nodes_dataframes(df, y, cardinality_threshold=40, cardinality_threshold_target=400, n_topics=42, n_topics_target=7, use_scaler='robust', use_scaler_target='kbins', multilabel=False, embedding=False, use_ngrams=False, ngram_range=(1, 3), max_df=0.2, min_df=3, min_words=2.5, model_name='paraphrase-MiniLM-L6-v2', similarity=None, categories='auto', impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5, feature_engine='pandas')#

Automatic Deep Learning Embedding/ngrams of Textual Features, with the rest of the columns taken care of by skrub

Parameters:

df (DataFrame) – pandas DataFrame of data
y (DataFrame) – pandas DataFrame of targets
n_topics (int) – number of topics in Gap Encoder
n_topics_target (int) – number of topics in Gap Encoder for target
use_scaler (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile']) – Scaling transformer
use_scaler_target (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile']) – Scaling transformer for target
confidence – Number between 0 and 1, will pass column for textual processing if total entries are string like in a column and above this relative threshold.
min_words (float) – Sets the threshold for average number of words to include column for textual sentence encoding. Lower values means that columns will be labeled textual and sent to sentence-encoder. Set to 0 to force named columns as textual.
model_name (str) – SentenceTransformer model name. See available list at https://www.sbert.net/docs/pretrained_models. html#sentence-embedding-models
cardinality_threshold (int)
cardinality_threshold_target (int)
multilabel (bool)
embedding (bool)
use_ngrams (bool)
ngram_range (tuple)
max_df (float)
min_df (int)
similarity (str | None)
categories (str | None)
impute (bool)
n_quantiles (int)
output_distribution (str)
n_bins (int)
encode (str)
strategy (str)
keep_n_decimals (int)
feature_engine (Literal['none', 'pandas', 'skrub', 'torch'])

Returns:

X_enc, y_enc, data_encoder, label_encoder, scaling_pipeline, scaling_pipeline_target, text_model, text_cols,

Return type:

Tuple[DataFrame, Any, DataFrame, Any, Any, Any, Any | None, Any | None, Any, List[str]]

graphistry.feature_utils.remove_internal_namespace_if_present(df)#

Some tranformations below add columns to the DataFrame, this method removes them before featurization Will not drop if suffix is added during UMAP-ing

Parameters:: df (DataFrame) – DataFrame
Returns:: DataFrame with dropped columns in reserved namespace
Return type:: DataFrame

graphistry.feature_utils.remove_node_column_from_symbolic(X_symbolic, node)#

graphistry.feature_utils.resolve_X(df, X)#

Parameters:

df (DataFrame | None)
X (List[str] | str | DataFrame | None)

Return type:

DataFrame

graphistry.feature_utils.resolve_feature_engine(feature_engine)#

Parameters:: feature_engine (Literal['none', 'pandas', 'skrub', 'torch', 'dirty_cat', 'auto'])
Return type:: Literal[‘none’, ‘pandas’, ‘skrub’, ‘torch’]

graphistry.feature_utils.resolve_scaler(use_scaler, feature_engine)#

Parameters:

use_scaler (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None)
feature_engine (Literal['none', 'pandas', 'skrub', 'torch'])

Return type:

Literal[‘none’, ‘kbins’, ‘standard’, ‘robust’, ‘minmax’, ‘quantile’]

graphistry.feature_utils.resolve_scaler_target(use_scaler_target, feature_engine, multilabel)#

Parameters:

use_scaler_target (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'] | None)
feature_engine (Literal['none', 'pandas', 'skrub', 'torch'])
multilabel (bool)

Return type:

Literal[‘none’, ‘kbins’, ‘standard’, ‘robust’, ‘minmax’, ‘quantile’]

graphistry.feature_utils.resolve_y(df, y)#

Parameters:

df (DataFrame | None)
y (List[str] | str | DataFrame | None)

Return type:

DataFrame

graphistry.feature_utils.reuse_featurization(g, memoize, metadata)#

Parameters:

g (Plottable)
memoize (bool)
metadata (Any)

graphistry.feature_utils.safe_divide(a, b)#

graphistry.feature_utils.set_currency_to_float(df, col, return_float=True)#

Parameters:

df (DataFrame)
col (str)
return_float (bool)

graphistry.feature_utils.set_to_bool(df, col, value)#

Parameters:

df (DataFrame)
col (str)
value (Any)

graphistry.feature_utils.set_to_datetime(df, cols, new_col)#

Parameters:

df (DataFrame)
cols (List)
new_col (str)

graphistry.feature_utils.set_to_numeric(df, cols, fill_value=0.0)#

Parameters:

df (DataFrame)
cols (List)
fill_value (float)

graphistry.feature_utils.smart_scaler(X_enc, y_enc, use_scaler, use_scaler_target, impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5)#

Parameters:

use_scaler (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'])
use_scaler_target (Literal['none', 'kbins', 'standard', 'robust', 'minmax', 'quantile'])
impute (bool)
n_quantiles (int)
output_distribution (str)
n_bins (int)
encode (str)
strategy (str)
keep_n_decimals (int)

graphistry.feature_utils.transform(df, ydf, res, kind, src, dst, feature_names_in, target_names_in)#

Parameters:

df (DataFrame)
ydf (DataFrame | None)
res (List)
kind (str)
feature_names_in (Index)
target_names_in (Index)

Return type:

Tuple[DataFrame, DataFrame]

graphistry.feature_utils.transform_dirty(df, data_encoder, name='')#

Parameters:

df (DataFrame)
data_encoder (Any)
name (str)

Return type:

DataFrame

graphistry.feature_utils.transform_text(df, text_model, text_cols)#

Parameters:

df (DataFrame)
text_model (Any)
text_cols (List | str)

Return type:

DataFrame

graphistry.feature_utils.where_is_currency_column(df, col)#

Parameters:

df (DataFrame)
col (str)

graphistry.feature_utils.FeatureEngine#: alias of Literal[‘none’, ‘pandas’, ‘skrub’, ‘torch’, ‘dirty_cat’, ‘auto’]

graphistry.feature_utils.FeatureEngineConcrete#: alias of Literal[‘none’, ‘pandas’, ‘skrub’, ‘torch’]

UMAP#

class graphistry.umap_utils.UMAPMixin(*a, **kw)#

Bases: object

UMAP Mixin for automagic UMAPing

filter_weighted_edges(scale=1.0, index_to_nodes_dict=None, inplace=False, kind='nodes')#

Filter edges based on _weighted_edges_df (ex: from .umap())

Parameters:

scale (float)
index_to_nodes_dict (Dict | None)
inplace (bool)
kind (str)

transform_umap(df: DataFrame, y: DataFrame | None = None, kind: Literal['nodes', 'edges'] = 'nodes', min_dist: str | float | int = 'auto', n_neighbors: int = 7, merge_policy: bool = False, sample: int | None = None, *, return_graph: Literal[True] = True, fit_umap_embedding: bool = True, umap_transform_kwargs: Dict[str, Any] = {}) → Plottable#

transform_umap(df: DataFrame, y: DataFrame | None = None, kind: Literal['nodes', 'edges'] = 'nodes', min_dist: str | float | int = 'auto', n_neighbors: int = 7, merge_policy: bool = False, sample: int | None = None, *, return_graph: Literal[False], fit_umap_embedding: bool = True, umap_transform_kwargs: Dict[str, Any] = {}) → Tuple[DataFrame, DataFrame, DataFrame]

Transforms data into UMAP embedding

Args:

df:: Dataframe to transform
y:: Target column
kind:: One of nodes or edges
min_dist:: Epsilon for including neighbors in infer_graph
n_neighbors:: Number of neighbors to use for contextualization
merge_policy:: if True, use previous graph, adding new batch to existing graph’s neighbors useful to contextualize new data against existing graph. If False, sample is irrelevant.

sample: Sample number of existing graph’s neighbors to use for contextualization – helps make denser graphs return_graph: Whether to return a graph or just the embeddings fit_umap_embedding: Whether to infer graph from the UMAP embedding on the new data, default True

umap(X: DataFrame | ndarray | List[str] | None = None, y: DataFrame | ndarray | List[str] | None = None, kind: Literal['nodes', 'edges'] = 'nodes', scale: float = 1.0, n_neighbors: int = 12, min_dist: float = 0.1, spread: float = 0.5, local_connectivity: int = 1, repulsion_strength: float = 1, negative_sample_rate: int = 5, n_components: int = 2, metric: str = 'euclidean', suffix: str = '', play: int | None = 0, encode_position: bool = True, encode_weight: bool = True, dbscan: bool = False, engine: Literal['cuml', 'umap_learn', 'auto'] = 'auto', feature_engine: str = 'auto', inplace: Literal[False] = False, memoize: bool = True, umap_kwargs: Dict[str, Any] = {}, umap_fit_kwargs: Dict[str, Any] = {}, umap_transform_kwargs: Dict[str, Any] = {}, **featurize_kwargs) → Plottable#

umap(X: DataFrame | ndarray | List[str] | None = None, y: DataFrame | ndarray | List[str] | None = None, kind: Literal['nodes', 'edges'] = 'nodes', scale: float = 1.0, n_neighbors: int = 12, min_dist: float = 0.1, spread: float = 0.5, local_connectivity: int = 1, repulsion_strength: float = 1, negative_sample_rate: int = 5, n_components: int = 2, metric: str = 'euclidean', suffix: str = '', play: int | None = 0, encode_position: bool = True, encode_weight: bool = True, dbscan: bool = False, engine: Literal['cuml', 'umap_learn', 'auto'] = 'auto', feature_engine: str = 'auto', *, inplace: Literal[True], memoize: bool = True, umap_kwargs: Dict[str, Any] = {}, umap_fit_kwargs: Dict[str, Any] = {}, umap_transform_kwargs: Dict[str, Any] = {}, **featurize_kwargs) → None

UMAP the featurized nodes or edges data, or pass in your own X, y (optional) dataframes of values

Example

>>> import graphistry   
>>> g = graphistry.nodes(pd.DataFrame({'node': [0,1,2], 'data': [1,2,3], 'meta': ['a', 'b', 'c']}))
>>> g2 = g.umap(n_components=3, spread=1.0, min_dist=0.1, n_neighbors=12, negative_sample_rate=5, local_connectivity=1, repulsion_strength=1.0, metric='euclidean', suffix='', play=0, encode_position=True, encode_weight=True, dbscan=False, engine='auto', feature_engine='auto', inplace=False, memoize=True)
>>> g2.plot()

Parameters

X:

either a dataframe ndarray of features, or column names to featurize

y:

either an dataframe ndarray of targets, or column names to featurize targets

kind:

nodes or edges or None. If None, expects explicit X, y (optional) matrices, and will Not associate them to nodes or edges. If X, y (optional) is given, with kind = [nodes, edges], it will associate new matrices to nodes or edges attributes.

scale:

multiplicative scale for pruning weighted edge DataFrame gotten from UMAP, between [0, ..) with high end meaning keep all edges

n_neighbors:

UMAP number of nearest neighbors to include for UMAP connectivity, lower makes more compact layouts. Minimum 2

min_dist:

UMAP float between 0 and 1, lower makes more compact layouts.

spread:

UMAP spread of values for relaxation

local_connectivity:

UMAP connectivity parameter

repulsion_strength:

UMAP repulsion strength

negative_sample_rate:

UMAP negative sampling rate

n_components:

number of components in the UMAP projection, default 2

metric:

UMAP metric, default ‘euclidean’. see (UMAP-LEARN)[https://umap-learn.readthedocs.io/ en/latest/parameters.html] documentation for more.

suffix:

optional suffix to add to x, y attributes of umap.

play:

Graphistry play parameter, default 0, how much to evolve the network during clustering. 0 preserves the original UMAP layout.

encode_weight:

if True, will set new edges_df from implicit UMAP, default True.

encode_position:

whether to set default plotting bindings – positions x,y from umap for .plot(), default True

dbscan:

whether to run DBSCAN on the UMAP embedding, default False.

engine:

selects which engine to use to calculate UMAP: default “auto” will use cuML if available, otherwise UMAP-LEARN.

feature_engine:

How to encode data (“none”, “auto”, “pandas”, “skrub”, “torch”)

inplace:

bool = False, whether to modify the current object, default False. when False, returns a new object, useful for chaining in a functional paradigm.

memoize:

whether to memoize the results of this method, default True.

umap_kwargs:

Optional kwargs to pass to underlying UMAP library constructor

umap_fit_kwargs:

Optional kwargs to pass to underlying UMAP fit method, including fit part of fit_transform

umap_transform_kwargs:

Optional kwargs to pass to underlying UMAP transform method, including transform part of fit_transform

featurize_kwargs:

Optional kwargs to pass to .featurize()

Returns:: self, with attributes set with new data

umap_fit(X, y=None, umap_fit_kwargs={})#

Parameters:

X (DataFrame)
y (DataFrame | None)
umap_fit_kwargs (Dict[str, Any])

Parameters:

res (Plottable)
n_neighbors (int)
min_dist (float)
spread (float)
local_connectivity (int)
repulsion_strength (float)
negative_sample_rate (int)
n_components (int)
metric (str)
engine (Literal['cuml', 'umap_learn', 'auto'])
suffix (str)
umap_kwargs (Dict[str, Any])
umap_fit_kwargs (Dict[str, Any])
umap_transform_kwargs (Dict[str, Any])

graphistry.umap_utils.assert_imported()#

graphistry.umap_utils.assert_imported_cuml()#

graphistry.umap_utils.is_legacy_cuml()#

graphistry.umap_utils.make_safe_umap_gpu_dataframes(X, y, engine)#

Parameters:

X (DataFrame)
y (DataFrame | None)
engine (Literal['cuml', 'umap_learn'])

Return type:

Tuple[DataFrame, DataFrame | None]

graphistry.umap_utils.prune_weighted_edges_df_and_relabel_nodes(wdf, scale=0.1, index_to_nodes_dict=None)#

Prune the weighted edge DataFrame so to return high fidelity similarity scores.

Parameters:

wdf (DataFrame | Any) – weighted edge DataFrame gotten via UMAP
scale (float) – lower values means less edges > (max - scale * std)
index_to_nodes_dict (Dict | None) – dict of index to node name; remap src/dst values if provided

Returns:

pd.DataFrame

Return type:

DataFrame

graphistry.umap_utils.resolve_umap_engine(engine)#

Parameters:: engine (Literal['cuml', 'umap_learn', 'auto'])
Return type:: Literal[‘cuml’, ‘umap_learn’]

graphistry.umap_utils.reuse_umap(g, memoize, metadata)#

Parameters:

g (Plottable)
memoize (bool)
metadata (Any)

Return type:

Plottable | None

graphistry.umap_utils.umap_graph_to_weighted_edges(umap_graph, engine, is_legacy, cfg=<module 'graphistry.constants' from '/home/docs/checkouts/readthedocs.org/user_builds/pygraphistry/checkouts/latest/graphistry/constants.py'>)#

Parameters:: engine (Literal['cuml', 'umap_learn'])

graphistry.umap_utils.umap_model_to_engine(v)#

Parameters:: v (Any)
Return type:: Literal[‘cuml’, ‘umap_learn’] | None

Semantic Search#

class graphistry.text_utils.SearchToGraphMixin(*a, **kw)#

Bases: object

assert_features_line_up_with_nodes()#

assert_fitted()#

build_index(angular=False, n_trees=None)#

classmethod load_search_instance(savepath)#

save_search_instance(savepath)#

search(query, cols=None, thresh=5000, fuzzy=True, top_n=10)#

Natural language query over nodes that returns a dataframe of results sorted by relevance column “distance”.

If node data is not yet feature-encoded (and explicit edges are given), run automatic feature engineering:
g2 = g.featurize(kind='nodes', X=['text_col_1', ..],
min_words=0 # forces all named columns are textually encoded
)
If edges do not yet exist, generate them via
g2 = g.umap(kind='nodes', X=['text_col_1', ..],
min_words=0 # forces all named columns are textually encoded
)
If an index is not yet built, it is generated g2.build_index() on the fly at search time. Otherwise, can set g2.build_index() to build it ahead of time.

Args:

query (str):: natural language query.
cols (list or str, optional):: if fuzzy=False, select which column to query. Defaults to None since fuzzy=True by defaul.
thresh (float, optional):: distance threshold from query vector to returned results. Defaults to 5000, set large just in case, but could be as low as 10.
fuzzy (bool, optional):: if True, uses embedding + annoy index for recall, otherwise does string matching over given cols Defaults to True.
top_n (int, optional):: how many results to return. Defaults to 100.

Returns:

pd.DataFrame, vector_encoding_of_query: rank ordered dataframe of results matching query

vector encoding of query via given transformer/ngrams model if fuzzy=True else None

Parameters:

query (str)
thresh (float)
fuzzy (bool)
top_n (int)

Return type:

Tuple[DataFrame, ndarray[tuple[Any, …], dtype[float32]] | ndarray[tuple[Any, …], dtype[float64]] | None]

search_graph(query, scale=0.5, top_n=100, thresh=5000, broader=False, inplace=False)#

Input a natural language query and return a graph of results.

See help(g.search) for more information

Args:

query (str):: query input eg “coding best practices”
scale (float, optional):: edge weigh threshold, Defaults to 0.5.
top_n (int, optional):: how many results to return. Defaults to 100.
thresh (float, optional):: distance threshold from query vector to returned results. Defaults to 5000, set large just in case, but could be as low as 10.
broader (bool, optional):: if True, will retrieve entities connected via an edge that were not necessarily bubbled up in the results_dataframe. Defaults to False.
inplace (bool, optional):: whether to return new instance (default) or mutate self. Defaults to False.

Returns:

graphistry Instance: g

Parameters:

query (str)
scale (float)
top_n (int)
thresh (float)
broader (bool)
inplace (bool)

Return type:

Plottable

DBSCAN#

class graphistry.compute.cluster.ClusterMixin(*a, **kw)#

Bases: object

dbscan(min_dist=0.2, min_samples=1, cols=None, kind='nodes', fit_umap_embedding=True, target=False, verbose=False, engine_dbscan='auto', *args, **kwargs)#

DBSCAN clustering on cpu or gpu infered automatically. Adds a _dbscan column to nodes or edges.

NOTE: g.transform_dbscan(..) currently unsupported on GPU.

Saves model as g._dbscan_nodes or g._dbscan_edges

Examples:

g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node')

# cluster by UMAP embeddings
kind = 'nodes' | 'edges'
g2 = g.umap(kind=kind).dbscan(kind=kind)
print(g2._nodes['_dbscan']) | print(g2._edges['_dbscan'])

# dbscan in umap or featurize API
g2 = g.umap(dbscan=True, min_dist=1.2, min_samples=2, **kwargs)
# or, here dbscan is infered from features, not umap embeddings
g2 = g.featurize(dbscan=True, min_dist=1.2, min_samples=2, **kwargs)

# and via chaining,
g2 = g.umap().dbscan(min_dist=1.2, min_samples=2, **kwargs)

# cluster by feature embeddings
g2 = g.featurize().dbscan(**kwargs)

# cluster by a given set of feature column attributes, or with target=True
g2 = g.featurize().dbscan(cols=['ip_172', 'location', 'alert'], target=False, **kwargs)

# equivalent to above (ie, cols != None and umap=True will still use features dataframe, rather than UMAP embeddings)
g2 = g.umap().dbscan(cols=['ip_172', 'location', 'alert'], umap=True | False, **kwargs)

g2.plot() # color by `_dbscan` column

Useful:

Enriching the graph with cluster labels from UMAP is useful for visualizing clusters in the graph by color, size, etc, as well as assessing metrics per cluster, e.g. graphistry/pygraphistry

Args:

min_dist float:: The maximum distance between two samples for them to be considered as in the same neighborhood.
kind str:: ‘nodes’ or ‘edges’
cols:: list of columns to use for clustering given g.featurize has been run, nice way to slice features or targets by fragments of interest, e.g. [‘ip_172’, ‘location’, ‘ssh’, ‘warnings’]
fit_umap_embedding bool:: whether to use UMAP embeddings or features dataframe to cluster DBSCAN
min_samples:: The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself.
target:: whether to use the target column as the clustering feature

Parameters:

min_dist (float)
min_samples (int)
cols (List | str | None)
kind (Literal['nodes', 'edges'])
fit_umap_embedding (bool)
target (bool)
verbose (bool)
engine_dbscan (Literal['cuml', 'sklearn', 'auto'])

transform_dbscan(df, y=None, min_dist='auto', infer_umap_embedding=False, sample=None, n_neighbors=None, kind='nodes', return_graph=True, verbose=False)#

Transforms a minibatch dataframe to one with a new column ‘_dbscan’ containing the DBSCAN cluster labels on the minibatch and generates a graph with the minibatch and the original graph, with edges between the minibatch and the original graph inferred from the umap embedding or features dataframe. Graph nodes | edges will be colored by ‘_dbscan’ column.

Examples:

fit:
    g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node')
    g2 = g.featurize().dbscan()

predict:
::

    emb, X, _, ndf = g2.transform_dbscan(ndf, return_graph=False)
    # or
    g3 = g2.transform_dbscan(ndf, return_graph=True)
    g3.plot()

likewise for umap:

fit:
    g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node')
    g2 = g.umap(X=.., y=..).dbscan()

predict:
::

    emb, X, y, ndf = g2.transform_dbscan(ndf, ndf, return_graph=False)
    # or
    g3 = g2.transform_dbscan(ndf, ndf, return_graph=True)
    g3.plot()

Args:

df:: dataframe to transform
y:: optional labels dataframe
min_dist:: The maximum distance between two samples for them to be considered as in the same neighborhood. smaller values will result in less edges between the minibatch and the original graph. Default ‘auto’, infers min_dist from the mean distance and std of new points to the original graph
fit_umap_embedding:: whether to use UMAP embeddings or features dataframe when inferring edges between the minibatch and the original graph. Default False, uses the features dataframe
sample:: number of samples to use when inferring edges between the minibatch and the original graph, if None, will only use closest point to the minibatch. If greater than 0, will sample the closest sample points in existing graph to pull in more edges. Default None
kind:: ‘nodes’ or ‘edges’
return_graph:: whether to return a graph or the (emb, X, y, minibatch df enriched with DBSCAN labels), default True infered graph supports kind=’nodes’ only.
verbose:: whether to print out progress, default False

Parameters:

df (DataFrame)
y (DataFrame | None)
min_dist (float | str)
infer_umap_embedding (bool)
sample (int | None)
n_neighbors (int | None)
kind (str)
return_graph (bool)
verbose (bool)

graphistry.compute.cluster.dbscan_fit_inplace(res, dbscan, kind='nodes', cols=None, use_umap_embedding=True, target=False, verbose=False)#

Fits clustering on UMAP embeddings if umap is True, otherwise on the features dataframe

or target dataframe if target is True.

Sets:

res._dbscan_edges or res._dbscan_nodes to the DBSCAN model
res._edges or res._nodes gains column _dbscan

Args:

res:: graphistry graph
kind:: ‘nodes’ or ‘edges’
cols:: list of columns to use for clustering given g.featurize has been run
use_umap_embedding:: whether to use UMAP embeddings or features dataframe for clustering (default: True)
target:: whether to use the target dataframe or features dataframe (typically False, for features)

Parameters:

res (Plottable)
dbscan (Any)
kind (Literal['nodes', 'edges'])
cols (List | str | None)
use_umap_embedding (bool)
target (bool)
verbose (bool)

Return type:

None

graphistry.compute.cluster.dbscan_predict_cuml(X, model)#

Parameters:

X (Any)
model (Any)

Return type:

Any

graphistry.compute.cluster.dbscan_predict_sklearn(X, model)#

DBSCAN has no predict per se, so we reverse engineer one here from https://stackoverflow.com/questions/27822752/scikit-learn-predicting-new-points-with-dbscan

Parameters:

X (DataFrame)
model (Any)

Return type:

ndarray

graphistry.compute.cluster.get_model_matrix(g, kind, cols, umap, target)#

Allows for a single function to get the model matrix for both nodes and edges as well as targets, embeddings, and features

Args:

g:: graphistry graph
kind:: ‘nodes’ or ‘edges’
cols:: list of columns to use for clustering given g.featurize has been run
umap:: whether to use UMAP embeddings or features dataframe
target:: whether to use the target dataframe or features dataframe

Returns:

pd.DataFrame: dataframe of model matrix given the inputs

Parameters:

g (Plottable)
kind (Literal['nodes', 'edges'])
cols (List | str | None)

Return type:

Any

graphistry.compute.cluster.make_safe_gpu_dataframes(X, y, engine)#

Coerce a dataframe to pd vs cudf based on engine

Parameters:

X (Any | None)
y (Any | None)
engine (Engine)

Return type:

Tuple[Any | None, Any | None]

graphistry.compute.cluster.resolve_dbscan_engine(engine, g_or_df=None)#

Resolves the engine to use for DBSCAN clustering

If ‘auto’, decide by checking if cuml or sklearn is installed, and if provided, natural type of the dataset. GPU is used if both a GPU dataset and GPU library is installed. Otherwise, CPU library.

Parameters:

engine (Literal['cuml', 'sklearn', 'auto'])
g_or_df (Any | None)

Return type:

Literal[‘cuml’, ‘sklearn’]

RGCN#

class graphistry.networks.DotProductPredictor#

Bases: object

forward(graph, h)#

class graphistry.networks.GCN(in_feats, h_feats, num_classes)#

Bases: object

forward(g, in_feat)#

class graphistry.networks.HeteroClassifier(in_dim, hidden_dim, n_classes, rel_names)#

Bases: object

forward(g)#

class graphistry.networks.HeteroEmbed(num_nodes, num_rels, d, proto, node_features=None, device='cpu', reg=0.01)#

Bases: object

Parameters:

num_nodes (int)
num_rels (int)
d (int)

loss(node_embedding, triplets, labels)#

score(node_embedding, triplets)#

class graphistry.networks.LinkPredModel(in_features, hidden_features, out_features)#

Bases: object

forward(g, x)#

class graphistry.networks.LinkPredModelMultiOutput(in_features, hidden_features, out_features, out_classes)#

Bases: object

embed(g, x)#

forward(g, x)#

class graphistry.networks.MLPPredictor(in_features, out_classes)#

Bases: object

One can also write a prediction function that predicts a vector for each edge with an MLP. Such vector can be used in further downstream tasks, e.g. as logits of a categorical distribution.

apply_edges(edges)#

forward(graph, h)#

class graphistry.networks.RGCN(in_feats, hid_feats, out_feats, rel_names)#

Bases: object

Heterograph where we gather message from neighbors along all edge types. You can use the module dgl.nn.pytorch.HeteroGraphConv (also available in MXNet and Tensorflow) to perform message passing on all edge types, then combining different graph convolution modules for each edge type.

:returns: torch model with forward pass methods useful for fitting model in standard way

forward(graph, inputs)#

class graphistry.networks.RGCNEmbed(d, num_nodes, num_rels, hidden=None, device='cpu')#

Bases: object

forward(g, node_features=None)#

class graphistry.networks.SAGE(in_feats, hid_feats, out_feats)#

Bases: object

forward(graph, inputs)#

graphistry.networks.train_link_pred(model, G, epochs=100, use_cross_entropy_loss=False)#

HeterographEmbedModuleMixin#

class graphistry.embed_utils.EmbedDistScore#

Bases: object

static DistMult(h, r, t)#

Parameters:

h (Any)
r (Any)
t (Any)

Return type:

Any

static RotatE(h, r, t)#

Parameters:

h (Any)
r (Any)
t (Any)

Return type:

Any

static TransE(h, r, t)#

Parameters:

h (Any)
r (Any)
t (Any)

Return type:

Any

class graphistry.embed_utils.HeterographEmbedModuleMixin(*args, **kwargs)#

Bases: ComputeMixin

Embed a graph using a relational graph convolutional network (RGCN), and return a new graphistry graph with the embeddings as node attributes.

Parameters#

relationstr: column to use as relation between nodes
protoProtoSymbolic: metric to use, [‘TransE’, ‘RotateE’, ‘DistMult’] or provide your own. Defaults to ‘DistMult’.
embedding_dimint: relation embedding dimension. defaults to 32
use_featbool: wether to featurize nodes, if False will produce random embeddings and shape them during training. Defaults to True
XXSymbolic: Which columns in the nodes dataframe to featurize. Inherets args from graphistry.featurize(). Defaults to None.
epochsint: Number of training epochs. Defaults to 2
batch_sizeint: batch_size. Defaults to 32
train_splitUnion[float, int]: train percentage, between 0, 1. Defaults to 0.8.
sample_sizeint: sample size. Defaults to 1000
num_stepsint: num_steps. Defaults to 50
lrfloat: learning rate. Defaults to 0.002
inplaceOptional[bool]: inplace
deviceOptional[str]: accelarator. Defaults to “cpu”
evaluatebool: Whether to evaluate. Defaults to False.

Returns#

self : graphistry instance

Parameters:

relation (str)
proto (str | Callable[[Any, Any, Any], Any] | None)
embedding_dim (int)
use_feat (bool)
X (DataFrame | ndarray | List[str] | None)
epochs (int)
batch_size (int)
train_split (float | int)
sample_size (int)
num_steps (int)
lr (float)
inplace (bool | None)
device (str | None)
evaluate (bool)

Return type:

Plottable

predict_links(source=None, relation=None, destination=None, threshold=0.5, anomalous=False, retain_old_edges=False, return_dataframe=False)#

predict_links over all the combinations of given source, relation, destinations.

Parameters#

source: list: Targeted source nodes. Defaults to None(all).
relation: list: Targeted relations. Defaults to None(all).
destination: list: Targeted destinations. Defaults to None(all).
thresholdOptional[float]: Probability threshold. Defaults to 0.5
retain_old_edgesOptional[bool]: will include old edges in predicted graph. Defaults to False.
return_dataframeOptional[bool]: will return a dataframe instead of a graphistry instance. Defaults to False.
anomalousOptional[False]: will return the edges < threshold or low confidence edges(anomaly).

Returns#

Graphistry Instance: containing the corresponding source, relation, destination and score column where score >= threshold if anamalous if False else score <= threshold, or a dataframe

Parameters:

source (list | None)
relation (list | None)
destination (list | None)
threshold (float | None)
anomalous (bool | None)
retain_old_edges (bool | None)
return_dataframe (bool | None)

Return type:

Plottable

predict_links_all(threshold=0.5, anomalous=False, retain_old_edges=False, return_dataframe=False)#

predict_links over entire graph given a threshold

Parameters#

thresholdOptional[float]: Probability threshold. Defaults to 0.5
anomalousOptional[False]: will return the edges < threshold or low confidence edges(anomaly).
retain_old_edgesOptional[bool]: will include old edges in predicted graph. Defaults to False.
return_dataframe: Optional[bool]: will return a dataframe instead of a graphistry instance. Defaults to False.

Returns#

Plottable: graphistry graph instance containing all predicted/anomalous links or dataframe

Parameters:

threshold (float | None)
anomalous (bool | None)
retain_old_edges (bool | None)
return_dataframe (bool | None)

Return type:

Plottable

class graphistry.embed_utils.SubgraphIterator(g, sample_size=3000, num_steps=1000)#

Bases: object

Parameters:

sample_size (int)
num_steps (int)

graphistry.embed_utils.check_cudf()#

graphistry.embed_utils.log(msg)#

Parameters:: msg (str)
Return type:: None

AI

Contents

AI#

Featurize#

UMAP#

Semantic Search#

DBSCAN#

RGCN#

HeterographEmbedModuleMixin#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#