Plugins

Utilities

Featurize

class graphistry.feature_utils.Embedding(df)

Bases: object

Generates random embeddings of a given dimension that aligns with the index of the dataframe

Parameters

df (DataFrame) –

fit(n_dim)
Parameters

n_dim (int) –

fit_transform(n_dim)
Parameters

n_dim (int) –

transform(ids)
Return type

DataFrame

class graphistry.feature_utils.FastEncoder(df, y=None, kind='nodes')

Bases: object

fit(src=None, dst=None, *args, **kwargs)
fit_transform(src=None, dst=None, *args, **kwargs)
scale(X=None, y=None, return_pipeline=False, *args, **kwargs)

Fits new scaling functions on df, y via args-kwargs

Example:
from graphisty.features import SCALERS, SCALER_OPTIONS
print(SCALERS)
g = graphistry.nodes(df)
# set a scaling strategy for features and targets -- umap uses those and produces different results depending.
g2 = g.umap(use_scaler='standard', use_scaler_target=None)

# later if you want to scale new data, you can do so
X, y = g2.transform(df, df, scaled=False)  # unscaled transformer output
# now scale with new settings
X_scaled, y_scaled = g2.scale(X, y, use_scaler='minmax', use_scaler_target='kbins', n_bins=5)
# fit some other pipeline
clf.fit(X_scaled, y_scaled)

args:

;X: pd.DataFrame of features
:y: pd.DataFrame of target features
:kind: str, one of 'nodes' or 'edges'
*args, **kwargs: passed to smart_scaler pipeline
returns:

scaled X, y

transform(df, ydf=None)

Raw transform, no scaling.

transform_scaled(df, ydf=None, scaling_pipeline=None, scaling_pipeline_target=None)
class graphistry.feature_utils.FastMLB(mlb, in_column, out_columns)

Bases: object

fit(X, y=None)
get_feature_names_in()
get_feature_names_out()
transform(df)
class graphistry.feature_utils.FeatureMixin(*args, **kwargs)

Bases: object

FeatureMixin for automatic featurization of nodes and edges DataFrames. Subclasses UMAPMixin for umap-ing of automatic features.

Usage:

g = graphistry.nodes(df, 'node_column')
g2 = g.featurize()

or for edges,

g = graphistry.edges(df, 'src', 'dst')
g2 = g.featurize(kind='edges')

or chain them for both nodes and edges,

g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node_column')
g2 = g.featurize().featurize(kind='edges')
featurize(kind='nodes', X=None, y=None, use_scaler=None, use_scaler_target=None, cardinality_threshold=40, cardinality_threshold_target=400, n_topics=42, n_topics_target=12, multilabel=False, embedding=False, use_ngrams=False, ngram_range=(1, 3), max_df=0.2, min_df=3, min_words=4.5, model_name='paraphrase-MiniLM-L6-v2', impute=True, n_quantiles=100, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', similarity=None, categories='auto', keep_n_decimals=5, remove_node_column=True, inplace=False, feature_engine='auto', dbscan=False, min_dist=0.5, min_samples=1, memoize=True, verbose=False)

Featurize Nodes or Edges of the underlying nodes/edges DataFrames.

Parameters
  • kind (str) – specify whether to featurize nodes or edges. Edge featurization includes a pairwise src-to-dst feature block using a MultiLabelBinarizer, with any other columns being treated the same way as with nodes featurization.

  • X (Union[List[str], str, DataFrame, None]) – Optional input, default None. If symbolic, evaluated against self data based on kind. If None, will featurize all columns of DataFrame

  • y (Union[List[str], str, DataFrame, None]) – Optional Target(s) columns or explicit DataFrame, default None

  • use_scaler (Optional[str]) – selects which scaler (and automatically imputes missing values using mean strategy) to scale the data. Options are; “minmax”, “quantile”, “standard”, “robust”, “kbins”, default None. Please see scikits-learn documentation https://scikit-learn.org/stable/modules/preprocessing.html Here ‘standard’ corresponds to ‘StandardScaler’ in scikits.

  • cardinality_threshold (int) – dirty_cat threshold on cardinality of categorical labels across columns. If value is greater than threshold, will run GapEncoder (a topic model) on column. If below, will one-hot_encode. Default 40.

  • cardinality_threshold_target (int) – similar to cardinality_threshold, but for target features. Default is set high (400), as targets generally want to be one-hot encoded, but sometimes it can be useful to use GapEncoder (ie, set threshold lower) to create regressive targets, especially when those targets are textual/softly categorical and have semantic meaning across different labels. Eg, suppose a column has fields like [‘Application Fraud’, ‘Other Statuses’, ‘Lost-Target scaling using/Stolen Fraud’, ‘Investigation Fraud’, …] the GapEncoder will concentrate the ‘Fraud’ labels together.

  • n_topics (int) – the number of topics to use in the GapEncoder if cardinality_thresholds is saturated. Default is 42, but good rule of thumb is to consult the Johnson-Lindenstrauss Lemma https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma or use the simplified random walk estimate => n_topics_lower_bound ~ (pi/2) * (N-documents)**(1/4)

  • n_topics_target (int) – the number of topics to use in the GapEncoder if cardinality_thresholds_target is saturated for the target(s). Default 12.

  • min_words (float) – sets threshold on how many words to consider in a textual column if it is to be considered in the text processing pipeline. Set this very high if you want any textual columns to bypass the transformer, in favor of GapEncoder (topic modeling). Set to 0 to force all named columns to be encoded as textual (embedding)

  • model_name (str) – Sentence Transformer model to use. Default Paraphrase model makes useful vectors, but at cost of encoding time. If faster encoding is needed, average_word_embeddings_komninos is useful and produces less semantically relevant vectors. Please see sentence_transformer (https://www.sbert.net/) library for all available models.

  • multilabel (bool) – if True, will encode a single target column composed of lists of lists as multilabel outputs. This only works with y=[‘a_single_col’], default False

  • embedding (bool) – If True, produces a random node embedding of size n_topics default, False. If no node features are provided, will produce random embeddings (for GNN models, for example)

  • use_ngrams (bool) – If True, will encode textual columns as TfIdf Vectors, default, False.

  • ngram_range (tuple) – if use_ngrams=True, can set ngram_range, eg: tuple = (1, 3)

  • max_df (float) – if use_ngrams=True, set max word frequency to consider in vocabulary eg: max_df = 0.2,

  • min_df (int) – if use_ngrams=True, set min word count to consider in vocabulary eg: min_df = 3 or 0.00001

  • categories (Optional[str]) – Optional[str] in [“auto”, “k-means”, “most_frequent”], decides which category to select in Similarity Encoding, default ‘auto’

  • impute (bool) – Whether to impute missing values, default True

  • n_quantiles (int) – if use_scaler = ‘quantile’, sets the quantile bin size.

  • output_distribution (str) – if use_scaler = ‘quantile’, can return distribution as [“normal”, “uniform”]

  • quantile_range – if use_scaler = ‘robust’|’quantile’, sets the quantile range.

  • n_bins (int) – number of bins to use in kbins discretizer, default 10

  • encode (str) – encoding for KBinsDiscretizer, can be one of onehot, onehot-dense, ordinal, default ‘ordinal’

  • strategy (str) – strategy for KBinsDiscretizer, can be one of uniform, quantile, kmeans, default ‘quantile’

  • n_quantiles – if use_scaler = “quantile”, sets the number of quantiles, default=100

  • output_distribution – if use_scaler=”quantile”|”robust”, choose from [“normal”, “uniform”]

  • dbscan (bool) – whether to run DBSCAN, default False.

  • min_dist (float) – DBSCAN eps parameter, default 0.5.

  • min_samples (int) – DBSCAN min_samples parameter, default 5.

  • keep_n_decimals (int) – number of decimals to keep

  • remove_node_column (bool) – whether to remove node column so it is not featurized, default True.

  • inplace (bool) – whether to not return new graphistry instance or not, default False.

  • memoize (bool) – whether to store and reuse results across runs, default True.

  • use_scaler_target (Optional[str]) –

  • similarity (Optional[str]) –

  • feature_engine (Literal[‘none’, ‘pandas’, ‘dirty_cat’, ‘torch’, ‘auto’]) –

  • verbose (bool) –

Returns

graphistry instance with new attributes set by the featurization process.

get_matrix(columns=None, kind='nodes', target=False)

Returns feature matrix, and if columns are specified, returns matrix with only the columns that contain the string column_part in their name.`X = g.get_matrix([‘feature1’, ‘feature2’])` will retrieve a feature matrix with only the columns that contain the string feature1 or feature2 in their name. Most useful for topic modeling, where the column names are of the form topic_0: descriptor, topic_1: descriptor, etc. Can retrieve unique columns in original dataframe, or actual topic features like [ip_part, shoes, preference_x, etc]. Powerful way to retrieve features from a featurized graph by column or (top) features of interest.

Example:

# get the full feature matrices
X = g.get_matrix()
y = g.get_matrix(target=True)

# get subset of features, or topics, given topic model encoding
X = g2.get_matrix(['172', 'percent'])
X.columns
    => ['ip_172.56.104.67', 'ip_172.58.129.252', 'item_percent']
# or in targets
y = g2.get_matrix(['total', 'percent'], target=True)
y.columns
    => ['basket_price_total', 'conversion_percent', 'CTR_percent', 'CVR_percent']

# not as useful for sbert features. 
Caveats:
  • if you have a column name that is a substring of another column name, you may get unexpected results.

Args:
columns (Union[List, str])

list of column names or a single column name that may exist in columns of the feature matrix. If None, returns original feature matrix

kind (str, optional)

Node or Edge features. Defaults to ‘nodes’.

target (bool, optional)

If True, returns the target matrix. Defaults to False.

Returns:

pd.DataFrame: feature matrix with only the columns that contain the string column_part in their name.

Parameters
  • columns (Union[List, str, None]) –

  • kind (str) –

  • target (bool) –

Return type

DataFrame

scale(df=None, y=None, kind='nodes', use_scaler=None, use_scaler_target=None, impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5, return_scalers=False)

Scale data using the same scalers as used in the featurization step.

Example

g = graphistry.nodes(df)
X, y = g.featurize().scale(kind='nodes', use_scaler='robust', use_scaler_target='kbins', n_bins=3)

# or 
g = graphistry.nodes(df)
# set a scaling strategy for features and targets -- umap uses those and produces different results depending.
g2 = g.umap(use_scaler='standard', use_scaler_target=None)

# later if you want to scale new data, you can do so
X, y = g2.transform(df, df, scale=False)
X_scaled, y_scaled = g2.scale(X, y, use_scaler='minmax', use_scaler_target='kbins', n_bins=5)
# fit some other pipeline
clf.fit(X_scaled, y_scaled)

Args:

df

pd.DataFrame, raw data to transform, if None, will use data from featurization fit

y

pd.DataFrame, optional target data

kind

str, one of nodes, edges

use_scaler

str, optional, one of minmax, robust, standard, kbins, quantile

use_scaler_target

str, optional, one of minmax, robust, standard, kbins, quantile

impute

bool, if True, will impute missing values

n_quantiles

int, number of quantiles to use for quantile scaler

output_distribution

str, one of normal, uniform, lognormal

quantile_range

tuple, range of quantiles to use for quantile scaler

n_bins

int, number of bins to use for KBinsDiscretizer

encode

str, one of ordinal, onehot, onehot-dense, binary

strategy

str, one of uniform, quantile, kmeans

keep_n_decimals

int, number of decimals to keep after scaling

return_scalers

bool, if True, will return the scalers used to scale the data

Returns:

(X, y) transformed data if return_graph is False or a graph with inferred edges if return_graph is True, or (X, y, scaler, scaler_target) if return_scalers is True

Parameters
  • df (Optional[DataFrame]) –

  • y (Optional[DataFrame]) –

  • kind (str) –

  • use_scaler (Optional[str]) –

  • use_scaler_target (Optional[str]) –

  • impute (bool) –

  • n_quantiles (int) –

  • output_distribution (str) –

  • n_bins (int) –

  • encode (str) –

  • strategy (str) –

  • keep_n_decimals (int) –

  • return_scalers (bool) –

transform(df, y=None, kind='nodes', min_dist='auto', n_neighbors=7, merge_policy=False, sample=None, return_graph=True, scaled=True, verbose=False)

Transform new data and append to existing graph, or return dataframes

args:

df

pd.DataFrame, raw data to transform

ydf

pd.DataFrame, optional

kind

str # one of nodes, edges

return_graph

bool, if True, will return a graph with inferred edges.

merge_policy

bool, if True, adds batch to existing graph nodes via nearest neighbors. If False, will infer edges only between nodes in the batch, default False

min_dist

float, if return_graph is True, will use this value in NN search, or ‘auto’ to infer a good value. min_dist represents the maximum distance between two samples for one to be considered as in the neighborhood of the other.

sample

int, if return_graph is True, will use sample edges of existing graph to fill out the new graph

n_neighbors

int, if return_graph is True, will use this value for n_neighbors in Nearest Neighbors search

scaled

bool, if True, will use scaled transformation of data set during featurization, default True

verbose

bool, if True, will print metadata about the graph construction, default False

Returns:

X, y: pd.DataFrame, transformed data if return_graph is False or a graphistry Plottable with inferred edges if return_graph is True

Parameters
  • df (DataFrame) –

  • y (Optional[DataFrame]) –

  • kind (str) –

  • min_dist (Union[str, float, int]) –

  • n_neighbors (int) –

  • merge_policy (bool) –

  • sample (Optional[int]) –

  • return_graph (bool) –

  • scaled (bool) –

  • verbose (bool) –

graphistry.feature_utils.assert_imported()
graphistry.feature_utils.assert_imported_text()
class graphistry.feature_utils.callThrough(x)

Bases: object

graphistry.feature_utils.check_if_textual_column(df, col, confidence=0.35, min_words=2.5)

Checks if col column of df is textual or not using basic heuristics

Parameters
  • df (DataFrame) – DataFrame

  • col – column name

  • confidence (float) – threshold float value between 0 and 1. If column col has confidence more elements as type str it will pass it onto next stage of evaluation. Default 0.35

  • min_words (float) – mean minimum words threshold. If mean words across col is greater than this, it is deemed textual. Default 2.5

Return type

bool

Returns

bool, whether column is textual or not

graphistry.feature_utils.concat_text(df, text_cols)
graphistry.feature_utils.encode_edges(edf, src, dst, mlb, fit=False)

edge encoder – creates multilabelBinarizer on edge pairs.

Args:

edf (pd.DataFrame): edge dataframe src (string): source column dst (string): destination column mlb (sklearn): multilabelBinarizer fit (bool, optional): If true, fits multilabelBinarizer. Defaults to False.

Returns

tuple: pd.DataFrame, multilabelBinarizer

graphistry.feature_utils.encode_multi_target(ydf, mlb=None)
graphistry.feature_utils.encode_textual(df, min_words=2.5, model_name='paraphrase-MiniLM-L6-v2', use_ngrams=False, ngram_range=(1, 3), max_df=0.2, min_df=3)
Parameters
  • df (DataFrame) –

  • min_words (float) –

  • model_name (str) –

  • use_ngrams (bool) –

  • ngram_range (tuple) –

  • max_df (float) –

  • min_df (int) –

Return type

Tuple[DataFrame, List, Any]

graphistry.feature_utils.features_without_target(df, y=None)

Checks if y DataFrame column name is in df, and removes it from df if so

Parameters
  • df (DataFrame) – model DataFrame

  • y (Union[List, str, DataFrame, None]) – target DataFrame

Return type

DataFrame

Returns

DataFrames of model and target

graphistry.feature_utils.find_bad_set_columns(df, bad_set=['[]'])

Finds columns that if not coerced to strings, will break processors.

Parameters
  • df (DataFrame) – DataFrame

  • bad_set (List) – List of strings to look for.

Returns

list

graphistry.feature_utils.fit_pipeline(X, transformer, keep_n_decimals=5)

Helper to fit DataFrame over transformer pipeline. Rounds resulting matrix X by keep_n_digits if not 0, which helps for when transformer pipeline is scaling or imputer which sometime introduce small negative numbers, and umap metrics like Hellinger need to be positive :type X: DataFrame :param X: DataFrame to transform. :param transformer: Pipeline object to fit and transform :type keep_n_decimals: int :param keep_n_decimals: Int of how many decimal places to keep in rounded transformed data

Return type

DataFrame

graphistry.feature_utils.get_cardinality_ratio(df)

Calculates the ratio of unique values to total number of rows of DataFrame

Parameters

df (DataFrame) – DataFrame

graphistry.feature_utils.get_dataframe_by_column_dtype(df, include=None, exclude=None)
graphistry.feature_utils.get_matrix_by_column_part(X, column_part)

Get the feature matrix by column part existing in column names.

Parameters
  • X (DataFrame) –

  • column_part (str) –

Return type

DataFrame

graphistry.feature_utils.get_matrix_by_column_parts(X, column_parts)

Get the feature matrix by column parts list existing in column names.

Parameters
  • X (DataFrame) –

  • column_parts (Union[list, str, None]) –

Return type

DataFrame

graphistry.feature_utils.get_numeric_transformers(ndf, y=None)
graphistry.feature_utils.get_preprocessing_pipeline(use_scaler='robust', impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='quantile')

Helper function for imputing and scaling np.ndarray data using different scaling transformers.

Parameters
  • X – np.ndarray

  • impute (bool) – whether to run imputing or not

  • use_scaler (str) – string in None or [“minmax”, “quantile”, “standard”, “robust”, “kbins”], selects scaling transformer, default None

  • n_quantiles (int) – if use_scaler = ‘quantile’, sets the quantile bin size.

  • output_distribution (str) – if use_scaler = ‘quantile’, can return distribution as [“normal”, “uniform”]

  • quantile_range – if use_scaler = ‘robust’/’quantile’, sets the quantile range.

  • n_bins (int) – number of bins to use in kbins discretizer

  • encode (str) – encoding for KBinsDiscretizer, can be one of onehot, onehot-dense, ordinal, default ‘ordinal’

  • strategy (str) – strategy for KBinsDiscretizer, can be one of uniform, quantile, kmeans, default ‘quantile’

Return type

Any

Returns

scaled array, imputer instances or None, scaler instance or None

graphistry.feature_utils.get_text_preprocessor(ngram_range=(1, 3), max_df=0.2, min_df=3)
graphistry.feature_utils.get_textual_columns(df, min_words=2.5)

Collects columns from df that it deems are textual.

Parameters
  • df (DataFrame) – DataFrame

  • min_words (float) –

Return type

List

Returns

list of columns names

graphistry.feature_utils.group_columns_by_dtypes(df, verbose=True)
Parameters
  • df (DataFrame) –

  • verbose (bool) –

Return type

Dict

graphistry.feature_utils.identity(x)
graphistry.feature_utils.impute_and_scale_df(df, use_scaler='robust', impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5)
Parameters
  • df (DataFrame) –

  • use_scaler (str) –

  • impute (bool) –

  • n_quantiles (int) –

  • output_distribution (str) –

  • n_bins (int) –

  • encode (str) –

  • strategy (str) –

  • keep_n_decimals (int) –

Return type

Tuple[DataFrame, Any]

graphistry.feature_utils.is_dataframe_all_numeric(df)
Parameters

df (DataFrame) –

Return type

bool

graphistry.feature_utils.lazy_import_has_dependancy_text()
graphistry.feature_utils.lazy_import_has_dirty_cat()
graphistry.feature_utils.lazy_import_has_min_dependancy()
graphistry.feature_utils.make_array(X)
graphistry.feature_utils.passthrough_df_cols(df, columns)
graphistry.feature_utils.process_dirty_dataframes(ndf, y, cardinality_threshold=40, cardinality_threshold_target=400, n_topics=42, n_topics_target=7, similarity=None, categories='auto', multilabel=False)

Dirty_Cat encoder for record level data. Will automatically turn inhomogeneous dataframe into matrix using smart conversion tricks.

Parameters
  • ndf (DataFrame) – node DataFrame

  • y (Optional[DataFrame]) – target DataFrame or series

  • cardinality_threshold (int) – For ndf columns, below this threshold, encoder is OneHot, above, it is GapEncoder

  • cardinality_threshold_target (int) – For target columns, below this threshold, encoder is OneHot, above, it is GapEncoder

  • n_topics (int) – number of topics for GapEncoder, default 42

  • use_scaler – None or string in [‘minmax’, ‘standard’, ‘robust’, ‘quantile’]

  • similarity (Optional[str]) – one of ‘ngram’, ‘levenshtein-ratio’, ‘jaro’, or’jaro-winkler’}) – The type of pairwise string similarity to use. If None or False, uses a SuperVectorizer

  • n_topics_target (int) –

  • categories (Optional[str]) –

  • multilabel (bool) –

Return type

Tuple[DataFrame, Optional[DataFrame], Any, Any]

Returns

Encoded data matrix and target (if not None), the data encoder, and the label encoder.

graphistry.feature_utils.process_edge_dataframes(edf, y, src, dst, cardinality_threshold=40, cardinality_threshold_target=400, n_topics=42, n_topics_target=7, use_scaler=None, use_scaler_target=None, multilabel=False, use_ngrams=False, ngram_range=(1, 3), max_df=0.2, min_df=3, min_words=2.5, model_name='paraphrase-MiniLM-L6-v2', similarity=None, categories='auto', impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5, feature_engine='pandas')

Custom Edge-record encoder. Uses a MultiLabelBinarizer to generate a src/dst vector and then process_textual_or_other_dataframes that encodes any other data present in edf, textual or not.

Parameters
  • edf (DataFrame) – pandas DataFrame of edge features

  • y (DataFrame) – pandas DataFrame of edge labels

  • src (str) – source column to select in edf

  • dst (str) – destination column to select in edf

  • use_scaler (Optional[str]) – None or string in [‘minmax’, ‘standard’, ‘robust’, ‘quantile’]

  • cardinality_threshold (int) –

  • cardinality_threshold_target (int) –

  • n_topics (int) –

  • n_topics_target (int) –

  • use_scaler_target (Optional[str]) –

  • multilabel (bool) –

  • use_ngrams (bool) –

  • ngram_range (tuple) –

  • max_df (float) –

  • min_df (int) –

  • min_words (float) –

  • model_name (str) –

  • similarity (Optional[str]) –

  • categories (Optional[str]) –

  • impute (bool) –

  • n_quantiles (int) –

  • output_distribution (str) –

  • n_bins (int) –

  • encode (str) –

  • strategy (str) –

  • keep_n_decimals (int) –

  • feature_engine (Literal[‘none’, ‘pandas’, ‘dirty_cat’, ‘torch’]) –

Return type

Tuple[DataFrame, DataFrame, DataFrame, DataFrame, List[Any], Any, Optional[Any], Optional[Any], Any, List[str]]

Returns

Encoded data matrix and target (if not None), the data encoders, and the label encoder.

graphistry.feature_utils.process_nodes_dataframes(df, y, cardinality_threshold=40, cardinality_threshold_target=400, n_topics=42, n_topics_target=7, use_scaler='robust', use_scaler_target='kbins', multilabel=False, embedding=False, use_ngrams=False, ngram_range=(1, 3), max_df=0.2, min_df=3, min_words=2.5, model_name='paraphrase-MiniLM-L6-v2', similarity=None, categories='auto', impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5, feature_engine='pandas')

Automatic Deep Learning Embedding/ngrams of Textual Features, with the rest of the columns taken care of by dirty_cat

Parameters
  • df (DataFrame) – pandas DataFrame of data

  • y (DataFrame) – pandas DataFrame of targets

  • use_scaler (Optional[str]) – None or string in [‘minmax’, ‘standard’, ‘robust’, ‘quantile’]

  • n_topics (int) – number of topics in Gap Encoder

  • use_scaler

  • confidence – Number between 0 and 1, will pass column for textual processing if total entries are string like in a column and above this relative threshold.

  • min_words (float) – Sets the threshold for average number of words to include column for textual sentence encoding. Lower values means that columns will be labeled textual and sent to sentence-encoder. Set to 0 to force named columns as textual.

  • model_name (str) – SentenceTransformer model name. See available list at https://www.sbert.net/docs/pretrained_models. html#sentence-embedding-models

  • cardinality_threshold (int) –

  • cardinality_threshold_target (int) –

  • n_topics_target (int) –

  • use_scaler_target (Optional[str]) –

  • multilabel (bool) –

  • embedding (bool) –

  • use_ngrams (bool) –

  • ngram_range (tuple) –

  • max_df (float) –

  • min_df (int) –

  • similarity (Optional[str]) –

  • categories (Optional[str]) –

  • impute (bool) –

  • n_quantiles (int) –

  • output_distribution (str) –

  • n_bins (int) –

  • encode (str) –

  • strategy (str) –

  • keep_n_decimals (int) –

  • feature_engine (Literal[‘none’, ‘pandas’, ‘dirty_cat’, ‘torch’]) –

Return type

Tuple[DataFrame, Any, DataFrame, Any, Any, Any, Optional[Any], Optional[Any], Any, List[str]]

Returns

X_enc, y_enc, data_encoder, label_encoder, scaling_pipeline, scaling_pipeline_target, text_model, text_cols,

graphistry.feature_utils.prune_weighted_edges_df_and_relabel_nodes(wdf, scale=0.1, index_to_nodes_dict=None)

Prune the weighted edge DataFrame so to return high fidelity similarity scores.

Parameters
  • wdf (DataFrame) – weighted edge DataFrame gotten via UMAP

  • scale (float) – lower values means less edges > (max - scale * std)

  • index_to_nodes_dict (Optional[Dict]) – dict of index to node name; remap src/dst values if provided

Return type

DataFrame

Returns

pd.DataFrame

graphistry.feature_utils.remove_internal_namespace_if_present(df)

Some tranformations below add columns to the DataFrame, this method removes them before featurization Will not drop if suffix is added during UMAP-ing

Parameters

df (DataFrame) – DataFrame

Returns

DataFrame with dropped columns in reserved namespace

graphistry.feature_utils.remove_node_column_from_symbolic(X_symbolic, node)
graphistry.feature_utils.resolve_X(df, X)
Parameters
  • df (Optional[DataFrame]) –

  • X (Union[List[str], str, DataFrame, None]) –

Return type

DataFrame

graphistry.feature_utils.resolve_feature_engine(feature_engine)
Parameters

feature_engine (Literal[‘none’, ‘pandas’, ‘dirty_cat’, ‘torch’, ‘auto’]) –

Return type

Literal[‘none’, ‘pandas’, ‘dirty_cat’, ‘torch’]

graphistry.feature_utils.resolve_y(df, y)
Parameters
  • df (Optional[DataFrame]) –

  • y (Union[List[str], str, DataFrame, None]) –

Return type

DataFrame

graphistry.feature_utils.reuse_featurization(g, memoize, metadata)
Parameters
  • g (Plottable) –

  • memoize (bool) –

  • metadata (Any) –

graphistry.feature_utils.safe_divide(a, b)
graphistry.feature_utils.set_currency_to_float(df, col, return_float=True)
Parameters
  • df (DataFrame) –

  • col (str) –

  • return_float (bool) –

graphistry.feature_utils.set_to_bool(df, col, value)
Parameters
  • df (DataFrame) –

  • col (str) –

  • value (Any) –

graphistry.feature_utils.set_to_datetime(df, cols, new_col)
Parameters
  • df (DataFrame) –

  • cols (List) –

  • new_col (str) –

graphistry.feature_utils.set_to_numeric(df, cols, fill_value=0.0)
Parameters
  • df (DataFrame) –

  • cols (List) –

  • fill_value (float) –

graphistry.feature_utils.smart_scaler(X_enc, y_enc, use_scaler, use_scaler_target, impute=True, n_quantiles=10, output_distribution='normal', quantile_range=(25, 75), n_bins=10, encode='ordinal', strategy='uniform', keep_n_decimals=5)
Parameters
  • impute (bool) –

  • n_quantiles (int) –

  • output_distribution (str) –

  • n_bins (int) –

  • encode (str) –

  • strategy (str) –

  • keep_n_decimals (int) –

graphistry.feature_utils.transform(df, ydf, res, kind, src, dst)
Parameters
  • df (DataFrame) –

  • ydf (DataFrame) –

  • res (List) –

  • kind (str) –

Return type

Tuple[DataFrame, DataFrame]

graphistry.feature_utils.transform_dirty(df, data_encoder, name='')
Parameters
  • df (DataFrame) –

  • data_encoder (Any) –

  • name (str) –

Return type

DataFrame

graphistry.feature_utils.transform_text(df, text_model, text_cols)
Parameters
  • df (DataFrame) –

  • text_model (Any) –

  • text_cols (Union[List, str]) –

Return type

DataFrame

graphistry.feature_utils.where_is_currency_column(df, col)
Parameters
  • df (DataFrame) –

  • col (str) –

UMAP

class graphistry.umap_utils.UMAPMixin(*args, **kwargs)

Bases: object

UMAP Mixin for automagic UMAPing

filter_weighted_edges(scale=1.0, index_to_nodes_dict=None, inplace=False, kind='nodes')

Filter edges based on _weighted_edges_df (ex: from .umap())

Parameters
  • scale (float) –

  • index_to_nodes_dict (Optional[Dict]) –

  • inplace (bool) –

  • kind (str) –

transform_umap(df, y=None, kind='nodes', min_dist='auto', n_neighbors=7, merge_policy=False, sample=None, return_graph=True, fit_umap_embedding=True, verbose=False)

Transforms data into UMAP embedding

Args:
df

Dataframe to transform

y

Target column

kind

One of nodes or edges

min_dist

Epsilon for including neighbors in infer_graph

n_neighbors

Number of neighbors to use for contextualization

merge_policy

if True, use previous graph, adding new batch to existing graph’s neighbors useful to contextualize new data against existing graph. If False, sample is irrelevant.

sample: Sample number of existing graph’s neighbors to use for contextualization – helps make denser graphs return_graph: Whether to return a graph or just the embeddings fit_umap_embedding: Whether to infer graph from the UMAP embedding on the new data, default True verbose: Whether to print information about the graph inference

Parameters
  • df (DataFrame) –

  • y (Optional[DataFrame]) –

  • kind (str) –

  • min_dist (Union[str, float, int]) –

  • n_neighbors (int) –

  • merge_policy (bool) –

  • sample (Optional[int]) –

  • return_graph (bool) –

  • fit_umap_embedding (bool) –

  • verbose (bool) –

Return type

Union[Tuple[DataFrame, DataFrame, DataFrame], Plottable]

umap(X=None, y=None, kind='nodes', scale=1.0, n_neighbors=12, min_dist=0.1, spread=0.5, local_connectivity=1, repulsion_strength=1, negative_sample_rate=5, n_components=2, metric='euclidean', suffix='', play=0, encode_position=True, encode_weight=True, dbscan=False, engine='auto', feature_engine='auto', inplace=False, memoize=True, verbose=False, **featurize_kwargs)

UMAP the featurized nodes or edges data, or pass in your own X, y (optional) dataframes of values

Example

>>> import graphistry   
>>> g = graphistry.nodes(pd.DataFrame({'node': [0,1,2], 'data': [1,2,3], 'meta': ['a', 'b', 'c']}))
>>> g2 = g.umap(n_components=3, spread=1.0, min_dist=0.1, n_neighbors=12, negative_sample_rate=5, local_connectivity=1, repulsion_strength=1.0, metric='euclidean', suffix='', play=0, encode_position=True, encode_weight=True, dbscan=False, engine='auto', feature_engine='auto', inplace=False, memoize=True, verbose=False)
>>> g2.plot()

Parameters

X

either a dataframe ndarray of features, or column names to featurize

y

either an dataframe ndarray of targets, or column names to featurize targets

kind

nodes or edges or None. If None, expects explicit X, y (optional) matrices, and will Not associate them to nodes or edges. If X, y (optional) is given, with kind = [nodes, edges], it will associate new matrices to nodes or edges attributes.

scale

multiplicative scale for pruning weighted edge DataFrame gotten from UMAP, between [0, ..) with high end meaning keep all edges

n_neighbors

UMAP number of nearest neighbors to include for UMAP connectivity, lower makes more compact layouts. Minimum 2

min_dist

UMAP float between 0 and 1, lower makes more compact layouts.

spread

UMAP spread of values for relaxation

local_connectivity

UMAP connectivity parameter

repulsion_strength

UMAP repulsion strength

negative_sample_rate

UMAP negative sampling rate

n_components

number of components in the UMAP projection, default 2

metric

UMAP metric, default ‘euclidean’. see (UMAP-LEARN)[https://umap-learn.readthedocs.io/ en/latest/parameters.html] documentation for more.

suffix

optional suffix to add to x, y attributes of umap.

play

Graphistry play parameter, default 0, how much to evolve the network during clustering. 0 preserves the original UMAP layout.

encode_weight

if True, will set new edges_df from implicit UMAP, default True.

encode_position

whether to set default plotting bindings – positions x,y from umap for .plot(), default True

dbscan

whether to run DBSCAN on the UMAP embedding, default False.

engine

selects which engine to use to calculate UMAP: default “auto” will use cuML if available, otherwise UMAP-LEARN.

feature_engine

How to encode data (“none”, “auto”, “pandas”, “dirty_cat”, “torch”)

inplace

bool = False, whether to modify the current object, default False. when False, returns a new object, useful for chaining in a functional paradigm.

memoize

whether to memoize the results of this method, default True.

verbose

whether to print out extra information, default False.

Returns

self, with attributes set with new data

Parameters
  • X (Union[List[str], str, DataFrame, None]) –

  • y (Union[List[str], str, DataFrame, None]) –

  • kind (str) –

  • scale (float) –

  • n_neighbors (int) –

  • min_dist (float) –

  • spread (float) –

  • local_connectivity (int) –

  • repulsion_strength (float) –

  • negative_sample_rate (int) –

  • n_components (int) –

  • metric (str) –

  • suffix (str) –

  • play (Optional[int]) –

  • encode_position (bool) –

  • encode_weight (bool) –

  • dbscan (bool) –

  • engine (Literal[‘cuml’, ‘umap_learn’, ‘auto’]) –

  • feature_engine (str) –

  • inplace (bool) –

  • memoize (bool) –

  • verbose (bool) –

umap_fit(X, y=None, verbose=False)
Parameters
  • X (DataFrame) –

  • y (Optional[DataFrame]) –

umap_lazy_init(res, n_neighbors=12, min_dist=0.1, spread=0.5, local_connectivity=1, repulsion_strength=1, negative_sample_rate=5, n_components=2, metric='euclidean', engine='auto', suffix='', verbose=False)
Parameters
  • n_neighbors (int) –

  • min_dist (float) –

  • spread (float) –

  • local_connectivity (int) –

  • repulsion_strength (float) –

  • negative_sample_rate (int) –

  • n_components (int) –

  • metric (str) –

  • engine (Literal[‘cuml’, ‘umap_learn’, ‘auto’]) –

  • suffix (str) –

  • verbose (bool) –

graphistry.umap_utils.assert_imported()
graphistry.umap_utils.assert_imported_cuml()
graphistry.umap_utils.is_legacy_cuml()
graphistry.umap_utils.lazy_cudf_import_has_dependancy()
graphistry.umap_utils.lazy_cuml_import_has_dependancy()
graphistry.umap_utils.lazy_umap_import_has_dependancy()
graphistry.umap_utils.make_safe_gpu_dataframes(X, y, engine)
graphistry.umap_utils.resolve_umap_engine(engine)
Parameters

engine (Literal[‘cuml’, ‘umap_learn’, ‘auto’]) –

Return type

Literal[‘cuml’, ‘umap_learn’]

graphistry.umap_utils.reuse_umap(g, memoize, metadata)
Parameters
  • g (Plottable) –

  • memoize (bool) –

  • metadata (Any) –

graphistry.umap_utils.umap_graph_to_weighted_edges(umap_graph, engine, is_legacy, cfg=<module 'graphistry.constants' from '/home/docs/checkouts/readthedocs.org/user_builds/pygraphistry/checkouts/latest/graphistry/constants.py'>)

Semantic Search

class graphistry.text_utils.SearchToGraphMixin(*args, **kwargs)

Bases: object

assert_features_line_up_with_nodes()
assert_fitted()
build_index(angular=False, n_trees=None)
classmethod load_search_instance(savepath)
save_search_instance(savepath)
search(query, cols=None, thresh=5000, fuzzy=True, top_n=10)

Natural language query over nodes that returns a dataframe of results sorted by relevance column “distance”.

If node data is not yet feature-encoded (and explicit edges are given), run automatic feature engineering:

g2 = g.featurize(kind='nodes', X=['text_col_1', ..],
min_words=0 # forces all named columns are textually encoded
)

If edges do not yet exist, generate them via

g2 = g.umap(kind='nodes', X=['text_col_1', ..],
min_words=0 # forces all named columns are textually encoded
)

If an index is not yet built, it is generated g2.build_index() on the fly at search time. Otherwise, can set g2.build_index() to build it ahead of time.

Args:
query (str)

natural language query.

cols (list or str, optional)

if fuzzy=False, select which column to query. Defaults to None since fuzzy=True by defaul.

thresh (float, optional)

distance threshold from query vector to returned results. Defaults to 5000, set large just in case, but could be as low as 10.

fuzzy (bool, optional)

if True, uses embedding + annoy index for recall, otherwise does string matching over given cols Defaults to True.

top_n (int, optional)

how many results to return. Defaults to 100.

Returns:

pd.DataFrame, vector_encoding_of_query: rank ordered dataframe of results matching query

vector encoding of query via given transformer/ngrams model if fuzzy=True else None

Parameters
  • query (str) –

  • thresh (float) –

  • fuzzy (bool) –

  • top_n (int) –

search_graph(query, scale=0.5, top_n=100, thresh=5000, broader=False, inplace=False)
Input a natural language query and return a graph of results.

See help(g.search) for more information

Args:
query (str)

query input eg “coding best practices”

scale (float, optional)

edge weigh threshold, Defaults to 0.5.

top_n (int, optional)

how many results to return. Defaults to 100.

thresh (float, optional)

distance threshold from query vector to returned results. Defaults to 5000, set large just in case, but could be as low as 10.

broader (bool, optional)

if True, will retrieve entities connected via an edge that were not necessarily bubbled up in the results_dataframe. Defaults to False.

inplace (bool, optional)

whether to return new instance (default) or mutate self. Defaults to False.

Returns:

graphistry Instance: g

Parameters
  • query (str) –

  • scale (float) –

  • top_n (int) –

  • thresh (float) –

  • broader (bool) –

  • inplace (bool) –

DBScan

class graphistry.compute.cluster.ClusterMixin(*args, **kwargs)

Bases: object

dbscan(min_dist=0.2, min_samples=1, cols=None, kind='nodes', fit_umap_embedding=True, target=False, verbose=False, engine_dbscan='sklearn', *args, **kwargs)
DBSCAN clustering on cpu or gpu infered automatically. Adds a _dbscan column to nodes or edges.

NOTE: g.transform_dbscan(..) currently unsupported on GPU.

Examples:

g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node')

# cluster by UMAP embeddings
kind = 'nodes' | 'edges'
g2 = g.umap(kind=kind).dbscan(kind=kind)
print(g2._nodes['_dbscan']) | print(g2._edges['_dbscan'])

# dbscan in umap or featurize API
g2 = g.umap(dbscan=True, min_dist=1.2, min_samples=2, **kwargs)
# or, here dbscan is infered from features, not umap embeddings
g2 = g.featurize(dbscan=True, min_dist=1.2, min_samples=2, **kwargs)

# and via chaining,
g2 = g.umap().dbscan(min_dist=1.2, min_samples=2, **kwargs)

# cluster by feature embeddings
g2 = g.featurize().dbscan(**kwargs)

# cluster by a given set of feature column attributes, or with target=True
g2 = g.featurize().dbscan(cols=['ip_172', 'location', 'alert'], target=False, **kwargs)

# equivalent to above (ie, cols != None and umap=True will still use features dataframe, rather than UMAP embeddings)
g2 = g.umap().dbscan(cols=['ip_172', 'location', 'alert'], umap=True | False, **kwargs)

g2.plot() # color by `_dbscan` column
Useful:

Enriching the graph with cluster labels from UMAP is useful for visualizing clusters in the graph by color, size, etc, as well as assessing metrics per cluster, e.g. https://github.com/graphistry/pygraphistry/blob/master/demos/ai/cyber/cyber-redteam-umap-demo.ipynb

Args:
min_dist float

The maximum distance between two samples for them to be considered as in the same neighborhood.

kind str

‘nodes’ or ‘edges’

cols

list of columns to use for clustering given g.featurize has been run, nice way to slice features or targets by fragments of interest, e.g. [‘ip_172’, ‘location’, ‘ssh’, ‘warnings’]

fit_umap_embedding bool

whether to use UMAP embeddings or features dataframe to cluster DBSCAN

min_samples

The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself.

target

whether to use the target column as the clustering feature

Parameters
  • min_dist (float) –

  • min_samples (int) –

  • cols (Union[List, str, None]) –

  • kind (str) –

  • fit_umap_embedding (bool) –

  • target (bool) –

  • verbose (bool) –

  • engine_dbscan (str) –

transform_dbscan(df, y=None, min_dist='auto', infer_umap_embedding=False, sample=None, n_neighbors=None, kind='nodes', return_graph=True, verbose=False)

Transforms a minibatch dataframe to one with a new column ‘_dbscan’ containing the DBSCAN cluster labels on the minibatch and generates a graph with the minibatch and the original graph, with edges between the minibatch and the original graph inferred from the umap embedding or features dataframe. Graph nodes | edges will be colored by ‘_dbscan’ column.

Examples:

fit:
    g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node')
    g2 = g.featurize().dbscan()

predict:
::

    emb, X, _, ndf = g2.transform_dbscan(ndf, return_graph=False)
    # or
    g3 = g2.transform_dbscan(ndf, return_graph=True)
    g3.plot()

likewise for umap:

fit:
    g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node')
    g2 = g.umap(X=.., y=..).dbscan()

predict:
::

    emb, X, y, ndf = g2.transform_dbscan(ndf, ndf, return_graph=False)
    # or
    g3 = g2.transform_dbscan(ndf, ndf, return_graph=True)
    g3.plot()
Args:
df

dataframe to transform

y

optional labels dataframe

min_dist

The maximum distance between two samples for them to be considered as in the same neighborhood. smaller values will result in less edges between the minibatch and the original graph. Default ‘auto’, infers min_dist from the mean distance and std of new points to the original graph

fit_umap_embedding

whether to use UMAP embeddings or features dataframe when inferring edges between the minibatch and the original graph. Default False, uses the features dataframe

sample

number of samples to use when inferring edges between the minibatch and the original graph, if None, will only use closest point to the minibatch. If greater than 0, will sample the closest sample points in existing graph to pull in more edges. Default None

kind

‘nodes’ or ‘edges’

return_graph

whether to return a graph or the (emb, X, y, minibatch df enriched with DBSCAN labels), default True infered graph supports kind=’nodes’ only.

verbose

whether to print out progress, default False

Parameters
  • df (DataFrame) –

  • y (Optional[DataFrame]) –

  • min_dist (Union[float, str]) –

  • infer_umap_embedding (bool) –

  • sample (Optional[int]) –

  • n_neighbors (Optional[int]) –

  • kind (str) –

  • return_graph (bool) –

  • verbose (bool) –

graphistry.compute.cluster.dbscan_fit(g, dbscan, kind='nodes', cols=None, use_umap_embedding=True, target=False, verbose=False)
Fits clustering on UMAP embeddings if umap is True, otherwise on the features dataframe

or target dataframe if target is True.

Args:
g

graphistry graph

kind

‘nodes’ or ‘edges’

cols

list of columns to use for clustering given g.featurize has been run

use_umap_embedding

whether to use UMAP embeddings or features dataframe for clustering (default: True)

Parameters
  • g (Any) –

  • dbscan (Any) –

  • kind (str) –

  • cols (Union[List, str, None]) –

  • use_umap_embedding (bool) –

  • target (bool) –

  • verbose (bool) –

graphistry.compute.cluster.dbscan_predict(X, model)

DBSCAN has no predict per se, so we reverse engineer one here from https://stackoverflow.com/questions/27822752/scikit-learn-predicting-new-points-with-dbscan

Parameters
  • X (DataFrame) –

  • model (Any) –

graphistry.compute.cluster.get_model_matrix(g, kind, cols, umap, target)

Allows for a single function to get the model matrix for both nodes and edges as well as targets, embeddings, and features

Args:
g

graphistry graph

kind

‘nodes’ or ‘edges’

cols

list of columns to use for clustering given g.featurize has been run

umap

whether to use UMAP embeddings or features dataframe

target

whether to use the target dataframe or features dataframe

Returns:

pd.DataFrame: dataframe of model matrix given the inputs

Parameters
  • kind (str) –

  • cols (Union[List, str, None]) –

graphistry.compute.cluster.lazy_cudf_import_has_dependancy()
graphistry.compute.cluster.lazy_dbscan_import_has_dependency()
graphistry.compute.cluster.make_safe_gpu_dataframes(X, y, engine)

helper method to coerce a dataframe to the correct type (pd vs cudf)

graphistry.compute.cluster.resolve_cpu_gpu_engine(engine)
Parameters

engine (Literal[‘cuml’, ‘umap_learn’, ‘auto’]) –

Return type

Literal[‘cuml’, ‘umap_learn’]

Arrow uploader Module

class graphistry.arrow_uploader.ArrowUploader(server_base_path='http://nginx', view_base_path='http://localhost', name=None, description=None, edges=None, nodes=None, node_encodings=None, edge_encodings=None, token=None, dataset_id=None, metadata=None, certificate_validation=True, org_name=None)

Bases: object

Parameters
  • edges (Optional[Table]) –

  • nodes (Optional[Table]) –

  • org_name (Optional[str]) –

arrow_to_buffer(table)
Parameters

table (Table) –

cascade_privacy_settings(mode=None, notify=None, invited_users=None, mode_action=None, message=None)
Cascade:
  • local (passed in)

  • global

  • hard-coded

Parameters
  • mode (Optional[Literal[‘private’, ‘organization’, ‘public’]]) –

  • notify (Optional[bool]) –

  • invited_users (Optional[List[str]]) –

  • mode_action (Optional[str]) –

  • message (Optional[str]) –

property certificate_validation
create_dataset(json, validate=True)
Parameters

validate (bool) –

property dataset_id
Return type

str

property description
Return type

str

property edge_encodings
property edges
Return type

Optional[Table]

g_to_edge_bindings(g)
g_to_edge_encodings(g)
g_to_node_bindings(g)
g_to_node_encodings(g)
login(username, password, org_name=None)
maybe_bindings(g, bindings, base={})
maybe_post_share_link(g)

Skip if never called .privacy() Return True/False based on whether called

Return type

bool

property metadata
property name
Return type

str

property node_encodings
property nodes
Return type

Optional[Table]

property org_name
Return type

Optional[str]

pkey_login(personal_key_id, personal_key_secret, org_name=None)
post(as_files=True, memoize=True, validate=True)

Note: likely want to pair with self.maybe_post_share_link(g)

Parameters
  • as_files (bool) –

  • memoize (bool) –

  • validate (bool) –

post_arrow(arr, graph_type, opts='')
Parameters
  • arr (Table) –

  • graph_type (str) –

  • opts (str) –

post_arrow_generic(sub_path, tok, arr, opts='')
Parameters
  • sub_path (str) –

  • tok (str) –

  • arr (Table) –

Return type

Response

post_edges_arrow(arr=None, opts='')
Parameters

arr (Optional[Table]) –

post_edges_file(file_path, file_type='csv')
post_file(file_path, graph_type='edges', file_type='csv')
post_g(g, name=None, description=None)

Warning: main post() does not call this

post_nodes_arrow(arr=None, opts='')
Parameters

arr (Optional[Table]) –

post_nodes_file(file_path, file_type='csv')

Set sharing settings. Any settings not passed here will cascade from PyGraphistry or defaults

Parameters
  • obj_pk (str) –

  • obj_type (str) –

  • privacy (Optional[Privacy]) –

refresh(token=None)
property server_base_path
Return type

str

sso_get_token(state)

Koa, 04 May 2022 Use state to get token

sso_login(org_name=None, idp_name=None)

Koa, 04 May 2022 Get SSO login auth_url or token

property token
Return type

str

verify(token=None)
Return type

bool

property view_base_path
Return type

str

Arrow File Uploader Module

class graphistry.ArrowFileUploader.ArrowFileUploader(uploader)

Bases: object

Implement file API with focus on Arrow support

Memoization in this class is based on reference equality, while plotter is based on hash. That means the plotter resolves different-identity value matches, so by the time ArrowFileUploader compares, identities are unified for faster reference-based checks.

Example: Upload files with per-session memoization

uploader : ArrowUploader arr : pa.Table afu = ArrowFileUploader(uploader)

file1_id = afu.create_and_post_file(arr)[0] file2_id = afu.create_and_post_file(arr)[0]

assert file1_id == file2_id # memoizes by default (memory-safe: weak refs)

Example: Explicitly create a file and upload data for it

uploader : ArrowUploader arr : pa.Table afu = ArrowFileUploader(uploader)

file1_id = afu.create_file() afu.post_arrow(arr, file_id)

file2_id = afu.create_file() afu.post_arrow(arr, file_id)

assert file1_id != file2_id

create_and_post_file(arr, file_id=None, file_opts={}, upload_url_opts='erase=true', memoize=True)

Create file and upload data for it.

Default upload_url_opts=’erase=true’ throws exceptions on parse errors and deletes upload.

Default memoize=True skips uploading ‘arr’ when previously uploaded in current session

See File REST API for file_opts (file create) and upload_url_opts (file upload)

Parameters
  • arr (Table) –

  • file_id (Optional[str]) –

  • file_opts (dict) –

  • upload_url_opts (str) –

  • memoize (bool) –

Return type

Tuple[str, dict]

create_file(file_opts={})

Creates File and returns file_id str.

Defauls:
  • file_type: ‘arrow’

See File REST API for file_opts

Parameters

file_opts (dict) –

Return type

str

post_arrow(arr, file_id, url_opts='erase=true')

Upload new data to existing file id

Default url_opts=’erase=true’ throws exceptions on parse errors and deletes upload.

See File REST API for url_opts (file upload)

Parameters
  • arr (Table) –

  • file_id (str) –

  • url_opts (str) –

Return type

dict

uploader: Any = None
graphistry.ArrowFileUploader.DF_TO_FILE_ID_CACHE: weakref.WeakKeyDictionary = <WeakKeyDictionary>
NOTE: Will switch to pa.Table -> … when RAPIDS upgrades from pyarrow,

which adds weakref support

class graphistry.ArrowFileUploader.MemoizedFileUpload(file_id, output)

Bases: object

Parameters
  • file_id (str) –

  • output (dict) –

file_id: str
output: dict
class graphistry.ArrowFileUploader.WrappedTable(arr)

Bases: object

Parameters

arr (Table) –

arr: pyarrow.lib.Table
graphistry.ArrowFileUploader.cache_arr(arr)

Hold reference to most recent memoization entries Hack until RAPIDS supports Arrow 2.0, when pa.Table becomes weakly referenceable

Validation

Versioneer

Git implementation of _version.py.

exception graphistry._version.NotThisMethod

Bases: Exception

Exception raised if a method is not valid for the current scenario.

class graphistry._version.VersioneerConfig

Bases: object

Container for Versioneer configuration parameters.

graphistry._version.get_config()

Create, populate and return the VersioneerConfig() object.

graphistry._version.get_keywords()

Get the keywords needed to look up the version information.

graphistry._version.get_versions()

Get version information or return default if unable to do so.

graphistry._version.git_get_keywords(versionfile_abs)

Extract version information from the given file.

graphistry._version.git_pieces_from_vcs(tag_prefix, root, verbose, run_command=<function run_command>)

Get version from ‘git describe’ in the root of the source tree.

This only gets called if the git-archive ‘subst’ keywords were not expanded, and _version.py hasn’t already been rewritten with a short version string, meaning we’re inside a checked out source tree.

graphistry._version.git_versions_from_keywords(keywords, tag_prefix, verbose)

Get version information from git keywords.

graphistry._version.plus_or_dot(pieces)

Return a + if we don’t already have one, else return a .

graphistry._version.register_vcs_handler(vcs, method)

Create decorator to mark a method as the handler of a VCS.

graphistry._version.render(pieces, style)

Render the given version pieces into the requested style.

graphistry._version.render_git_describe(pieces)

TAG[-DISTANCE-gHEX][-dirty].

Like ‘git describe –tags –dirty –always’.

Exceptions: 1: no tags. HEX[-dirty] (note: no ‘g’ prefix)

graphistry._version.render_git_describe_long(pieces)

TAG-DISTANCE-gHEX[-dirty].

Like ‘git describe –tags –dirty –always -long’. The distance/hash is unconditional.

Exceptions: 1: no tags. HEX[-dirty] (note: no ‘g’ prefix)

graphistry._version.render_pep440(pieces)

Build up version string, with post-release “local version identifier”.

Our goal: TAG[+DISTANCE.gHEX[.dirty]] . Note that if you get a tagged build and then dirty it, you’ll get TAG+0.gHEX.dirty

Exceptions: 1: no tags. git_describe was just HEX. 0+untagged.DISTANCE.gHEX[.dirty]

graphistry._version.render_pep440_old(pieces)

TAG[.postDISTANCE[.dev0]] .

The “.dev0” means dirty.

Exceptions: 1: no tags. 0.postDISTANCE[.dev0]

graphistry._version.render_pep440_post(pieces)

TAG[.postDISTANCE[.dev0]+gHEX] .

The “.dev0” means dirty. Note that .dev0 sorts backwards (a dirty tree will appear “older” than the corresponding clean one), but you shouldn’t be releasing software with -dirty anyways.

Exceptions: 1: no tags. 0.postDISTANCE[.dev0]

graphistry._version.render_pep440_pre(pieces)

TAG[.post0.devDISTANCE] – No -dirty.

Exceptions: 1: no tags. 0.post0.devDISTANCE

graphistry._version.run_command(commands, args, cwd=None, verbose=False, hide_stderr=False, env=None)

Call the given command(s).

graphistry._version.versions_from_parentdir(parentdir_prefix, root, verbose)

Try to determine the version from the parent directory name.

Source tarballs conventionally unpack into a directory that includes both the project name and a version string. We will also support searching up two directory levels for an appropriately named parent directory