The age old question, Bot or Not? Attacker or Friendly?#
We load a botnet dataset and produce features from the raw data as well as a GNN model that predicts bot traffic. Attacking bots make up less than a percent of the total data.
Feature engineering the data might take a form like this NagabhushanS/Machine-Learning-Based-Botnet-Detection
We create features and models automatically using the g.featurize
and g.umap
APIs, demonstrating fast ML pipelines over data that may be complex and multimodal with little more effort than setting some parameters.
[ ]:
#! pip install --upgrade graphistry[ai]
[ ]:
# cd ..
[ ]:
import os
import graphistry
from graphistry.features import ModelDict
import torch
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import os
from collections import Counter
from importlib import reload
import warnings
warnings.filterwarnings('ignore')
[ ]:
## add your username and password here
graphistry.register(api=3, protocol="https", server="hub.graphistry.com", username=os.environ['USERNAME'], password=os.environ['GRAPHISTRY_PASSWORD'])
[ ]:
# some plot helpers
def fast_plot(g, attr, mask=None, cols=None, interpolation=None):
plt.figure(figsize=(17,10))
if cols is None:
cols = np.arange(getattr(g, attr).shape[1])
if mask is not None:
plt.imshow(getattr(g, attr)[mask].values[:,cols], aspect='auto', cmap='hot', interpolation=interpolation)
else:
plt.imshow(getattr(g, attr).values[:,cols], aspect='auto', cmap='hot', interpolation=interpolation)
We import the CTU-13 malware dataset#
You can find a number of datasets here https://www.stratosphereips.org/datasets-ctu13
[ ]:
edf = pd.read_csv('https://gist.githubusercontent.com/silkspace/33bde3e69ae24fee1298a66d1e00b467/raw/dc66bd6f1687270be7098f94b3929d6a055b4438/malware_bots.csv', index_col=0)
[ ]:
edf
[ ]:
# let's find the Botnet vs not
T = edf.Label.apply(lambda x: True if 'Botnet' in x else False)
[ ]:
T
[ ]:
bot = edf[T]
nbot = edf[~T]
print(f'Botnet abundance: {100*len(bot)/len(edf):0.2f}%')# so botnet traffic makes up a tiny fraction of total
# let's balance the dataset in a 10-1 ratio, for speed and demonstrative purposes
negs = nbot.sample(10*len(bot))
edf = pd.concat([bot, negs]) # top part of arrays are bot traffic, then all non-bot traffic
edf = edf.drop_duplicates()
# some useful indicators for later that predict Botnet as Bool and Int
Y = edf.Label.apply(lambda x: 1 if 'Botnet' in x else 0) # np.array(T)
# Later we will use and exploit any meaning shared between the labels in a latent distribution
# add it to the dataframe
edf['bot'] = Y
[ ]:
# name some columns for edges and features
src = 'SrcAddr'
dst = 'DstAddr'
good_cols_with_edges = ['Dur', 'Proto', 'Sport',
'Dport', 'State', 'TotPkts', 'TotBytes', 'SrcBytes', src, dst]
good_cols_without_edges = ['Dur', 'Proto', 'Sport',
'Dport', 'State', 'TotPkts', 'TotBytes', 'SrcBytes']
## some encoding parameters
n_topics = 20
n_topics_target = 7
Fast Incident Response#
An Incident Responder needs to quickly find which IP is the attacker.
If, say, a predictive model enriched the data, responders could repeat the pipeline on new data drastically reducing the search space.
They can see affected computers and log, manage, escalate, and triage alerts using Graphistry playbooks integrations.
We will use Graphistry[ai] to generate such a predictive pipeline, that finds offending nodes (find attacker IPs), as well as the systems and patterns they exploit, detecting deviations from benign behaviors and instances of known attack behaviors.
[ ]:
# load the data using the edges API
g = graphistry.edges(edf, src, dst)
[ ]:
g.plot()
[ ]:
# Let's featurize and reduce the dimensionality of the dataset
[ ]:
# lets umap the data
g2 = g.umap(kind='edges',
X=good_cols_with_edges,
y = ['bot'],
use_scaler='quantile',
use_scaler_target=None,
cardinality_threshold=20,
cardinality_threshold_target=2,
n_topics=n_topics,
n_topics_target=n_topics_target,
n_bins=n_topics_target,
metric='euclidean',
n_neighbors=12)
[ ]:
x, y = g2._edge_features, g2._edge_target
x
[ ]:
# can spot edge features that are bot vs not
fast_plot(g2, '_edge_features', mask=[True]*1500 + [False]*(len(x)-1500))
Do we have a predictive model?#
Using the x, y’s we get from autofeaturization, we fit two RandomForest models
[ ]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
[ ]:
clf = RandomForestClassifier()
rlf = RandomForestRegressor()
[ ]:
X_train, x_test, y_train, y_test = train_test_split(x, y)
rlf.fit(X_train, y_train)
rlf.score(x_test, y_test)
[ ]:
X_train, x_test, y_train, y_test = train_test_split(x, y)
clf.fit(X_train, y_train).score(x_test, y_test)
[ ]:
lengthy_computation = False
if lengthy_computation: # if you have patience or GPUs
from sklearn.inspection import permutation_importance
r = permutation_importance(clf, x_test, y_test,
n_repeats=10,
random_state=0)
for i in r.importances_mean.argsort()[::-1]:
if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
print(f"{x.columns[i]:<8}"
f"{r.importances_mean[i]:.3f}"
f" +/- {r.importances_std[i]:.3f}")
tops = r.importances_mean.argsort()[::-1][:10]
else:
tops = clf.feature_importances_.argsort()[::-1][:10]
tops
[ ]:
# top features that predict bot or not -- and since we used the ip address, we easily find the 'feature' (ie target)
x.columns[tops]
Let’s remove edges and see if there is a model of just ‘common features’ (ie no ip addresses)#
Given learnings, we want to see if there is a model that does not use edge information (ie, no IP addresses, only connection metadata)
[ ]:
g3 = g.nodes(edf)# treat edf as ndf to featurize, since we aren't using the src/dst data, no need to add it to .edges(..)
g3 = g3.umap(kind='nodes',
X=good_cols_without_edges,
y = 'bot',
scale =0.1,
use_scaler='quantile',
use_scaler_target=None,
cardinality_threshold=20,
cardinality_threshold_target=20,
n_topics=n_topics,
n_topics_target=n_topics_target,
n_bins=n_topics,
metric='euclidean',
n_neighbors=20)
[ ]:
X = g3._node_features
y = g3._node_target
[ ]:
X
[ ]:
y
[ ]:
fast_plot(g3, '_node_features') # one can clearly see that bot vs non-bot features are different
[ ]:
X_train, x_test, y_train, y_test = train_test_split(X, y)
clf.fit(X_train, y_train).score(x_test, y_test)
[ ]:
# if interested to find sensitivities
X_train, x_test, y_train, y_test = train_test_split(X, y)
rlf.fit(X_train, y_train).score(x_test, y_test)
[ ]:
lengthy_computation = False
if lengthy_computation: # if you have patience or GPUs
from sklearn.inspection import permutation_importance
r = permutation_importance(clf, x_test, y_test,
n_repeats=10,
random_state=0)
for i in r.importances_mean.argsort()[::-1]:
if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
print(f"{x.columns[i]:<8}"
f"{r.importances_mean[i]:.3f}"
f" +/- {r.importances_std[i]:.3f}")
tops = r.importances_mean.argsort()[::-1][:10]
else:
tops = clf.feature_importances_.argsort()[::-1][:10]
tops
[ ]:
topcols = X.columns[tops]
topcols
[ ]:
nres = X[y.values==0][topcols].describe() #not bot
res = X[y.values==1][topcols].describe() #bot
res
Hence we see that including just common features clusters botnet traffic together under featurization and UMAP#
[ ]:
a=res.loc['mean']/nres.loc['mean']
a.plot(kind='bar')
print('Bot Mean divided by Non-Bot Mean')
[ ]:
b=nres.loc['mean']/res.loc['mean']
b.plot(kind='bar', rot=77)
print('Not-Bot Mean divided by Bot Mean')
## Now we dive deeper#
Let’s encode the graph as a DGL graph for use in Machine Learning#
[ ]:
from graphistry.networks import LinkPredModelMultiOutput, train_link_pred
[ ]:
# first, let's examine the cardinality of labels and get a sense of the different flows
cnt = Counter(g2._edges['Label'])
cnt.most_common()
[ ]:
## can we learn a better representation of these labels?
len(cnt)
[ ]:
# let's build a GNN model with the 'Label' being reduced to a n_topics_target < 70+ dimensional representation
g4 = g.build_gnn(y_edges = 'Label',
use_node_scaler='quantile',
use_node_scaler_target=None,
cardinality_threshold=2,
cardinality_threshold_target=2,
n_topics=n_topics,
n_topics_target=n_topics_target,
)
[ ]:
g4._edge_target # it identifies `Label: *, download, botnet` strongly over bot vs not
[ ]:
fast_plot(g4, '_edge_target') # clear to see bot vs not as a regressive label
[ ]:
# the deep learning graph
G = g4._dgl_graph
# define the model from the data
node_features = G.ndata["feature"].float()
n_feat = node_features.shape[1]
# we are predicting edges
edge_label = G.edata["target"]
labels = edge_label.argmax(1) # turn regressive label into
n_targets = edge_label.shape[1]
train_mask = G.edata["train_mask"]
test_mask = G.edata["test_mask"]
latent_dim = 32
n_output_feats = 16 # this is equal to the latent dim output of the SAGE net
model = LinkPredModelMultiOutput(n_feat, latent_dim, n_output_feats, n_targets)
pred = model(G, node_features) # the untrained graph
assert G.num_edges() == pred.shape[0], "something went wrong"
print(f"output of model has same length as the number of edges: {pred.shape[0]}")
print(f"number of edges: {G.num_edges()}\n")
# Train model
train_link_pred(model, G, epochs=2900)
[ ]:
# trained comparison
logits = model(G, node_features)
pred = logits.argmax(1)
accuracy = sum(pred[test_mask] == labels[test_mask]) / len(
pred[test_mask]
)
print("-" * 30)
print(f"Final Accuracy: {100 * accuracy:.2f}%")
Take away – * we encode targets in the latent space of messy multi targets * we can do inductive prediction on new graphs using GNN model
[ ]:
cnts = Counter(np.array(labels)).most_common()
cnts
[ ]:
# most common classifier (that always predicts one class), would score:
print(f'{100*cnts[0][1]/len(labels):.2f}%')
[ ]:
# get node features after training of the GNN model
enc = model.sage(G, node_features.float())
enc
[ ]:
plt.figure(figsize=(15,7))
plt.imshow(enc.detach().numpy(), aspect='auto')
Contributions#
Now we know how to take raw data and turn them into actionable features and models using the Graphistry[ai] API.
Integrate them into your pipelines and join our slack!
[ ]: