Identity data anomaly detection: SSH session anomaly detection with RGCNs#
SSH logs from secrepo: Replace with any event data
Detects and visualizes anomalous connections based on communication topology & event type
Unsupervised graph neural network: RGCN
Runs on both CPU + GPU: Toggle
is_gpu
For background, so the RGCN intro: intro-story.ipynb
Dependencies & data#
[1]:
#! pip cache remove graphistry
#! pip install --no-cache --user https://github.com/graphistry/pygraphistry/archive/heteroembed.zip
#! pip install --user --no-input "torch==1.11.0" -f https://download.pytorch.org/whl/cu113/torch_stable.html
#! pip install --user dgl-cu113 dglgo -f https://data.dgl.ai/wheels/repo.html
! python -c "import torch; print(torch.cuda.is_available())"
[2]:
#! wget https://www.secrepo.com/maccdc2012/ssh.log.gz
#! gunzip ssh.log.gz
#! ls -alh ssh*
! head -n 5 ssh.log
1331901011.840000 CTHcOo3BARDOPDjYue 192.168.202.68 53633 192.168.28.254 22 failure INBOUND SSH-2.0-OpenSSH_5.0 SSH-1.99-Cisco-1.25 - - - - -
1331901030.210000 CBHpSz2Zi3rdKbAvwd 192.168.202.68 35820 192.168.23.254 22 failure INBOUND SSH-2.0-OpenSSH_5.0 SSH-1.99-Cisco-1.25 - - - - -
1331901032.030000 C2h6wz2S5MWTiAk6Hb 192.168.202.68 36254 192.168.26.254 22 failure INBOUND SSH-2.0-OpenSSH_5.0 SSH-1.99-Cisco-1.25 - - - - -
1331901034.340000 CeY76r1JXPbjJS8yKb 192.168.202.68 37764 192.168.27.102 22 failure INBOUND SSH-2.0-OpenSSH_5.0 SSH-2.0-OpenSSH_5.8p1 Debian-1ubuntu3 - - - - -
1331901041.920000 CPJHML3uGn4IV2MGWi 192.168.202.68 40244 192.168.27.101 22 failure INBOUND SSH-2.0-OpenSSH_5.0 SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1 - - - - -
Imports#
[3]:
import pandas as pd
import graphistry
graphistry.register(
#Free gpu server API key: https://www.graphistry.com/get-started
api=3, username='***', password='***',
protocol='https', server='hub.graphistry.com', client_protocol_hostname='https://hub.graphistry.com'
)
Load data#
[4]:
df = pd.read_csv(
'./ssh.log', sep='\t',
names=['time', 'key', 'src_ip', 'src_port', 'dst_ip', 'dst_port', 'msg', 'dir',
'o1', 'o2', 'o3', 'o4', 'o5', 'o6', 'o7']
)
df.sample(5)
[4]:
time | key | src_ip | src_port | dst_ip | dst_port | msg | dir | o1 | o2 | o3 | o4 | o5 | o6 | 7 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6174 | 1.332014e+09 | CA7Epl2hovHB7Zm4a9 | 192.168.202.141 | 7200 | 192.168.229.101 | 22 | failure | INBOUND | - | SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1 | - | - | - | - | - |
1554 | 1.331919e+09 | CwL1tJHLLzytUAaH2 | 192.168.202.110 | 49584 | 192.168.229.101 | 22 | failure | INBOUND | SSH-2.0-OpenSSH_5.0 | SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1 | - | - | - | - | - |
4032 | 1.332000e+09 | C40EOw3sbeRoypxQKi | 192.168.202.140 | 48131 | 192.168.25.203 | 22 | undetermined | INBOUND | SSH-2.0-OpenSSH_5.0 | SSH-2.0-OpenSSH_5.8p1 Debian-1ubuntu3 | - | - | - | - | - |
4691 | 1.332011e+09 | CGSKwo4O56EzNTUqN2 | 192.168.202.90 | 48951 | 192.168.23.254 | 22 | failure | INBOUND | SSH-2.0-OpenSSH_5.3p1 Debian-3ubuntu6 | SSH-1.99-Cisco-1.25 | - | - | - | - | - |
1460 | 1.331918e+09 | CY0O0Q2vWPnFKJAJNe | 192.168.204.45 | 58408 | 192.168.25.253 | 22 | failure | INBOUND | SSH-2.0-OpenSSH_5.0 | SSH-2.0-OpenSSH_4.5 | - | - | - | - | - |
Train#
help(g.embed)
for optionsrelation
: pick an edge column to guide learning to weight differently onSee other notebooks for adding node features
[5]:
is_gpu = True
dev0 = 'cpu'
if is_gpu:
dev0 = 'cuda'
g = graphistry.edges(df, 'src_ip', 'dst_ip') # graph
[ ]:
g2 = g.embed( # rerun until happy with quality
device=dev0,
#relation='dst_port', # always 22, so runs as a GCN instead of RGCN
relation='o1', # split by sw type
#==== OPTIONAL: NODE FEATURES ====
#requires node feature data, ex: g = graphistry.nodes(nodes_df, node_id_col).edges(..
#use_feat=True
#X=[g._node] + good_feats_col_names,
#cardinality_threshold=len(g._edges)+1, #optional: avoid topic modeling on high-cardinality cols
#min_words=len(g._edges)+1, #optional: avoid topic modeling on high-cardinality cols
epochs=10
)
[ ]:
Score#
score
: prediction score from RGCNlow_score
:True
when 2 stdev below the average score
[10]:
%%time
def to_cpu(tensor):
"""
Helper for switching between is_gpu=True/False to avoid coercion errors
"""
if is_gpu:
return tensor.cpu()
else:
return tensor
score2 = pd.Series(to_cpu(g2._score(g2._triplets)).numpy())
df2 = df.assign(
score=score2,
low_score=(score2 < (score2.mean() - 2 * score2.std())) # True for unusually low prediction scores
)
df2[['score', 'low_score'] + list(df2.columns[:10])].sort_values(by=['score'])[:5]
CPU times: user 36.6 ms, sys: 0 ns, total: 36.6 ms
Wall time: 6.87 ms
/home/graphistry/.local/lib/python3.8/site-packages/graphistry/embed_utils.py:459: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
emb = torch.tensor(self._embeddings)
[10]:
score | low_score | time | key | src_ip | src_port | dst_ip | dst_port | msg | dir | o1 | o2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4273 | 0.017218 | True | 1.332001e+09 | CvpN0F4oRP5Pc895fc | 192.168.202.136 | 47495 | 192.168.229.101 | 22 | undetermined | INBOUND | - | SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1 |
4369 | 0.018677 | True | 1.332006e+09 | COqmtb2K1yl9ptmBC | 192.168.202.143 | 37624 | 192.168.229.156 | 22 | undetermined | INBOUND | - | SSH-2.0-OpenSSH_4.3 |
2847 | 0.023266 | True | 1.331931e+09 | CH5EtE1xtwQmyxf5s1 | 192.168.203.63 | 53667 | 192.168.23.101 | 22 | undetermined | INBOUND | - | SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1 |
2844 | 0.023587 | True | 1.331931e+09 | CsO3K9zNNojTSGFhk | 192.168.202.110 | 36493 | 192.168.229.101 | 22 | failure | INBOUND | SSH-2.0-OpenSSH_5.3p1 Debian-3ubuntu6 | SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1 |
2977 | 0.023587 | True | 1.331931e+09 | CObILv2xfzVJkUXY6 | 192.168.202.110 | 36511 | 192.168.229.101 | 22 | failure | INBOUND | SSH-2.0-OpenSSH_5.3p1 Debian-3ubuntu6 | SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1 |
Visualize#
Color edges red when low prediction score
[9]:
(g2
.edges(df2)
.encode_edge_color('low_score', categorical_mapping={'true': 'red', 'false': 'blue'})
.settings(url_params={'strongGravity': 'true', 'play': 0})
).plot()
[9]:
Next steps#
RGCN intro: intro-story.ipynb
In-depth RGCN: advanced-identity-protection-40m.ipynb
UMAP demo for 97% alert volume reduction & alert correlation
PyGraphistry (py, oss) + Graphistry Hub (free)
Dashboarding with graph-app-kit (containerized, gpu, graph Streamlit)
Happy to help:
email and let’s chat! info@graphistry.com