Identity data anomaly detection: SSH session anomaly detection with RGCNs

Identity data anomaly detection: SSH session anomaly detection with RGCNs#

  • SSH logs from secrepo: Replace with any event data

  • Detects and visualizes anomalous connections based on communication topology & event type

  • Unsupervised graph neural network: RGCN

  • Runs on both CPU + GPU: Toggle is_gpu

For background, so the RGCN intro: intro-story.ipynb

Dependencies & data#

[1]:
#! pip cache remove graphistry
#! pip install --no-cache --user https://github.com/graphistry/pygraphistry/archive/heteroembed.zip

#! pip install --user --no-input "torch==1.11.0" -f https://download.pytorch.org/whl/cu113/torch_stable.html
#! pip install --user dgl-cu113 dglgo -f https://data.dgl.ai/wheels/repo.html
! python -c "import torch; print(torch.cuda.is_available())"
[2]:
#! wget https://www.secrepo.com/maccdc2012/ssh.log.gz
#! gunzip ssh.log.gz
#! ls -alh ssh*
! head -n 5 ssh.log
1331901011.840000       CTHcOo3BARDOPDjYue      192.168.202.68  53633   192.168.28.254  22      failure INBOUND SSH-2.0-OpenSSH_5.0     SSH-1.99-Cisco-1.25     -       -       -       -       -
1331901030.210000       CBHpSz2Zi3rdKbAvwd      192.168.202.68  35820   192.168.23.254  22      failure INBOUND SSH-2.0-OpenSSH_5.0     SSH-1.99-Cisco-1.25     -       -       -       -       -
1331901032.030000       C2h6wz2S5MWTiAk6Hb      192.168.202.68  36254   192.168.26.254  22      failure INBOUND SSH-2.0-OpenSSH_5.0     SSH-1.99-Cisco-1.25     -       -       -       -       -
1331901034.340000       CeY76r1JXPbjJS8yKb      192.168.202.68  37764   192.168.27.102  22      failure INBOUND SSH-2.0-OpenSSH_5.0     SSH-2.0-OpenSSH_5.8p1 Debian-1ubuntu3   -       -       -       -       -
1331901041.920000       CPJHML3uGn4IV2MGWi      192.168.202.68  40244   192.168.27.101  22      failure INBOUND SSH-2.0-OpenSSH_5.0     SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1   -       -       -       -       -

Imports#

[3]:
import pandas as pd
import graphistry
graphistry.register(
    #Free gpu server API key: https://www.graphistry.com/get-started
    api=3, username='***', password='***',
    protocol='https', server='hub.graphistry.com', client_protocol_hostname='https://hub.graphistry.com'
)

Load data#

[4]:
df = pd.read_csv(
    './ssh.log', sep='\t',
    names=['time', 'key', 'src_ip', 'src_port', 'dst_ip', 'dst_port', 'msg', 'dir',
           'o1', 'o2', 'o3', 'o4', 'o5', 'o6', 'o7']
)
df.sample(5)
[4]:
time key src_ip src_port dst_ip dst_port msg dir o1 o2 o3 o4 o5 o6 7
6174 1.332014e+09 CA7Epl2hovHB7Zm4a9 192.168.202.141 7200 192.168.229.101 22 failure INBOUND - SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1 - - - - -
1554 1.331919e+09 CwL1tJHLLzytUAaH2 192.168.202.110 49584 192.168.229.101 22 failure INBOUND SSH-2.0-OpenSSH_5.0 SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1 - - - - -
4032 1.332000e+09 C40EOw3sbeRoypxQKi 192.168.202.140 48131 192.168.25.203 22 undetermined INBOUND SSH-2.0-OpenSSH_5.0 SSH-2.0-OpenSSH_5.8p1 Debian-1ubuntu3 - - - - -
4691 1.332011e+09 CGSKwo4O56EzNTUqN2 192.168.202.90 48951 192.168.23.254 22 failure INBOUND SSH-2.0-OpenSSH_5.3p1 Debian-3ubuntu6 SSH-1.99-Cisco-1.25 - - - - -
1460 1.331918e+09 CY0O0Q2vWPnFKJAJNe 192.168.204.45 58408 192.168.25.253 22 failure INBOUND SSH-2.0-OpenSSH_5.0 SSH-2.0-OpenSSH_4.5 - - - - -

Train#

  • help(g.embed) for options

  • relation: pick an edge column to guide learning to weight differently on

  • See other notebooks for adding node features

[5]:
is_gpu = True
dev0 = 'cpu'
if is_gpu:
    dev0 = 'cuda'

g = graphistry.edges(df, 'src_ip', 'dst_ip')  # graph
[ ]:
g2 = g.embed(  # rerun until happy with quality
    device=dev0,

    #relation='dst_port', # always 22, so runs as a GCN instead of RGCN
    relation='o1', # split by sw type

    #==== OPTIONAL: NODE FEATURES ====
    #requires node feature data, ex: g = graphistry.nodes(nodes_df, node_id_col).edges(..
    #use_feat=True
    #X=[g._node] + good_feats_col_names,
    #cardinality_threshold=len(g._edges)+1, #optional: avoid topic modeling on high-cardinality cols
    #min_words=len(g._edges)+1, #optional: avoid topic modeling on high-cardinality cols

    epochs=10
)
[ ]:

Score#

  • score: prediction score from RGCN

  • low_score: True when 2 stdev below the average score

[10]:
%%time
def to_cpu(tensor):
    """
    Helper for switching between is_gpu=True/False to avoid coercion errors
    """
    if is_gpu:
        return tensor.cpu()
    else:
        return tensor

score2 = pd.Series(to_cpu(g2._score(g2._triplets)).numpy())

df2 = df.assign(
    score=score2,
    low_score=(score2 < (score2.mean() - 2 * score2.std())) # True for unusually low prediction scores
)
df2[['score', 'low_score'] + list(df2.columns[:10])].sort_values(by=['score'])[:5]
CPU times: user 36.6 ms, sys: 0 ns, total: 36.6 ms
Wall time: 6.87 ms
/home/graphistry/.local/lib/python3.8/site-packages/graphistry/embed_utils.py:459: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  emb = torch.tensor(self._embeddings)
[10]:
score low_score time key src_ip src_port dst_ip dst_port msg dir o1 o2
4273 0.017218 True 1.332001e+09 CvpN0F4oRP5Pc895fc 192.168.202.136 47495 192.168.229.101 22 undetermined INBOUND - SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1
4369 0.018677 True 1.332006e+09 COqmtb2K1yl9ptmBC 192.168.202.143 37624 192.168.229.156 22 undetermined INBOUND - SSH-2.0-OpenSSH_4.3
2847 0.023266 True 1.331931e+09 CH5EtE1xtwQmyxf5s1 192.168.203.63 53667 192.168.23.101 22 undetermined INBOUND - SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1
2844 0.023587 True 1.331931e+09 CsO3K9zNNojTSGFhk 192.168.202.110 36493 192.168.229.101 22 failure INBOUND SSH-2.0-OpenSSH_5.3p1 Debian-3ubuntu6 SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1
2977 0.023587 True 1.331931e+09 CObILv2xfzVJkUXY6 192.168.202.110 36511 192.168.229.101 22 failure INBOUND SSH-2.0-OpenSSH_5.3p1 Debian-3ubuntu6 SSH-2.0-OpenSSH_5.8p1 Debian-7ubuntu1

Visualize#

Color edges red when low prediction score

[9]:
(g2
 .edges(df2)
 .encode_edge_color('low_score', categorical_mapping={'true': 'red', 'false': 'blue'})
 .settings(url_params={'strongGravity': 'true', 'play': 0})
).plot()
[9]:

Next steps#