Your first graph neural network: Detecting suspicious logins with link prediction#

Graphistry - Leo Meyerovich, Alex Morrise, Tanmoy Sarkar

Infosec Jupyterthon 2022, December 2022

Alert on & visualize anomalous identity events * Demo dataset: 1.6B windows events over 58 days => logins by 12K user over 14K systems * adapt to any identity system with logins * => Can we identify accounts & computers acting anomalously? Resources being oddly accessed? * => Can we spot the red team? * => Operations: Identity incident alerting + identity data investigations * Community/contact for help handling bigger-than-memory & additional features * Techniques explored: Graph AI - * RGCN (primary) - powerful with tweaking and in a pipeline * UMAP (secondary) - surprisingly effective with little tweaking * Runs on both CPU + multi-GPU * Tools: PyGraphistry[AI], DGL + PyTorch, and NVIDIA RAPIDS / umap-learn

Dependencies#

Installs#

installs below
… or docker (free): https://hub.docker.com/r/graphistry/graphistry-nvidia
…. or aws/azure/enterprise (paid) https://www.graphistry.com/get-started

[4]:

#! pip install --user --no-input "torch==1.11.0" -f https://download.pytorch.org/whl/cu113/torch_stable.html

[2]:

# ! python -c "import torch; print(torch.cuda.is_available())"

True

[3]:

#! pip install --user dgl-cu113 dglgo -f https://data.dgl.ai/wheels/repo.html

[4]:

#! pip cache remove graphistry
#! pip install --no-cache --user https://github.com/graphistry/pygraphistry/archive/heteroembed.zip

#! pip install --user git+https://github.com/graphistry/pygraphistry.git@fe894b556d078a35d1109c5cb78150818950d66d
#! pip install --user git+https://github.com/graphistry/pygraphistry.git@d7e87c0faf762be5c8779c71caf8da13fae48041
#! pip install --user git+https://github.com/graphistry/pygraphistry.git@6f47e034dcf8e96f241be66aca6a6a6374e7bf83

Imports#

[5]:

import pandas as pd
import os
from joblib import load, dump
from collections import Counter
import torch
print('torch', torch.__version__)

import numpy as np
import random
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

import graphistry
print('graphistry', graphistry.__version__)

torch 1.11.0+cu113
graphistry 0.28.4+78.g6f47e03

graphistry#

[6]:

#graphistry.register(
# api=3,
#  protocol="https", server="hub.graphistry.com", client_protocol_hostname='https://hub.graphistry.com',
#  username = '***', password='***'
#)

Data#

Winlogs demo

Tested on 40M row sample (2 x RTX), 4M sample (3080 TI), and on a 32 GB CPU box

https://csr.lanl.gov/data/cyber1/
auth: auth.txt.gz (extractable from winlogs)
optional: red team data (749 events flagged from auth.txt.gz)

Custom

Use any auth data like winlogs

load as dataframe auth
key fields: auth_type (arbitrary values), src_computer (arbitrary values), dst_computer (arbitrary values)
other fields nice but not necessary
some evaluation steps benefit from a red team but not necessary

Load identity events + optional red team data#

[7]:

%%time
# total size -> 1051430459/1.05B
auth = pd.read_parquet('data/auth40m.parquet')
print(auth.shape)
auth.head(5)

(40000000, 9)
CPU times: user 11.7 s, sys: 1.88 s, total: 13.6 s
Wall time: 6.14 s

[7]:

	time	src_domain	dst_domain	src_computer	dst_computer	auth_type	logontype	authentication_orientation	success_or_failure
0	1	ANONYMOUS LOGON@C586	ANONYMOUS LOGON@C586	C1250	C586	NTLM	Network	LogOn	Success
1	1	ANONYMOUS LOGON@C586	ANONYMOUS LOGON@C586	C586	C586	?	Network	LogOff	Success
2	1	C101$@DOM1	C101$@DOM1	C988	C988	?	Network	LogOff	Success
3	1	C1020$@DOM1	SYSTEM@C1020	C1020	C1020	Negotiate	Service	LogOn	Success
4	1	C1021$@DOM1	C1021$@DOM1	C1021	C625	Kerberos	Network	LogOn	Success

Red team comparison data (optional)#

[8]:

if True:
    red_team = pd.read_csv('data/redteam.txt', header=None)
else:
    # or whatever comparison set
    # unsupervised method so just for reporting
    red_team = auth.head(749)

print(red_team.shape)
red_team.head(5)

(749, 4)

[8]:

	0	1	2	3
0	150885	U620@DOM1	C17693	C1003
1	151036	U748@DOM1	C17693	C305
2	151648	U748@DOM1	C17693	C728
3	151993	U6115@DOM1	C17693	C1173
4	153792	U636@DOM1	C17693	C294

[9]:

cols = auth.columns
anom_label = auth[
    auth[cols[0]].isin(red_team[0]) &
    auth[cols[1]].isin(red_team[1]) &
    auth[cols[2]].isin(red_team[1]) &
    auth[cols[3]].isin(red_team[2]) &
    auth[cols[4]].isin(red_team[3])
].reset_index()
anom_label

[9]:

	index	time	src_domain	dst_domain	src_computer	dst_computer	auth_type	logontype	authentication_orientation	success_or_failure
0	29627157	150885	U620@DOM1	U620@DOM1	C17693	C1003	NTLM	Network	LogOn	Success
1	29657341	151036	U748@DOM1	U748@DOM1	C17693	C305	NTLM	Network	LogOn	Success
2	29776464	151648	U748@DOM1	U748@DOM1	C17693	C728	NTLM	Network	LogOn	Success
3	29850298	151993	U6115@DOM1	U6115@DOM1	C17693	C1173	NTLM	Network	LogOn	Success
4	30188660	153792	U636@DOM1	U636@DOM1	C17693	C294	NTLM	Network	LogOn	Success
5	30425163	155219	U748@DOM1	U748@DOM1	C17693	C5693	NTLM	Network	LogOn	Success
6	30458238	155399	U748@DOM1	U748@DOM1	C17693	C152	NTLM	Network	LogOn	Success
7	30468055	155460	U748@DOM1	U748@DOM1	C17693	C2341	NTLM	Network	LogOn	Success
8	30489739	155591	U748@DOM1	U748@DOM1	C17693	C332	NTLM	Network	LogOn	Success
9	30662347	156658	U748@DOM1	U748@DOM1	C17693	C4280	NTLM	Network	LogOn	Success
10	39964825	210086	U748@DOM1	U748@DOM1	C18025	C1493	NTLM	Network	LogOn	Success

[10]:

anom_label['RED'] = 1

# enrich the edges with the red team data
ind = auth.index.isin(anom_label['index'])
auth.loc[ind, 'RED'] = 1
auth.loc[~ind, 'RED'] = 0

print('# red', auth['RED'].sum())

# let's see the bad infiltration point
auth[auth.src_computer == 'C17693'].sample(10)

# red 11.0

[10]:

	time	src_domain	dst_domain	src_computer	dst_computer	auth_type	logontype	authentication_orientation	success_or_failure	RED
39201345	206645	U8350@DOM1	U8350@DOM1	C17693	C467	NTLM	Network	LogOn	Success	0.0
29776464	151648	U748@DOM1	U748@DOM1	C17693	C728	NTLM	Network	LogOn	Success	1.0
38906146	205356	U748@DOM1	U748@DOM1	C17693	C3007	NTLM	Network	LogOn	Success	0.0
30188660	153792	U636@DOM1	U636@DOM1	C17693	C294	NTLM	Network	LogOn	Success	1.0
28845530	147076	U8009@C926	U8009@C926	C17693	C926	NTLM	Network	LogOn	Fail	0.0
28860620	147141	U8009@C1759	U8009@C1759	C17693	C1759	NTLM	Network	LogOn	Fail	0.0
39208487	206678	U8350@DOM1	U8350@DOM1	C17693	C529	NTLM	Network	LogOn	Success	0.0
39126261	206295	U8350@DOM1	U8350@DOM1	C17693	C529	NTLM	Network	LogOn	Success	0.0
29620541	150847	U620@DOM1	U620@DOM1	C17693	C1759	NTLM	Network	LogOn	Success	0.0
28359627	145015	U1723@C1759	U1723@C1759	C17693	C1759	NTLM	Network	LogOn	Fail	0.0

Quick viz#

viz

[11]:

graphistry.edges(
    pd.concat([anom_label, auth.sample(100000)], ignore_index=True, sort=False),
    'src_computer', 'dst_computer'
).plot()

[11]:

Down-sample (optional)#

[12]:

if True:
    SAMPLE_PERCENT = 0.10
    print(f'sampling down {SAMPLE_PERCENT * 100}%')
    all_auth = auth
    all_auth_unred = all_auth[ all_auth.RED == 0 ]
    all_auth_red = all_auth[ all_auth.RED == 1 ]
    auth = pd.concat(
        [
            all_auth_unred.sample(frac=SAMPLE_PERCENT, random_state=RANDOM_SEED),
            all_auth_red
        ],
        ignore_index=True,
        sort=False)
    print(auth.shape)

sampling down 10.0%
(4000010, 10)

[13]:

# see the data and if everything was flagged (since this is timelike..)
auth[
    auth['src_computer'].isin(set(red_team[2])) &
    auth['dst_computer'].isin(set(red_team[3]))
].reset_index()

[13]:

	index	time	src_domain	dst_domain	src_computer	dst_computer	auth_type	logontype	authentication_orientation	success_or_failure	RED
0	1467943	206678	U8350@DOM1	U8350@DOM1	C17693	C529	NTLM	Network	LogOn	Success	0.0
1	3999999	150885	U620@DOM1	U620@DOM1	C17693	C1003	NTLM	Network	LogOn	Success	1.0
2	4000000	151036	U748@DOM1	U748@DOM1	C17693	C305	NTLM	Network	LogOn	Success	1.0
3	4000001	151648	U748@DOM1	U748@DOM1	C17693	C728	NTLM	Network	LogOn	Success	1.0
4	4000002	151993	U6115@DOM1	U6115@DOM1	C17693	C1173	NTLM	Network	LogOn	Success	1.0
5	4000003	153792	U636@DOM1	U636@DOM1	C17693	C294	NTLM	Network	LogOn	Success	1.0
6	4000004	155219	U748@DOM1	U748@DOM1	C17693	C5693	NTLM	Network	LogOn	Success	1.0
7	4000005	155399	U748@DOM1	U748@DOM1	C17693	C152	NTLM	Network	LogOn	Success	1.0
8	4000006	155460	U748@DOM1	U748@DOM1	C17693	C2341	NTLM	Network	LogOn	Success	1.0
9	4000007	155591	U748@DOM1	U748@DOM1	C17693	C332	NTLM	Network	LogOn	Success	1.0
10	4000008	156658	U748@DOM1	U748@DOM1	C17693	C4280	NTLM	Network	LogOn	Success	1.0
11	4000009	210086	U748@DOM1	U748@DOM1	C18025	C1493	NTLM	Network	LogOn	Success	1.0

Configure#

Identify src vs dst columns#

The RGCN connects homogeneous nodes (ex: computers) through heterogeneous relationships (ex: event types)

[14]:

SRC = 'src_computer'
DST = 'dst_computer'
#SRC = 'src_domain'
#DST = 'dst_domain'

NODE = 'computer'


##DST_OPP = 'dst_domain'
#SRC_OPP = 'src_computer'
#DST_OPP = 'dst_computer'

Feature engineering#

Node features - optional enrichment & event summaries#

Provided: If node features are available, handle here
Aggregated: We gather some here from the incoming vs outgoing edges
Graph analytics: Often useful to also add graph statistics here too (pagerank, centrality, …)

For bigger-than-memory, use dask (cpu) or dask_cudf (gpu)

[15]:

mode = pd.Series.mode
outgoing_conn_stats = auth[:1000].groupby([SRC]).agg({
    'time': ['min', 'max', 'mean'],
    'dst_domain': [mode, 'count'],
    'dst_computer': [mode, 'count'],
    'auth_type': [mode, 'count'],
    'logontype': [mode, 'count'],
    'authentication_orientation': [mode, 'count'],
    'success_or_failure': [mode, 'count']
})
outgoing_conn_stats.columns = [f'out_{c[0]}_{c[1]}' for c in outgoing_conn_stats.columns]

mode = pd.Series.mode
incoming_conn_stats = auth[:1000].groupby([DST]).agg({
    'time': ['min', 'max', 'mean'],
    'src_domain': [mode, 'count'],
    'src_computer': [mode, 'count'],
    'auth_type': [mode, 'count'],
    'logontype': [mode, 'count'],
    'authentication_orientation': [mode, 'count'],
    'success_or_failure': [mode, 'count']
})
incoming_conn_stats.columns = [f'in_{c[0]}_{c[1]}' for c in incoming_conn_stats.columns]


# outgoing_conn_stats
# incoming_conn_stats

nodes = pd.DataFrame({NODE: pd.concat([auth[SRC], auth[DST]]).unique()})
nodes = nodes.merge(outgoing_conn_stats, how='left', left_on=NODE, right_on=SRC)
nodes = nodes.merge(incoming_conn_stats, how='left', left_on=NODE, right_on=DST)
for c in ['min', 'max', 'mean']:
    nodes[f'out_time_{c}'] = nodes[f'out_time_{c}'].fillna(0.)
    nodes[f'in_time_{c}'] = nodes[f'in_time_{c}'].fillna(0.)


def flatten_lst(o):
    if isinstance(o, list):
        return o[0]
    if isinstance(o, str):
        return o
    return ''


for c in ['mode']:
    for k in ['dst_domain', 'dst_computer', 'auth_type', 'logontype', 'authentication_orientation', 'success_or_failure']:
        nodes[f'out_{k}_{c}'] = nodes[f'out_{k}_{c}'].apply(flatten_lst)
    for k in ['src_domain', 'src_computer', 'auth_type', 'logontype', 'authentication_orientation', 'success_or_failure']:
        nodes[f'in_{k}_{c}'] = nodes[f'in_{k}_{c}'].apply(flatten_lst)

for c in ['count']:
    for k in ['dst_domain', 'dst_computer', 'auth_type', 'logontype', 'authentication_orientation', 'success_or_failure']:
        nodes[f'out_{k}_{c}'] = nodes[f'out_{k}_{c}'].fillna(0)
    for k in ['src_domain', 'src_computer', 'auth_type', 'logontype', 'authentication_orientation', 'success_or_failure']:
        nodes[f'in_{k}_{c}'] = nodes[f'in_{k}_{c}'].fillna(0)

print(nodes.shape)
nodes.sample(10)

(10887, 31)

[15]:

	computer	...
1157	C7249	...
10714	C10703	...
3872	C6943	...
8832	C7472	...
10732	C9999	...
2885	C1555	...
1701	C5796	...
7597	C11731	...
5051	C7485	...
8477	C1548	...

10 rows × 31 columns

Node features - auto-generation & compression#

[16]:

gg = graphistry.nodes(nodes, NODE)
g_sample = gg.featurize(
    cardinality_threshold=len(auth)+1,  # avoid topic modeling on high-cardinality cols
    min_words=len(auth)+1,  # avoid topic modeling on high-cardinality cols
)

X = g_sample._node_features
#print('memory', X.memory_usage())
print('sample low cardinality')
{
    c: X[c].nunique()
    for c in X.columns[:10] #sampled, good to manually inspect for nice columns
    if X[c].nunique() < 10
}

sample low cardinality

[16]:

{'out_dst_domain_mode_': 2,
 'out_dst_domain_mode_ANONYMOUS LOGON@C1065': 2,
 'out_dst_domain_mode_ANONYMOUS LOGON@C1628': 2,
 'out_dst_domain_mode_ANONYMOUS LOGON@C17899': 2,
 'out_dst_domain_mode_ANONYMOUS LOGON@C457': 2,
 'out_dst_domain_mode_ANONYMOUS LOGON@C467': 2,
 'out_dst_domain_mode_ANONYMOUS LOGON@C528': 2,
 'out_dst_domain_mode_ANONYMOUS LOGON@C529': 2,
 'out_dst_domain_mode_ANONYMOUS LOGON@C586': 2,
 'out_dst_domain_mode_ANONYMOUS LOGON@C625': 2}

[17]:

sums = X.sum(0)
good_feats = [c for c in X.columns if sums[c] > 10 or 'type' in c]

print('#total feats', len(X.columns))
print('#good feats (decent cardinality)', len(good_feats))
print(X[good_feats].sum(0).astype(int))
X[good_feats].sample(5)

#total feats 889
#good feats (decent cardinality) 74
out_dst_domain_mode_                        10411
out_dst_domain_mode_ANONYMOUS LOGON@C586       16
out_dst_computer_mode_                      10408
out_dst_computer_mode_C1065                    25
out_dst_computer_mode_C2106                    16
                                            ...
in_src_computer_count                        1000
in_auth_type_count                           1000
in_logontype_count                           1000
in_authentication_orientation_count          1000
in_success_or_failure_count                  1000
Length: 74, dtype: int64

[17]:

	out_dst_domain_mode_	out_dst_computer_mode_	...	out_success_or_failure_count	in_time_min	in_time_max	in_time_mean	in_src_domain_count	in_src_computer_count	in_auth_type_count	in_logontype_count	in_authentication_orientation_count	in_success_or_failure_count
3579	1.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6966	1.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4599	1.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
409	0.0	0.0	...	1.0	199787.0	199787.0	199787.0	1.0	1.0	1.0	1.0	1.0	1.0
5072	1.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 74 columns

Edge features - Multi-column relationship types (optional)#

Untyped#

Useful for emulating a vanilla GCN via RGCN

[18]:

df = auth
df2a = df.assign(constant_edge_label=1)

Combos: reln as combined (`auth_type` x `logontype`)#

[19]:

df2 = df2a.assign(reln=df.auth_type + '::' + df.logontype)
df2.reln.sample(5)

[19]:

243922     Kerberos::Network
835852     Kerberos::Network
865963                  ?::?
3562087           ?::Network
1788835    Kerberos::Network
Name: reln, dtype: object

Compression: reln as topic model over many columns or high-cardinality columns#

To prevent too many relationship types blowing out memory, we can use tricks like topic models to cut down to a more reasonable set

[20]:

g_runiq = graphistry.nodes(df2.drop_duplicates(subset=['reln']))
g_featurized = g_runiq.featurize(X=['reln'], min_words=1000000, n_topics=15, cardinality_threshold=2)
X = g_featurized._node_features
X.sample(5)

[20]:

	reln: ntlm, wave, batch	reln: batch, ntlm, wave	reln: negotiate, ntlm, wave	reln: ntlm, wave, interactive	reln: microsoft_authentication_packa, microsoft_authentication_pack, microsoft_authentication_pac	reln: remoteinteractive, interactive, ntlm	reln: microsoft_authentication_package_v1, microsoft_authentication_package_v, microsoft_authentication_package_	reln: kerberos, network, ntlm	reln: newcredentials, ntlm, wave	reln: microsoft_authentication_package_v1_0, microsoft_authentication_package_v1_, microsoft_authentication_package_v1	reln: networkcleartext, ntlm, wave	reln: service, ntlm, wave	reln: unlock, wave, ntlm	reln: cachedinteractive, interactive, ntlm	reln: microsoft_authentica, microsoft_authentication_pa, microsoft_authentication_pac
339375	0.148557	0.056838	0.057913	0.152160	0.185454	0.174556	0.280331	0.054524	0.060744	70.747278	0.053815	0.058804	0.056780	0.522876	0.139371
7814	0.052674	0.055471	11.718799	1.717323	0.053624	25.518031	0.052828	0.054558	0.062141	0.060254	0.071018	0.085292	0.057539	0.137766	0.052682
683631	0.050558	0.053184	0.058496	0.053717	0.050803	0.051884	0.050711	0.051522	0.050903	0.050871	0.051656	0.052714	15.020143	0.052309	0.050528
237141	0.206005	0.056884	0.057404	0.057193	8.189937	0.054009	51.363086	0.262343	0.072549	0.211270	0.085901	0.055594	0.054310	0.057548	1.465967
4217	0.054613	0.074897	0.057616	0.056144	0.056909	0.056405	0.055506	0.066759	22.429283	0.053986	0.059091	0.056387	0.054664	0.058439	0.059300

[21]:

X_df2, y_df2 = g_featurized.transform(df2, ydf=None, kind='nodes')
reln_topic = pd.Series([ X_df2.columns[k] for k in X_df2.values.argmax(1) ])
df3 = df2.assign(reln_topic=reln_topic)
reln_topic.sample(10)

[21]:

392537     reln: kerberos, network, ntlm
3497381    reln: kerberos, network, ntlm
387017     reln: kerberos, network, ntlm
2356261    reln: kerberos, network, ntlm
1169846    reln: kerberos, network, ntlm
3038088    reln: kerberos, network, ntlm
2420173    reln: kerberos, network, ntlm
2441144          reln: batch, ntlm, wave
3933167    reln: kerberos, network, ntlm
2362703    reln: kerberos, network, ntlm
dtype: object

Comparison#

Any of these may be interesting as feature columns

[22]:

print({
    v: df3[v].nunique()
    for v in ['constant_edge_label', 'auth_type', 'logontype', 'reln', 'reln_topic']
})

df3[['constant_edge_label', 'auth_type', 'logontype', 'reln', 'reln_topic']].sample(10)

{'constant_edge_label': 1, 'auth_type': 17, 'logontype': 10, 'reln': 38, 'reln_topic': 15}

[22]:

	constant_edge_label	auth_type	logontype	reln	reln_topic
3181846	1	?	?	?::?	reln: batch, ntlm, wave
3637183	1	NTLM	Network	NTLM::Network	reln: kerberos, network, ntlm
3527822	1	Kerberos	Network	Kerberos::Network	reln: kerberos, network, ntlm
2972171	1	Kerberos	Network	Kerberos::Network	reln: kerberos, network, ntlm
1844173	1	?	Network	?::Network	reln: kerberos, network, ntlm
3405729	1	?	Network	?::Network	reln: kerberos, network, ntlm
703185	1	Kerberos	Network	Kerberos::Network	reln: kerberos, network, ntlm
3827554	1	?	Network	?::Network	reln: kerberos, network, ntlm
2057222	1	?	Network	?::Network	reln: kerberos, network, ntlm
2072045	1	Kerberos	Network	Kerberos::Network	reln: kerberos, network, ntlm

Model training#

Graph: (computer)-[event]->(computer)

Edge relationship variants: untyped, auth_type, auth_type::logontype, topic model auth_type::logontype Node feature variants: none, sampled

Graph shape#

[23]:

# simple: graphistry.edges(df3, SRC, DST)

g = (
    graphistry
    .nodes(pd.concat([nodes, X], axis=1), NODE) ## adds computed node features
    .edges(df3, SRC, DST)
)

Node & edge data#

[24]:

print(g._nodes.shape)
g._nodes.sample(5)

(10904, 46)

[24]:

	computer	...	reln: remoteinteractive, interactive, ntlm	reln: microsoft_authentication_package_v1, microsoft_authentication_package_v, microsoft_authentication_package_	reln: kerberos, network, ntlm	reln: newcredentials, ntlm, wave	reln: microsoft_authentication_package_v1_0, microsoft_authentication_package_v1_, microsoft_authentication_package_v1	reln: networkcleartext, ntlm, wave	reln: service, ntlm, wave	reln: unlock, wave, ntlm	reln: cachedinteractive, interactive, ntlm	reln: microsoft_authentica, microsoft_authentication_pa, microsoft_authentication_pac
554	C1476	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8324	C15964	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10540	C5684	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1933	C12857	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2428	C14040	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 46 columns

[25]:

print(g._edges.shape)
g._edges.sample(5)

(4000010, 13)

[25]:

	time	src_domain	dst_domain	src_computer	dst_computer	auth_type	logontype	authentication_orientation	success_or_failure	constant_edge_label	reln	reln_topic
2919189	165394	C1620$@DOM1	C1620$@DOM1	C1621	C528	?	?	TGS	Success	1	?::?	reln: batch, ntlm, wave
3926399	45204	ANONYMOUS LOGON@C586	ANONYMOUS LOGON@C586	C2654	C586	NTLM	Network	LogOn	Success	1	NTLM::Network	reln: kerberos, network, ntlm
390234	14983	U6@DOM1	U6@DOM1	C61	C61	?	Network	LogOff	Success	1	?::Network	reln: kerberos, network, ntlm
2017626	47099	C1085$@DOM1	C1085$@DOM1	C467	C467	?	Network	LogOff	Success	1	?::Network	reln: kerberos, network, ntlm
3126304	138945	ANONYMOUS LOGON@C2106	ANONYMOUS LOGON@C2106	C8416	C2106	NTLM	Network	LogOn	Success	1	NTLM::Network	reln: kerberos, network, ntlm

Train#

[26]:

# If GPU/CPU RAM low, try to free up
if True:
    g.reset_caches()
    torch.cuda.empty_cache()

Resumable – want to run for more epochs? Just run the cell again 🚀

[ ]:

is_gpu = True

# CPU + GPU both work!
# GPUs tested:
#   20-40M rows: 2 x GPUs (24 GB GPU RAM ea)
#   4M rows: 1 x 3080 Ti GPU  (12 GB GPU RAM)
# CPU: 32 GB RAM

dev0 = 'cpu'
if is_gpu:
    dev0 = 'cuda'

g2 = g.embed(

    #relation='constant_edge_label',
    #relation='auth_type',
    #relation='reln',
    #relation='reln_topic'

    #proto='RotatE',
    #proto='DistMult', ##default
    #proto='TransE',

    #if nodes: g.nodes(...), then g.embed(use_feats=True, X=['col1', ..], and optionally, y=['col10', ..]
    use_feat=True, #nodes with attributes
    X=[NODE] + good_feats,
    #cardinality_threshold=len(g._edges)+1, #avoid topic modeling on high-cardinality cols
    #min_words=len(g._edges)+1,  #avoid topic modeling on high-cardinality cols

    ##n_topics=21,

    train_split=0.99,
    batch_size=32,
    embedding_dim=32,
    evaluate=True,
    lr=0.004,  # 0.002, ..
    device=dev0,

    epochs=20,  # 20 if no features & 4M samples, 60+ if node features:
                # increase/rerun till score 85%+
)

Inspect model#

[27]:

g2._embed_model

[27]:

HeteroEmbed(
  (rgcn): RGCNEmbed(
    (emb): Embedding(10888, 32)
    (rgc1): RelGraphConv(
      (linear_r): TypedLinear(in_size=32, out_size=32, num_types=1, regularizer=bdd, num_bases=32)
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (rgc2): RelGraphConv(
      (linear_r): TypedLinear(in_size=32, out_size=32, num_types=1)
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (dropout): Dropout(p=0.2, inplace=False)
  )
)

[28]:

g2._embed_model.relational_embedding.shape

[28]:

torch.Size([1, 32])

[29]:

(
    g2._embed_model.relational_embedding.min(),
    g2._embed_model.relational_embedding.mean(),
    g2._embed_model.relational_embedding.max()
)

[29]:

(tensor(-0.4868, grad_fn=<MinBackward1>),
 tensor(0.0236, grad_fn=<MeanBackward0>),
 tensor(0.4513, grad_fn=<MaxBackward1>))

Explore hits#

[28]:

def to_cpu(tensor):
    """
    Helper for switching between is_gpu to avoid coercion errors
    """
    if is_gpu:
        return tensor.cpu()
    else:
        return tensor

[37]:

%%time
score2 = g2._score(g2._triplets)
df['score'] = to_cpu(score2).numpy()
type(score2), to_cpu(score2).numpy().min(), to_cpu(score2).numpy().max()

CPU times: user 998 ms, sys: 27.8 ms, total: 1.03 s
Wall time: 744 ms

[37]:

(torch.Tensor, 3.049316e-08, 0.99901354)

[38]:

anom_edges2 = df[to_cpu(score2).numpy() < 0.03]

print('number within threshold', anom_edges2.shape)

anom_edges2[
    anom_edges2['time'].isin(set(red_team[0])) &
    anom_edges2['src_computer'].isin(set(red_team[2])) &
    anom_edges2['dst_computer'].isin(set(red_team[3]))
].reset_index()

number within threshold (924, 11)

[38]:

	index	time	src_domain	dst_domain	src_computer	dst_computer	auth_type	logontype	authentication_orientation	success_or_failure	RED	score
0	4000000	151036	U748@DOM1	U748@DOM1	C17693	C305	NTLM	Network	LogOn	Success	1.0	0.006004
1	4000001	151648	U748@DOM1	U748@DOM1	C17693	C728	NTLM	Network	LogOn	Success	1.0	0.003731
2	4000003	153792	U636@DOM1	U636@DOM1	C17693	C294	NTLM	Network	LogOn	Success	1.0	0.020784
3	4000004	155219	U748@DOM1	U748@DOM1	C17693	C5693	NTLM	Network	LogOn	Success	1.0	0.023513
4	4000005	155399	U748@DOM1	U748@DOM1	C17693	C152	NTLM	Network	LogOn	Success	1.0	0.000488
5	4000007	155591	U748@DOM1	U748@DOM1	C17693	C332	NTLM	Network	LogOn	Success	1.0	0.006469
6	4000008	156658	U748@DOM1	U748@DOM1	C17693	C4280	NTLM	Network	LogOn	Success	1.0	0.005589
7	4000009	210086	U748@DOM1	U748@DOM1	C18025	C1493	NTLM	Network	LogOn	Success	1.0	0.009605

[31]:

srcindx = df.src_computer.isin(anom_label.src_computer)
dstindx = df.dst_computer.isin(anom_label.dst_computer)
srcscore = score2.cpu().numpy()[srcindx]
dstscore = score2.cpu().numpy()[dstindx]

[ ]:

sdf = df[srcindx | dstindx]
sdf

[33]:

# one can see from this graph that it is not entirely obvious that red team nodes are where they are, and certianly
gsmall = graphistry.edges(sdf, 'src_computer','dst_computer')
url = gsmall.plot(render=False)
url

[33]:

'https://hub.graphistry.com/graph/graph.html?dataset=3f5ec495b70449abad38834abafbccfc&type=arrow&viztoken=b274cd2e-e255-49bb-a5ce-34c6c9d54357&usertag=23693246-pygraphistry-0.28.4+78.g6f47e03&splashAfter=1669998005&info=true'

[35]:

import matplotlib.pyplot as plt
import numpy as np
s = np.r_[srcscore, dstscore]
plt.hist(s)

[35]:

(array([ 43.,  77., 107., 107., 170.,  31., 243., 160.,   0.,  39.]),
 array([0.01555263, 0.10742634, 0.19930004, 0.29117373, 0.38304743,
        0.47492114, 0.5667948 , 0.6586685 , 0.7505422 , 0.8424159 ,
        0.93428963], dtype=float32),
 <BarContainer object of 10 artists>)

../../../../_images/demos_talks_infosec_jupyterthon2022_rgcn_login_anomaly_detection_advanced-identity-protection-40m_62_1.png

Are simpler methods better?#

Maybe conditional probability p(dst_computer | src_computer) can give us good histograms to tease out red team nodes?

[53]:

def conditional_probability(x, given, df):
    """conditional probability function over categorical variables
       p(x|given) = p(x,given)/p(given)
    Args:
        x: the column variable of interest given the column 'given'
        given: the variabe to fix constant
        df: dataframe with columns [given, x]
    Returns:
        pd.DataFrame: the conditional probability of x given the column 'given'
    """
    return df.groupby([given])[x].apply(lambda g: g.value_counts()/len(g))

[54]:

auth.src_computer.nunique(), auth.dst_computer.nunique()

[54]:

(10475, 10313)

[55]:

pd.concat([auth.src_computer, auth.dst_computer]).nunique()

[55]:

[56]:

cnp = conditional_probability('dst_computer', given='src_computer', df=auth)
cnp

[56]:

src_computer
C1            C1       0.438462
              C529     0.130769
              C625     0.092308
              C586     0.080769
              U25      0.076923
                         ...
C9997         C1065    0.010256
              C585     0.010256
              U7       0.005128
              C801     0.005128
              C2162    0.005128
Name: dst_computer, Length: 136119, dtype: float64

Conditional Graph#

Demo

[57]:

x=SRC
given=DST
condprobs = conditional_probability(x, given, df=auth)

cprob = pd.DataFrame(list(condprobs.index), columns=[given, x])
cprob['_probs'] = condprobs.values

[58]:

# now enrich the edges dataframe with the redteam data
indx = cprob.src_computer.isin(anom_label[SRC]) & cprob[DST].isin(anom_label[DST])
cprob.loc[indx, 'red'] = 1
cprob.loc[~indx, 'red'] = 0

[42]:

cg = graphistry.edges(cprob, x, given).bind(edge_weight='_probs')
cg.plot(render=False)

[42]:

'https://hub.graphistry.com/graph/graph.html?dataset=03bcc2f71a054a9c84fa5ce881f462bd&type=arrow&viztoken=abe69b16-c979-4ee5-b954-f6214d2d6b91&usertag=23693246-pygraphistry-refs/pull/408/head&splashAfter=1669918110&info=true'

Learning#

The conditional graph shows that most of the edge probabilities are between 4e-7 and 0.03, whose bucket contains 91k edges. Thus the chances of finding the red team edges are 10/91000 ~ 1e-4 – slim indeed

[43]:

# let's repeat but with reverse conditional
x='src_computer'
given='dst_computer'
condprobs2 = conditional_probability(x, given, df=auth)

cprob2 = pd.DataFrame(list(condprobs2.index), columns=[given, x])
cprob2['_probs'] = condprobs2.values

[44]:

# now enrich the edges dataframe with the redteam data
indx = cprob2.src_computer.isin(anom_label.src_computer) & cprob2.dst_computer.isin(anom_label.dst_computer)
cprob2.loc[indx, 'red'] = 1
cprob2.loc[~indx, 'red'] = 0

Can we do simple std analysis on red vs not?#

Not really

[56]:

red=X[:11].std(0).values
red

[56]:

array([6.69291932e+00, 6.94233955e+00, 4.03406261e+00, 1.24375581e+00,
       8.19816652e+00, 5.34599137e+00, 8.55273272e+00, 8.82849580e+00,
       5.97048788e+00, 7.98474697e-01, 7.14580528e-03, 4.22057963e-02,
       2.00462442e-02, 2.38532804e+00, 8.04956952e-02])

[57]:

import numpy as np
indx = np.random.choice(range(11, len(X)), 11)
rand = X.iloc[indx].std(0).values
rand

[57]:

array([2.72630555e-01, 4.27446082e+00, 1.50943836e-02, 2.35438425e+01,
       7.47794986e-02, 1.66500055e-01, 2.07862744e+00, 1.97366326e-01,
       1.64508509e-02, 6.49012753e-01, 6.75271878e+00, 1.20960431e+01,
       7.55023721e+00, 3.07596829e+01, 2.12004662e+01])

[58]:

blue=X[11:].std(0).values
blue

[58]:

array([11.94419565,  3.63960234,  2.87543166, 18.48754619,  3.58461871,
        5.26787065, 20.78975583,  6.13758242,  2.29442998,  8.55447084,
        5.97436329,  7.97298694,  8.26999525, 14.63120211, 13.53291024])

[61]:

from matplotlib import pyplot as plt


#plt.plot(red/blue)
plt.plot(red, c='r')
plt.plot(blue, c='b')
plt.plot(rand, c='g')

[61]:

[<matplotlib.lines.Line2D at 0x7f954901ef70>]

../../../../_images/demos_talks_infosec_jupyterthon2022_rgcn_login_anomaly_detection_advanced-identity-protection-40m_79_1.png

[62]:

plt.plot(red, c='r')
rands = []
for k in range(4):
    indx = np.random.choice(range(11, len(X)), 11)
    rand = X.iloc[indx].std(0).values
    plt.plot(rand)
    rands.append(rand)

plt.plot(red/np.mean(rands, axis=0))

[62]:

[<matplotlib.lines.Line2D at 0x7f953806d0d0>]

../../../../_images/demos_talks_infosec_jupyterthon2022_rgcn_login_anomaly_detection_advanced-identity-protection-40m_80_1.png

Resource#

PyGraphistry[AI]
- Dashboarding with graph-app-kit (containerized, gpu, graph Streamlit)
What is graph intelligence
GNN Videos:
- GCN - https://www.youtube.com/watch?v=2KRAOZIULzw
- RGCN - https://www.youtube.com/watch?v=wJQQFUcHO5U
- Euler (combining RNN + GNN)- https://www.youtube.com/watch?v=1t124vguwJ8

[ ]:

Your first graph neural network: Detecting suspicious logins with link prediction

Contents