GPU UMAP#

UMAP is a popular method of dimensionality reduction, a helpful technique for meaningful analysis of large, complex datasets. Graphistry provides convenient bindings for working with cuml.UMAP.

UMAP is: * interested in the number of nearest numbers * non-linear, unlike longstanding methods such as PCA * non-scaling, which keep calculation fast * stochastic and thus non-deterministic – and different libraries handle this differently as you will see in this notebook * umap-learn states that “variance between runs will exist, however small” * cuml currently uses “exact kNN”. This may chance in future releases

Further reading:

clone and install graphistry, print version#

[9]:
import pandas as pd, networkx as nx
# !git clone https://github.com/graphistry/pygraphistry.git

from time import time
!pip install -U pygraphistry/ --quiet

import graphistry
graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='***', password='***')
graphistry.__version__
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[9]:
'0.27.2+4.ga674343.dirty'
[2]:
import pandas as pd, numpy as np
start_u = pd.to_datetime('2016-01-01').value//10**9
end_u = pd.to_datetime('2021-01-01').value//10**9
samples=1000
# df = pd.DataFrame(np.random.randint(,100,size=(samples, 1)), columns=['user_id', 'age', 'profile'])
df = pd.DataFrame(np.random.randint(18,75,size=(samples, 1)), columns=['age'])
df['user_id'] = np.random.randint(0,200,size=(samples, 1))
df['profile'] = np.random.randint(0,1000,size=(samples, 1))
df['date']=pd.to_datetime(np.random.randint(start_u, end_u, samples), unit='s').date

# df[['lat','lon']]=(np.round(np.random.uniform(, 180,size=(samples,2)), 5))
df['lon']=np.round(np.random.uniform(20, 24,size=(samples)), 2)
df['lat']=np.round(np.random.uniform(110, 120,size=(samples)), 2)
df['location']=df['lat'].astype(str) +","+ df["lon"].astype(str)
df.drop(columns=['lat','lon'],inplace=True)
df = df.applymap(str)
df
[2]:
age user_id profile date location
0 32 185 357 2017-06-16 117.81,22.87
1 66 86 84 2020-03-30 110.07,20.52
2 28 26 862 2019-05-12 116.16,23.02
3 69 193 607 2019-03-11 112.21,23.25
4 34 27 4 2019-08-06 114.56,20.99
... ... ... ... ... ...
995 52 128 435 2016-10-19 115.3,23.67
996 67 116 97 2016-04-24 117.69,23.92
997 32 55 915 2018-11-07 113.63,22.74
998 72 68 148 2020-05-23 116.39,21.25
999 56 19 932 2016-04-23 116.2,23.54

1000 rows × 5 columns

[3]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap()
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional
['time: 0.14064184427261353 line/min: 7110.259433612426']

Parameters: X and y, feature_engine, etc#

[4]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'])
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional
['time: 0.0287002166112264 line/min: 34842.94260026035']
[5]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'], feature_engine='torch')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional
['time: 0.0024895787239074705 line/min: 401674.38386140653']

testing various other parameters

[6]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'], feature_engine='torch', n_neighbors= 2,min_dist=.1, spread=.1, local_connectivity=2, n_components=5,metric='hellinger')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot(render=False)
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional
['time: 0.0022179365158081056 line/min: 450869.5325013168']

test engine flag to see speed boost#

[7]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(engine='cuml')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional
['time: 0.00446544885635376 line/min: 223941.65338544376']
[8]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(engine='umap_learn') ## note this will take appreciable time depending on sample count defined above
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional
['time: 0.11818180878957113 line/min: 8461.539134001174']

Now lets look at some real data:#

[12]:
G=pd.read_csv('pygraphistry/demos/data/honeypot.csv')

g = graphistry.nodes(G)
t=time()
g3 = g.umap(engine='cuml')#-learn')
min=(time()-t)/60
lin=G.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (220, 0) in UMAP fit, as it is not one dimensional
['time: 0.008098324139912924 line/min: 27166.11439590581']
[13]:
print(g3._edges.info())
g3._edges.sample(5)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2410 entries, 0 to 2821
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   _src_implicit  2410 non-null   int32
 1   _dst_implicit  2410 non-null   int32
 2   _weight        2410 non-null   float32
dtypes: float32(1), int32(2)
memory usage: 47.1 KB
None
[13]:
_src_implicit _dst_implicit _weight
671 51 123 0.017956
2123 167 194 0.663975
1761 139 78 0.113361
2444 191 3 0.999991
2441 190 152 0.544303
[16]:
#g3.plot()

Next steps#