GPU UMAP#

UMAP is a popular method of dimensionality reduction, a helpful technique for meaningful analysis of large, complex datasets. Graphistry provides convenient bindings for working with cuml.UMAP.

UMAP is: * interested in the number of nearest numbers * non-linear, unlike longstanding methods such as PCA * non-scaling, which keep calculation fast * stochastic and thus non-deterministic – and different libraries handle this differently as you will see in this notebook * umap-learn states that “variance between runs will exist, however small” * cuml currently uses “exact kNN”. This may chance in future releases

clone and install graphistry, print version#

[9]:

import pandas as pd, networkx as nx
# !git clone https://github.com/graphistry/pygraphistry.git

from time import time
!pip install -U pygraphistry/ --quiet

import graphistry
graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='***', password='***')
graphistry.__version__

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[9]:

'0.27.2+4.ga674343.dirty'

[2]:

import pandas as pd, numpy as np
start_u = pd.to_datetime('2016-01-01').value//10**9
end_u = pd.to_datetime('2021-01-01').value//10**9
samples=1000
# df = pd.DataFrame(np.random.randint(,100,size=(samples, 1)), columns=['user_id', 'age', 'profile'])
df = pd.DataFrame(np.random.randint(18,75,size=(samples, 1)), columns=['age'])
df['user_id'] = np.random.randint(0,200,size=(samples, 1))
df['profile'] = np.random.randint(0,1000,size=(samples, 1))
df['date']=pd.to_datetime(np.random.randint(start_u, end_u, samples), unit='s').date

# df[['lat','lon']]=(np.round(np.random.uniform(, 180,size=(samples,2)), 5))
df['lon']=np.round(np.random.uniform(20, 24,size=(samples)), 2)
df['lat']=np.round(np.random.uniform(110, 120,size=(samples)), 2)
df['location']=df['lat'].astype(str) +","+ df["lon"].astype(str)
df.drop(columns=['lat','lon'],inplace=True)
df = df.applymap(str)
df

[2]:

	age	user_id	profile	date	location
0	32	185	357	2017-06-16	117.81,22.87
1	66	86	84	2020-03-30	110.07,20.52
2	28	26	862	2019-05-12	116.16,23.02
3	69	193	607	2019-03-11	112.21,23.25
4	34	27	4	2019-08-06	114.56,20.99
...	...	...	...	...	...
995	52	128	435	2016-10-19	115.3,23.67
996	67	116	97	2016-04-24	117.69,23.92
997	32	55	915	2018-11-07	113.63,22.74
998	72	68	148	2020-05-23	116.39,21.25
999	56	19	932	2016-04-23	116.2,23.54

1000 rows × 5 columns

[3]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap()
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional

['time: 0.14064184427261353 line/min: 7110.259433612426']

Parameters: `X` and `y`, `feature_engine`, etc#

[4]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'])
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional

['time: 0.0287002166112264 line/min: 34842.94260026035']

[5]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'], feature_engine='torch')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional

['time: 0.0024895787239074705 line/min: 401674.38386140653']

testing various other parameters

[6]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'], feature_engine='torch', n_neighbors= 2,min_dist=.1, spread=.1, local_connectivity=2, n_components=5,metric='hellinger')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot(render=False)

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional

['time: 0.0022179365158081056 line/min: 450869.5325013168']

test `engine` flag to see speed boost#

[7]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap(engine='cuml')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional

['time: 0.00446544885635376 line/min: 223941.65338544376']

[8]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap(engine='umap_learn') ## note this will take appreciable time depending on sample count defined above
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])

* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional

['time: 0.11818180878957113 line/min: 8461.539134001174']

Now lets look at some real data:#

[12]:

G=pd.read_csv('pygraphistry/demos/data/honeypot.csv')

g = graphistry.nodes(G)
t=time()
g3 = g.umap(engine='cuml')#-learn')
min=(time()-t)/60
lin=G.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (220, 0) in UMAP fit, as it is not one dimensional

['time: 0.008098324139912924 line/min: 27166.11439590581']

[13]:

print(g3._edges.info())
g3._edges.sample(5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2410 entries, 0 to 2821
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   _src_implicit  2410 non-null   int32
 1   _dst_implicit  2410 non-null   int32
 2   _weight        2410 non-null   float32
dtypes: float32(1), int32(2)
memory usage: 47.1 KB
None

[13]:

	_src_implicit	_dst_implicit	_weight
671	51	123	0.017956
2123	167	194	0.663975
1761	139	78	0.113361
2444	191	3	0.999991
2441	190	152	0.544303

[16]:

#g3.plot()

Next steps#

Part I: CPU Baseline in Python Pandas
Part II: GPU Dataframe with RAPIDS Python cudf bindings
Part III: GPU SQL - deprecated as Dask-SQL replaced BlazingSQL in the RAPIDS ecosystem
Part IV: GPU ML with RAPIDS cuML UMAP and PyGraphistry
Graphistry cuGraph bindings

GPU UMAP

Contents