GPU UMAP#
UMAP is a popular method of dimensionality reduction, a helpful technique for meaningful analysis of large, complex datasets. Graphistry provides convenient bindings for working with cuml.UMAP
.
UMAP is: * interested in the number of nearest numbers * non-linear, unlike longstanding methods such as PCA * non-scaling, which keep calculation fast * stochastic and thus non-deterministic – and different libraries handle this differently as you will see in this notebook * umap-learn
states that “variance between runs will exist, however small” * cuml
currently uses “exact
kNN”. This may chance in future releases
Further reading:
Part I: CPU Baseline in Python Pandas
Part III: GPU SQL - deprecated as Dask-SQL replaced BlazingSQL in the RAPIDS ecosystem
clone and install graphistry, print version#
[9]:
import pandas as pd, networkx as nx
# !git clone https://github.com/graphistry/pygraphistry.git
from time import time
!pip install -U pygraphistry/ --quiet
import graphistry
graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='***', password='***')
graphistry.__version__
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[9]:
'0.27.2+4.ga674343.dirty'
[2]:
import pandas as pd, numpy as np
start_u = pd.to_datetime('2016-01-01').value//10**9
end_u = pd.to_datetime('2021-01-01').value//10**9
samples=1000
# df = pd.DataFrame(np.random.randint(,100,size=(samples, 1)), columns=['user_id', 'age', 'profile'])
df = pd.DataFrame(np.random.randint(18,75,size=(samples, 1)), columns=['age'])
df['user_id'] = np.random.randint(0,200,size=(samples, 1))
df['profile'] = np.random.randint(0,1000,size=(samples, 1))
df['date']=pd.to_datetime(np.random.randint(start_u, end_u, samples), unit='s').date
# df[['lat','lon']]=(np.round(np.random.uniform(, 180,size=(samples,2)), 5))
df['lon']=np.round(np.random.uniform(20, 24,size=(samples)), 2)
df['lat']=np.round(np.random.uniform(110, 120,size=(samples)), 2)
df['location']=df['lat'].astype(str) +","+ df["lon"].astype(str)
df.drop(columns=['lat','lon'],inplace=True)
df = df.applymap(str)
df
[2]:
age | user_id | profile | date | location | |
---|---|---|---|---|---|
0 | 32 | 185 | 357 | 2017-06-16 | 117.81,22.87 |
1 | 66 | 86 | 84 | 2020-03-30 | 110.07,20.52 |
2 | 28 | 26 | 862 | 2019-05-12 | 116.16,23.02 |
3 | 69 | 193 | 607 | 2019-03-11 | 112.21,23.25 |
4 | 34 | 27 | 4 | 2019-08-06 | 114.56,20.99 |
... | ... | ... | ... | ... | ... |
995 | 52 | 128 | 435 | 2016-10-19 | 115.3,23.67 |
996 | 67 | 116 | 97 | 2016-04-24 | 117.69,23.92 |
997 | 32 | 55 | 915 | 2018-11-07 | 113.63,22.74 |
998 | 72 | 68 | 148 | 2020-05-23 | 116.39,21.25 |
999 | 56 | 19 | 932 | 2016-04-23 | 116.2,23.54 |
1000 rows × 5 columns
[3]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap()
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional
['time: 0.14064184427261353 line/min: 7110.259433612426']
Parameters: X
and y
, feature_engine
, etc#
[4]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'])
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional
['time: 0.0287002166112264 line/min: 34842.94260026035']
[5]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'], feature_engine='torch')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional
['time: 0.0024895787239074705 line/min: 401674.38386140653']
testing various other parameters
[6]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'], feature_engine='torch', n_neighbors= 2,min_dist=.1, spread=.1, local_connectivity=2, n_components=5,metric='hellinger')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot(render=False)
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional
['time: 0.0022179365158081056 line/min: 450869.5325013168']
test engine
flag to see speed boost#
[7]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(engine='cuml')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional
['time: 0.00446544885635376 line/min: 223941.65338544376']
[8]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(engine='umap_learn') ## note this will take appreciable time depending on sample count defined above
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional
['time: 0.11818180878957113 line/min: 8461.539134001174']
Now lets look at some real data:#
[12]:
G=pd.read_csv('pygraphistry/demos/data/honeypot.csv')
g = graphistry.nodes(G)
t=time()
g3 = g.umap(engine='cuml')#-learn')
min=(time()-t)/60
lin=G.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (220, 0) in UMAP fit, as it is not one dimensional
['time: 0.008098324139912924 line/min: 27166.11439590581']
[13]:
print(g3._edges.info())
g3._edges.sample(5)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2410 entries, 0 to 2821
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _src_implicit 2410 non-null int32
1 _dst_implicit 2410 non-null int32
2 _weight 2410 non-null float32
dtypes: float32(1), int32(2)
memory usage: 47.1 KB
None
[13]:
_src_implicit | _dst_implicit | _weight | |
---|---|---|---|
671 | 51 | 123 | 0.017956 |
2123 | 167 | 194 | 0.663975 |
1761 | 139 | 78 | 0.113361 |
2444 | 191 | 3 | 0.999991 |
2441 | 190 | 152 | 0.544303 |
[16]:
#g3.plot()
Next steps#
Part I: CPU Baseline in Python Pandas
Part III: GPU SQL - deprecated as Dask-SQL replaced BlazingSQL in the RAPIDS ecosystem