Clustering with Python |
您所在的位置:网站首页 › 苍蝇拍图画 › Clustering with Python |
Clustering with Python — HDBSCANMost effective clustering method using DBSCAN and Hierarchical clustering https://hdbscan.readthedocs.io/en/latest/index.html How HDBSCAN Works - hdbscan 0.8.1 documentationHDBSCAN is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends DBSCAN by converting it into a…hdbscan.readthedocs.io HDBSCAN IntuitionThe simple way to do this is to define a new distance metric between points which we will call the mutual reachability distance. We define mutual reachability distance as follows: For this we need 3 things - Core distance of point A , Core Distance of point B , Distance between A and B 1- Use k neighbours of 6 points ( whatever you want to start with) and get a core point A , and the core distance of point A 2 — find another core point with a larger circle and this is Core point B and core distance of point B Progressively we find another point and its core distance 3 Calculate the Distance Between A and B = Mutual Reachable Distance: If distance between A and B is smaller than the the core distance of A or B the mutual reachable distance is marked as lesser Same way if the mutual reachable distance here is greater than core distance of A and Core distance of B Using the Mutual Reachability Distance we build the minimum spanning tree Now consider a threshold value, starting high, and steadily being lowered. Drop any edges with weight above that threshold. As we drop edges we will start to disconnect the graph into connected components. Eventually we will have a hierarchy of connected components (from completely connected to completely disconnected) at varying threshold levels. Build the cluster hierarchy by defining the Min Cluster PointsAny points that are outside this range are dropped out Condense the clusters and extract the points The circled out are the points inside the condensed clusters , To make this concrete we need a notion of minimum cluster size which we take as a parameter to HDBSCAN. More Reading : https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html Python Implementationhttps://hdbscan.readthedocs.io/en/latest/index.html Install hdbscan !pip install hdbscan import hdbscan import hdbscanhdbscan = hdbscan.HDBSCAN(min_cluster_size=10, min_samples = 5) labels = hdbscan.fit_predict(data)hdbscan.condensed_tree_.plot(select_clusters=True)np.unique(labels) Note the horizontal lines in the number of 6 clusters min_cluster_size = is guessed looking at the scatter plot min_smaples = number of points in k neighbour hdbscan.condensed_tree_.plot(select_clusters=True) — shows the points The returned array has negative labels = outliers and many labels hdbscan.single_linkage_tree_.plot()this will plot the tree DO NOT COMPUTE THIS as it will break your computer takes a lot of time Fine tune the model:To get a reasonable number of labels … increase or decrease the number of cluster and number of points = min_samples eg: labels = number of clusters Plot the results This clearly does not look correct , so we need lesser clusters Predict New Data Points with HDBSCAN1- In the clusterer — add parameter prediction data = True hdbscan = hdbscan.HDBSCAN(min_cluster_size=12, min_samples = 3, prediction_data=True) 2- Use the approximate_predict( hdbscan,[[34,90]]) >>>> approximate_predict(clusterer, new points)Read more : https://hdbscan.readthedocs.io/en/latest/prediction_tutorial.html Considerations : the data used originally does not have any clear clusters , and hdbscan accurately calculated that, consider a dataset which has clusters so we can see more clear results : Comparisons between different algorithms |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |