Classes1
各种聚类方法特性汇总:
![](https://i-blog.csdnimg.cn/blog_migrate/10d5a0c4136202c2b9e081d525649017.png)
sklearn.cluster.KMeans
from sklearn.cluster import KMeans
KMeans(n_clusters=8,init='k-means++',n_init=10,max_iter=300,tol=0.0001,precompute_distances='auto',verbose=0,random_state=None,copy_x=True,n_jobs=1,algorithm='auto')
#n_clusters:class的个数;
#max_inter:每一个初始化值下,最大的iteration次数;
#n_init:尝试用n_init个初始化值进行拟合;
#tol:within-cluster sum of square to declare convergence;
#init=‘k-means++’:可使初始化的centroids相互远离;
算法要点: 1、将training data (X)分为k clusters; 2、object function: 3、缺陷: 1)KMeans假设clusters是convex or isotropics;他不能很好的拟合elongated clusters or manifolds with irregular shape; 2)欧几里得距离会随着特征数量的增加(dimension),而变得越来越膨胀;从而影响模型收敛。针对这一问题,一个好的解决方法是:可以利用PCA等降维工具先将data的特征数目降低在可接受的范围内,然后在计算其欧几里得距离; 3)initial centroids选择不慎,可能会使模型convergence to local minimum;利用KMeans的参数init=‘k-means++’,可以使initial centroids相互远离,从而使模型收敛到一个更好的结果。
sklearn.cluster.MiniBatchKMeans
from sklearn.cluster import MiniBatchKMeans(n_clusters=8,init='k-means++',max_iter=100,batch-size=100,verbose=0,compute_labels=True,random_state=None,tol=0.0,max_no_improvement=10,init_size=None,n_init=3,reassignment_ration=0.01)
#n_clusters: class数量;
#batch_size:用来拟合KMeans的subset size;
#compute_labels=True:将在batch_size上的拟合结果应用到整个data;
#tol:目标函数变化值 20时,此时可以选择Mini Batch K-Means。如果数据集的分布簇不是类似于超球体,或者说不是凸的,则聚类效果不好。 3)参考博文:BIRCH聚类算法原理
clustering performance evaluation
from sklearn import metrics
metrics.adjusted_rand_score(labels_true,labels_pred)
metrics.adjuested_mutual_info_score(labels_true,labels_pred)
metrics.homogeneity_score(labels_true,labels_pred)
metrics.completeness_score(labels_true,labels_pred)
metrics.v_measure_score(labels_true,labels_pred)
metrics.homogeneity_completeness_v_measure(labels_true,labels_pred)
metrics.fowlkes_mallows_score(labels_true,labels_pred)
metrics.silhouette_score(X,labels,metric='euclidean')
metrics.calinski_harabaz_score(X,labels)
参考文献:Clustering
Class 2
sklearn.cluster.Biclustering
Biclustering 简介
Biclustering同时对rows和columns进行聚类,每一个cluster(rows,columns)被叫做一个bicluster,在聚类的过程中,会重新排列data matrix的rows和columns;比如,一个data matrix(10,10),通过Biclustering,可能会形成一个(3,2)的bicluster(submatrix); sklearn.cluster.bicluster中有两种biclustering的function:
SpectralBiclustering
sklearn.cluster.bicluster.SpectralBiclustering(n_clusters=3,method='bistochastic',n_components=6,n_best=3,svd_method='randomized',n_svd_vecs=None,mini_batch=False,init='k-means++',n_init=10,n_jobs=1,random_state=None)
该algorithm形成的是一个hidden checkboard structure biclusters,每一个checkboard内各个点的值几乎相同,因此,该checkboard structure提供了一个对原data的近似; checkboard structure如下图所示: ![](https://i-blog.csdnimg.cn/blog_migrate/39d9e787c19378618918a8094a404748.png)
SpectralCoclustering
sklearn.cluster.bicluster.SpectralCoclustering(n_clusters=3,svd_method='randomized',n_svd_vecs=None,mini_batch=False,init='k-means++',n_init=10,n_jobs=1,random_state=None)
该algorithm形成的是一个diagonal structure,diagonal上的每一个bicluster代表了data matrix中的high values。此算法通过对data matrix gaph进行归一化,从而找到数据图中的heavy subgraph(The algorithm approximates the normalized cut of this graph to find heavy subgraphs.)其结构如下图所示: ![](https://i-blog.csdnimg.cn/blog_migrate/bde1a74416f89a1d0a334e4049e9f793.png)
|