### K

百家乐-操作     |      2020-01-25 03:55

K-MEANS算法:

http://mahout.apache.org/users/clustering/k-means-clustering.html

k-means 算法接收输入量 k ；然后将n个数据对象划分为 k个聚类以便使得所收获的聚类满足：同后生可畏聚类中的对象肖似度较高；而各异聚类中的对象相近度十分小。聚类肖似度是运用各聚类中指标的均值所获得八个“中央目的”（引力核心）来开展总括的。

k-means 算法的行事进度表达如下：首先从n个数据对象率性选择 k 个对象作为开端聚类宗旨；而对此所剩下其余对象，则基于它们与这个聚类中央的雷同度（间距），分别将它们分配给予其最肖似的（聚类主旨所代表的）聚类；然后再计算每一个所获新聚类的聚类中央（该聚类中有着指标的均值）；不断重复那风度翩翩进度直到正式测算函数初步破灭停止。平常都选拔均方差作为标准测算函数. k个聚类具备以下特征：各聚类本人尽恐怕的严密，而各聚类之间尽大概的抽离。

k-Means is a simple but well-known algorithm for grouping objects, clustering. All objects need to be represented as a set of numerical features. In addition, the user has to specify the number of groups (referred to as k) she wishes to identify.

补充三个Matlab达成方式：

k-Means能够对目的举办分组，那个指标需求被代表为特点值，而且制订要分成几组。

function [cid,nr,centers] = cskmeans(x,k,nc)

Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to the center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task.

% CSKMEANS K-Means clustering - general method.

%

After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations.

% This implements the more general k-means algorithm, where

Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same.

% HMEANS is used to find the initial partition and then each

% observation is examined for further improvements in minimizing

Here is a short shell script outline that will get you started quickly with k-means. This does the following:

% the within-group sum of squares.

%

• Accepts clustering type: kmeans, fuzzykmeans, lda, or streamingkmeans
• Gets the Reuters dataset
• Runs seqdirectory to convert reuters-out to SequenceFile format
• Runs seq2sparse to convert SequenceFiles to sparse vector format
• Runs k-means with 20 clusters
• Runs clusterdump to show results

% [CID,NR,CENTERS] = CSKMEANS(X,K,NC) Performs K-means

% clustering using the data given in X.

• 经受聚类类型：
1: kmeans, 2: fuzzykmeans, 3:lda, 4:streamingkmeans
• 得到Reuters的数据集
• 运行 `org.apache.lucene.benchmark.utils.ExtractReuters` 生成输出文件，直白运行脚本
• 调换到二进制文件
• 转移二进制文件为疏散向量格式
• 聚20个分类
• 呈现结果

%

% INPUTS: X is the n x d matrix of data,

After following through the output that scrolls past, reading the code will offer you a better understanding.

% where each row indicates an observation. K indicates

% the number of desired clusters. NC is a k x d matrix for the

% initial cluster centers. If NC is not specified, then the

% centers will be randomly chosen from the observations.

%

% OUTPUTS: CID provides a set of n indexes indicating cluster

% membership for each point. NR is the number of observations

% in each cluster. CENTERS is a matrix, where each row

% corresponds to a cluster center.

%

% W. L. and A. R. Martinez, 9/15/01

% Computational Statistics Toolbox

warning off

[n,d] = size(x);

if nargin < 3

% Then pick some observations to be the cluster centers.

ind = ceil(n*rand(1,k));

% We will add some noise to make it interesting.

nc = x(ind,:) + randn(k,d);

end

% set up storage

% integer 1,...,k indicating cluster membership

cid = zeros(1,n);

% Make this different to get the loop started.

oldcid = ones(1,n);

% The number in each cluster.

nr = zeros(1,k);

% Set up maximum number of iterations.

maxiter = 100;

iter = 1;

while ~isequal(cid,oldcid) & iter < maxiter

% Implement the hmeans algorithm