I'm using sklearn.cluster.AgglomerativeClustering. It begins with one cluster per data point and iteratively merges together the two "closest".

Use DistParameter to specify another value for S. Use DistParameter to specify another value for C , where the matrix C is symmetric and positive definite. Minkowski distance. The default exponent is 2. Use DistParameter to specify a different exponent P , where P is a positive scalar value of the exponent.

One minus the sample Spearman's rank correlation between observations treated as sequences of values. Custom distance function handle. ZI is a 1 -by- n vector containing a single observation. ZJ is an m2 -by- n matrix containing multiple observations. If your data is not sparse, you can generally compute distance more quickly by using a built-in distance instead of a function handle. For more information, see Distance Metrics. Distance metric and distance metric option, specified as a cell array of the comma-separated pair consisting of the two input arguments Distance and DistParameter of the function pdist.

This argument is valid only for specifying 'seuclidean' , 'minkowski' , or 'mahalanobis'. Flag for the 'savememory' option, specified as either 'on' or 'off'. The 'on' setting causes linkage to construct clusters without computing the distance matrix. The 'on' setting is available only when method is 'centroid' , 'median' , or 'ward' and metric is 'euclidean'. When value is 'on' , the linkage run time is proportional to the number of dimensions number of columns of X.

When value is 'off' , the linkage memory requirement is proportional to N 2 , where N is the number of observations. The best least-time setting to use for value depends on the problem dimensions, number of observations, and available memory. The default value setting is a rough approximation of an optimal setting.

The default is 'on' when X has 20 columns or fewer, or the computer does not have enough memory to store the distance matrix. Otherwise, the default is 'off'. Example: 'savememory','on'. Distances, specified as a numeric vector with the same format as the output of the pdist function:. Distances arranged in the order 2,1 , 3,1 , Agglomerative hierarchical cluster tree, returned as a numeric matrix.

Z is an m — 1 -by-3 matrix, where m is the number of observations in the original data. Columns 1 and 2 of Z contain cluster indices linked in pairs to form a binary tree. The leaf nodes are numbered from 1 to m. Leaf nodes are the singleton clusters from which all higher clusters are built. The m — 1 higher clusters correspond to the interior nodes of the clustering tree. Z I,3 contains the linkage distance between the two clusters merged in row Z I,:.

For example, consider building a tree with 30 initial nodes. Suppose that cluster 5 and cluster 7 are combined at step 12, and that the distance between them at that step is 1. Then Z 12,: is [5 7 1. If cluster 42 appears in a later row, then the function is combining the cluster created at step 12 into a larger cluster. A linkage is the distance between two clusters.

Cluster r is formed from clusters p and q. Single linkage , also called nearest neighbor , uses the smallest distance between objects in the two clusters. Complete linkage , also called farthest neighbor , uses the largest distance between objects in the two clusters. Average linkage uses the average distance between all pairs of objects in any two clusters.

Centroid linkage uses the Euclidean distance between the centroids of the two clusters. Median linkage uses the Euclidean distance between weighted centroids of the two clusters. Ward's linkage uses the incremental sum of squares, that is, the increase in the total within-cluster sum of squares as a result of joining two clusters. The within-cluster sum of squares is defined as the sum of the squares of the distances between all objects in the cluster and the centroid of the cluster.

The sum of squares metric is equivalent to the following distance metric d r , s , which is the formula linkage uses. In some references, Ward's linkage does not use the factor of 2 multiplying n r n s. The linkage function uses this factor so that the distance between two singleton clusters is the same as the Euclidean distance.

Weighted average linkage uses a recursive definition for the distance between two clusters. If cluster r was created by combining clusters p and q , the distance between r and another cluster s is defined as the average of the distance between p and s and the distance between q and s. Computing linkage y can be slow when y is a vector representation of the distance matrix.

For the 'centroid' , 'median' , and 'ward' methods, linkage checks whether y is a Euclidean distance. Avoid this time-consuming check by passing in X instead of y. The 'centroid' and 'median' methods can produce a cluster tree that is not monotonic. This result occurs when the distance from the union of two clusters, r and s , to a third cluster is less than the distance between r and s. In this case, in a dendrogram drawn with the default orientation, the path from a leaf to the root node takes some downward steps.

