Workshop: Hierarchical clustering
This exercise involves the illustration of hierarchical clustering and model based clustering using Iris data. We will understand the algorithm of hierarchical clustering and how it can be visualised in a different way. We will also learn how to use agglomerative coefficient
to select appropriate linkage in Hierarchical clustering. At the end of the workshop, we will understand why the model based clustering can be a stable alternative to non-model based clustering. You will learn how to implement the models in R.
Hierarchical clustering
<- scale(iris[, -5]) iris.scaled
The next step is to compute the Euclidean distances and hierarchical clustering with complete linkage (other linkage such as single
, average
, centroid
, ward.D
can also be used).
# Compute distances and hierarchical clustering
<- dist(iris.scaled, method = "euclidean")
dd <- hclust(dd, method = "complete")
hc hc
##
## Call:
## hclust(d = dd, method = "complete")
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 150
We can visualise the object by using a dendrogram. This will enable us to determine the point to cut the tree, resulting in the number of clusters.
<- as.dendrogram(hc)
dendros plot(dendros, main = "Iris data - Complete linkage",
ylab = "Height")
abline(h=0.2, lty = 2, col="red")
abline(h=0.428, lty = 2, col="blue")#point 116 and 142 are merged
<- hc$height[(length(hc$height)-4):length(hc$height)]
Hs abline(h=Hs, col=3, lty=2)
The height axis on the dendrogram displays the dissimilarity between observations and/or clusters. The horizontal bars indicate the point at which two clusters/observations are merged. Thus, observations 137 and 149 are merged at a distance (height) of 0.2 (see the red line and carefully read the numbers on the x-axis). Also, observations 111 and 116 are merged at a height of about 0.428 (see the blue line).
Dissimilarity scores between merged clusters only increases as we run the algorithm. In other words, we can draw a proper dendrogram, where the height of a parent is always higher than height of its daughters. This property is called no inversion property which is illustrated with dashed green lines in the plot.
Visualisation of the plot above can be improved by using the fviz_cluster()
and fviz_dend()
function in factoextra
package (we will be able to see the observations number clearly) as follows:
library(factoextra)
fviz_cluster(list(data=iris.scaled, cluster=cutree(hc, 3)), ellipse.type = "norm")
fviz_dend(hc, k = 3, # Cut in three groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
ggtheme = theme_gray() # Change theme
)
We can also examine the number of observations in each of the three clusters.
<- cutree(hc, k = 3)
hcut table(hcut)
## hcut
## 1 2 3
## 49 24 77
The aggregate()
function can be used to compute the mean of each variables by clusters using original data.
aggregate(iris[, -5], by=list(cluster=hcut), mean)
## cluster Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 5.016327 3.451020 1.465306 0.244898
## 2 2 5.512500 2.466667 3.866667 1.170833
## 3 3 6.472727 2.990909 5.183117 1.815584
We have used complete linkage. For practical purposes, we may want to examine which of the three linkages (we learn about three linkages in the lecture but they are more than three) is best for our data set. We can use the agnes
function (acronym for Agglomerative Nesting) from cluster
package to do this. This will compute the so called agglomerative coefficient
(ac). The AC measures the dissimilarity of an object to the first cluster it joins, divided by the dissimilarity of the final merger in the cluster analysis, averaged across all samples. The AC represents the strength of the clustering structure; values closer to 1 indicates that clustering structure is more balanced while values closer to 0 suggests less well balanced clusters. The higher the value, the better.
library(cluster)
<- c("average","single","complete")
m names(m) <- c("average","single","complete")
# function to compute coefficient
<- function(x){
ac agnes(iris.scaled, method = x)$ac
}
We can now use the map function from tidyverse
package, specifically map_dbl()
to obtain the agglomerative coefficients. Note that the map functions transform their input by applying a function to each element of a list or atomic vector (use Google to check the meaning of atomic vector
) and returning an object of the same length as the input.
library(tidyverse)
map_dbl(m,ac)
## average single complete
## 0.9035705 0.8023794 0.9438858
The highest ac
value is 94.4%. So, complete linkage is the best choice among the three. We are on course with the use of complete linkage
.
We can also visualise the data using phylogenic-like tree as we did in the lecture note. You will need to load the package igraph
and combine it with fviz_dend()
function to do this.
Further visualisation using phylogenic-like tree can be accomplished as follows:
library(igraph)
fviz_dend(hc, k = 3, k_colors = "jco", type = "phylogenic", repel = TRUE)
We can compare original species labels with our results from Hierarchical clustering
<- cbind(iris, cluster=hcut)
dd table(dd$Species,dd$cluster)
##
## 1 2 3
## setosa 49 1 0
## versicolor 0 21 29
## virginica 0 2 48
Overall, 78.7% correct classification is obtained from the algorithm (that is, ((49+21+48)/150)*100 = 78.7%)
Model-based clustering
The clustering methods that we have discussed so far neither involve any sort of model nor is based on a density estimate. Model-based clustering considers the data as coming from a distribution that is mixture of two or more clusters. Each cluster \(k\) is modelled by the normal or Gaussian distribution which is characterised by the parameters:
- \(\mu_{k}:\) mean vector
- \(\Sigma_{k}:\) covariance matrix
- \(p_{k}\) associated probability in the mixture. Each point has a probability of belonging to each cluster.
The model parameters can be estimated using the Expectation-Maximisation (EM) algorithm initialised by hierarchical model-based clustering. Each cluster \(k\) is centred at the means \(\mu_{k}\), with increased density for points near the mean. The method has been implemented in the mclust
package in R.
Since the model is based on maximum likelihood estimation, criterion such as BIC (Bayesian Information Criterion) and AIC (Akaike Information Criterion) can be used for the selection of the number of clusters.
\[AIC = -2logL +2p\] \[BIC = -2logL +log(n)p\] where \(logL\) is the model log-likelihood, \(p\) is the number of parameters and \(n\) is the number of observations. In general, we want these criteria to be small. A model with the smallest value of the criterion is selected as the best.
BIC is often used for selecting the number of clusters. If you did not use this criterion for variable selection in your Machine Learning module, please see here.
The implementation of BIC supports the higher the value the better
. This is because the BIC implemented in mclust
is defined with a positive sign. A model is fitted for varying number of clusters and the number \(k\) that gives the highest BIC value is selected for the data.
Let us use the package on iris data set.
library(mclust)
<- scale(iris[, -5]) # Standardise the data
iris.scaled <- Mclust(iris.scaled, G = 3) # Model-based-clustering, G=3 indicates a request for 3 clusters
mc summary(mc)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3
## components:
##
## log-likelihood n df BIC ICL
## -288.5255 150 44 -797.519 -800.7391
##
## Clustering table:
## 1 2 3
## 50 45 55
The result indicates that there are 50 observations in cluster 1, 45 in cluster 2 and 55 in cluster 3.
We can examine the variables created within object mc
:
names(mc)
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
The value that is of utmost importance in the object created is classification
. These are the newly created clusters.
$classification mc
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 3 2
## [75] 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3
As a small task, use the table()
function in R to compare the true class labels to the class labels obtained from mclust
. Does this result offer any improvement over the K-means clustering results in Workshop 3?
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 45 5
## virginica 0 0 50
If you implemented this correctly, then Setosa is correctly classified (50 out of 50 correctly classified). 5 observations are misclassified in Versicolor (45 out of 50 correctly classified). Overall, 97% correct classification is obtained under the model-based clustering, whereas it was 88% under K-medoids and 83.3% under K-means.
You can obtain four different types of plot from the mc
object that was created.
Model-based clustering plots:
1: BIC
2: classification
3: uncertainty
4: density
We can plot the classification as follows:
plot(mc, what = "classification")
We can identify the 5 observations that were misclassified and calculate the classification error rate as follows.
classError(mc$classification, iris[,5])
## $misclassified
## [1] 69 71 73 78 84
##
## $errorRate
## [1] 0.03333333
Observe that the mclust
output contains the following comment: "Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3 components:"
. This is based on the parametrisation of the covariance matrix, \(\Sigma_{k}\) used. The available model options in mclust
package, are represented by identifiers including:
EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV and VVV
. So, VVV (geometric features: shape, volume, orientation) has been used for our example. The first identifier refers to volume, the second to shape and the third to orientation. E stands for “equal”, V for “variable” and I for “coordinate axes”. VEI, for example, implies that the clusters have variable volume, the same shape and orientation equal to coordinate axes. EEE means that the clusters have the same volume, shape and orientation in p-dimensional space. To find out more about model options, type ?mclustModelNames
in R studio console.
To enforce VVV
as a model option, for example, you can do the following:
<- Mclust(iris.scaled, G = 3, modelNames = "VVV") mc_option
Now, let us try to compare the performance of model options VVI, EVE and VEI to VVV.
You are required to fit the models using the three model options, create a classification table and compare the percentage of correct classification with what we obtained for VVV.
<- Mclust(iris.scaled, G = 3, modelNames = "VVI")
mc_option_VVI table(iris[,5], mc_option_VVI$classification)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 49 1
## virginica 0 14 36
50+49+36)/150 (
## [1] 0.9
<- Mclust(iris.scaled, G = 3, modelNames = "EVE")
mc_option_EVE table(iris[,5], mc_option_EVE$classification)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 50 0
## virginica 0 16 34
50+50+34)/150 (
## [1] 0.8933333
<- Mclust(iris.scaled, G = 3, modelNames = "VEI")
mc_option_VEI table(iris[,5], mc_option_VEI$classification)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 48 2
## virginica 0 4 46
50+48+46)/150 (
## [1] 0.96
<- Mclust(iris.scaled, G = 3, modelNames = "VVV")
mc_option_VVV table(iris[,5], mc_option_VVV$classification)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 45 5
## virginica 0 0 50
50+45+50)/150 (
## [1] 0.9666667
Clearly, the automatically selected model option, VVV
is the best with 97% classification accuracy. This is followed by VEI
at 96% accuracy. The worst is VVI
at 90%, although it is still better than all the previous clustering methods (K-means, K-medoids and Hierarchical).
As stated previously, one can use the BIC criterion (higher values are desirable) to select the number of clusters (G
). Consider the case below:
<- Mclust(iris.scaled, G = 3)
mc_bic $BIC mc_bic
## Bayesian Information Criterion (BIC):
## EII VII EEI VEI EVI VVI EEE
## 3 -1214.525 -1222.844 -1029.733 -995.8343 -1014.515 -961.3205 -849.6448
## VEE EVE VVE EEV VEV EVV VVV
## 3 -822.076 -862.9077 -828.8731 -875.3724 -797.5483 -872.7515 -797.519
##
## Top 3 models based on the BIC criterion:
## VVV,3 VEV,3 VEE,3
## -797.5190 -797.5483 -822.0760
You will observe that the results included the top 3 model options based on the BIC criterion.
Now, let’s use the BIC criterion to select the number of clusters from the set G = {2, 3, 4, 5}.
<- Mclust(iris.scaled, G = 2)
mc_bic2 $BIC mc_bic2
## Bayesian Information Criterion (BIC):
## EII VII EEI VEI EVI VVI EEE VEE
## 2 -1344.053 -1325.273 -1259.646 -1172.96 -1223.986 -1074.229 -904.775 -873.0048
## EVE VVE EEV VEV EVV VVV
## 2 -906.9282 -880.7933 -873.659 -802.5253 -875.0084 -790.6956
##
## Top 3 models based on the BIC criterion:
## VVV,2 VEV,2 VEE,2
## -790.6956 -802.5253 -873.0048
<- Mclust(iris.scaled, G = 3)
mc_bic3 $BIC mc_bic3
## Bayesian Information Criterion (BIC):
## EII VII EEI VEI EVI VVI EEE
## 3 -1214.525 -1222.844 -1029.733 -995.8343 -1014.515 -961.3205 -849.6448
## VEE EVE VVE EEV VEV EVV VVV
## 3 -822.076 -862.9077 -828.8731 -875.3724 -797.5483 -872.7515 -797.519
##
## Top 3 models based on the BIC criterion:
## VVV,3 VEV,3 VEE,3
## -797.5190 -797.5483 -822.0760
<- Mclust(iris.scaled, G = 4)
mc_bic4 $BIC mc_bic4
## Bayesian Information Criterion (BIC):
## EII VII EEI VEI EVI VVI EEE
## 4 -1177.163 -1139.392 -1044.112 -965.1348 -1054.223 -967.7021 -862.7323
## VEE EVE VVE EEV VEV EVV VVV
## 4 -821.5149 -902.9964 -881.127 -930.1693 -821.4057 -941.9898 -847.2777
##
## Top 3 models based on the BIC criterion:
## VEV,4 VEE,4 VVV,4
## -821.4057 -821.5149 -847.2777
<- Mclust(iris.scaled, G = 5)
mc_bic5 $BIC mc_bic5
## Bayesian Information Criterion (BIC):
## EII VII EEI VEI EVI VVI EEE VEE
## 5 -1135.642 -1107.791 -958.6219 -905.0241 -983.5054 -928.1301 -821.5159 NA
## EVE VVE EEV VEV EVV VVV
## 5 -901.9539 -853.4148 -877.2201 -837.6111 NA -893.2973
##
## Top 3 models based on the BIC criterion:
## EEE,5 VEV,5 VVE,5
## -821.5159 -837.6111 -853.4148
The results show that \(G = 2\) has the highest BIC value with the model option VVV. The next is \(G = 3\). Although the best number of clusters is chosen to be two, this is not far from the three clusters that we have used for the data set (the natural subgroups for the data are indeed three).