HC Teo (15 Jul 2019)
Why clustering?
Clustering is a form of exploratory data mining that allows us to categorise objects similar to each other into clusters. In ecology, clustering environmental variables is an important tool in characterising vegetation communities for conservation (Lechner et al., 2016).
In this example, we used remotely-sensed data to identify urban ponds and lakes in the Klang Valley, Malaysia. From the remotely-sensed images, for each of the 1261 water bodies, we derived landscape metrics such as area (log-transformed), perimeter (log-transformed), Perimeter-Area Ratio (PAR), shape index, fractal dimensions, median Red/Green band value, median Red/Blue band value and land cover within 100m. We cluster these environmental variables to get an initial idea of what clusters of ponds there might be, to facilitate field-based sampling.
Load packages and read data
First, we need to load the packages.
library(factoextra) #to visualise clustering data
library(ggpubr) #for plotting
library(umap) #for umap
library(dbscan) #for hdbscan
library(ggfortify) #for autoplot
library(vegan) #for decorana
Next, we read in the data.
#set seed to ensure reproducibility
set.seed(1)
df <- read.csv("ponds_only_log.csv")
tab100 <- read.csv("tabulate100.csv")
df$tab100 <- tab100[,1:4]
df.scaled <- as.data.frame(scale(df))
Dimensionality reduction using PCA
Principal Components Analysis (PCA) is one of the more commonly known dimensionality reduction methods. Let’s see how it performs.
data.pca <- prcomp(df, center = TRUE,scale. = TRUE)
summary(data.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7883 1.6470 1.3817 1.1860 1.03341 0.70296 0.2834
## Proportion of Variance 0.2907 0.2466 0.1736 0.1279 0.09708 0.04492 0.0073
## Cumulative Proportion 0.2907 0.5373 0.7109 0.8388 0.93584 0.98077 0.9881
## PC8 PC9 PC10 PC11
## Standard deviation 0.25785 0.20101 0.15599 0.007111
## Proportion of Variance 0.00604 0.00367 0.00221 0.000000
## Cumulative Proportion 0.99411 0.99778 1.00000 1.000000
autoplot(data.pca, data = df)
Dimensionality reduction using DCA
Detrended Correspondence Analysis (DCA) is widely used by ecologists for data ordination, which works well on sparse data (link).
data.dca <- decorana(df)
dcadf <- data.frame(data.dca[2])
dcadf2 <- dcadf[1:2]
plot(dcadf2)
Dimensionality reduction using UMAP
One of the increasingly popular dimensionality reduction algorithms in data science is Uniform Manifold Approximation and Projection (UMAP). It performs well in preserving the local and global structure of the data, and can be used as a pre-processing step prior to clustering.
Run UMAP on data frame containing all variables. UMAP seems to perform pretty well.
df.umap <- umap(df,n_components=2)
df.umap <- df.umap$layout
df.umap <- as.data.frame(df.umap)
plot(df.umap)
Clustering using HDBSCAN
HDBSCAN is a popular hierarchical clustering algorithm in data science. It can automatically identify and extract stable clusters from data. However, it will identify some points as noise points and exclude them from the clusters. This is a strength if you want this feature and a weakness if you don’t. Hence, we cannot use HDBSCAN if we want all the points to be included in clustering.
Just for demonstration, we will run HDBSCAN on the DCA-ordinated data. So many noise points!
hdbcluster <- hdbscan(dcadf2,minPts=20)
hdbcluster
## HDBSCAN clustering for 1013 objects.
## Parameters: minPts = 20
## The clustering contains 3 cluster(s) and 762 noise points.
##
## 0 1 2 3
## 762 166 44 41
##
## Available fields: cluster, minPts, cluster_scores,
## membership_prob, outlier_scores, hc
plot(dcadf2, col=hdbcluster$hdbcluster+1,pch=20)
Clustering validation measures
Since we cannot use HDBSCAN for clustering, we can use some of the more traditional methods that will not exclude any points as noise. We will try hierarchical clustering, k-means and Partioning Around Medoids (PAM).
Before clustering, we will check the clustering validation measures to decide on what the best number of clusters ought to be.
Connectivity should be minimised, while both the Dunn index and the silhouette width should be maximized. We will look at between 2 and 10 clusters.
library(clValid)
## Loading required package: cluster
# Compute clValid
clmethods <- c("hierarchical","kmeans","pam")
intern <- clValid(df.umap, nClust = 2:10,
clMethods = clmethods, validation = "internal",maxitems=1200)
## Warning in clValid(df.umap, nClust = 2:10, clMethods = clmethods,
## validation = "internal", : rownames for data not specified, using
## 1:nrow(data)
# Summary
summary(intern)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 2 3 4 5 6 7 8 9 10
##
## hierarchical Connectivity 0.2361 0.4611 2.6512 4.1163 8.3048 10.9611 15.0659 18.6246 18.6246
## Dunn 0.0275 0.0315 0.0225 0.0282 0.0324 0.0324 0.0371 0.0501 0.0501
## Silhouette 0.4863 0.5236 0.5391 0.5079 0.4568 0.4520 0.4912 0.4764 0.4813
## kmeans Connectivity 6.0012 12.4575 14.3837 15.4242 17.9389 31.1048 27.0258 41.4306 50.8548
## Dunn 0.0057 0.0217 0.0158 0.0178 0.0088 0.0130 0.0056 0.0153 0.0113
## Silhouette 0.4800 0.5524 0.5532 0.5210 0.4865 0.5162 0.5091 0.4870 0.4927
## pam Connectivity 3.6075 12.4575 8.4659 10.2353 23.2944 33.6583 45.2889 51.0825 58.2718
## Dunn 0.0136 0.0217 0.0056 0.0119 0.0069 0.0088 0.0102 0.0102 0.0102
## Silhouette 0.4800 0.5524 0.5553 0.5123 0.5016 0.5121 0.5008 0.4841 0.4893
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 0.2361 hierarchical 2
## Dunn 0.0501 hierarchical 9
## Silhouette 0.5553 pam 4
plot(intern)
The optimal model is 6 or 8 hierarchical clusters (Dunn & Silhouette width), or k-means 4 clusters (Silhouette).
Hierarchical clustering - for 6 clusters
#compute distance between vectors
d <- dist(df.umap)
#run hierarchical clustering using Ward's method
groups <- hclust(d,method="ward.D")
#plot dendogram, use hang to ensure that labels fall below tree
plot(groups, hang=-1,labels=FALSE)
#cut into 6 subtrees
rect.hclust(groups,6)
cut <- cutree(groups,k=6)
dfh <- df.umap
dfh$cluster <- cut
dfh$cluster <- as.factor(dfh$cluster)
write.csv(cut,file="df_hc6.csv")
Hierarchical clustering - for 8 clusters
#compute distance between vectors
d2 <- dist(df.umap)
#run hierarchical clustering using Ward's method
groups2 <- hclust(d,method="ward.D")
#plot dendogram, use hang to ensure that labels fall below tree
plot(groups2, hang=-1,labels=FALSE)
#cut into 6 subtrees
rect.hclust(groups2,8)
cut2 <- cutree(groups2,k=8)
dfh2 <- df.umap
dfh2$cluster <- cut2
dfh2$cluster <- as.factor(dfh2$cluster)
write.csv(cut2,file="df_hc8.csv")
k-means clustering for 4 clusters
km <- kmeans(df.umap,centers=4)
dfk <- df.umap
dfk$cluster <- as.factor(km$cluster)
write.csv(km$cluster,file="df_km4.csv")
Scatter plots
Go to Excel and merge clustered ponds with aquaculture. Import back into R to visualise as scatter plot.
Draw the hierarchical 6-cluster plot.
ggplot(data=dfh,aes(x=V1, y=V2,colour=cluster)) +
geom_point(aes(fill=cluster),size=2.5) +
scale_shape_manual(values=c(16, 17)) +
stat_chull(aes(fill=cluster),geom='polygon',alpha=0.2) +
labs(x = "V1 ()", y="V2 ()") + ggtitle(expression("Hierarchical 6-cluster plot"))+ ggsave("6Hclust.jpg", width = 7, height = 6)
Draw the hierarchical 8-cluster plot.
ggplot(data=dfh2,aes(x=V1, y=V2,colour=cluster)) +
geom_point(aes(fill=cluster),size=2.5) +
scale_shape_manual(values=c(16, 17)) +
stat_chull(aes(fill=cluster),geom='polygon',alpha=0.2) +
labs(x = "V1 ()", y="V2 ()") + ggtitle(expression("Hierarchical 8-cluster plot"))+ ggsave("8Hclust.jpg", width = 7, height = 6)
Draw the k-means 4-cluster plot.
ggplot(data=dfk,aes(x=V1, y=V2,colour=cluster)) +
geom_point(aes(fill=cluster),size=2.5) +
scale_shape_manual(values=c(16, 17)) +
stat_chull(aes(fill=cluster),geom='polygon',alpha=0.2) +
labs(x = "V1 ()", y="V2 ()") + ggtitle(expression("k-means 4-cluster plot"))+ ggsave("4kmeans.jpg", width = 7, height = 6)
References
Lechner, A.M., McCaffrey, N., McKenna, P., Venables, W.N., & Hunter, J.T. (2016). Ecoregionalization classification of wetlands based on a cluster analysis of environmental data. Applied Vegetation Science, 19(4),724-735. https://doi.org/10.1111/avsc.12248
Comments