#k-means #clustering

Hier kommt das versprochene Beispiel zu k-means Clustering. Angelehnt ist es an Ledolter, Johannes. 2013. Business analytics and data mining with R. Hoboken, NewJersey: Wiley.

df = read.csv("http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv")
View(df)
#what is the right number of clusters?
#One idea is to look at the sum squared error (SSE) for 
#each possible number of clusters.
 
#Formula to calculate SSE:
wss <- (nrow(df)-1)*sum(apply(df[,-1],2,var))
 
for (i in 2:20) wss[i] <- sum(kmeans(df[,-1],centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

 #The "elbow" now indicates the optimal number of clusters. The
#idea is that at a certain point additional clusters do not 
#reduce the SSE much...
#In this case, we might decide to take 5 clusters.
 
#But there are many more and more sophisticated ways:
#The NbClust package provides 30 indices to determine the number 
#of clusters in a dataset.
 
#requires :: NbClust
library(NbClust)
set.seed(123)
nb <- NbClust(df[,-1], method = "kmeans")
#Warning messages show that Beale-Test failed. 
#Too few observations?
#NbClust suggests 2 (or 3) Clusters. We can plot the different
#results:
par(mfrow=c(1,1)) #NbClust had changed this parameter, so
#we set it back...
hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
#Changing set.seed will lead to different resulsts! Still
#some randomness in the method.
 

#kmeans clustering
set.seed(1234)
clus = kmeans(df[,-1], 2)
clus
o=order(clus$cluster)
data.frame(df$Country[o], clus$cluster[o])
 
#plotting the cluster
plot(df$RedMeat, df$WhiteMeat, type="n", xlab="Red Meat", 
     ylab="White Meat")
text(x=df$RedMeat, y=df$White, labels=df$Country,
      col=clus$cluster+1)
#The plot shows the only two variables, but all were used
#for clustering!

  
#same with 3 clusters:
clus = kmeans(df[,-1], 3)
plot(df$RedMeat, df$WhiteMeat, type="n", xlab="Red Meat", 
     ylab="White Meat")
text(x=df$RedMeat, y=df$White, labels=df$Country,
     col=clus$cluster+1)
 
#A nicer way to plot (requires::fpc)
rownames(df)=df$Country
library(fpc)
clusplot(df[,-1], clus$cluster, color=T, shade=F, 
         labels=2, lines=0, cex=.8, main="Protein Clusters")

 #The plot takes the first two principle components as 
#dimensions. This is not the same as the variables!
#What are principle components and what can I do with them?
#Wait for next post!

Created by Pretty R at inside-R.org

Political Data Science

Dieses Blog durchsuchen

#k-means #clustering

Labels

Kommentare

Kommentar veröffentlichen

Beliebte Posts aus diesem Blog

Deep-Dive Impfeffektivität: Eine kritische Datenanalyse der RKI-Berechnungen / Teil 1: Die Methode

Der Nutzerismus: Eine Ideologie mit totalitärem Potential

Was man an der COVID-Politik über Faschismus lernen kann