Direkt zum Hauptbereich

#k-means #clustering



Hier kommt das versprochene Beispiel zu k-means Clustering. Angelehnt ist es an Ledolter, Johannes. 2013. Business analytics and data mining with R. Hoboken, NewJersey: Wiley.


df = read.csv("http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv")
View(df)
#what is the right number of clusters?
#One idea is to look at the sum squared error (SSE) for 
#each possible number of clusters.
 
#Formula to calculate SSE:
wss <- (nrow(df)-1)*sum(apply(df[,-1],2,var))
 
for (i in 2:20) wss[i] <- sum(kmeans(df[,-1],centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares") 
 #The "elbow" now indicates the optimal number of clusters. The
#idea is that at a certain point additional clusters do not 
#reduce the SSE much...
#In this case, we might decide to take 5 clusters.
 
#But there are many more and more sophisticated ways:
#The NbClust package provides 30 indices to determine the number 
#of clusters in a dataset.
 
#requires :: NbClust
library(NbClust)
set.seed(123)
nb <- NbClust(df[,-1], method = "kmeans")
#Warning messages show that Beale-Test failed. 
#Too few observations?
#NbClust suggests 2 (or 3) Clusters. We can plot the different
#results:
par(mfrow=c(1,1)) #NbClust had changed this parameter, so
#we set it back...
hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
#Changing set.seed will lead to different resulsts! Still
#some randomness in the method.
 
#kmeans clustering set.seed(1234) clus = kmeans(df[,-1], 2) clus o=order(clus$cluster) data.frame(df$Country[o], clus$cluster[o])   #plotting the cluster plot(df$RedMeat, df$WhiteMeat, type="n", xlab="Red Meat", ylab="White Meat") text(x=df$RedMeat, y=df$White, labels=df$Country, col=clus$cluster+1) #The plot shows the only two variables, but all were used #for clustering! 
  
#same with 3 clusters:
clus = kmeans(df[,-1], 3)
plot(df$RedMeat, df$WhiteMeat, type="n", xlab="Red Meat", 
     ylab="White Meat")
text(x=df$RedMeat, y=df$White, labels=df$Country,
     col=clus$cluster+1)
 
#A nicer way to plot (requires::fpc)
rownames(df)=df$Country
library(fpc)
clusplot(df[,-1], clus$cluster, color=T, shade=F, 
         labels=2, lines=0, cex=.8, main="Protein Clusters") 
 
 #The plot takes the first two principle components as 
#dimensions. This is not the same as the variables!
#What are principle components and what can I do with them?
#Wait for next post!
Created by Pretty R at inside-R.org

Kommentare