Hier kommt das
versprochene Beispiel zu k-means Clustering. Angelehnt ist es an Ledolter,
Johannes. 2013. Business analytics and data mining with R. Hoboken, NewJersey: Wiley.
df = read.csv("http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv") View(df) #what is the right number of clusters? #One idea is to look at the sum squared error (SSE) for #each possible number of clusters. #Formula to calculate SSE: wss <- (nrow(df)-1)*sum(apply(df[,-1],2,var)) for (i in 2:20) wss[i] <- sum(kmeans(df[,-1],centers=i)$withinss) plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
#The "elbow" now indicates the optimal number of clusters. The #idea is that at a certain point additional clusters do not #reduce the SSE much... #In this case, we might decide to take 5 clusters. #But there are many more and more sophisticated ways: #The NbClust package provides 30 indices to determine the number #of clusters in a dataset. #requires :: NbClust library(NbClust) set.seed(123) nb <- NbClust(df[,-1], method = "kmeans") #Warning messages show that Beale-Test failed. #Too few observations? #NbClust suggests 2 (or 3) Clusters. We can plot the different #results: par(mfrow=c(1,1)) #NbClust had changed this parameter, so #we set it back... hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,]))) #Changing set.seed will lead to different resulsts! Still #some randomness in the method. #kmeans clustering set.seed(1234) clus = kmeans(df[,-1], 2) clus o=order(clus$cluster) data.frame(df$Country[o], clus$cluster[o]) #plotting the cluster plot(df$RedMeat, df$WhiteMeat, type="n", xlab="Red Meat", ylab="White Meat") text(x=df$RedMeat, y=df$White, labels=df$Country, col=clus$cluster+1) #The plot shows the only two variables, but all were used #for clustering!
#same with 3 clusters: clus = kmeans(df[,-1], 3) plot(df$RedMeat, df$WhiteMeat, type="n", xlab="Red Meat", ylab="White Meat") text(x=df$RedMeat, y=df$White, labels=df$Country, col=clus$cluster+1) #A nicer way to plot (requires::fpc) rownames(df)=df$Country library(fpc) clusplot(df[,-1], clus$cluster, color=T, shade=F, labels=2, lines=0, cex=.8, main="Protein Clusters")
#The plot takes the first two principle components as #dimensions. This is not the same as the variables! #What are principle components and what can I do with them? #Wait for next post!
Kommentare
Kommentar veröffentlichen