DataMining the Soccer World Cup 2014

To predict the results of sport events is always fun - and an enormous challenge. Especially in a tournament like the soccer world cup: Teams compete who play very rarely against each other. In soccer, luck is one of the most important factors (which means most of the data available represents noise…) - especially in a knock out tournament… So what could be more hopeless and more fun than using data-mining to predict the results. The codecentric guys did an excellent job in providing data on past tournaments. But matchday is coming soon and we need a dataset with the actual tournament to apply a model. And this is what I did.
The R-script creates a new dataset with all possible combinations of teams playing against each other in the 2014 world cup. And it reconstructs the actual values of the codecentric-features for these matches. So, if you build a model on the codecentric data, you can easily apply this new dataset to predict new values. In the following posts, I will demonstrate some models and hopefully my students get inspired to build some models by themselves.

#load the codecentric data from

df=read.csv2("games-with-graph-features.csv", as.is=T)
#select relevant columns
df=df[,-c(1,2,5:24,27,28,31:38, 48,49,110:132)]
#many colums are characters, I don't know why...
for(i in wrongClass[-c(1,2)]) df[,i]=as.numeric(df[,i])
Teams = c("BRA","HRV","MEX","CMR","ESP","NLD","CHL","AUS",
#all names should be in the dataframe
Teams %in% df[,1]

#build a matrix with all combinations
TeamsWC = t(combn(Teams,2))
WC = matrix(NA,dim(TeamsWC)[1],dim(df)[2]-2)
WC = as.data.frame(WC)

#create a function to catch the latest values for the teams

#automatical grap values for "home" columns
HomeCols = which(grepl("home",colnames(WC))&!grepl("minus",colnames(WC)))
for(i in HomeCols[-c(1:3)]){

#and the same for away...
AwayCols = which(grepl("away",colnames(WC))&!grepl("minus",colnames(WC)))
for(i in AwayCols[-c(1:3)]){

#let's re-calculate the difference-colums

#...and set the goals to 0, so that no NA is causing trouble...

#save and start building model!
write.csv(WC,"WorldCup2014.csv", row.names=F)

And here is a boxplot of my first predictions. You can see, Spain will become world champion. More on this later.

Comments welcome!


  1. I added a Group-variable:

    #now we create a variable for the groups
    GroupA = c("BRA","HRV","MEX","CMR")
    GroupB = c("ESP","NLD","CHL","AUS")
    GroupC = c("COL","GRC","CIV","JPN")
    GroupD = c("URY","CRI","ENG","ITA")
    GroupE = c("CHE","ECU","FRA","HND")
    GroupF = c("ARG","BIH","IRN","NGA")
    GroupG = c("DEU","PRT","GHA","USA")
    GroupH = c("BEL","DZA","RUS","KOR")

    getGroup = function(x){
    return(ifelse(any(GroupA %in% x), "A",
    ifelse(any(GroupB %in% x), "B",
    ifelse(any(GroupC %in% x), "C",
    ifelse(any(GroupD %in% x), "D",
    ifelse(any(GroupE %in% x), "E",
    ifelse(any(GroupF %in% x), "F",
    ifelse(any(GroupG %in% x), "G","H"))))))))

    WC$Group1 = sapply(as.character(WC[,1]), getGroup)
    WC$Group2 = sapply(as.character(WC[,2]), getGroup)


