DataMining the Soccer World Cup 2014

To predict the results of sport events is always fun - and an enormous challenge. Especially in a tournament like the soccer world cup: Teams compete who play very rarely against each other. In soccer, luck is one of the most important factors (which means most of the data available represents noise…) - especially in a knock out tournament… So what could be more hopeless and more fun than using data-mining to predict the results. The codecentric guys did an excellent job in providing data on past tournaments. But matchday is coming soon and we need a dataset with the actual tournament to apply a model. And this is what I did.

The R-script creates a new dataset with all possible combinations of teams playing against each other in the 2014 world cup. And it reconstructs the actual values of the codecentric-features for these matches. So, if you build a model on the codecentric data, you can easily apply this new dataset to predict new values. In the following posts, I will demonstrate some models and hopefully my students get inspired to build some models by themselves.

#load the codecentric data from
#https://raw.githubusercontent.com/codecentric/soccer-prediction-2014/master/3-compute-features/2-league-based/output/soccerDataWithFeatures.csv

df=read.csv2("games-with-graph-features.csv", as.is=T)
#select relevant columns
df=df[,-c(1,2,5:24,27,28,31:38, 48,49,110:132)]
colnames(df)
#many colums are characters, I don't know why...
wrongClass=as.numeric(which(sapply(df,class)=="character"))
for(i in wrongClass[-c(1,2)]) df[,i]=as.numeric(df[,i])
Teams = c("BRA","HRV","MEX","CMR","ESP","NLD","CHL","AUS",
          "COL","GRC","CIV","JPN","URY","CRI","ENG","ITA",
          "CHE","ECU","FRA","HND","ARG","BIH","IRN","NGA",
          "DEU","PRT","GHA","USA","BEL","DZA","RUS","KOR")
#all names should be in the dataframe
Teams %in% df[,1]

#build a matrix with all combinations
TeamsWC = t(combn(Teams,2))
WC = matrix(NA,dim(TeamsWC)[1],dim(df)[2]-2)
WC = as.data.frame(WC)
WC=cbind(TeamsWC,WC)
colnames(WC)=colnames(df)
WC[,1:2]=TeamsWC[,1:2]

#create a function to catch the latest values for the teams
getValue=function(ha,team,value){
df[max(which(df[,ha]==team)),value]
}

#automatical grap values for "home" columns
HomeCols = which(grepl("home",colnames(WC))&!grepl("minus",colnames(WC)))
for(i in HomeCols[-c(1:3)]){
WC[,i]=mapply(getValue,1,WC[,1],i)
}

#and the same for away...
AwayCols = which(grepl("away",colnames(WC))&!grepl("minus",colnames(WC)))
for(i in AwayCols[-c(1:3)]){
WC[,i]=mapply(getValue,2,WC[,2],i)
}

#let's re-calculate the difference-colums
WC[,11]=WC[,7]-WC[,9]
WC[,12]=WC[,8]-WC[,10]
WC[,15]=WC[,13]-WC[,14]
WC[,seq(18,75,3)]=WC[,seq(16,73,3)]-WC[,seq(17,74,3)]

#...and set the goals to 0, so that no NA is causing trouble...
WC[,c(3:6)]=0
View(WC)

#save and start building model!
write.csv(WC,"WorldCup2014.csv", row.names=F)

And here is a boxplot of my first predictions. You can see, Spain will become world champion. More on this later.

Comments welcome!

Kommentare

Simon Hegelich13. Juni 2014 um 11:11
I added a Group-variable:

#now we create a variable for the groups
GroupA = c("BRA","HRV","MEX","CMR")
GroupB = c("ESP","NLD","CHL","AUS")
GroupC = c("COL","GRC","CIV","JPN")
GroupD = c("URY","CRI","ENG","ITA")
GroupE = c("CHE","ECU","FRA","HND")
GroupF = c("ARG","BIH","IRN","NGA")
GroupG = c("DEU","PRT","GHA","USA")
GroupH = c("BEL","DZA","RUS","KOR")

getGroup = function(x){
return(ifelse(any(GroupA %in% x), "A",
ifelse(any(GroupB %in% x), "B",
ifelse(any(GroupC %in% x), "C",
ifelse(any(GroupD %in% x), "D",
ifelse(any(GroupE %in% x), "E",
ifelse(any(GroupF %in% x), "F",
ifelse(any(GroupG %in% x), "G","H"))))))))
}

WC$Group1 = sapply(as.character(WC[,1]), getGroup)
WC$Group2 = sapply(as.character(WC[,2]), getGroup)
AntwortenLöschen
Antworten

Kommentar hinzufügen

Political Data Science

Dieses Blog durchsuchen

DataMining the Soccer World Cup 2014

Labels

Kommentare

Kommentar veröffentlichen

Beliebte Posts aus diesem Blog

Deep-Dive Impfeffektivität: Eine kritische Datenanalyse der RKI-Berechnungen / Teil 1: Die Methode

#RTutorial: Using R to Harvest the Twitter STREAM API

Der Nutzerismus: Eine Ideologie mit totalitärem Potential