To predict the results of sport events is always fun - and
an enormous challenge. Especially in a tournament like the soccer world cup:
Teams compete who play very rarely against each other. In soccer, luck is one
of the most important factors (which means most of the data available
represents noise…) - especially in a knock out tournament… So what could be
more hopeless and more fun than using data-mining to predict the results. The
codecentric guys did an excellent job in providing data on past tournaments.
But matchday is coming soon and we need a dataset with the actual tournament to
apply a model. And this is what I did.
The R-script creates a new dataset with all possible
combinations of teams playing against each other in the 2014 world cup. And it
reconstructs the actual values of the codecentric-features for these matches.
So, if you build a model on the codecentric data, you can easily apply this new
dataset to predict new values. In the following posts, I will demonstrate some
models and hopefully my students get inspired to build some models by
themselves.
#load the codecentric data from
#https://raw.githubusercontent.com/codecentric/soccer-prediction-2014/master/3-compute-features/2-league-based/output/soccerDataWithFeatures.csv
df=read.csv2("games-with-graph-features.csv", as.is=T)
#select relevant columns
df=df[,-c(1,2,5:24,27,28,31:38, 48,49,110:132)]
colnames(df)
#many colums are characters, I don't know why...
wrongClass=as.numeric(which(sapply(df,class)=="character"))
for(i in wrongClass[-c(1,2)]) df[,i]=as.numeric(df[,i])
Teams = c("BRA","HRV","MEX","CMR","ESP","NLD","CHL","AUS",
"COL","GRC","CIV","JPN","URY","CRI","ENG","ITA",
"CHE","ECU","FRA","HND","ARG","BIH","IRN","NGA",
"DEU","PRT","GHA","USA","BEL","DZA","RUS","KOR")
#all names should be in the dataframe
Teams %in% df[,1]
#build a matrix with all combinations
TeamsWC = t(combn(Teams,2))
WC = matrix(NA,dim(TeamsWC)[1],dim(df)[2]-2)
WC = as.data.frame(WC)
WC=cbind(TeamsWC,WC)
colnames(WC)=colnames(df)
WC[,1:2]=TeamsWC[,1:2]
#create a function to catch the latest values for the teams
getValue=function(ha,team,value){
df[max(which(df[,ha]==team)),value]
}
#automatical grap values for "home" columns
HomeCols = which(grepl("home",colnames(WC))&!grepl("minus",colnames(WC)))
for(i in HomeCols[-c(1:3)]){
WC[,i]=mapply(getValue,1,WC[,1],i)
}
#and the same for away...
AwayCols = which(grepl("away",colnames(WC))&!grepl("minus",colnames(WC)))
for(i in AwayCols[-c(1:3)]){
WC[,i]=mapply(getValue,2,WC[,2],i)
}
#let's re-calculate the difference-colums
WC[,11]=WC[,7]-WC[,9]
WC[,12]=WC[,8]-WC[,10]
WC[,15]=WC[,13]-WC[,14]
WC[,seq(18,75,3)]=WC[,seq(16,73,3)]-WC[,seq(17,74,3)]
#...and set the goals to 0, so that no NA is causing trouble...
WC[,c(3:6)]=0
View(WC)
#save and start building model!
write.csv(WC,"WorldCup2014.csv", row.names=F)
And here is a boxplot of my first predictions. You can see, Spain will become world champion. More on this later.
Comments welcome!
I added a Group-variable:
AntwortenLöschen#now we create a variable for the groups
GroupA = c("BRA","HRV","MEX","CMR")
GroupB = c("ESP","NLD","CHL","AUS")
GroupC = c("COL","GRC","CIV","JPN")
GroupD = c("URY","CRI","ENG","ITA")
GroupE = c("CHE","ECU","FRA","HND")
GroupF = c("ARG","BIH","IRN","NGA")
GroupG = c("DEU","PRT","GHA","USA")
GroupH = c("BEL","DZA","RUS","KOR")
getGroup = function(x){
return(ifelse(any(GroupA %in% x), "A",
ifelse(any(GroupB %in% x), "B",
ifelse(any(GroupC %in% x), "C",
ifelse(any(GroupD %in% x), "D",
ifelse(any(GroupE %in% x), "E",
ifelse(any(GroupF %in% x), "F",
ifelse(any(GroupG %in% x), "G","H"))))))))
}
WC$Group1 = sapply(as.character(WC[,1]), getGroup)
WC$Group2 = sapply(as.character(WC[,2]), getGroup)