How big is big? Are you fit for real big data?
These are some questions I am thinking about. Luckily, there is a kaggle competition going on, with the aim to predict the click through rate in a huge dataset of webpage visits. The task is to predict the probability that 4.5 Million users are clicking on an advertisment. The training dataset contains of 40 Million (!!!) lines of user data.
In this little series I will share my experience in trying to handle this mess.
First problem: How to load the data?
Loading the csv file in a normal way takes much too long.
The package "data.table" includes the fread function, which is much faster. By setting colClasses to "character" all columns are loaded as character class
Reading nearly 6 GB will take some time, nevertheless...
These are some questions I am thinking about. Luckily, there is a kaggle competition going on, with the aim to predict the click through rate in a huge dataset of webpage visits. The task is to predict the probability that 4.5 Million users are clicking on an advertisment. The training dataset contains of 40 Million (!!!) lines of user data.
In this little series I will share my experience in trying to handle this mess.
First problem: How to load the data?
Loading the csv file in a normal way takes much too long.
The package "data.table" includes the fread function, which is much faster. By setting colClasses to "character" all columns are loaded as character class
library(data.table) train <- fread("train.csv", colClasses="character")
Kommentare
Kommentar veröffentlichen