In this post the k-nearest neighbors algorithm is used to classify Twitter-data.
We have 1000 tweets about fracking which are labeld "c" = contra fracking, "p" = pro fracking, "n" = neutral to fracking, "nr" = not related to fracking.
The data can be found here.
The data is already cleaned: Missing values have been coded as -9999, than all numeric variables have been normalized with this R-function:
The k-nearest-neighbors algorithm is a "lazzy" learner. It does not make any assumtpions about any distributions (non-parametric approach!). It simply calculates the distance in the feature space. Therefore, it does not "learn" anything because it won't come up with any abstraction.
We want to test, if k-nearest-neighbors can identify the tweets just by the meta-data (without looking at the text...).
Here is the R-script:
How good is this classification?
The package gmodels has a nice function for cross-tables:
What do think? Is it a good classifyer?
We have 1000 tweets about fracking which are labeld "c" = contra fracking, "p" = pro fracking, "n" = neutral to fracking, "nr" = not related to fracking.
The data can be found here.
The data is already cleaned: Missing values have been coded as -9999, than all numeric variables have been normalized with this R-function:
normalize = function(x) (x-min(x))/(max(x)-min(x))Dates are transformed with, as can be seen here.
The k-nearest-neighbors algorithm is a "lazzy" learner. It does not make any assumtpions about any distributions (non-parametric approach!). It simply calculates the distance in the feature space. Therefore, it does not "learn" anything because it won't come up with any abstraction.
We want to test, if k-nearest-neighbors can identify the tweets just by the meta-data (without looking at the text...).
Here is the R-script:
#load the data (of course it has to be in the working directory...)We used half of the data for training and the other half for testing.
dfn=read.csv("FrackTweets.csv")
#remove the text-column
dfn=dfn[,-1]
#create 500 random numbers to allocate test and training set
samp=sample(1000,500)
#and now run the k-nearest-neigbors
library(class)
bof=knn(train=dfn[samp,-dim(dfn)[2]], test=dfn[-samp,-dim(dfn)[2]], cl=dfn$codes[samp], k=32)
How good is this classification?
The package gmodels has a nice function for cross-tables:
library(gmodels)
CrossTable(x=dfn$codes[-samp], y=bof, prop.chisq=F)
Total Observations in Table: 500 | bof dfn$codes[-samp] | c | n | nr | Row Total | -----------------|-----------|-----------|-----------|-----------| c | 216 | 11 | 3 | 230 | | 0.939 | 0.048 | 0.013 | 0.460 | | 0.457 | 0.550 | 0.429 | | | 0.432 | 0.022 | 0.006 | | -----------------|-----------|-----------|-----------|-----------| n | 114 | 5 | 0 | 119 | | 0.958 | 0.042 | 0.000 | 0.238 | | 0.241 | 0.250 | 0.000 | | | 0.228 | 0.010 | 0.000 | | -----------------|-----------|-----------|-----------|-----------| nr | 97 | 4 | 3 | 104 | | 0.933 | 0.038 | 0.029 | 0.208 | | 0.205 | 0.200 | 0.429 | | | 0.194 | 0.008 | 0.006 | | -----------------|-----------|-----------|-----------|-----------| p | 46 | 0 | 1 | 47 | | 0.979 | 0.000 | 0.021 | 0.094 | | 0.097 | 0.000 | 0.143 | | | 0.092 | 0.000 | 0.002 | | -----------------|-----------|-----------|-----------|-----------| Column Total | 473 | 20 | 7 | 500 | | 0.946 | 0.040 | 0.014 | | -----------------|-----------|-----------|-----------|-----------|
What do think? Is it a good classifyer?
Kommentare
Kommentar veröffentlichen