Love your #nearest neighbor

In this post the k-nearest neighbors algorithm is used to classify Twitter-data.
We have 1000 tweets about fracking which are labeld "c" = contra fracking, "p" = pro fracking, "n" = neutral to fracking, "nr" = not related to fracking.
The data can be found here.
The data is already cleaned: Missing values have been coded as -9999, than all numeric variables have been normalized with this R-function:

normalize = function(x) (x-min(x))/(max(x)-min(x))

Dates are transformed with, as can be seen here.
The k-nearest-neighbors algorithm is a "lazzy" learner. It does not make any assumtpions about any distributions (non-parametric approach!). It simply calculates the distance in the feature space. Therefore, it does not "learn" anything because it won't come up with any abstraction.
We want to test, if k-nearest-neighbors can identify the tweets just by the meta-data (without looking at the text...).

Here is the R-script:

#load the data (of course it has to be in the working directory...)
dfn=read.csv("FrackTweets.csv")

#remove the text-column
dfn=dfn[,-1]

#create 500 random numbers to allocate test and training set
samp=sample(1000,500)

#and now run the k-nearest-neigbors
library(class)
bof=knn(train=dfn[samp,-dim(dfn)[2]], test=dfn[-samp,-dim(dfn)[2]], cl=dfn$codes[samp], k=32)

We used half of the data for training and the other half for testing.
How good is this classification?
The package gmodels has a nice function for cross-tables:

library(gmodels)
CrossTable(x=dfn$codes[-samp], y=bof, prop.chisq=F)

Total Observations in Table:  500 

 
                 | bof 
dfn$codes[-samp] |         c |         n |        nr | Row Total | 
-----------------|-----------|-----------|-----------|-----------|
               c |       216 |        11 |         3 |       230 | 
                 |     0.939 |     0.048 |     0.013 |     0.460 | 
                 |     0.457 |     0.550 |     0.429 |           | 
                 |     0.432 |     0.022 |     0.006 |           | 
-----------------|-----------|-----------|-----------|-----------|
               n |       114 |         5 |         0 |       119 | 
                 |     0.958 |     0.042 |     0.000 |     0.238 | 
                 |     0.241 |     0.250 |     0.000 |           | 
                 |     0.228 |     0.010 |     0.000 |           | 
-----------------|-----------|-----------|-----------|-----------|
              nr |        97 |         4 |         3 |       104 | 
                 |     0.933 |     0.038 |     0.029 |     0.208 | 
                 |     0.205 |     0.200 |     0.429 |           | 
                 |     0.194 |     0.008 |     0.006 |           | 
-----------------|-----------|-----------|-----------|-----------|
               p |        46 |         0 |         1 |        47 | 
                 |     0.979 |     0.000 |     0.021 |     0.094 | 
                 |     0.097 |     0.000 |     0.143 |           | 
                 |     0.092 |     0.000 |     0.002 |           | 
-----------------|-----------|-----------|-----------|-----------|
    Column Total |       473 |        20 |         7 |       500 | 
                 |     0.946 |     0.040 |     0.014 |           | 
-----------------|-----------|-----------|-----------|-----------|

What do think? Is it a good classifyer?

Political Data Science

Dieses Blog durchsuchen

Love your #nearest neighbor

Labels

Kommentare

Kommentar veröffentlichen

Beliebte Posts aus diesem Blog

Deep-Dive Impfeffektivität: Eine kritische Datenanalyse der RKI-Berechnungen / Teil 1: Die Methode

Der Nutzerismus: Eine Ideologie mit totalitärem Potential

Was man an der COVID-Politik über Faschismus lernen kann