#DecisionTrees and #RandomForest, here come the #RCodes

My new Publication in @EPA_Journal is #openSource.

Abstract

The article introduces machine learning algorithms for political scientists. These approaches should not be seen as new method for old problems. Rather, it is important to understand the different logic of the machine learning approach. Here, data is analyzed without theoretical assumptions about possible causalities. Models are optimized according to their accuracy and robustness. While the computer can do this work more or less alone, it is the researcher’s duty to make sense of these models afterwards. Visualization of machine learning results therefore becomes very important and is in the focus of this paper. The methods that are presented and compared are decision trees, bagging and random forests. The later are more advanced versions of the former, relying on bootstrapping procedures. To demonstrate these methods, extreme shifts in the US budget and their connection to attention of political actors are analyzed. The paper presents a comparison of the accuracy of different models based on ROC curves and shows how to interpret random forest models with the help of visualizations. The aim of the article is to provide an example, how these methods can be used in political science and to highlight possible pitfalls as well as advantages of machine learning.

And here are the R-Codes as supporting information

##########################
### R-script for 
###  
### Decision trees and random forests: 
### Machine learning techniques to classify rare events
###
### European Policy Analysis
### Simon Hegelich
###
### Contact: 
### Prof. Dr. Simon Hegelich
### Political Data Science
### Technical University of Munich TUM
### Bavarian School of Public Policy
### Email: hegelich@hfpm.de
### http://politicaldatascience.blogspot.de
##########################
 
### Dowlnoading data from the Policy Agendas Project
### This can take some time and requires a good internet connection
budget = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/budget/instances.csv?from=1945&to=2014", as.is=T)
congress = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/ch/instances.csv?from=1945&to=2014", as.is=T)
laws = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/pl/instances.csv?from=1945&to=2014", as.is=T)
rowCall = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/rc/instances.csv?from=1945&to=2014", as.is=T)
eo = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/eo/instances.csv?from=1945&to=2014", as.is=T)
sou = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/sou/instances.csv?from=1945&to=2014", as.is=T)
supCourt = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/sc/instances.csv?from=1945&to=2014", as.is=T)
gallup = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/mip/instances.csv?from=1945&to=2014", as.is=T)
 
### The next table was constructed by the author.
### It show matches in topics and budgets
TopicsCross = read.csv("TopicsCross.csv", as.is=T)
TopicsCross
##    BudgetCode                                          BudgetTopic PAPCode
## 1          50                                     National Defense      16
## 2         150                                International Affairs      19
## 3         250               General Science, Space, and Technology      NA
## 4         270                                               Energy       8
## 5         300                    Natural Resources and Environment       7
## 6         350                                          Agriculture       4
## 7         370                          Commerce and Housing Credit      NA
## 8         400                                       Transportation      10
## 9         450                   Community and Regional Development      NA
## 10        500 Education, Training, Employment, and Social Services       6
## 11        550                                               Health       3
## 12        570                                             Medicare      NA
## 13        600                                      Income Security      NA
## 14        650                                      Social Security      13
## 15        700                       Veterans Benefits and Services      NA
## 16        750                            Administration of Justice      12
## 17        800                                   General Government      20
## 18        900                                         Net Interest      NA
## 19        920                                           Allowances      NA
## 20        950                    Undistributed Offsetting Receipts      NA
## 21         NA                                                            1
## 22         NA                                                            2
## 23         NA                                                            5
## 24         NA                                                            9
## 25         NA                                                           14
## 26         NA                                                           15
## 27         NA                                                           17
## 28         NA                                                           18
## 29         NA                                                           21
##                                              PAPTopic
## 1                                             Defense
## 2               International Affairs and Foreign Aid
## 3                                                    
## 4                                              Energy
## 5                                         Environment
## 6                                         Agriculture
## 7                                                    
## 8                                      Transportation
## 9                                                    
## 10                                          Education
## 11                                             Health
## 12                                                   
## 13                                                   
## 14                                     Social Welfare
## 15                                                   
## 16                      Law, Crime, and Family Issues
## 17                              Government Operations
## 18                                                   
## 19                                                   
## 20                                                   
## 21                                     Macroeconomics
## 22 Civil Rights, Minority Issues, and Civil Liberties
## 23                               Labor and Employment
## 24                                        Immigration
## 25           Community Development and Housing Issues
## 26            Banking, Finance, and Domestic Commerce
## 27      Space, Science, Technology and Communications
## 28                                      Foreign Trade
## 29                  Public Lands and Water Management
budcd = TopicsCross$BudgetCode[!(is.na(TopicsCross$PAPCode))]
budcd = budcd[!is.na(budcd)]
budgetSel = subset(budget, select=c(pctChng, TopicCode, Year, Congress), MajorFunction==1 & TopicCode %in% budcd)
hist(budgetSel$pctChng)

### Function to match topicCodes with budgetCodes
budToPAP = function(x) TopicsCross$PAPCode[TopicsCross$BudgetCode==x][1]
 
### Function to count the mentions of topics per year
getCount = function(code, year, YearCol, TopicCol){
  years = which(YearCol==year)
  codes = which(TopicCol==budToPAP(code))
 
  return(length(which(years %in% codes)))
}
 
### Construction of the final data frame
budgetSel$congress = NA
for(i in 1:dim(budgetSel)[1]){
  budgetSel$congress[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, congress$Year, congress$MajorTopic)
}
budgetSel$congress[budgetSel$Year>2012 | budgetSel$Year<1948] = NA
 
budgetSel$eo = NA
for(i in 1:dim(budgetSel)[1]){
  budgetSel$eo[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, eo$Year, eo$MajorTopic)
}
 
budgetSel$gallup = NA
for(i in 1:dim(budgetSel)[1]){
  if(any(gallup$Year==budgetSel$Year[i]-2 &
         gallup$MajorTopic==budToPAP(budgetSel$TopicCode[i])))
    budgetSel$gallup[i] = gallup$Percentage[which(gallup$Year==budgetSel$Year[i]-2 &
                                                    gallup$MajorTopic==budToPAP(budgetSel$TopicCode[i]))]
}
 
budgetSel$laws = NA
for(i in 1:dim(budgetSel)[1]){
  budgetSel$laws[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, laws$Year, laws$MajorTopic)
}
budgetSel$laws[budgetSel$Year<1950] = NA
 
#RowCall has not enough data points and is not used
# budgetSel$rowCall = NULL
# for(i in 1:dim(budgetSel)[1]){
#   budgetSel$rowCall[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, rowCall$Year, rowCall$MajorTopic)
# }
# budgetSel$rowCall[budgetSel$Year>2004] = NA
 
budgetSel$sou = NA
for(i in 1:dim(budgetSel)[1]){
  budgetSel$sou[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, sou$Year, sou$Majortopic)
}
budgetSel$sou[budgetSel$Year<1948] = NA
 
# sup Court has too much missing data
# budgetSel$supCourt = NULL
# for(i in 1:dim(budgetSel)[1]){
#   budgetSel$supCourt[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, supCourt$Year, supCourt$MajorTopic)
# }
# budgetSel$supCourt[budgetSel$Year>2011] = NA
 
### Save the data as csv
write.csv(budgetSel, "BudgetMasterLag2.csv", row.names=F)
 
### Read the data in again
df <- read.csv("BudgetMasterLag2.csv")
summary(df)
##     pctChng           TopicCode        Year         Congress    
##  Min.   :-218.248   Min.   : 50   Min.   :1947   Min.   : 80.0  
##  1st Qu.:  -6.801   1st Qu.:270   1st Qu.:1964   1st Qu.: 88.0  
##  Median :   3.226   Median :375   Median :1980   Median : 96.5  
##  Mean   :  13.498   Mean   :412   Mean   :1980   Mean   : 96.5  
##  3rd Qu.:  14.673   3rd Qu.:550   3rd Qu.:1997   3rd Qu.:105.0  
##  Max.   :1025.630   Max.   :800   Max.   :2014   Max.   :113.0  
##                                                                 
##     congress            eo            gallup             laws       
##  Min.   :  5.00   Min.   : 0.00   Min.   :0.00000   Min.   :  0.00  
##  1st Qu.: 38.00   1st Qu.: 0.00   1st Qu.:0.00000   1st Qu.:  6.00  
##  Median : 71.00   Median : 2.00   Median :0.01127   Median : 11.00  
##  Mean   : 89.61   Mean   : 4.49   Mean   :0.04459   Mean   : 19.11  
##  3rd Qu.:122.75   3rd Qu.: 6.00   3rd Qu.:0.04579   3rd Qu.: 21.00  
##  Max.   :367.00   Max.   :65.00   Max.   :0.53333   Max.   :128.00  
##  NA's   :30                       NA's   :40        NA's   :30      
##       sou        
##  Min.   :  0.00  
##  1st Qu.:  1.00  
##  Median :  6.00  
##  Mean   : 16.78  
##  3rd Qu.: 22.00  
##  Max.   :336.00  
##  NA's   :10
df$TopicCode = as.factor(df$TopicCode)
df = na.omit(df)
 
### Find punctuations: see Hegelich/Fraune/Knollmann 2015
### http://onlinelibrary.wiley.com/doi/10.1111/psj.12089/abstract
IQRs = by(df$pctChng, df$TopicCode, IQR)
df$Punc = NA
for(i in 1: length(IQRs)){
  x = df$pctChng[df$TopicCode==levels(df$TopicCode)[i]]
  Punc = ifelse((x > quantile(x, .75) + IQRs[[i]]*1.5) | (x < quantile(x, .25) - IQRs[[i]]*1.5), TRUE, FALSE)
  df$Punc[df$TopicCode==levels(df$TopicCode)[i]]=Punc
}
 
 
for (i in 1:length(levels(df$TopicCode))){
  df$congS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$congress[df$TopicCode==levels(df$TopicCode)[i]])
}
 
for (i in 1:length(levels(df$TopicCode))){
  df$eoS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$eo[df$TopicCode==levels(df$TopicCode)[i]])
}
 
for (i in 1:length(levels(df$TopicCode))){
  df$gallupS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$gallup[df$TopicCode==levels(df$TopicCode)[i]])
}
 
for (i in 1:length(levels(df$TopicCode))){
  df$lawsS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$laws[df$TopicCode==levels(df$TopicCode)[i]])
}
 
# for (i in 1:length(levels(df$TopicCode))){
#   df$rowCallS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$rowCall[df$TopicCode==levels(df$TopicCode)[i]])
# }
 
for (i in 1:length(levels(df$TopicCode))){
  df$souS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$sou[df$TopicCode==levels(df$TopicCode)[i]])
}
 
# for (i in 1:length(levels(df$TopicCode))){
#   df$supCourtS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$supCourt[df$TopicCode==levels(df$TopicCode)[i]])
# }
 
colnames(df)[4] = "legislative"
df$Punc=as.factor(df$Punc)
 
colnames(df)
##  [1] "pctChng"     "TopicCode"   "Year"        "legislative" "congress"   
##  [6] "eo"          "gallup"      "laws"        "sou"         "Punc"       
## [11] "congS"       "eoS"         "gallupS"     "lawsS"       "souS"
### Select relevant data
df <- df[,-c(4, 11:15)]
summary(df)
##     pctChng           TopicCode        Year         congress     
##  Min.   :-218.248   50     : 61   Min.   :1950   Min.   :  5.00  
##  1st Qu.:  -5.734   150    : 61   1st Qu.:1967   1st Qu.: 41.00  
##  Median :   3.809   270    : 61   Median :1982   Median : 73.00  
##  Mean   :  14.157   300    : 61   Mean   :1982   Mean   : 90.43  
##  3rd Qu.:  15.137   350    : 61   3rd Qu.:1997   3rd Qu.:123.00  
##  Max.   :1025.630   400    : 61   Max.   :2012   Max.   :349.00  
##                     (Other):244                                  
##        eo           gallup             laws             sou        
##  Min.   : 0.0   Min.   :0.00000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 0.0   1st Qu.:0.00000   1st Qu.:  6.00   1st Qu.:  1.00  
##  Median : 2.0   Median :0.01109   Median : 12.00   Median :  5.00  
##  Mean   : 4.2   Mean   :0.04511   Mean   : 19.56   Mean   : 15.53  
##  3rd Qu.: 6.0   3rd Qu.:0.04578   3rd Qu.: 21.00   3rd Qu.: 20.00  
##  Max.   :46.0   Max.   :0.53333   Max.   :128.00   Max.   :336.00  
##                                                                    
##     Punc    
##  FALSE:553  
##  TRUE : 57  
##             
##             
##             
##             
## 
library(psych)
pairs.panels(df)

#png("HistBud.png", units="cm", width=16, height=8, res=600)
hist(df$pctChng, breaks= "FD", main= "", xlab="Annual Percantage Change", prob=T)
x<-seq(0,100,0.01) 
curve(dnorm(x, mean(df$pctChng), sd(df$pctChng)), add=T)

#dev.off()
 
dfSub = subset(df, select=c("Punc", "Year", "sou"))
puncs <- which(dfSub$Punc==T)
imc <- which(dfSub$Punc==F)
set.seed(2)
sel <- c(sample(puncs, 25), sample(imc, 25))
dfSub2 <- dfSub[sel, ]
 
 
library(tree)
set.seed(1)
subtree = tree(Punc~., data= dfSub2)
 
# png("SubTree.png", unit="cm", type="cairo", height = 12, width = 16, res=600)
plot(subtree)
text(subtree, pretty=0)


# dev.off()
 
summary(subtree)
## 
## Classification tree:
## tree(formula = Punc ~ ., data = dfSub2)
## Number of terminal nodes:  8 
## Residual mean deviance:  0.6694 = 28.11 / 42 
## Misclassification error rate: 0.14 = 7 / 50
subtree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 50 69.310 FALSE ( 0.5000 0.5000 )  
##    2) Year < 1959 10  6.502 TRUE ( 0.1000 0.9000 )  
##      4) Year < 1951.5 5  5.004 TRUE ( 0.2000 0.8000 ) *
##      5) Year > 1951.5 5  0.000 TRUE ( 0.0000 1.0000 ) *
##    3) Year > 1959 40 53.840 FALSE ( 0.6000 0.4000 )  
##      6) sou < 54 32 44.360 TRUE ( 0.5000 0.5000 )  
##       12) sou < 1.5 7  5.742 FALSE ( 0.8571 0.1429 ) *
##       13) sou > 1.5 25 33.650 TRUE ( 0.4000 0.6000 )  
##         26) Year < 1989.5 15 17.400 TRUE ( 0.2667 0.7333 )  
##           52) Year < 1975 9 12.370 TRUE ( 0.4444 0.5556 ) *
##           53) Year > 1975 6  0.000 TRUE ( 0.0000 1.0000 ) *
##         27) Year > 1989.5 10 13.460 FALSE ( 0.6000 0.4000 )  
##           54) Year < 2003.5 5  0.000 FALSE ( 1.0000 0.0000 ) *
##           55) Year > 2003.5 5  5.004 TRUE ( 0.2000 0.8000 ) *
##      7) sou > 54 8  0.000 FALSE ( 1.0000 0.0000 ) *
### Find pruned tree
cv.subtree <- cv.tree(subtree, FUN=prune.misclass)
cv.subtree
## $size
## [1] 8 6 2 1
## 
## $dev
## [1] 23 23 24 35
## 
## $k
## [1] -Inf  0.0  2.5  8.0
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
par(mfrow = c(1,2))
plot(cv.subtree$size, cv.subtree$dev, type="b")
plot(cv.subtree$k, cv.subtree$dev, type="b")


par(mfrow = c(1,1))
 
prune.subtree <- prune.misclass(subtree, best=6)
#png("simpleTree.png", 800,600)
plot(prune.subtree)
text(prune.subtree, pretty=0)
#dev.off()
 
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha


#png("Devide.png",  units="cm", width=16, height=12, res=600)
ggplot(dfSub2, aes(x=Year, y=sou, shape=Punc)) + geom_point(size=4) + 
  theme_bw() +
  scale_shape_discrete(labels= c("Incremental", "Punctuation"), solid=F) +
  xlab("Year") +
  ylab("State of the Union Speeches") +
  theme( panel.grid.major = element_blank(), 
         panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))


#dev.off()
 
#png("Devided.png", units="cm", width=16, height=12, res=600)
ggplot(dfSub2, aes(x=Year, y=sou, shape=Punc)) + geom_point(size=4) + 
  theme_bw() +
  scale_shape_discrete(labels= c("Incremental", "Punctuation"), solid=F) +
  geom_vline(xintercept = 1959) +
  geom_segment(aes(x = 1959, y = 54, xend = Inf, yend = 54)) +
  geom_segment(aes(x = 1959, y = 1.5, xend = Inf, yend = 1.5),  linetype="dashed") +
  geom_segment(aes(x = 1989.5, y = 1.5, xend = 1989.5, yend = 54),  linetype="dashed") +
  geom_segment(aes(x = 2003.5, y = 1.5, xend = 2003.5, yend = 54), linetype="dashed") +
  xlab("Year") +
  ylab("State of the Union Speeches") +
  theme( panel.grid.major = element_blank(), 
         panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))


#dev.off()
 
 
 
#png(file="PunctuationDist.png", type="cairo", units="cm", width=8, height=8, res=600)
plot(df$Punc, main="Distribution of Punctuations")


#dev.off()
 
colnames(df)
## [1] "pctChng"   "TopicCode" "Year"      "congress"  "eo"        "gallup"   
## [7] "laws"      "sou"       "Punc"
### Percantage changes have to be removed from data,
### because this variable is nearly the same as Punc.
### Could be used for cool regression project...
df <- df[,-1]
 
### Final correlation plot for article
#png(file="CorrPanels.png", type="cairo", units="cm", width=18, height=18, res=600)
pairs.panels(df, lm=T, ellipses=F, scale=F, rug=F)


#dev.off()
# Check if Punc and TopicCode are factors.
# sapply(df, class)
 
 
### Model testing
set.seed(555)
sel <- sample(dim(df)[1], (dim(df)[1]/3)*2)
tr <- df[sel,]
te <- df[-sel, ]
 
### Logistic regression
fit1 <- glm(Punc~., fam=binomial, data=tr)
predsReg <- predict(fit1, type="response", newdata=te)
 
### Decision tree
library(rpart)
fitTree <- rpart(Punc~., data=tr)
predTreeTr <- predict(fitTree, type="class")
predTree <- predict(fitTree, newdata=te)[,2]
 
library(descr)
 
Prediction <- predTreeTr
Real <- tr$Punc
crosstab(Prediction, Real)


##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## ==================================
##               Real
## Prediction    FALSE   TRUE   Total
## ----------------------------------
## FALSE           366     27     393
## ----------------------------------
## TRUE              5      8      13
## ----------------------------------
## Total           371     35     406
## ==================================
Prediction <- predict(fitTree,type="class", newdata=te)
Real <- te$Punc
crosstab(Prediction, Real)


##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## ==================================
##               Real
## Prediction    FALSE   TRUE   Total
## ----------------------------------
## FALSE           178     21     199
## ----------------------------------
## TRUE              4      1       5
## ----------------------------------
## Total           182     22     204
## ==================================
library(rpart.plot)
#png(file="TrainTree.png", type="cairo", units="cm", width=12, height=12, res=600)
rpart.plot(fitTree,type=0,extra=2,cex=1)



#dev.off()
 
summary(fitTree)
## Call:
## rpart(formula = Punc ~ ., data = tr)
##   n= 406 
## 
##           CP nsplit rel error xerror      xstd
## 1 0.04285714      0 1.0000000      1 0.1615809
## 2 0.01000000      2 0.9142857      1 0.1615809
## 
## Variable importance
##      Year TopicCode  congress    gallup      laws 
##        50        28        12         6         3 
## 
## Node number 1: 406 observations,    complexity param=0.04285714
##   predicted class=FALSE  expected loss=0.0862069  P(node) =1
##     class counts:   371    35
##    probabilities: 0.914 0.086 
##   left son=2 (384 obs) right son=3 (22 obs)
##   Primary splits:
##       Year      < 1952.5     to the right, improve=4.8499870, (0 missing)
##       TopicCode splits as  RLRRRRLRRL, improve=1.8075810, (0 missing)
##       eo        < 4.5        to the right, improve=0.9136482, (0 missing)
##       sou       < 30.5       to the right, improve=0.8550369, (0 missing)
##       congress  < 72.5       to the right, improve=0.7677503, (0 missing)
## 
## Node number 2: 384 observations
##   predicted class=FALSE  expected loss=0.06770833  P(node) =0.9458128
##     class counts:   358    26
##    probabilities: 0.932 0.068 
## 
## Node number 3: 22 observations,    complexity param=0.04285714
##   predicted class=FALSE  expected loss=0.4090909  P(node) =0.05418719
##     class counts:    13     9
##    probabilities: 0.591 0.409 
##   left son=6 (9 obs) right son=7 (13 obs)
##   Primary splits:
##       TopicCode splits as  RLRLRRLRRL, improve=2.7047400, (0 missing)
##       congress  < 56.5       to the right, improve=1.0637140, (0 missing)
##       gallup    < 0.01869159 to the right, improve=0.6363636, (0 missing)
##       eo        < 6          to the right, improve=0.4363636, (0 missing)
##       laws      < 14         to the right, improve=0.3030303, (0 missing)
##   Surrogate splits:
##       congress < 14.5       to the left,  agree=0.773, adj=0.444, (0 split)
##       gallup   < 0.01869159 to the right, agree=0.682, adj=0.222, (0 split)
##       laws     < 7          to the left,  agree=0.636, adj=0.111, (0 split)
## 
## Node number 6: 9 observations
##   predicted class=FALSE  expected loss=0.1111111  P(node) =0.02216749
##     class counts:     8     1
##    probabilities: 0.889 0.111 
## 
## Node number 7: 13 observations
##   predicted class=TRUE   expected loss=0.3846154  P(node) =0.0320197
##     class counts:     5     8
##    probabilities: 0.385 0.615
### Random forest
y = tr$Punc
x =tr[,-8]
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:psych':
## 
##     outlier
rf <- randomForest(x, y, ntree=301, proximity=TRUE, importance=TRUE, keep.forest=T)
rf2 <- randomForest(x, y, ntree=301, proximity=TRUE, importance=TRUE, keep.forest=T, cutoff=c(0.8,0.2))
bag <- randomForest(x, y, ntree=301, mtry=7)
 
#png(file="RFTrees.png", type="cairo", units="cm", width=24, height=12, res=300)
par(mfrow=c(1,2))
plot(bag, col=c("black", "lightgrey", "darkgrey"), lwd=2, lty=c(1,2,3), main = "Bagging")
legend("right", colnames(bag$err.rate), lwd=2, lty =c(1,2,3), col=c("black", "lightgrey", "darkgrey"))
plot(rf, col=c("black", "lightgrey", "darkgrey"), lwd=2, lty=c(1,2,3), main = "Random Forest")
legend("right", colnames(rf$err.rate), lwd=2, lty =c(1,2,3), col=c("black", "lightgrey", "darkgrey"))


# dev.off()
par(mfrow=c(1,1))
 
### MDSplot (was not used in article)
MDSplot(rf, y, k=2)
## Warning in RColorBrewer::brewer.pal(nlevs, "Set1"): minimal value for n is 3, returning requested palette with 3 different levels


rf
## 
## Call:
##  randomForest(x = x, y = y, ntree = 301, importance = TRUE, proximity = TRUE,      keep.forest = T) 
##                Type of random forest: classification
##                      Number of trees: 301
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 9.36%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE   368    3 0.008086253
## TRUE     35    0 1.000000000
rf2
## 
## Call:
##  randomForest(x = x, y = y, ntree = 301, cutoff = c(0.8, 0.2),      importance = TRUE, proximity = TRUE, keep.forest = T) 
##                Type of random forest: classification
##                      Number of trees: 301
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.55%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE   340   31  0.08355795
## TRUE     24   11  0.68571429
bag
## 
## Call:
##  randomForest(x = x, y = y, ntree = 301, mtry = 7) 
##                Type of random forest: classification
##                      Number of trees: 301
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 9.61%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE   364    7  0.01886792
## TRUE     32    3  0.91428571
predsRF <- predict(rf, type="prob", newdata=te)[,2]
predsRF2 <- predict(rf2, type="prob", newdata=te)[,2]
predsBag <- predict(bag, type="prob", newdata=te)[,2]
 
Prediction <- predict(bag, type="class", newdata=te)
Real <- te$Punc
crosstab(Prediction, Real)


##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## ==================================
##               Real
## Prediction    FALSE   TRUE   Total
## ----------------------------------
## FALSE           180     20     200
## ----------------------------------
## TRUE              2      2       4
## ----------------------------------
## Total           182     22     204
## ==================================
Prediction <-predict(rf, type="class", newdata=te)
Real <- te$Punc
crosstab(Prediction, Real)


##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## ==================================
##               Real
## Prediction    FALSE   TRUE   Total
## ----------------------------------
## FALSE           182     21     203
## ----------------------------------
## TRUE              0      1       1
## ----------------------------------
## Total           182     22     204
## ==================================
Prediction <-predict(rf2, type="class", newdata=te)
Real <- te$Punc
crosstab(Prediction, Real)


##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## ==================================
##               Real
## Prediction    FALSE   TRUE   Total
## ----------------------------------
## FALSE           165     16     181
## ----------------------------------
## TRUE             17      6      23
## ----------------------------------
## Total           182     22     204
## ==================================
colnames(tr)
## [1] "TopicCode" "Year"      "congress"  "eo"        "gallup"    "laws"     
## [7] "sou"       "Punc"
#png("PartialPlot.png", height= 12, width=20, units="cm", type= "cairo", res=600)
par(mfrow=c(1,2))
partialPlot(rf, pred.data=tr, x.var="Year", "TRUE")
partialPlot(rf, pred.data=tr, x.var="congress", "TRUE")


#dev.off()
par(mfrow=c(1,1))
#png("VarImpPlot.png", height= 9, width=16, units="cm", type= "cairo", res=600)
varImpPlot(rf, main="")


#dev.off()
 
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
# png("ROC.png", height= 18, width=16, units="cm", type= "cairo", res=600)
plot(performance(prediction(predsReg, te$Punc), 'tpr', 'fpr'),  col="lightgrey", colorize=F, lwd=2)
plot(performance(prediction(predsRF, te$Punc), 'tpr', 'fpr'), col="black",add=T, lwd=2)
plot(performance(prediction(predTree, te$Punc), 'tpr', 'fpr'), col="black", lty= 2, add=T, colorize=F, lwd=2)
plot(performance(prediction(predsBag, te$Punc), 'tpr', 'fpr'),  col="darkgrey", lty=3, add=T, colorize=F, lwd=2)
# plot(performance(prediction(predsRF2, te$Punc), 'tpr', 'fpr'),  col="red", lty=3, add=T, colorize=F, lwd=2)
abline(a=0, b= 1, lty=4, col="grey")
legend("bottomright", legend=c("Random Forest", "Decision Tree", "Bagging", "Logistic Reg.", "Random Class."), 
       col=c("black", "black", "darkgrey", "grey", "grey"), lty=c(1,2,3,1, 4), lwd=c(2,2,2,2,1))


# dev.off()
 
logAUC <- as.numeric(performance(prediction(predsReg, te$Punc), "auc")@y.values)
rfAUC <- as.numeric(performance(prediction(predsRF, te$Punc), "auc")@y.values)
rf2AUC <- as.numeric(performance(prediction(predsRF2, te$Punc), "auc")@y.values)
bagAUC <- as.numeric(performance(prediction(predsBag, te$Punc), "auc")@y.values)
treeAUC <- as.numeric(performance(prediction(predTree, te$Punc), "auc")@y.values)
AUCs <- c(logAUC, rfAUC, bagAUC, treeAUC)
 
# png("AUCs.png", height= 18, width=16, units="cm", type= "cairo", res=600)
barplot(AUCs, 
        width=0.5, 
        space=0.2,
        names=c("Logistic Reg.", "Random Forest", "Bagging", "Decision Tree"), 
        horiz=T, 
        xpd=F,
        xlim=c(0,1),
        density=c(10,20,30,40),
        col=c("lightgrey", "black", "darkgrey", "white"))
 
abline(v=0.75, lty=2, col="black")
abline(v=0.5, lty=3, col="black")


# dev.off()
 
### Comparison of final predictions (was not used 
### in article)
RFPred <-predict(rf, type="response", newdata=te)
TreePred <-predict(fitTree, type="class", newdata=te)
LogPred <-predict(fit1, type="response", newdata=te)
LogPred <- ifelse(LogPred>0.5, "TRUE", "FALSE")
LogPred <- as.factor(LogPred)
punctuations <- te$Punc
table(RFPred, punctuations)
##        punctuations
## RFPred  FALSE TRUE
##   FALSE   182   21
##   TRUE      0    1
table(LogPred, punctuations)
##        punctuations
## LogPred FALSE TRUE
##   FALSE   181   21
##   TRUE      1    1
table(TreePred, punctuations)
##         punctuations
## TreePred FALSE TRUE
##    FALSE   178   21
##    TRUE      4    1
par(mfrow=c(1,2))
plot(RFPred~punctuations)
plot(LogPred~punctuations)

Kommentare

Unknown24. November 2017 um 09:59
Ich bin DANIEL aus Deutschland hier ist meine kurze Aussage darüber, wie ich meine ATM-Karte dank Clara Cyber-Tech für ihre Hilfe bekam, sie sind die einzige Firma, die über diese Karte helfen kann Ich bin so glücklich, ich habe viele Hacker aber niemand getroffen Helfen Sie mir, alles, was sie tun, ist für mehr Geld erforderlich, das war, wie ich 2 mal betrogen wurde, bevor ich diese Firma, die mir helfen sehen, so möchte ich jeden Körper beraten, wenn Sie an dieser Karte interessiert sind, die richtige Firma zu kontaktieren, weil Die meisten dieser Unternehmen sind gefälscht, sie sind nicht real, für mich ist das einzige echte Unternehmen, das ich kennenlerne, CLARA CYBER TECH ... Follow company facebook page; Clara world of ATM cyber technology EMAIL, clara.cybertech @ outlook.com
AntwortenLöschen
Antworten

Kommentar hinzufügen

Political Data Science

Dieses Blog durchsuchen