My new Publication in @EPA_Journal is #openSource.
Abstract
The
article introduces machine learning algorithms for political scientists. These
approaches should not be seen as new method for old problems. Rather, it is
important to understand the different logic of the machine learning approach.
Here, data is analyzed without theoretical assumptions about possible
causalities. Models are optimized according to their accuracy and robustness.
While the computer can do this work more or less alone, it is the researcher’s
duty to make sense of these models afterwards. Visualization of machine
learning results therefore becomes very important and is in the focus of this
paper. The methods that are presented and compared are decision trees, bagging
and random forests. The later are more advanced versions of the former, relying
on bootstrapping procedures. To demonstrate these methods, extreme shifts in
the US budget and their connection to attention of political actors are
analyzed. The paper presents a comparison of the accuracy of different models
based on ROC curves and shows how to interpret random forest models with the
help of visualizations. The aim of the article is to provide an example, how
these methods can be used in political science and to highlight possible
pitfalls as well as advantages of machine learning.
And here are the R-Codes as supporting information
########################## ### R-script for ### ### Decision trees and random forests: ### Machine learning techniques to classify rare events ### ### European Policy Analysis ### Simon Hegelich ### ### Contact: ### Prof. Dr. Simon Hegelich ### Political Data Science ### Technical University of Munich TUM ### Bavarian School of Public Policy ### Email: hegelich@hfpm.de ### http://politicaldatascience.blogspot.de ########################## ### Dowlnoading data from the Policy Agendas Project ### This can take some time and requires a good internet connection budget = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/budget/instances.csv?from=1945&to=2014", as.is=T) congress = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/ch/instances.csv?from=1945&to=2014", as.is=T) laws = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/pl/instances.csv?from=1945&to=2014", as.is=T) rowCall = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/rc/instances.csv?from=1945&to=2014", as.is=T) eo = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/eo/instances.csv?from=1945&to=2014", as.is=T) sou = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/sou/instances.csv?from=1945&to=2014", as.is=T) supCourt = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/sc/instances.csv?from=1945&to=2014", as.is=T) gallup = read.csv("http://www.utexas.edu/cola/_webservices/policyagendas/mip/instances.csv?from=1945&to=2014", as.is=T) ### The next table was constructed by the author. ### It show matches in topics and budgets TopicsCross = read.csv("TopicsCross.csv", as.is=T) TopicsCross ## BudgetCode BudgetTopic PAPCode ## 1 50 National Defense 16 ## 2 150 International Affairs 19 ## 3 250 General Science, Space, and Technology NA ## 4 270 Energy 8 ## 5 300 Natural Resources and Environment 7 ## 6 350 Agriculture 4 ## 7 370 Commerce and Housing Credit NA ## 8 400 Transportation 10 ## 9 450 Community and Regional Development NA ## 10 500 Education, Training, Employment, and Social Services 6 ## 11 550 Health 3 ## 12 570 Medicare NA ## 13 600 Income Security NA ## 14 650 Social Security 13 ## 15 700 Veterans Benefits and Services NA ## 16 750 Administration of Justice 12 ## 17 800 General Government 20 ## 18 900 Net Interest NA ## 19 920 Allowances NA ## 20 950 Undistributed Offsetting Receipts NA ## 21 NA 1 ## 22 NA 2 ## 23 NA 5 ## 24 NA 9 ## 25 NA 14 ## 26 NA 15 ## 27 NA 17 ## 28 NA 18 ## 29 NA 21 ## PAPTopic ## 1 Defense ## 2 International Affairs and Foreign Aid ## 3 ## 4 Energy ## 5 Environment ## 6 Agriculture ## 7 ## 8 Transportation ## 9 ## 10 Education ## 11 Health ## 12 ## 13 ## 14 Social Welfare ## 15 ## 16 Law, Crime, and Family Issues ## 17 Government Operations ## 18 ## 19 ## 20 ## 21 Macroeconomics ## 22 Civil Rights, Minority Issues, and Civil Liberties ## 23 Labor and Employment ## 24 Immigration ## 25 Community Development and Housing Issues ## 26 Banking, Finance, and Domestic Commerce ## 27 Space, Science, Technology and Communications ## 28 Foreign Trade ## 29 Public Lands and Water Management budcd = TopicsCross$BudgetCode[!(is.na(TopicsCross$PAPCode))] budcd = budcd[!is.na(budcd)] budgetSel = subset(budget, select=c(pctChng, TopicCode, Year, Congress), MajorFunction==1 & TopicCode %in% budcd) hist(budgetSel$pctChng)
### Function to match topicCodes with budgetCodes budToPAP = function(x) TopicsCross$PAPCode[TopicsCross$BudgetCode==x][1] ### Function to count the mentions of topics per year getCount = function(code, year, YearCol, TopicCol){ years = which(YearCol==year) codes = which(TopicCol==budToPAP(code)) return(length(which(years %in% codes))) } ### Construction of the final data frame budgetSel$congress = NA for(i in 1:dim(budgetSel)[1]){ budgetSel$congress[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, congress$Year, congress$MajorTopic) } budgetSel$congress[budgetSel$Year>2012 | budgetSel$Year<1948] = NA budgetSel$eo = NA for(i in 1:dim(budgetSel)[1]){ budgetSel$eo[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, eo$Year, eo$MajorTopic) } budgetSel$gallup = NA for(i in 1:dim(budgetSel)[1]){ if(any(gallup$Year==budgetSel$Year[i]-2 & gallup$MajorTopic==budToPAP(budgetSel$TopicCode[i]))) budgetSel$gallup[i] = gallup$Percentage[which(gallup$Year==budgetSel$Year[i]-2 & gallup$MajorTopic==budToPAP(budgetSel$TopicCode[i]))] } budgetSel$laws = NA for(i in 1:dim(budgetSel)[1]){ budgetSel$laws[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, laws$Year, laws$MajorTopic) } budgetSel$laws[budgetSel$Year<1950] = NA #RowCall has not enough data points and is not used # budgetSel$rowCall = NULL # for(i in 1:dim(budgetSel)[1]){ # budgetSel$rowCall[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, rowCall$Year, rowCall$MajorTopic) # } # budgetSel$rowCall[budgetSel$Year>2004] = NA budgetSel$sou = NA for(i in 1:dim(budgetSel)[1]){ budgetSel$sou[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, sou$Year, sou$Majortopic) } budgetSel$sou[budgetSel$Year<1948] = NA # sup Court has too much missing data # budgetSel$supCourt = NULL # for(i in 1:dim(budgetSel)[1]){ # budgetSel$supCourt[i] = getCount(budgetSel$TopicCode[i], budgetSel$Year[i]-2, supCourt$Year, supCourt$MajorTopic) # } # budgetSel$supCourt[budgetSel$Year>2011] = NA ### Save the data as csv write.csv(budgetSel, "BudgetMasterLag2.csv", row.names=F) ### Read the data in again df <- read.csv("BudgetMasterLag2.csv") summary(df) ## pctChng TopicCode Year Congress ## Min. :-218.248 Min. : 50 Min. :1947 Min. : 80.0 ## 1st Qu.: -6.801 1st Qu.:270 1st Qu.:1964 1st Qu.: 88.0 ## Median : 3.226 Median :375 Median :1980 Median : 96.5 ## Mean : 13.498 Mean :412 Mean :1980 Mean : 96.5 ## 3rd Qu.: 14.673 3rd Qu.:550 3rd Qu.:1997 3rd Qu.:105.0 ## Max. :1025.630 Max. :800 Max. :2014 Max. :113.0 ## ## congress eo gallup laws ## Min. : 5.00 Min. : 0.00 Min. :0.00000 Min. : 0.00 ## 1st Qu.: 38.00 1st Qu.: 0.00 1st Qu.:0.00000 1st Qu.: 6.00 ## Median : 71.00 Median : 2.00 Median :0.01127 Median : 11.00 ## Mean : 89.61 Mean : 4.49 Mean :0.04459 Mean : 19.11 ## 3rd Qu.:122.75 3rd Qu.: 6.00 3rd Qu.:0.04579 3rd Qu.: 21.00 ## Max. :367.00 Max. :65.00 Max. :0.53333 Max. :128.00 ## NA's :30 NA's :40 NA's :30 ## sou ## Min. : 0.00 ## 1st Qu.: 1.00 ## Median : 6.00 ## Mean : 16.78 ## 3rd Qu.: 22.00 ## Max. :336.00 ## NA's :10 df$TopicCode = as.factor(df$TopicCode) df = na.omit(df) ### Find punctuations: see Hegelich/Fraune/Knollmann 2015 ### http://onlinelibrary.wiley.com/doi/10.1111/psj.12089/abstract IQRs = by(df$pctChng, df$TopicCode, IQR) df$Punc = NA for(i in 1: length(IQRs)){ x = df$pctChng[df$TopicCode==levels(df$TopicCode)[i]] Punc = ifelse((x > quantile(x, .75) + IQRs[[i]]*1.5) | (x < quantile(x, .25) - IQRs[[i]]*1.5), TRUE, FALSE) df$Punc[df$TopicCode==levels(df$TopicCode)[i]]=Punc } for (i in 1:length(levels(df$TopicCode))){ df$congS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$congress[df$TopicCode==levels(df$TopicCode)[i]]) } for (i in 1:length(levels(df$TopicCode))){ df$eoS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$eo[df$TopicCode==levels(df$TopicCode)[i]]) } for (i in 1:length(levels(df$TopicCode))){ df$gallupS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$gallup[df$TopicCode==levels(df$TopicCode)[i]]) } for (i in 1:length(levels(df$TopicCode))){ df$lawsS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$laws[df$TopicCode==levels(df$TopicCode)[i]]) } # for (i in 1:length(levels(df$TopicCode))){ # df$rowCallS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$rowCall[df$TopicCode==levels(df$TopicCode)[i]]) # } for (i in 1:length(levels(df$TopicCode))){ df$souS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$sou[df$TopicCode==levels(df$TopicCode)[i]]) } # for (i in 1:length(levels(df$TopicCode))){ # df$supCourtS[df$TopicCode==levels(df$TopicCode)[i]] = scale(df$supCourt[df$TopicCode==levels(df$TopicCode)[i]]) # } colnames(df)[4] = "legislative" df$Punc=as.factor(df$Punc) colnames(df) ## [1] "pctChng" "TopicCode" "Year" "legislative" "congress" ## [6] "eo" "gallup" "laws" "sou" "Punc" ## [11] "congS" "eoS" "gallupS" "lawsS" "souS" ### Select relevant data df <- df[,-c(4, 11:15)] summary(df) ## pctChng TopicCode Year congress ## Min. :-218.248 50 : 61 Min. :1950 Min. : 5.00 ## 1st Qu.: -5.734 150 : 61 1st Qu.:1967 1st Qu.: 41.00 ## Median : 3.809 270 : 61 Median :1982 Median : 73.00 ## Mean : 14.157 300 : 61 Mean :1982 Mean : 90.43 ## 3rd Qu.: 15.137 350 : 61 3rd Qu.:1997 3rd Qu.:123.00 ## Max. :1025.630 400 : 61 Max. :2012 Max. :349.00 ## (Other):244 ## eo gallup laws sou ## Min. : 0.0 Min. :0.00000 Min. : 0.00 Min. : 0.00 ## 1st Qu.: 0.0 1st Qu.:0.00000 1st Qu.: 6.00 1st Qu.: 1.00 ## Median : 2.0 Median :0.01109 Median : 12.00 Median : 5.00 ## Mean : 4.2 Mean :0.04511 Mean : 19.56 Mean : 15.53 ## 3rd Qu.: 6.0 3rd Qu.:0.04578 3rd Qu.: 21.00 3rd Qu.: 20.00 ## Max. :46.0 Max. :0.53333 Max. :128.00 Max. :336.00 ## ## Punc ## FALSE:553 ## TRUE : 57 ## ## ## ## ## library(psych) pairs.panels(df)
#png("HistBud.png", units="cm", width=16, height=8, res=600) hist(df$pctChng, breaks= "FD", main= "", xlab="Annual Percantage Change", prob=T) x<-seq(0,100,0.01) curve(dnorm(x, mean(df$pctChng), sd(df$pctChng)), add=T)
#dev.off() dfSub = subset(df, select=c("Punc", "Year", "sou")) puncs <- which(dfSub$Punc==T) imc <- which(dfSub$Punc==F) set.seed(2) sel <- c(sample(puncs, 25), sample(imc, 25)) dfSub2 <- dfSub[sel, ] library(tree) set.seed(1) subtree = tree(Punc~., data= dfSub2) # png("SubTree.png", unit="cm", type="cairo", height = 12, width = 16, res=600) plot(subtree) text(subtree, pretty=0)# dev.off() summary(subtree) ## ## Classification tree: ## tree(formula = Punc ~ ., data = dfSub2) ## Number of terminal nodes: 8 ## Residual mean deviance: 0.6694 = 28.11 / 42 ## Misclassification error rate: 0.14 = 7 / 50 subtree ## node), split, n, deviance, yval, (yprob) ## * denotes terminal node ## ## 1) root 50 69.310 FALSE ( 0.5000 0.5000 ) ## 2) Year < 1959 10 6.502 TRUE ( 0.1000 0.9000 ) ## 4) Year < 1951.5 5 5.004 TRUE ( 0.2000 0.8000 ) * ## 5) Year > 1951.5 5 0.000 TRUE ( 0.0000 1.0000 ) * ## 3) Year > 1959 40 53.840 FALSE ( 0.6000 0.4000 ) ## 6) sou < 54 32 44.360 TRUE ( 0.5000 0.5000 ) ## 12) sou < 1.5 7 5.742 FALSE ( 0.8571 0.1429 ) * ## 13) sou > 1.5 25 33.650 TRUE ( 0.4000 0.6000 ) ## 26) Year < 1989.5 15 17.400 TRUE ( 0.2667 0.7333 ) ## 52) Year < 1975 9 12.370 TRUE ( 0.4444 0.5556 ) * ## 53) Year > 1975 6 0.000 TRUE ( 0.0000 1.0000 ) * ## 27) Year > 1989.5 10 13.460 FALSE ( 0.6000 0.4000 ) ## 54) Year < 2003.5 5 0.000 FALSE ( 1.0000 0.0000 ) * ## 55) Year > 2003.5 5 5.004 TRUE ( 0.2000 0.8000 ) * ## 7) sou > 54 8 0.000 FALSE ( 1.0000 0.0000 ) * ### Find pruned tree cv.subtree <- cv.tree(subtree, FUN=prune.misclass) cv.subtree ## $size ## [1] 8 6 2 1 ## ## $dev ## [1] 23 23 24 35 ## ## $k ## [1] -Inf 0.0 2.5 8.0 ## ## $method ## [1] "misclass" ## ## attr(,"class") ## [1] "prune" "tree.sequence" par(mfrow = c(1,2)) plot(cv.subtree$size, cv.subtree$dev, type="b") plot(cv.subtree$k, cv.subtree$dev, type="b")par(mfrow = c(1,1)) prune.subtree <- prune.misclass(subtree, best=6) #png("simpleTree.png", 800,600) plot(prune.subtree) text(prune.subtree, pretty=0) #dev.off() library(ggplot2) ## ## Attaching package: 'ggplot2' ## The following objects are masked from 'package:psych': ## ## %+%, alpha#png("Devide.png", units="cm", width=16, height=12, res=600) ggplot(dfSub2, aes(x=Year, y=sou, shape=Punc)) + geom_point(size=4) + theme_bw() + scale_shape_discrete(labels= c("Incremental", "Punctuation"), solid=F) + xlab("Year") + ylab("State of the Union Speeches") + theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))#dev.off() #png("Devided.png", units="cm", width=16, height=12, res=600) ggplot(dfSub2, aes(x=Year, y=sou, shape=Punc)) + geom_point(size=4) + theme_bw() + scale_shape_discrete(labels= c("Incremental", "Punctuation"), solid=F) + geom_vline(xintercept = 1959) + geom_segment(aes(x = 1959, y = 54, xend = Inf, yend = 54)) + geom_segment(aes(x = 1959, y = 1.5, xend = Inf, yend = 1.5), linetype="dashed") + geom_segment(aes(x = 1989.5, y = 1.5, xend = 1989.5, yend = 54), linetype="dashed") + geom_segment(aes(x = 2003.5, y = 1.5, xend = 2003.5, yend = 54), linetype="dashed") + xlab("Year") + ylab("State of the Union Speeches") + theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))#dev.off() #png(file="PunctuationDist.png", type="cairo", units="cm", width=8, height=8, res=600) plot(df$Punc, main="Distribution of Punctuations")#dev.off() colnames(df) ## [1] "pctChng" "TopicCode" "Year" "congress" "eo" "gallup" ## [7] "laws" "sou" "Punc" ### Percantage changes have to be removed from data, ### because this variable is nearly the same as Punc. ### Could be used for cool regression project... df <- df[,-1] ### Final correlation plot for article #png(file="CorrPanels.png", type="cairo", units="cm", width=18, height=18, res=600) pairs.panels(df, lm=T, ellipses=F, scale=F, rug=F)#dev.off() # Check if Punc and TopicCode are factors. # sapply(df, class) ### Model testing set.seed(555) sel <- sample(dim(df)[1], (dim(df)[1]/3)*2) tr <- df[sel,] te <- df[-sel, ] ### Logistic regression fit1 <- glm(Punc~., fam=binomial, data=tr) predsReg <- predict(fit1, type="response", newdata=te) ### Decision tree library(rpart) fitTree <- rpart(Punc~., data=tr) predTreeTr <- predict(fitTree, type="class") predTree <- predict(fitTree, newdata=te)[,2] library(descr) Prediction <- predTreeTr Real <- tr$Punc crosstab(Prediction, Real)## Cell Contents ## |-------------------------| ## | Count | ## |-------------------------| ## ## ================================== ## Real ## Prediction FALSE TRUE Total ## ---------------------------------- ## FALSE 366 27 393 ## ---------------------------------- ## TRUE 5 8 13 ## ---------------------------------- ## Total 371 35 406 ## ================================== Prediction <- predict(fitTree,type="class", newdata=te) Real <- te$Punc crosstab(Prediction, Real)## Cell Contents ## |-------------------------| ## | Count | ## |-------------------------| ## ## ================================== ## Real ## Prediction FALSE TRUE Total ## ---------------------------------- ## FALSE 178 21 199 ## ---------------------------------- ## TRUE 4 1 5 ## ---------------------------------- ## Total 182 22 204 ## ================================== library(rpart.plot) #png(file="TrainTree.png", type="cairo", units="cm", width=12, height=12, res=600) rpart.plot(fitTree,type=0,extra=2,cex=1)#dev.off() summary(fitTree) ## Call: ## rpart(formula = Punc ~ ., data = tr) ## n= 406 ## ## CP nsplit rel error xerror xstd ## 1 0.04285714 0 1.0000000 1 0.1615809 ## 2 0.01000000 2 0.9142857 1 0.1615809 ## ## Variable importance ## Year TopicCode congress gallup laws ## 50 28 12 6 3 ## ## Node number 1: 406 observations, complexity param=0.04285714 ## predicted class=FALSE expected loss=0.0862069 P(node) =1 ## class counts: 371 35 ## probabilities: 0.914 0.086 ## left son=2 (384 obs) right son=3 (22 obs) ## Primary splits: ## Year < 1952.5 to the right, improve=4.8499870, (0 missing) ## TopicCode splits as RLRRRRLRRL, improve=1.8075810, (0 missing) ## eo < 4.5 to the right, improve=0.9136482, (0 missing) ## sou < 30.5 to the right, improve=0.8550369, (0 missing) ## congress < 72.5 to the right, improve=0.7677503, (0 missing) ## ## Node number 2: 384 observations ## predicted class=FALSE expected loss=0.06770833 P(node) =0.9458128 ## class counts: 358 26 ## probabilities: 0.932 0.068 ## ## Node number 3: 22 observations, complexity param=0.04285714 ## predicted class=FALSE expected loss=0.4090909 P(node) =0.05418719 ## class counts: 13 9 ## probabilities: 0.591 0.409 ## left son=6 (9 obs) right son=7 (13 obs) ## Primary splits: ## TopicCode splits as RLRLRRLRRL, improve=2.7047400, (0 missing) ## congress < 56.5 to the right, improve=1.0637140, (0 missing) ## gallup < 0.01869159 to the right, improve=0.6363636, (0 missing) ## eo < 6 to the right, improve=0.4363636, (0 missing) ## laws < 14 to the right, improve=0.3030303, (0 missing) ## Surrogate splits: ## congress < 14.5 to the left, agree=0.773, adj=0.444, (0 split) ## gallup < 0.01869159 to the right, agree=0.682, adj=0.222, (0 split) ## laws < 7 to the left, agree=0.636, adj=0.111, (0 split) ## ## Node number 6: 9 observations ## predicted class=FALSE expected loss=0.1111111 P(node) =0.02216749 ## class counts: 8 1 ## probabilities: 0.889 0.111 ## ## Node number 7: 13 observations ## predicted class=TRUE expected loss=0.3846154 P(node) =0.0320197 ## class counts: 5 8 ## probabilities: 0.385 0.615 ### Random forest y = tr$Punc x =tr[,-8] library(randomForest) ## randomForest 4.6-12 ## Type rfNews() to see new features/changes/bug fixes. ## ## Attaching package: 'randomForest' ## The following object is masked from 'package:ggplot2': ## ## margin ## The following object is masked from 'package:psych': ## ## outlier rf <- randomForest(x, y, ntree=301, proximity=TRUE, importance=TRUE, keep.forest=T) rf2 <- randomForest(x, y, ntree=301, proximity=TRUE, importance=TRUE, keep.forest=T, cutoff=c(0.8,0.2)) bag <- randomForest(x, y, ntree=301, mtry=7) #png(file="RFTrees.png", type="cairo", units="cm", width=24, height=12, res=300) par(mfrow=c(1,2)) plot(bag, col=c("black", "lightgrey", "darkgrey"), lwd=2, lty=c(1,2,3), main = "Bagging") legend("right", colnames(bag$err.rate), lwd=2, lty =c(1,2,3), col=c("black", "lightgrey", "darkgrey")) plot(rf, col=c("black", "lightgrey", "darkgrey"), lwd=2, lty=c(1,2,3), main = "Random Forest") legend("right", colnames(rf$err.rate), lwd=2, lty =c(1,2,3), col=c("black", "lightgrey", "darkgrey"))# dev.off() par(mfrow=c(1,1)) ### MDSplot (was not used in article) MDSplot(rf, y, k=2) ## Warning in RColorBrewer::brewer.pal(nlevs, "Set1"): minimal value for n is 3, returning requested palette with 3 different levelsrf ## ## Call: ## randomForest(x = x, y = y, ntree = 301, importance = TRUE, proximity = TRUE, keep.forest = T) ## Type of random forest: classification ## Number of trees: 301 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 9.36% ## Confusion matrix: ## FALSE TRUE class.error ## FALSE 368 3 0.008086253 ## TRUE 35 0 1.000000000 rf2 ## ## Call: ## randomForest(x = x, y = y, ntree = 301, cutoff = c(0.8, 0.2), importance = TRUE, proximity = TRUE, keep.forest = T) ## Type of random forest: classification ## Number of trees: 301 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 13.55% ## Confusion matrix: ## FALSE TRUE class.error ## FALSE 340 31 0.08355795 ## TRUE 24 11 0.68571429 bag ## ## Call: ## randomForest(x = x, y = y, ntree = 301, mtry = 7) ## Type of random forest: classification ## Number of trees: 301 ## No. of variables tried at each split: 7 ## ## OOB estimate of error rate: 9.61% ## Confusion matrix: ## FALSE TRUE class.error ## FALSE 364 7 0.01886792 ## TRUE 32 3 0.91428571 predsRF <- predict(rf, type="prob", newdata=te)[,2] predsRF2 <- predict(rf2, type="prob", newdata=te)[,2] predsBag <- predict(bag, type="prob", newdata=te)[,2] Prediction <- predict(bag, type="class", newdata=te) Real <- te$Punc crosstab(Prediction, Real)## Cell Contents ## |-------------------------| ## | Count | ## |-------------------------| ## ## ================================== ## Real ## Prediction FALSE TRUE Total ## ---------------------------------- ## FALSE 180 20 200 ## ---------------------------------- ## TRUE 2 2 4 ## ---------------------------------- ## Total 182 22 204 ## ================================== Prediction <-predict(rf, type="class", newdata=te) Real <- te$Punc crosstab(Prediction, Real)## Cell Contents ## |-------------------------| ## | Count | ## |-------------------------| ## ## ================================== ## Real ## Prediction FALSE TRUE Total ## ---------------------------------- ## FALSE 182 21 203 ## ---------------------------------- ## TRUE 0 1 1 ## ---------------------------------- ## Total 182 22 204 ## ================================== Prediction <-predict(rf2, type="class", newdata=te) Real <- te$Punc crosstab(Prediction, Real)## Cell Contents ## |-------------------------| ## | Count | ## |-------------------------| ## ## ================================== ## Real ## Prediction FALSE TRUE Total ## ---------------------------------- ## FALSE 165 16 181 ## ---------------------------------- ## TRUE 17 6 23 ## ---------------------------------- ## Total 182 22 204 ## ================================== colnames(tr) ## [1] "TopicCode" "Year" "congress" "eo" "gallup" "laws" ## [7] "sou" "Punc" #png("PartialPlot.png", height= 12, width=20, units="cm", type= "cairo", res=600) par(mfrow=c(1,2)) partialPlot(rf, pred.data=tr, x.var="Year", "TRUE") partialPlot(rf, pred.data=tr, x.var="congress", "TRUE")#dev.off() par(mfrow=c(1,1)) #png("VarImpPlot.png", height= 9, width=16, units="cm", type= "cairo", res=600) varImpPlot(rf, main="")#dev.off() library(ROCR) ## Loading required package: gplots ## ## Attaching package: 'gplots' ## The following object is masked from 'package:stats': ## ## lowess # png("ROC.png", height= 18, width=16, units="cm", type= "cairo", res=600) plot(performance(prediction(predsReg, te$Punc), 'tpr', 'fpr'), col="lightgrey", colorize=F, lwd=2) plot(performance(prediction(predsRF, te$Punc), 'tpr', 'fpr'), col="black",add=T, lwd=2) plot(performance(prediction(predTree, te$Punc), 'tpr', 'fpr'), col="black", lty= 2, add=T, colorize=F, lwd=2) plot(performance(prediction(predsBag, te$Punc), 'tpr', 'fpr'), col="darkgrey", lty=3, add=T, colorize=F, lwd=2) # plot(performance(prediction(predsRF2, te$Punc), 'tpr', 'fpr'), col="red", lty=3, add=T, colorize=F, lwd=2) abline(a=0, b= 1, lty=4, col="grey") legend("bottomright", legend=c("Random Forest", "Decision Tree", "Bagging", "Logistic Reg.", "Random Class."), col=c("black", "black", "darkgrey", "grey", "grey"), lty=c(1,2,3,1, 4), lwd=c(2,2,2,2,1))# dev.off() logAUC <- as.numeric(performance(prediction(predsReg, te$Punc), "auc")@y.values) rfAUC <- as.numeric(performance(prediction(predsRF, te$Punc), "auc")@y.values) rf2AUC <- as.numeric(performance(prediction(predsRF2, te$Punc), "auc")@y.values) bagAUC <- as.numeric(performance(prediction(predsBag, te$Punc), "auc")@y.values) treeAUC <- as.numeric(performance(prediction(predTree, te$Punc), "auc")@y.values) AUCs <- c(logAUC, rfAUC, bagAUC, treeAUC) # png("AUCs.png", height= 18, width=16, units="cm", type= "cairo", res=600) barplot(AUCs, width=0.5, space=0.2, names=c("Logistic Reg.", "Random Forest", "Bagging", "Decision Tree"), horiz=T, xpd=F, xlim=c(0,1), density=c(10,20,30,40), col=c("lightgrey", "black", "darkgrey", "white")) abline(v=0.75, lty=2, col="black") abline(v=0.5, lty=3, col="black")# dev.off() ### Comparison of final predictions (was not used ### in article) RFPred <-predict(rf, type="response", newdata=te) TreePred <-predict(fitTree, type="class", newdata=te) LogPred <-predict(fit1, type="response", newdata=te) LogPred <- ifelse(LogPred>0.5, "TRUE", "FALSE") LogPred <- as.factor(LogPred) punctuations <- te$Punc table(RFPred, punctuations) ## punctuations ## RFPred FALSE TRUE ## FALSE 182 21 ## TRUE 0 1 table(LogPred, punctuations) ## punctuations ## LogPred FALSE TRUE ## FALSE 181 21 ## TRUE 1 1 table(TreePred, punctuations) ## punctuations ## TreePred FALSE TRUE ## FALSE 178 21 ## TRUE 4 1 par(mfrow=c(1,2)) plot(RFPred~punctuations) plot(LogPred~punctuations)
Ich bin DANIEL aus Deutschland hier ist meine kurze Aussage darüber, wie ich meine ATM-Karte dank Clara Cyber-Tech für ihre Hilfe bekam, sie sind die einzige Firma, die über diese Karte helfen kann Ich bin so glücklich, ich habe viele Hacker aber niemand getroffen Helfen Sie mir, alles, was sie tun, ist für mehr Geld erforderlich, das war, wie ich 2 mal betrogen wurde, bevor ich diese Firma, die mir helfen sehen, so möchte ich jeden Körper beraten, wenn Sie an dieser Karte interessiert sind, die richtige Firma zu kontaktieren, weil Die meisten dieser Unternehmen sind gefälscht, sie sind nicht real, für mich ist das einzige echte Unternehmen, das ich kennenlerne, CLARA CYBER TECH ... Follow company facebook page; Clara world of ATM cyber technology EMAIL, clara.cybertech @ outlook.com
AntwortenLöschen