Direkt zum Hauptbereich

Posts

Es werden Posts vom 2014 angezeigt.

Using ngrams with #RTextTools

This littel example shows a workaround for a bug in RTextTools. Using ngramLength would lead to an error. But we can use the RWeka library and tm, then go back to RTextTools: library ( RTextTools ) texts <- c ( "This is the first document." ,   "Is this a text?" , "This is the second file." ,   "This is the third text." ,   "File is not this." ) library ( RWeka ) library ( tm ) TrigramTokenizer <- function ( x ) NGramTokenizer ( x , Weka_control ( min = 3 , max = 3 ) ) dtm <- DocumentTermMatrix ( Corpus ( VectorSource ( texts ) ) , control = list ( weighting=weightTf , tokenize = TrigramTokenizer ) ) as.matrix ( dtm ) isText <- c ( T , F , T , T , F ) container <- create_container ( dtm , isText , virgin=F , trainSize= 1 : 3 , testSize= 4 : 5 )   models=train_models ( container , algorithm= c ( "SVM" , &

#Hipster Frisuren und #Bluecard korrelieren

Soeben bin ich auf folgenden Zusammenhang gestoßen: Die Suchanfragen in Deutschland nach den Begriffen "EU blue card" und "hipster frisuren" korrelieren sehr stark, laut google correlate : Gehen hochqualifizierte Migranten in Deutschland gleich zum Friseur? Wohl eher eine nette Scheinkorrelation...

My favorite #statisticsjoke

Was @RitaOra #hacked? Probably not really... #Twitter-Forensics

According to news.com.au Rita Ora announced on Twitter to publish a new song in case the tweet would get 100K retweets. Then this tweet was deleted and Rita Ora claimed her account had been hacked. The idea of Twitter forensics is to verify those information. Luckily, I saved the tweets of Rita, before they were deleted... Here is the first tweet: The tweet had the id 528090474041856000, but it is deleted now. There were two other tweets that have been deleted: RT @niallsrita: where her 3.9m followers at when you need them smh and Let's just do it bots! Dropping a new song Monday keep re tweeting!! #boooooom   Interestingly, there has been another tweet just 10 minutes after the first one and six minutes before the cited RT that was posted from Instagram: Morning!!! Happy Halloween! Bring costume to work day! #thevoiceuk http://t.co/NzLnfLKV3g It does not look very professional to start such a campaign and then just change the subject. According to the meta-dat

#SimulatedData #R #caret

I just noticed a very cool function in the caret-package, I would like to share. The package can produce simulated data, which is very useful for Monte Carlo Simulations, or when you just want to try something out... In addition, the psych-package has one of the best correlation tables. library ( caret ) ## Loading required package: lattice ## Loading required package: ggplot2 set.seed ( 1 ) df <- twoClassSim ( 5000 , intercept = - 13 ) ## Loading required package: MASS summary ( df ) ## TwoFactor1 TwoFactor2 Linear01 Linear02 ## Min. :-4.940 Min. :-5.017 Min. :-4.303 Min. :-3.683 ## 1st Qu.:-0.956 1st Qu.:-0.969 1st Qu.:-0.696 1st Qu.:-0.668 ## Median : 0.015 Median :-0.021 Median :-0.045 Median : 0.007 ## Mean : 0.002 Mean :-0.010 Mean :-0.022 Mean : 0.014 ## 3rd Qu.: 0.978 3rd Qu.: 0.974 3rd Qu.: 0.645 3rd Qu.: 0.669 ## Max. : 5.076 Max. : 5.179 Max. : 3.728 Max.

What is wrong with US #energy #budget 2000?

The polcy agenda project (PAP) data base shows for year 2000 an annual percentage change of MINUS 218 PER CENT (!!!). I was wondering all the time, how this can be. In 1999, budget authority was 981 million dollars for energy (subtopic code 270). In 2000, bugdet authority was -1184 million dollars. So, there were earnings instead of spending. Mathematically, the PAP data is correct. But what is the reason why suddenly the Government makes money with energy instead of paying for it? Or is it a mistake in the data? The subtopics of the PAP do not help much: We can see that the earnings come from the subtopic "energy supply" (code 271). To find this out what is really going on, I wrote a little scipt to compare the PAP data the official data from the Office of Management and Budget: # comparing OMB energy budget with PAP energy budget   # OMB OMB <- read.csv ( "http://www.gpo.gov/fdsys/pkg/BUDGET-2015-DB/csv/BUDGET-2015-DB-1.csv" , as.is= TRUE ) OMB