#RTutorial: Using R to Harvest the Twitter STREAM API

Initializing the Twitter API

In this tutorial, the so called STREAMING-API from Twitter is used. This API provides real-time access to Twitter, so the results are dependent from what is actually going on, right now. Before we start, we have to initialize the Twitter-API. To use the Twitter API, a consumer key and consumer secret is required. Therefore, you have to register as a developer who is creating a Twitter app. Create a Twitter account and then sign in at https://apps.twitter.com/. The account has to be verified with a phone number. This can be done on the Twitter webpage in the account settings. Fill in name, description and any valid URL with leading “http://”. It is important NOT to provide any call-back URL, because otherwise the registration from R will not function. After this, you can see a summary of your newly created app with a link to “manage keys and access tokens”. The consumer key and consumer secret that can be found there have to be copied into the following R-script to save a permanent authentication token.

Julian Hillebrand has written a detailed explanation of the registration process on his blog.

# We use the package "streamR" which is designed to connect to the Twitter # Streaming API.
# Packages are loaded into R with the command library() 
library(streamR)
# In addition, we need the "ROAuth"-package to establish an 
# authentification.
library(ROAuth)
# The following four lines assign the right values to the variables that
# are needed for the API call.
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
# The string within the quotation marks has to be replaced with the actual
# consumerKey and consumerSecret.
consumerKey <- "myconsumerkey1122"
consumerSecret <- "myconsumersecret112233445566"
# The next two lines establish a connection to the Twitter API.
# The system will print a URL which should be copied in a browser to receive a PIN number.
# This PIN has to be entered in the R-console.
my_oauth <- OAuthFactory$new(consumerKey = consumerKey, 
consumerSecret = consumerSecret, 
requestURL = requestURL, 
accessURL = accessURL, 
authURL = authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
# Once the connection is established we can save it so that we do not have
# repeat this process.
save(my_oauth, file = "my_oauth.Rdata")

Sometimes, this approach fails because of an unexpected problem: Twitter checks the date and time of the computer from which the API-request is sent. If the time is not 100% correct the request is blocked. Even some seconds may lead to an error. Next time any session to analyze Twitter can simply start by the connection data stored in the my_oauth file. The necessary command is:

load("my_oauth.Rdata")

Searching on the Twitter API

By default, the search on the STREAM-API is performed endlessly. As long as the computer program is running additional tweets will be added to the file that is specified by the “file”-parameter. But we can specify a “timeout”-parameter that will end the search when the defined time-span is reached.

The following code specifies the parameters to search for one minute for all tweets from California. For the later examples, the timeout was set to 30 minutes to increase the number of tweets.

file = "tweets.json"
track = NULL
follow = NULL
loc = c(-125, 30, -114, 42)
lang = NULL
minutes = 0.5
time = 60*minutes
tweets = NULL
filterStream(file.name = file, 
             track = track,
             follow = follow, 
             locations = loc, 
             language = lang,
             timeout = time, 
             tweets = tweets, 
             oauth = my_oauth,
             verbose = TRUE)

Cleaning the data

In our working directory we will find now a file named “tweets.json” (as set by the “file”-parameter). A JavaScript Object Notation (JSON) file is structured like a list. All the information of every Tweet is written there but we do not have a table with rows and columns, yet. Fortunately, the streamR-package has a function to transfer JSON-data to conventional tables:

tweets.df <- parseTweets(file)
# Now we can inspect the table and save it.
View(tweets.df)
save(file="tweetsDF.RDATA", tweets.df)

The data is now nicely structured in a table with every Tweet in one row and 42 columns for all variables.
To demonstrate the importance of data-cleaning and -manipulation, we will now create an additional column with the hashtags that are used in the Tweets. Hashtags (every word in a tweet starting with the sign “#”) are a very important element of the Twitter structure (Java et al. 2007). In this example, we want to find out if the usage of hashtags shows regional differentiations. To find the hashtags in the Tweets, we use regular expressions. Regular expressions are a common way to specify patterns of character strings to be used in finding or replacement operations. In this case, we are looking for the character # followed by an undefined number of other alphabetical or numerical characters. Written as regular expression this pattern becomes “#[:alnum:]+”. The symbols [:alnum:] means “any alpha-numerical character”. The +-symbol has the meaning: “The proceeding item will be matched one or more times.”
To create a column with the hashtags, the string extract function (str_extract) from the package “stringR” is used.

library(stringr)
tweets.df$hashtags <- str_extract(tweets.df$text, "#[:alnum:]+")

In R, every object belongs to a specified class. For example, the columns that contain the numbers of friends or followers are assigned to the class numeric. The computer therefore expects only numbers in these columns. Right now, all columns containing text are assigned to the class character. But for the hashtags-column, something else is needed. Since we want to count the frequencies of the diverse hashtags to look for trends, the hashtags should be treated as categorical variable. This is achieved by using the class factor.

tweets.df$hashtags <- as.factor(tweets.df$hashtags)
summary(tweets.df$hashtags)

The summary function returns the different hashtags and their frequency. We can do the same with the column full_name. This column contains the name of the location the tweet was send from, or – in case the specific location is not shared on Twitter – the name of the user’s hometown.

tweets.df$full_name <- as.factor(tweets.df$full_name)
summary(tweets.df$full_name)

The advantage of transferring these columns into categorical variables (class factor) is that we can use them now directly in any kind of prediction model. But before we start testing the hypothesis, a final step in the cleaning process has to be done: As can be seen in the summary of the full_name column, there are many different locations. If we ran a model on the complete data, this would require a lot of computational power. We will therefore reduce the data in a way that only the most common places are included. This task is a very good example to introduce the R-style indexation. The following four lines of code show the solution for to this task:

library(plyr)
sel <- count(tweets.df$full_name)$x[count(tweets.df$full_name)$freq>100]
sel <- tweets.df$full_name %in% sel
test.df <- tweets.df[sel,]

The idea is to create an object sel that indicates all rows of the original table that should be selected. The criterion for selection is that the location specified in full_name is mentioned more than 100 times in all the tweets. To identify these locations, it is necessary to count the frequencies of all different levels of our categorical variable in a similar way like the summary-function did. The best command for this is count(), which can be found in the plyr-package. The function count() returns a table (an object of class data.frame) with a column x containing the name (i.e., the location) and the frequency (freq). A very clever thing about R-programing is that calls for functions can be treated in the same way like the object they are returning. This means in our example we can start indexing the count-function directly. Any column of a data.frame can be addressed by its name. The syntax for this is data.frame$columnName. We have used this way of indexing already with tweets.df$full_name. Now we can address the x-column of the count-call in the same way: count(tweets.df$full_name)$x. This command alone returns all names of locations in alphabetical order.

There is a second way of indexing R-objects. Within box brackets “[ ]” we can specify which element of the object is relevant. This can be done either by indexing its position with numbers (e.g. [1], vectors of numbers [c(1,3,5,6)], or sequences [c(1:6)]), or by a vector of TRUE or FALSE statements. The latter is done in the above example: The second call of the count-function asks which frequency has a value above 100 and returns a TRUE or FALSE value for each row. The command line above therefore finds the names of the locations that have a frequency above 100 and stores these names in the object sel.

The next line then asks which location name is an element of the vector sel. The answer to this question is again a vector of TRUE/FALSE of the length of the total number of locations. This object is stored as sel, which overwrites the existing object. At first glance, to overwrite an existing object might seem quite confusing. But the goal in the example was to create a selector that specifies the most common locations. The first line did not finish this job but rather contained an intermediate step. To avoid constructing too many objects – and thereby increase the complexity of programming – it makes sense to overwrite objects until the desired form is reached.

The final line now creates a new object test.df containing all the tweets from tweets.df that are indexed by the sel object. The careful reader will notice the comma in the command tweets.df[sel,]. Tables (like tweets.df) are multi-dimensional objects. They consists of rows and columns. To index an element in a multi-dimensional object, all dimensions must be called. The sel-object specifies the rows. The value after the comma specifies the columns. Since we want all columns, we do not specify this value at all, but we still need the comma. Indexing objects in R might be a confusing experience, at the beginning. But the different methods described here appear to be very stringent if the underlying principles are known. Being able to understand complex indexing is a key competence to understand R-code and to handle different data structures.

Finally, a last step of cleaning should be applied to the new table we want to test our assumptions on:

test.df$full_name <- droplevels(test.df$full_name)

Testing the hypothesis

Since R is a statistical environment a major focus of its development has been on tools for statistical testing. In this chapter linear regression serves as an example, but the capacities of R go far beyond this. The hypothesis of this tutorial is that there are regional differences in Twitter. If there are regional differences, our cleaned variable full_names should have an effect on followers_count.

fit <- lm(followers_count ~ full_name, data=test.df)
summary(fit)

In this example, the effect of each location is significant (at least one “*” for every predictor). Therefore, one might argue that we have proof for a regional effect. But the explanatory power of the model is extremely limited (adjusted R²of ca. 0.001) and the model itself is not significant (p-value ca. 0.15).

We can try now to add other predictor variables e.g. friends_count. The idea would be that the number of followers depends on the location and the number of friends a user has. This can be done with the following commands:

fit2 <- lm(followers_count ~ full_name + friends_count, data=test.df)
summary(fit2)

Visualization

The following code is quite complex. We have to create different objects to construct a map of California and repeat some of the cleaning steps to reduce the complexity for plotting. To understand these commands, comments have been added directly to the code. The reader is encouraged to copy this example and to experiment with different parameter settings.

# Two additional packages are needed:
library(ggplot2)
library(grid)
# Create an object containing the boundaries of California as 
# longuitude and lattitude.
map.data <- map_data("state", region=c("california"))
# We only need the long and lat values from the data. 
# These are put in a new object.
points <- data.frame(x = as.numeric(tweets.df$place_lon), 
                     y = as.numeric(tweets.df$place_lat))
# This line is needed for the second plot, when hashtags are added.
points$hashtags <- tweets.df$hashtags
# The next lines are just used to remove points that are not specified or 
# are incidental too far a way from California.
points[!is.na(tweets.df$lon), "x"] <- as.numeric(tweets.df$lon)[!is.na(tweets.df$lon)]
points[!is.na(tweets.df$lat), "y"] <- as.numeric(tweets.df$lat)[!is.na(tweets.df$lat)]
points <- points[(points$y > 25 & points$y < 42), ]
points <- points[points$x < -114,]
# The following code creates the graphic.
mapPlot <- ggplot(map.data) + # ggplot is the basic plotting function used.
  # The following lines define the map-areas.
  geom_map(aes(map_id = region), 
           map = map.data, 
           fill = "white", 
           color = "grey20", 
           size = 0.25) +  
  expand_limits(x = map.data$long, 
                y = map.data$lat) + 
  # The following parameters could be altered to insert axes, title, etc.
  theme(axis.line = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        axis.title = element_blank(), 
        panel.background = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major = element_blank(), 
        plot.background = element_blank(), 
        plot.margin = unit(0 * c(-1.5, -1.5, -1.5, -1.5), "lines")) + 
  # The next line plots points for each tweet. Size, transparency (alpha) 
  # and color could be altered.
  geom_point(data = points, 
             aes(x = x, y = y), 
             size = 2, 
             alpha = 1/20, 
             color = "steelblue")
 
mapPlot # This command plots the object.

This visualization shows us the spatial distribution of tweets. Some of the points are not in California, because the coordinates we had send to Twitter are not that exact. In metropolitan areas like Los Angeles or San Francisco more people seem to use Twitter – which is of course not very surprising. Nevertheless, the spatial distribution of tweets can be very important for a lot of different questions, especially if combined with other filters. For example, we can ask what hashtags are used in which region.

Following exactly the same cleaning procedure as described above, we can identify the hashtags that are used more than two times in our dataset.

sel <- count(points$hashtags)$x[count(points$hashtags)$freq>2]
sel <- sel[!is.na(sel)]
sel <- points$hashtags %in% sel
hashs <- points[sel,]
hashs <- hashs[!duplicated(hashs$x),]

This new table hashs can now be used to add the hashtags to the existing plot. Since we created an object of its own containing the plot (mapPlot), we can now add additional features to this plot without re-running all the code.

mapPlot2 <- mapPlot +
  geom_text(data = hashs, 
            aes(x = x, y = y, label = hashtags), 
            position = position_jitter(width=0, height=1),
            size = 4, 
            alpha = 1/2, 
            color = "black") 
 
mapPlot2

Kommentare

Unknown7. Dezember 2015 um 20:16
Thank you very much for the tutorial! Is very helpful!
AntwortenLöschen
Antworten
Abby19. Dezember 2016 um 05:57
Hi! This is great. Thank you. Is there any way to retrieve tweets including a specific phrase or term, rather than simply calling a category. I want to retrieve tweets that include "Arsenal is awesome" and plot that out in a map - seeing where this phrase is most frequently stated.
AntwortenLöschen
Antworten
Unknown26. April 2017 um 10:42
Great one! Post is really good, thank you
Bitcoin payment for e commerce development | Pay Bitcoin for Ecommerce Development | web design Hubli
AntwortenLöschen
Antworten
amarnath21. Oktober 2017 um 21:14
Thanks for the post. I am stuck where it redirects to the URL to get the PIN. it redirects to the URL that I have given in the app but that URL never exists. What should I provide over there to get the pin?
AntwortenLöschen
Antworten
noman22. November 2017 um 20:57
same issue i am having
Thanks for the post. I am stuck where it redirects to the URL to get the PIN. it redirects to the URL that I have given in the app but that URL never exists. What should I provide over there to get the pin?
AntwortenLöschen
Antworten
Inwizards27. Februar 2018 um 10:45
Nice blog and absolutely outstanding. You can do something much better but i still say this perfect.Keep trying for the best. Hire R Programmers
AntwortenLöschen
Antworten
Unknown11. Mai 2018 um 07:57
Great post!I am actually gready to intizaliting the twitter API, I am very happy to this post .Also great blog here with all of the valuable information you have.Well done, it's a great knowledge.

Also Check out the : https://www.credosystemz.com/training-in-chennai/best-data-science-training-in-chennai/
AntwortenLöschen
Antworten
rajkamal17. Mai 2018 um 16:48
We at Greens Technology have been in IT industry for nearly 10 years, providing Training, Consultancy and Development solutions in emerging technologies.
AntwortenLöschen
Antworten
Unknown15. Juni 2018 um 08:45
Thanks for your information sharing,nice article is on your blog.
R Programming Training in Ameerpet, Hyderabad

AntwortenLöschen
Antworten
akhilapriya40416. Juli 2018 um 12:41
Thanks for posting such a great article.you done a great job Java online course
AntwortenLöschen
Antworten
evergreensumi19. Juli 2018 um 07:57
Expected to form you a next to no word to thank you once more with respect to the decent recommendations you've contributed here.offshore safety course in Chennai
AntwortenLöschen
Antworten
Tejuteju20. Juli 2018 um 14:37
Really nice blog post.provided a helpful information.I hope that you will post more updates like this Data Science online Course Bangalore

AntwortenLöschen
Antworten
Unknown28. Juli 2018 um 07:07
Your blog is so nice, and the article is very good it helps to so many people.
R Programming Training in Ameerpet, Hyderabad

AntwortenLöschen
Antworten
Unknown31. Juli 2018 um 07:28
Very good blog thank you for sharing this information.
R Programming Training in Hyderabad
AntwortenLöschen
Antworten
Anonym1. August 2018 um 13:36
I was exploring some interesting content and found this personal interest here even i recommended most out of the web so keep reading. Some other valuable information clean master about is here and other than you can also check download app cache cleaner for android and unlock your mind about android phone.
AntwortenLöschen
Antworten
akhilapriya40430. August 2018 um 11:09
Nice information thank you,if you want more information please visit our link MSBI online training
AntwortenLöschen
Antworten
Unknown27. Oktober 2018 um 07:39
Great efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums.big data training in Velachery | Hadoop Training in Chennai | big data Hadoop training and certification in Chennai | Big data course fees |
AntwortenLöschen
Antworten
chirp11. Februar 2019 um 07:01
I like a lot this nice job of us.

Twitter Management
AntwortenLöschen
Antworten
Gloria Clifford27. Februar 2019 um 12:44
I have been using Twitter for a long time. I switched to Wizugo marketing tool to automate my Twitter account, and I must say that WizUgo com worked pretty amazing in boosting my Twitter account performance. It is saving my time and increasing my followers each day.
AntwortenLöschen
Antworten

Political Data Science

Dieses Blog durchsuchen