Direkt zum Hauptbereich

#RTutorial: Using R to Harvest the Twitter STREAM API

Initializing the Twitter API

In this tutorial, the so called STREAMING-API from Twitter is used. This API provides real-time access to Twitter, so the results are dependent from what is actually going on, right now. Before we start, we have to initialize the Twitter-API. To use the Twitter API, a consumer key and consumer secret is required. Therefore, you have to register as a developer who is creating a Twitter app. Create a Twitter account and then sign in at https://apps.twitter.com/. The account has to be verified with a phone number. This can be done on the Twitter webpage in the account settings. Fill in name, description and any valid URL with leading “http://”. It is important NOT to provide any call-back URL, because otherwise the registration from R will not function. After this, you can see a summary of your newly created app with a link to “manage keys and access tokens”. The consumer key and consumer secret that can be found there have to be copied into the following R-script to save a permanent authentication token.

Julian Hillebrand has written a detailed explanation of the registration process on his blog.

# We use the package "streamR" which is designed to connect to the Twitter # Streaming API.
# Packages are loaded into R with the command library() 
library(streamR)
# In addition, we need the "ROAuth"-package to establish an 
# authentification.
library(ROAuth)
# The following four lines assign the right values to the variables that
# are needed for the API call.
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
# The string within the quotation marks has to be replaced with the actual
# consumerKey and consumerSecret.
consumerKey <- "myconsumerkey1122"
consumerSecret <- "myconsumersecret112233445566"
# The next two lines establish a connection to the Twitter API.
# The system will print a URL which should be copied in a browser to receive a PIN number.
# This PIN has to be entered in the R-console.
my_oauth <- OAuthFactory$new(consumerKey = consumerKey, 
consumerSecret = consumerSecret, 
requestURL = requestURL, 
accessURL = accessURL, 
authURL = authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
# Once the connection is established we can save it so that we do not have
# repeat this process.
save(my_oauth, file = "my_oauth.Rdata")

Sometimes, this approach fails because of an unexpected problem: Twitter checks the date and time of the computer from which the API-request is sent. If the time is not 100% correct the request is blocked. Even some seconds may lead to an error. Next time any session to analyze Twitter can simply start by the connection data stored in the my_oauth file. The necessary command is:

load("my_oauth.Rdata")

Searching on the Twitter API

By default, the search on the STREAM-API is performed endlessly. As long as the computer program is running additional tweets will be added to the file that is specified by the “file”-parameter. But we can specify a “timeout”-parameter that will end the search when the defined time-span is reached.
The following code specifies the parameters to search for one minute for all tweets from California. For the later examples, the timeout was set to 30 minutes to increase the number of tweets.

file = "tweets.json"
track = NULL
follow = NULL
loc = c(-125, 30, -114, 42)
lang = NULL
minutes = 0.5
time = 60*minutes
tweets = NULL
filterStream(file.name = file, 
             track = track,
             follow = follow, 
             locations = loc, 
             language = lang,
             timeout = time, 
             tweets = tweets, 
             oauth = my_oauth,
             verbose = TRUE)

Cleaning the data

In our working directory we will find now a file named “tweets.json” (as set by the “file”-parameter). A JavaScript Object Notation (JSON) file is structured like a list. All the information of every Tweet is written there but we do not have a table with rows and columns, yet. Fortunately, the streamR-package has a function to transfer JSON-data to conventional tables:

tweets.df <- parseTweets(file)
# Now we can inspect the table and save it.
View(tweets.df)
save(file="tweetsDF.RDATA", tweets.df)

The data is now nicely structured in a table with every Tweet in one row and 42 columns for all variables.
To demonstrate the importance of data-cleaning and -manipulation, we will now create an additional column with the hashtags that are used in the Tweets. Hashtags (every word in a tweet starting with the sign “#”) are a very important element of the Twitter structure (Java et al. 2007). In this example, we want to find out if the usage of hashtags shows regional differentiations. To find the hashtags in the Tweets, we use regular expressions. Regular expressions are a common way to specify patterns of character strings to be used in finding or replacement operations. In this case, we are looking for the character # followed by an undefined number of other alphabetical or numerical characters. Written as regular expression this pattern becomes “#[:alnum:]+”. The symbols [:alnum:] means “any alpha-numerical character”. The +-symbol has the meaning: “The proceeding item will be matched one or more times.”
To create a column with the hashtags, the string extract function (str_extract) from the package “stringR” is used.

library(stringr)
tweets.df$hashtags <- str_extract(tweets.df$text, "#[:alnum:]+")

In R, every object belongs to a specified class. For example, the columns that contain the numbers of friends or followers are assigned to the class numeric. The computer therefore expects only numbers in these columns. Right now, all columns containing text are assigned to the class character. But for the hashtags-column, something else is needed. Since we want to count the frequencies of the diverse hashtags to look for trends, the hashtags should be treated as categorical variable. This is achieved by using the class factor.

tweets.df$hashtags <- as.factor(tweets.df$hashtags)
summary(tweets.df$hashtags)

The summary function returns the different hashtags and their frequency. We can do the same with the column full_name. This column contains the name of the location the tweet was send from, or – in case the specific location is not shared on Twitter – the name of the user’s hometown.

tweets.df$full_name <- as.factor(tweets.df$full_name)
summary(tweets.df$full_name)

The advantage of transferring these columns into categorical variables (class factor) is that we can use them now directly in any kind of prediction model. But before we start testing the hypothesis, a final step in the cleaning process has to be done: As can be seen in the summary of the full_name column, there are many different locations. If we ran a model on the complete data, this would require a lot of computational power. We will therefore reduce the data in a way that only the most common places are included. This task is a very good example to introduce the R-style indexation. The following four lines of code show the solution for to this task:


library(plyr)
sel <- count(tweets.df$full_name)$x[count(tweets.df$full_name)$freq>100]
sel <- tweets.df$full_name %in% sel
test.df <- tweets.df[sel,]

The idea is to create an object sel that indicates all rows of the original table that should be selected. The criterion for selection is that the location specified in full_name is mentioned more than 100 times in all the tweets. To identify these locations, it is necessary to count the frequencies of all different levels of our categorical variable in a similar way like the summary-function did. The best command for this is count(), which can be found in the plyr-package. The function count() returns a table (an object of class data.frame) with a column x containing the name (i.e., the location) and the frequency (freq). A very clever thing about R-programing is that calls for functions can be treated in the same way like the object they are returning. This means in our example we can start indexing the count-function directly. Any column of a data.frame can be addressed by its name. The syntax for this is data.frame$columnName. We have used this way of indexing already with tweets.df$full_name. Now we can address the x-column of the count-call in the same way: count(tweets.df$full_name)$x. This command alone returns all names of locations in alphabetical order.
There is a second way of indexing R-objects. Within box brackets “[ ]” we can specify which element of the object is relevant. This can be done either by indexing its position with numbers (e.g. [1], vectors of numbers [c(1,3,5,6)], or sequences [c(1:6)]), or by a vector of TRUE or FALSE statements. The latter is done in the above example: The second call of the count-function asks which frequency has a value above 100 and returns a TRUE or FALSE value for each row. The command line above therefore finds the names of the locations that have a frequency above 100 and stores these names in the object sel.
The next line then asks which location name is an element of the vector sel. The answer to this question is again a vector of TRUE/FALSE of the length of the total number of locations. This object is stored as sel, which overwrites the existing object. At first glance, to overwrite an existing object might seem quite confusing. But the goal in the example was to create a selector that specifies the most common locations. The first line did not finish this job but rather contained an intermediate step. To avoid constructing too many objects – and thereby increase the complexity of programming – it makes sense to overwrite objects until the desired form is reached.
The final line now creates a new object test.df containing all the tweets from tweets.df that are indexed by the sel object. The careful reader will notice the comma in the command tweets.df[sel,]. Tables (like tweets.df) are multi-dimensional objects. They consists of rows and columns. To index an element in a multi-dimensional object, all dimensions must be called. The sel-object specifies the rows. The value after the comma specifies the columns. Since we want all columns, we do not specify this value at all, but we still need the comma. Indexing objects in R might be a confusing experience, at the beginning. But the different methods described here appear to be very stringent if the underlying principles are known. Being able to understand complex indexing is a key competence to understand R-code and to handle different data structures.

Finally, a last step of cleaning should be applied to the new table we want to test our assumptions on:


test.df$full_name <- droplevels(test.df$full_name)

Testing the hypothesis

Since R is a statistical environment a major focus of its development has been on tools for statistical testing. In this chapter linear regression serves as an example, but the capacities of R go far beyond this. The hypothesis of this tutorial is that there are regional differences in Twitter. If there are regional differences, our cleaned variable full_names should have an effect on followers_count.

fit <- lm(followers_count ~ full_name, data=test.df)
summary(fit)

In this example, the effect of each location is significant (at least one “*” for every predictor). Therefore, one might argue that we have proof for a regional effect. But the explanatory power of the model is extremely limited (adjusted R2 of ca. 0.001) and the model itself is not significant (p-value ca. 0.15).
We can try now to add other predictor variables e.g. friends_count. The idea would be that the number of followers depends on the location and the number of friends a user has. This can be done with the following commands:

fit2 <- lm(followers_count ~ full_name + friends_count, data=test.df)
summary(fit2)

Visualization

The following code is quite complex. We have to create different objects to construct a map of California and repeat some of the cleaning steps to reduce the complexity for plotting. To understand these commands, comments have been added directly to the code. The reader is encouraged to copy this example and to experiment with different parameter settings.

# Two additional packages are needed:
library(ggplot2)
library(grid)
# Create an object containing the boundaries of California as 
# longuitude and lattitude.
map.data <- map_data("state", region=c("california"))
# We only need the long and lat values from the data. 
# These are put in a new object.
points <- data.frame(x = as.numeric(tweets.df$place_lon), 
                     y = as.numeric(tweets.df$place_lat))
# This line is needed for the second plot, when hashtags are added.
points$hashtags <- tweets.df$hashtags
# The next lines are just used to remove points that are not specified or 
# are incidental too far a way from California.
points[!is.na(tweets.df$lon), "x"] <- as.numeric(tweets.df$lon)[!is.na(tweets.df$lon)]
points[!is.na(tweets.df$lat), "y"] <- as.numeric(tweets.df$lat)[!is.na(tweets.df$lat)]
points <- points[(points$y > 25 & points$y < 42), ]
points <- points[points$x < -114,]
# The following code creates the graphic.
mapPlot <- ggplot(map.data) + # ggplot is the basic plotting function used.
  # The following lines define the map-areas.
  geom_map(aes(map_id = region), 
           map = map.data, 
           fill = "white", 
           color = "grey20", 
           size = 0.25) +  
  expand_limits(x = map.data$long, 
                y = map.data$lat) + 
  # The following parameters could be altered to insert axes, title, etc.
  theme(axis.line = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        axis.title = element_blank(), 
        panel.background = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major = element_blank(), 
        plot.background = element_blank(), 
        plot.margin = unit(0 * c(-1.5, -1.5, -1.5, -1.5), "lines")) + 
  # The next line plots points for each tweet. Size, transparency (alpha) 
  # and color could be altered.
  geom_point(data = points, 
             aes(x = x, y = y), 
             size = 2, 
             alpha = 1/20, 
             color = "steelblue")
 
mapPlot # This command plots the object.


This visualization shows us the spatial distribution of tweets. Some of the points are not in California, because the coordinates we had send to Twitter are not that exact. In metropolitan areas like Los Angeles or San Francisco more people seem to use Twitter – which is of course not very surprising. Nevertheless, the spatial distribution of tweets can be very important for a lot of different questions, especially if combined with other filters. For example, we can ask what hashtags are used in which region.
Following exactly the same cleaning procedure as described above, we can identify the hashtags that are used more than two times in our dataset.

sel <- count(points$hashtags)$x[count(points$hashtags)$freq>2]
sel <- sel[!is.na(sel)]
sel <- points$hashtags %in% sel
hashs <- points[sel,]
hashs <- hashs[!duplicated(hashs$x),]

This new table hashs can now be used to add the hashtags to the existing plot. Since we created an object of its own containing the plot (mapPlot), we can now add additional features to this plot without re-running all the code.

mapPlot2 <- mapPlot +
  geom_text(data = hashs, 
            aes(x = x, y = y, label = hashtags), 
            position = position_jitter(width=0, height=1),
            size = 4, 
            alpha = 1/2, 
            color = "black") 
 
mapPlot2

Kommentare

  1. Thank you very much for the tutorial! Is very helpful!

    AntwortenLöschen
  2. Hi! This is great. Thank you. Is there any way to retrieve tweets including a specific phrase or term, rather than simply calling a category. I want to retrieve tweets that include "Arsenal is awesome" and plot that out in a map - seeing where this phrase is most frequently stated.

    AntwortenLöschen
  3. Thanks for the post. I am stuck where it redirects to the URL to get the PIN. it redirects to the URL that I have given in the app but that URL never exists. What should I provide over there to get the pin?

    AntwortenLöschen
  4. same issue i am having
    Thanks for the post. I am stuck where it redirects to the URL to get the PIN. it redirects to the URL that I have given in the app but that URL never exists. What should I provide over there to get the pin?

    AntwortenLöschen
  5. Nice blog and absolutely outstanding. You can do something much better but i still say this perfect.Keep trying for the best. Hire R Programmers

    AntwortenLöschen
  6. Great post!I am actually gready to intizaliting the twitter API, I am very happy to this post .Also great blog here with all of the valuable information you have.Well done, it's a great knowledge.

    Also Check out the : https://www.credosystemz.com/training-in-chennai/best-data-science-training-in-chennai/

    AntwortenLöschen
  7. We at Greens Technology have been in IT industry for nearly 10 years, providing Training, Consultancy and Development solutions in emerging technologies.

    AntwortenLöschen
  8. Thanks for your information sharing,nice article is on your blog.
    R Programming Training in Ameerpet, Hyderabad

    AntwortenLöschen
  9. Thanks for posting such a great article.you done a great job Java online course

    AntwortenLöschen
  10. Expected to form you a next to no word to thank you once more with respect to the decent recommendations you've contributed here.offshore safety course in Chennai

    AntwortenLöschen
  11. Really nice blog post.provided a helpful information.I hope that you will post more updates like this Data Science online Course Bangalore


    AntwortenLöschen
  12. Your blog is so nice, and the article is very good it helps to so many people.
    R Programming Training in Ameerpet, Hyderabad

    AntwortenLöschen
  13. I was exploring some interesting content and found this personal interest here even i recommended most out of the web so keep reading. Some other valuable information clean master about is here and other than you can also check download app cache cleaner for android and unlock your mind about android phone.

    AntwortenLöschen
  14. Nice information thank you,if you want more information please visit our link MSBI online training

    AntwortenLöschen
  15. Great efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums.big data training in Velachery | Hadoop Training in Chennai | big data Hadoop training and certification in Chennai | Big data course fees |

    AntwortenLöschen
  16. I have been using Twitter for a long time. I switched to Wizugo marketing tool to automate my Twitter account, and I must say that WizUgo com worked pretty amazing in boosting my Twitter account performance. It is saving my time and increasing my followers each day.

    AntwortenLöschen

Kommentar veröffentlichen

Beliebte Posts aus diesem Blog

Was man an der COVID-Politik über Faschismus lernen kann

Kritiker der Corona-Politik führen immer häufiger den Begriff Faschismus im Munde, um die politischen Maßnahmen zu beschreiben. Einerseits ist damit natürlich eine polemische Ablehnung verbunden: Wer will schon für Faschismus sein? Generell ist der moralische Vorwurf, etwas sei faschistisch oder faschistoid in der demokratischen Auseinandersetzung durchaus geläufig. Dabei wird jedoch meist auf etwas verwiesen, was zum demokratischen Staat dazu gehört und gerade keinen Faschismus begründet: Die Polizei, die das Gewaltmonopol durchsetzt, ist keine faschistische Organisation, ein Parlament, welches Bürgerrechte einschränkt, ist kein Beleg für die faschistische Aufhebung des Rechtsstaats und ein Medienartikel, der dazu aufruft, Bürger sollen Straftäter anzeigen, ist keine faschistische Propaganda, usw. All dies sind Beispiele für das Leben in demokratischen Gemeinwesen. Anstatt die Demokratie also immer gleich auf dem Weg in den Faschismus zu wähnen, wäre es angebracht, sich zu fragen, war...

Kritik an dem Science-Artikel der Priesemann-Gruppe „Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions“

Der Science-Artikel von Dehning et al. (2020) gilt als Beleg für die Effektivität der Corona-Maßnahmen in Deutschland im März 2020. Wir glauben, dass der Artikel gravierende Fehler enthält und daher nichts darüber aussagt, ob insbesondere das Kontaktverbot vom 23.03.2020, irgendeinen Effekt hatte. Unsere Kritik haben wir bei Science eingereicht und sie ist hier zu finden: https://science.sciencemag.org/content/369/6500/eabb9789/tab-e-letters Im folgenden übersetze ich unseren Beitrag und gehe anschließend auf die Frage ein, wie Wissenschaft unter COVID-19-Bedingungen funktioniert und was daran bedenklich ist. Eine Kritik an ‘Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions’ Wir haben den Artikel ‘Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions’ analysiert und dabei gravierende Unstimmigkeiten im Design der Studie festgestellt: Anstatt das Datum der Wendepunkte (wann sich die COVID-19-Entwicklung i...

Der Nutzerismus: Eine Ideologie mit totalitärem Potential

Ich glaube, dass wir derzeit den Aufstieg einer Ideologie erleben, die ich Nutzerismus nennen möchte. Hannah Arendt hat darauf hingewiesen, dass jede Ideologie zu einem totalitaristischen Regime führen kann und es gibt ernste Anzeichen, dass dies auch für den Nutzerismus gilt.  Was ist der Nutzerismus? Wie bei jeder Ideologie ist der Kerngedanke sehr einfach: Im Prinzip gibt es für alle gesellschaftlichen Probleme eine technische Lösung. Leider wenden die Menschen die richtigen Technologien nicht an. Sie nehmen ihre Rolle als Nutzer nicht wahr. Es geht dem Nutzerismus also um das Zusammenspiel von Mensch und Technik, allerdings immer wieder aus der gleichen Perspektive. Die Technik kommt vor als potentielle Lösung eines gesellschaftlichen Problems. Eventuell fehlt die perfekte Lösung noch, aber das ist dann als Auftrag an die Wissenschaft und die Ingenieure zu verstehen. Dieser Technikglaube hat etwas sehr Naives. Er abstrahiert zum Beispiel von allen Interessen, für die Technolog...