Initializing the Twitter API
In this tutorial, the so called STREAMING-API from Twitter is used. This API provides real-time access to Twitter, so the results are dependent from what is actually going on, right now. Before we start, we have to initialize the Twitter-API. To use the Twitter API, a consumer key and consumer secret is required. Therefore, you have to register as a developer who is creating a Twitter app. Create a Twitter account and then sign in at https://apps.twitter.com/. The account has to be verified with a phone number. This can be done on the Twitter webpage in the account settings. Fill in name, description and any valid URL with leading “http://”. It is important NOT to provide any call-back URL, because otherwise the registration from R will not function. After this, you can see a summary of your newly created app with a link to “manage keys and access tokens”. The consumer key and consumer secret that can be found there have to be copied into the following R-script to save a permanent authentication token.
Julian Hillebrand has written a detailed explanation of the registration process on his blog.
# We use the package "streamR" which is designed to connect to the Twitter # Streaming API. # Packages are loaded into R with the command library() library(streamR) # In addition, we need the "ROAuth"-package to establish an # authentification. library(ROAuth) # The following four lines assign the right values to the variables that # are needed for the API call. requestURL <- "https://api.twitter.com/oauth/request_token" accessURL <- "https://api.twitter.com/oauth/access_token" authURL <- "https://api.twitter.com/oauth/authorize" # The string within the quotation marks has to be replaced with the actual # consumerKey and consumerSecret. consumerKey <- "myconsumerkey1122" consumerSecret <- "myconsumersecret112233445566" # The next two lines establish a connection to the Twitter API. # The system will print a URL which should be copied in a browser to receive a PIN number. # This PIN has to be entered in the R-console. my_oauth <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = requestURL, accessURL = accessURL, authURL = authURL) my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) # Once the connection is established we can save it so that we do not have # repeat this process. save(my_oauth, file = "my_oauth.Rdata")
Sometimes, this approach fails because of an unexpected
problem: Twitter checks the date and time of the computer from which the
API-request is sent. If the time is not 100% correct the request is blocked.
Even some seconds may lead to an error. Next time any session to analyze
Twitter can simply start by the connection data stored in the my_oauth file.
The necessary command is:
load("my_oauth.Rdata")
Searching on the Twitter API
By default, the search on the STREAM-API is performed endlessly. As long as the computer program is running additional tweets will be added to the file that is specified by the “file”-parameter. But we can specify a “timeout”-parameter that will end the search when the defined time-span is reached.
The following code specifies the parameters to search for one minute for all tweets from California. For the later examples, the timeout was set to 30 minutes to increase the number of tweets.
Cleaning the data
In our working directory we will find now a file named “tweets.json” (as set by the “file”-parameter). A JavaScript Object Notation (JSON) file is structured like a list. All the information of every Tweet is written there but we do not have a table with rows and columns, yet. Fortunately, the streamR-package has a function to transfer JSON-data to conventional tables:The data is now nicely structured in a table with every Tweet in one row and 42 columns for all variables.
To demonstrate the importance of data-cleaning and -manipulation, we will now create an additional column with the hashtags that are used in the Tweets. Hashtags (every word in a tweet starting with the sign “#”) are a very important element of the Twitter structure (Java et al. 2007). In this example, we want to find out if the usage of hashtags shows regional differentiations. To find the hashtags in the Tweets, we use regular expressions. Regular expressions are a common way to specify patterns of character strings to be used in finding or replacement operations. In this case, we are looking for the character # followed by an undefined number of other alphabetical or numerical characters. Written as regular expression this pattern becomes “#[:alnum:]+”. The symbols [:alnum:] means “any alpha-numerical character”. The +-symbol has the meaning: “The proceeding item will be matched one or more times.”
To create a column with the hashtags, the string extract function (str_extract) from the package “stringR” is used.
In R, every object belongs to a specified class. For
example, the columns that contain the numbers of friends or followers are
assigned to the class numeric. The
computer therefore expects only numbers in these columns. Right now, all
columns containing text are assigned to the class character. But for the hashtags-column, something else is needed.
Since we want to count the frequencies of the diverse hashtags to look for
trends, the hashtags should be treated as categorical variable. This is
achieved by using the class factor.
The summary function returns the different hashtags and their frequency. We can do the same with the column full_name. This column contains the name of the location the tweet was send from, or – in case the specific location is not shared on Twitter – the name of the user’s hometown.
The advantage of transferring these columns into categorical variables (class factor) is that we can use them now directly in any kind of prediction model. But before we start testing the hypothesis, a final step in the cleaning process has to be done: As can be seen in the summary of the full_name column, there are many different locations. If we ran a model on the complete data, this would require a lot of computational power. We will therefore reduce the data in a way that only the most common places are included. This task is a very good example to introduce the R-style indexation. The following four lines of code show the solution for to this task:
The idea is to create an object sel that indicates all rows of the original table that should be
selected. The criterion for selection is that the location specified in full_name is mentioned more than 100
times in all the tweets. To identify these locations, it is necessary to count
the frequencies of all different levels of our categorical variable in a
similar way like the summary-function
did. The best command for this is count(),
which can be found in the plyr-package.
The function count() returns a table
(an object of class data.frame) with
a column x containing the name (i.e.,
the location) and the frequency (freq).
A very clever thing about R-programing is that calls for functions can be
treated in the same way like the object they are returning. This means in our
example we can start indexing the count-function
directly. Any column of a data.frame
can be addressed by its name. The syntax for this is data.frame$columnName. We have used this way of indexing already
with tweets.df$full_name. Now we can
address the x-column of the
count-call in the same way: count(tweets.df$full_name)$x.
This command alone returns all names of locations in alphabetical order.
There is a second way of indexing R-objects. Within box
brackets “[ ]” we can specify which element of the object is relevant. This can
be done either by indexing its position with numbers (e.g. [1], vectors of
numbers [c(1,3,5,6)], or sequences [c(1:6)]), or by a vector of TRUE or FALSE
statements. The latter is done in the above example: The second call of the count-function asks which frequency has
a value above 100 and returns a TRUE or FALSE value for each row. The command
line above therefore finds the names of the locations that have a frequency
above 100 and stores these names in the object sel.
The next line then asks which location name is an element of
the vector sel. The answer to this
question is again a vector of TRUE/FALSE of the length of the total number of
locations. This object is stored as sel,
which overwrites the existing object. At first glance, to overwrite an existing
object might seem quite confusing. But the goal in the example was to create a
selector that specifies the most common locations. The first line did not
finish this job but rather contained an intermediate step. To avoid
constructing too many objects – and thereby increase the complexity of programming
– it makes sense to overwrite objects until the desired form is reached.
The final line now creates a new object test.df containing all the tweets from tweets.df that are indexed by the sel object. The careful reader will notice the comma in the command
tweets.df[sel,]. Tables (like tweets.df) are multi-dimensional
objects. They consists of rows and columns. To index an element in a
multi-dimensional object, all dimensions must be called. The sel-object specifies the rows. The value
after the comma specifies the columns. Since we want all columns, we do not
specify this value at all, but we still need the comma. Indexing objects in R
might be a confusing experience, at the beginning. But the different methods
described here appear to be very stringent if the underlying principles are
known. Being able to understand complex indexing is a key competence to
understand R-code and to handle different data structures.
Finally, a last step of cleaning should be applied to the new table we want to test our assumptions on:
test.df$full_name <- droplevels(test.df$full_name)
Testing the hypothesis
In this example, the effect of each location is significant (at least one “*” for every predictor). Therefore, one might argue that we have proof for a regional effect. But the explanatory power of the model is extremely limited (adjusted R2 of ca. 0.001) and the model itself is not significant (p-value ca. 0.15).
We can try now to add other predictor variables e.g. friends_count. The idea would be that
the number of followers depends on the location and the number of friends a
user has. This can be done with the following commands:
Visualization
The following code is quite complex. We have to create
different objects to construct a map of California and repeat some of the
cleaning steps to reduce the complexity for plotting. To understand these
commands, comments have been added directly to the code. The reader is
encouraged to copy this example and to experiment with different parameter
settings.
# Two additional packages are needed: library(ggplot2) library(grid) # Create an object containing the boundaries of California as # longuitude and lattitude. map.data <- map_data("state", region=c("california")) # We only need the long and lat values from the data. # These are put in a new object. points <- data.frame(x = as.numeric(tweets.df$place_lon), y = as.numeric(tweets.df$place_lat)) # This line is needed for the second plot, when hashtags are added. points$hashtags <- tweets.df$hashtags # The next lines are just used to remove points that are not specified or # are incidental too far a way from California. points[!is.na(tweets.df$lon), "x"] <- as.numeric(tweets.df$lon)[!is.na(tweets.df$lon)] points[!is.na(tweets.df$lat), "y"] <- as.numeric(tweets.df$lat)[!is.na(tweets.df$lat)] points <- points[(points$y > 25 & points$y < 42), ] points <- points[points$x < -114,] # The following code creates the graphic. mapPlot <- ggplot(map.data) + # ggplot is the basic plotting function used. # The following lines define the map-areas. geom_map(aes(map_id = region), map = map.data, fill = "white", color = "grey20", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) + # The following parameters could be altered to insert axes, title, etc. theme(axis.line = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(), panel.background = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), plot.background = element_blank(), plot.margin = unit(0 * c(-1.5, -1.5, -1.5, -1.5), "lines")) + # The next line plots points for each tweet. Size, transparency (alpha) # and color could be altered. geom_point(data = points, aes(x = x, y = y), size = 2, alpha = 1/20, color = "steelblue") mapPlot # This command plots the object.
This visualization shows us the spatial distribution of
tweets. Some of the points are not in California, because the coordinates we
had send to Twitter are not that exact. In metropolitan areas like Los Angeles
or San Francisco more people seem to use Twitter – which is of course not very
surprising. Nevertheless, the spatial distribution of tweets can be very
important for a lot of different questions, especially if combined with other
filters. For example, we can ask what hashtags are used in which region.
Following exactly the same cleaning procedure as described
above, we can identify the hashtags that are used more than two times in our
dataset.
This new table hashs can now be used to add the hashtags to the existing plot. Since we created an object of its own containing the plot (mapPlot), we can now add additional features to this plot without re-running all the code.
mapPlot2 <- mapPlot + geom_text(data = hashs, aes(x = x, y = y, label = hashtags), position = position_jitter(width=0, height=1), size = 4, alpha = 1/2, color = "black") mapPlot2
Thank you very much for the tutorial! Is very helpful!
AntwortenLöschenHi! This is great. Thank you. Is there any way to retrieve tweets including a specific phrase or term, rather than simply calling a category. I want to retrieve tweets that include "Arsenal is awesome" and plot that out in a map - seeing where this phrase is most frequently stated.
AntwortenLöschenGreat one! Post is really good, thank you
AntwortenLöschenBitcoin payment for e commerce development | Pay Bitcoin for Ecommerce Development | web design Hubli
Thanks for the post. I am stuck where it redirects to the URL to get the PIN. it redirects to the URL that I have given in the app but that URL never exists. What should I provide over there to get the pin?
AntwortenLöschensame issue i am having
AntwortenLöschenThanks for the post. I am stuck where it redirects to the URL to get the PIN. it redirects to the URL that I have given in the app but that URL never exists. What should I provide over there to get the pin?
Nice blog and absolutely outstanding. You can do something much better but i still say this perfect.Keep trying for the best. Hire R Programmers
AntwortenLöschenGreat post!I am actually gready to intizaliting the twitter API, I am very happy to this post .Also great blog here with all of the valuable information you have.Well done, it's a great knowledge.
AntwortenLöschenAlso Check out the : https://www.credosystemz.com/training-in-chennai/best-data-science-training-in-chennai/
We at Greens Technology have been in IT industry for nearly 10 years, providing Training, Consultancy and Development solutions in emerging technologies.
AntwortenLöschenThanks for your information sharing,nice article is on your blog.
AntwortenLöschenR Programming Training in Ameerpet, Hyderabad
Thanks for posting such a great article.you done a great job Java online course
AntwortenLöschenExpected to form you a next to no word to thank you once more with respect to the decent recommendations you've contributed here.offshore safety course in Chennai
AntwortenLöschenReally nice blog post.provided a helpful information.I hope that you will post more updates like this Data Science online Course Bangalore
AntwortenLöschenYour blog is so nice, and the article is very good it helps to so many people.
AntwortenLöschenR Programming Training in Ameerpet, Hyderabad
Very good blog thank you for sharing this information.
AntwortenLöschenR Programming Training in Hyderabad
I was exploring some interesting content and found this personal interest here even i recommended most out of the web so keep reading. Some other valuable information clean master about is here and other than you can also check download app cache cleaner for android and unlock your mind about android phone.
AntwortenLöschenNice information thank you,if you want more information please visit our link MSBI online training
AntwortenLöschenGreat efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums.big data training in Velachery | Hadoop Training in Chennai | big data Hadoop training and certification in Chennai | Big data course fees |
AntwortenLöschenI like a lot this nice job of us.
AntwortenLöschenTwitter Management
I have been using Twitter for a long time. I switched to Wizugo marketing tool to automate my Twitter account, and I must say that WizUgo com worked pretty amazing in boosting my Twitter account performance. It is saving my time and increasing my followers each day.
AntwortenLöschen