#cssws15 2nd GESIS Computational Social Science Winter Symposium 2015

Workshop on #R and Twitter

On 1st Dec. 2015 I will give a workshop at the GESIS Computational Social Science Winter Symposium 2015.

Using R to harvest the Twitter STREAM API

R is a free software programming language and software environment for statistical computing and graphics. Because R is a programming language its usage is not limited to the field of statistics. There are already many R-packages to cover the whole spectrum of social media analysis from web scraping to text mining. This is a hands-on-tutorial to analyze data from twitter in R. In four steps, the tutorial demonstrates how to get the desired data, how to “clean” it, how to analyze it, and how to visualize the results. By following these steps the participants will gain knowledge about the general structure of R, its basic grammar, some relevant packages for social media analyses, the twitter streaming API, and about some basic programming concepts. The tutorial addresses readers with little previous knowledge about programming. The aim is to demonstrate that the efforts in learning a programming language instead of using off-the-shelf-solutions are rewarded with greater flexibility for creative social media analysis. The R language has become one of the most prominent tools among statisticians and data miners. Because it is free software and due to its outstanding capacities which are enhanced by a huge community of contributors R has become the first choice at many universities in teaching statistics.

Example plot from the tutorial.

Data science is more than just statistics. Statisticians survey data in a very careful way to come up with representative samples structured in nice and neat tables. But social media data is not structured this way. It is just out there somewhere on the internet and we have to get it on our computer and transform it in a way that is suitable for analyzing. This step normally involves the use of an API. In a second step, social media data normally has to be cleaned. E.g. there might be duplicates, missing values or wrong specifications of objects. The cleaned data can be used to test hypothesis, to find hidden patterns, to analyze it with descriptive statistics or more advanced machine learning algorithms. Whatever we find out in the end, it is very important to present the results in a way that reduces the complexity of the original data drastically. For this last step, data science has developed visualization tools. All these four steps can be done directly in R. In the tutorial we will connect to twitter and get tweets located in California. This sample will then be analyzed and visualized. The research question for this example is the following: Are trends on Twitter regional localized? On the one hand, twitter is a global social media platform that connects people all over the world. It is therefore reasonable to argue that regional differences are not so important on Twitter. On the other hand, people use twitter to communicate about what is going on in their real lives. Since the real live takes place in a specified space, it is reasonable to argue that information on twitter should show a lot of regional differences.

The tutorial addresses readers with little previous knowledge about programming. The aim is to demonstrate that the efforts in learning a programming language instead of using off-the-shelf-solutions are rewarded with greater flexibility for creative social media analysis.

All participants should bring a laptop with R and R-Studio installed, as well as a valid Twitter account.

Political Data Science

Dieses Blog durchsuchen