The Social Science One Facebook Cooperation: A Systemic Failure

This blogpost is a detailed background statement to this comment published in Nature at https://www.nature.com/articles/d41586-020-00828-5.

Fast-Read

The following text argues that the Social Science One initiative that is celebrated by Facebook as success is an systemic failure. Privacy concerns are instrumentalized to withhold data from independent researchers while similar data is still available to private companies. The hurdles for researchers at Social Science One are set in a way that it is nearly impossible to study the effect of social media on elections and democracy. This fits to the strategic interest of Facebook to play cooperative but to control the outcomes of research. Social Science One is structured in a way that it is easy for Facebook to instrumentalize this organization. We end with policy recommendations. An appendix shows that the dataset delivered by Facebook is more or less useless.

Background

Facebook has announced the release of a dataset that is shared with the research community via the independent organization Social Science One (SS1) (https://research.fb.com/blog/2020/02/new-privacy-protected-facebook-data-for-independent-research-on-social-medias-impact-on-democracy/). As described in the latest blog post by SS1, Facebook is only offering data to the research community under differential privacy constraints (https://socialscience.one/blog/unprecedented-facebook-urls-dataset-now-available-research-through-social-science-one, SS1-Blog). In the codebook differential privacy is explained in terms of “plausible deniability”: “Differential privacy provides plausible deniability to people whose information is included in the data set. In this case, it’s impossible to determine—in a way that is significantly better than random guessing—whether a specific user took an action in these data, because differential privacy makes it impossible to isolate a specific row” (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TDOAPG, Codebook, 7). In the example of the URL-dataset, differential privacy means that random noise is added to columns containing user information. The noise is sampled from a Gaussian distribution with mean zero and a standard deviation that has to be calculated in regards to the data distribution and the level of privacy that should be assured. In general, this method is very promising to find a solution to bridge the gap between privacy concerns and the urge of the research community to get access to data. But in the specific case of the URL-dataset recently shared via SS1 this has extremely affected the quality of the data without contributing anything substantial to users’ privacy.

Whose Privacy?

The “trick” Facebook is using is hidden in the level of analysis. In other settings, where differential privacy might be useful, the level of analysis is the individual user. Normally, this means data to be structured in tables containing users and corresponding user information in rows. As noted in the quote above, differential privacy makes it impossible to isolate a specific row - i.e. user - and if anyone was identified in the data set, it would be more likely that this person is not the person with the attributes listed in this specific row. But the URL-dataset looks different. Here, each row-entry - or level of analysis - is represented by “the unique user-URL-action tuple, which can occur in the data only once” (Codebook, 8). There are two types of tables in the dataset linking URLs to user-actions: The URL Attributes Table contains general information on the URL like domain, date on which it has been shared on facebook, labeling by third-party fact-checkers and the three variables spam_usr_feedback, false_news_usr_feedback and hate_speech_usr_feedback. These last fields have been made differential private. The second type of tables are the so-called Breakdown Tables. Here, we find differential private information on number of views, clicks, shares, comments and the Facebook reactions (likes, loves, wows, hahas, sads, angrys) all broken down to buckets of users by country, age, gender and (for the US) political affinity. A sample of the data can be found here: http://bit.ly/FullURLsEG. Having a look at the structure really helps to understand the following argumentation.
None of the tables contains any information that is relevant for users’ privacy, if we consider “plausible deniability” as benchmark. Let’s say, I am a female Facebook user, aged 25-35, in the US and I liked a post including an URL mocking Trump. This behavior is stored in the dataset and linked to an URL. But from this dataset, it is impossible to identify me as the user. Let’s assume, I am the only user in the bracket with this specific behavior. Still I cannot be identified. The like could have been set by any of the 125 million monthly active users on Facebook. If I denied, the chance that it was not me would be 1-1/125 million, which is very plausible. Differential privacy does not change anything for the user in this situation. Let’s say, 3,000 users liked the same URL. Under differential privacy, this number would now be changed randomly by adding Gaussian noise. The number could raise for example to 3,890 or drop to 136 (while the mean of all numbers in this column stays the same). How does this change affect my privacy? The only thing we can say now, is that the real number might alter from the figure presented in the dataset but this does not make it harder (or easier) to identify any single user.

Mathematical Confusion

Differential privacy is a complex approach and Facebook will probably disagree with the argumentation above and stress that the aim of differential privacy is to make it “impossible to determine whether or not information about the action exists in the dataset at all—again, in a way substantially better than random guessing, and where ‘substantially’ is precisely quantified by a privacy parameter” (Codebook, 7). As explained above in the actual dataset with differential privacy it is true that nobody can tell, if the totally anonymous like that represents e.g. the 354th like of a URL in the dataset exists at all. And without differential privacy, the number 345 would mean that there had been exactly 345 likes for this specific URL. But in the quotation, Facebook has silently changed the level of analysis from user-URL-action-tuple to action.
The codebook is formulated in a way, that everyone thinks about users’ actions and users’ privacy, while the mathematical approach is focused on the the privacy of the URL! Or more precisely (in the User-Activity-Table), the privacy of an URL in combination with a country variable and a bracket of users (the rows from the table). A good example for the confusion in the codebook is the following sentence: “The privacy protections are ‘action level’ (rather than the more familiar ‘user-level’ differential privacy) in that the granularity of what is protected is not a single user, but rather a single action, or user-URL-action (e.g. a user sharing a specific URL)” (Codebook, 8). There is a mathematical relation between the privacy of unit (which is the user-URL-action-tuple) and the single entries in this vector (in our case the user-actions). Facebook uses an approach by Bun and Steinke (https://arxiv.org/pdf/1605.02065.pdf, Bun/Steinke 2019) which is called zero-concentrated differential privacy (zCDP). The authors clarify, that the level of analysis referred to within their method are the single rows in the dataset: “As is typical in the literature, we model a dataset as a multiset or tuple of n elements (or ‘rows’) in X^n, for some ‘data universe’ X , where each element represents one individual’s information” (Bun/Steinke 2019). The privacy that is made differential by Facebook is not the privacy of the users but the privacy of the URLs!
To understand what this means I will try to explain the process of zCDP in non-mathematical terms: In a typical setting, we would have a table with rows for each individual and columns containing information about these individuals. The goal is that no individual (i.e. an entry in a row) can be identified via the presented information. To achieve this goal, random noise is added to all columns where differential privacy should be achieved. Random noise here means that every single entry is made bigger or smaller by adding a number that is sampled from a Gaussian distribution with mean zero. The mean of each column therefore will not change, but each entry now looks different. The problem now is to estimate the necessary amount of noise to make it impossible to identify any individual row in the dataset. Thinking backwards, no individual row can be identified, if for any information in the final dataset, it is more likely that this information is a result of the random mechanism than that it was indeed present in the original dataset. To measure “more likely” the differential private data, as well as the original data, is considered to result from a probability distribution. This means, there is a mathematical formula to calculate the probability for each value that could appear. The two probability distributions have to be different enough that the goal formulated above is reached and Bun/Steinke 2019 provide a mathematical formulation for this problem as well as a way to calculate the amount of noise that is needed.
Thrown on the URL-dataset this approach now means that it is more likely that a specific combination of the number of views, likes, etc. is a result of the differential privacy process than that it was presented in the data. Thereby, it is now impossible to identify any URL out of the users’ behavior. So, if anyone knows that there is a URL that has been shared X-times, viewed Y-times, and liked Z-times, there is no chance to use this information to identify the URL in the dataset.
But there is no real-world-scenario for this problem. In the URL-dataset, the links of the URLs are given. There is no privacy of URLs. What Facebook is doing here is like throwing differential privacy noise on census data but at the same time keeping a column with all the names and addresses of the individuals.

Different Levels of Privacy

Facebook is not only publishing the names of the URLs that have been made differential private before, but the company provides a tool to find out which user has interacted in which way with a URL. Facebook offers a service called CrowdTangle, where anyone with access can directly search for URLs and then get many of the information about users’ interactions without differential privacy noise. The screenshot shows the results when searched for the first URL in the sample dataset.

How is it possible that similar information is very sensible in regards to privacy when integrated into the URL-dataset but harmless when received over Facebook’s CrowdTangle service?

And it becomes even more absurd: In CrowdTangle I can click on any post in the list (like the NYTimes post in the screenshot), and there I will find the Facebook-names of all users who have reacted to the post (see screenshot).

The SS1 blog states that Facebook argues that GDPR would require differential privacy measures. If this was the case, Facebook would have to shut down CrowdTangle immediately; and not only CrowdTangle but the regular Graph-API as well. This API has been shut down for most researchers (https://developers.facebook.com/docs/apps/review/) but it is still open for private companies (including the company Crimson Hexagon (now merged with Brandwatch) co-founded by SS1 chair Gary King). I cannot think of any legal argumentation why sharing anonymized user-actions is problematic and has to be treated with differential privacy, while sharing non-anonymized user-actions on posts or URLs is considered as non-violation of the same legal framework.

There are even more inconsistencies. The URL-dataset only contains URLs that have been shared at least 100 times by public pages and data goes only back to January 2017. Under the headline “privacy protection” the codebook says: “being represented in our data requires having interacted with a URL shared at least approximately 100 times” (Codebook, 7). Why is this necessary when Facebook and SS1 argue that their above mentioned differential privacy “guarantee is quantified by a precise mathematical bound” (Codebook, 7)? If data was differential private (as Facebook claims) there would be no need for choosing such a small sample of the data (the 38 million URLs are only 0.1 percent of the data promised in the first announcement).

And again: If this limit is relevant for the URL-dataset, why does it not count for CrowdTangle, where you can get information on any URL that has been shared publicly (like go.tum.de/980525, for example)?

And more restrictions come with the URL-dataset, only. It is not possible to download the dataset. Instead, all analysis has to be done in a special tool behind a Facebook firewall. Only researchers vetted by SS1 have access to this “research tool”. In contrast, other APIs are open to a much broader audience. Even third parties like the company mentioned above, Crimson Hexagon, are providing access to the data they have collected from facebook via special APIs (https://apidocs.crimsonhexagon.com/reference) while researchers at SS1 are not even allowed to share data within the research community.

To apply zCDP Facebook has changed the data for the URL-dataset in an uncommon way: Instead of reporting the total numbers of user-actions, only the first like, share, etc. a user has left on Facebook is integrated into the dataset. The justification for this is: “we define the unit of analysis as the unique user-URL-action tuple, which can occur in the data only once. This generally amounts to de-duplicating actions taken in the data. For example, if a user liked a post with the same URL more than once, we take the first instance and discard

others” (Codebook, 8). Here, the authors of the codebook get confused by their own lack of clarity in regards to the level of analysis, in my opinion. The user-URL-action tuple is unique if there are no two URLs with the same meta-data. How many times a user has interacted with a URL is irrelevant. But regardless, if this step is necessary to guarantee differential privacy (in a way that does not seem justified, anyway), it only leads to another distortion of the signal in the URL-dataset.

And finally, data included in the URL-dataset is limited to the timespan between January 2017 and August 2019.

These additional limitations of the URL-dataset do not fit at all to the differential privacy narrative. If Facebook argues the above-described steps for differential privacy are necessary to guarantee users’ privacy, then they are, by definition, sufficient. And therefore, any additional step is unnecessary.

Effects of the Special Protection of the URL-Dataset

Within the group of researchers with access to the URL-dataset the biggest concern has been that differential privacy will make it impossible to run sound analysis on the dataset. Methods usually used for big data analysis cannot be trusted on the differential private dataset. This mistrust is especially true for non-parametric approaches like k-nearest-neighbours, support vector machines or neural networks with stochastic gradient descent optimization. Within these models, small subsets of the data (strongly distorted because of differential privacy) have a strong influence on the fitting. Evans and King have proposed a solution for linear models that could work on the dataset (https://gking.harvard.edu/dpd, Evans/King 2020). If it is a good idea to run linear regression models on time-series data that is not normally distributed originally, has yet to be shown. In the appendix we show that the amount of noise makes it impossible to identify trends in the noisy data. But without detrending a linear model is not the right method.

Besides these methodological questions, it is quite clear what cannot be done with this dataset in regards to the overall topic of the initiative: social media, elections and democracy.

By making the URLs “differentially private” in the above-described manner, Facebook has built a firewall against most research results that could shed light on the harmful effects of the platform on elections:

Everyone familiar with disinformation campaigns will agree that we are talking about a small set of potentially harmful content. While the noise added to the data does not affect the mean of large samples, it is impossible to make any causal inference on the level of small samples or even individual URLs. Statements like: “These 100 URLs have been among the most common shared 10 percent.”, are not possible anymore.

When talking about disinformation or fake news, it is crucial to understand how shares of private facebook pages distribute this information. The dataset only contains URLs that have been shared at least 100 times by public pages. Without any doubt, this introduces a bias in the sample, and many, probably most, relevant URLs are not included. In addition, there is no justification for this precise threshold: Why 100 times and not just 1 time, or why not 146 times? Besides, any threshold that somehow mirrors the relevance of post or URL would have to be country-specific: In Germany, there are fewer Facebook-users than in the US. Therefore, a URL that has been shared publicly only 99 times in Germany might be more relevant than a URL that has been shared 200 times in the US.

Research on social media and elections requires interdisciplinary teams. Experts from political science, computer science, mathematics and other disciplines like psychology and communication science should work together. Because of the restricted access to the dataset, it is nearly impossible to integrate experts from other disciplines in case they are not part of the project right from the start.

When studying inauthentic user behavior on Facebook, hyperactive users - users that make much more interactions than the median user - are essential (https://www.sciencedirect.com/science/article/pii/S2468696419300886, Papakyriakopoulos et al. 2020). But Facebook has decided only to count the first interaction. This reduction introduces another bias that is probably very relevant for the topic, which is most likely a disadvantage for researchers in the SS1 track in comparison to researchers working with other data sources. And there is a second problem with this approach: The way, user-action is counted in the URL-dataset is different from all data other researchers might have access to. Therefore, it is nearly impossible to compare results from the SS1 project with any other research or to reproduce the results without access to the URL-dataset. It is therefore very easy to deny the relevance of anything researchers might find in this dataset.

And finally, the time-span of the dataset is the “elephant in the room”. Based on the discussions in politics and science, the US Presidential Election 2016 is the most important case to study the effect of social media on elections and democracy. But this case is not included in the dataset, as well as the Brexit campaign.

These issues - especially the inconsistency in different levels of privacy - are known to Facebook and SS1 and have been raised by researchers including our team. The now released dataset shows that facebook did not attempt to “heal” these shortcomings.

The Political Level

The Cambridge Analytica scandal and the pressure after the 2016 election have put Facebook in a very uncomfortable position. Mark Zuckerberg testified before Congress and promised to share data with researchers. Now Facebook says: “This release delivers on the commitment we made in July 2018 to share a data set that enables researchers to study information and misinformation on Facebook” (https://research.fb.com/blog/2020/02/new-privacy-protected-facebook-data-for-independent-research-on-social-medias-impact-on-democracy/, Facebook Research 2020). From a political science perspective, Facebook faces two threads, at the moment. If the political system - not only in the US but also in other countries - is not convinced that facebook is doing enough to safeguard elections - and allowing independent research has been identified as an essential part of such an approach - this might result in strict regulations. But if independent research was coming to the conclusion Facebook had harmed democracy, political reactions could be even worse.

The best strategy for Facebook, therefore, is to integrate independent research and at the same time, control the conditions in a way that it is doubtful that unwanted results pop up. And if this happened, strategies of plausible deniability should be developed right from the beginning. In my honest opinion, this is exactly what we are witnessing right now in the SS1 case. And it fits perfectly to a pattern some critics see in Facebook’s strategy to deal with similar situations: Facebook is criticised, Mark Zuckerberg apologizes and promises to improve. Then, Facebook becomes active by introducing minor steps, that are promoted with immense lobbying efforts but do not change anything, since the mistake is a logical result of Facebook’s business model and not caused by an error.

The Instrumentalization of SS1

We have shown that Facebook inadequately applies differential privacy and that researchers in the SS1 track face artificial limitations that do not hold for other official data sources of Facebook. Therefore, these limitations cannot be justified with concerns about users’ privacy. Additionally, they have a highly plausible and relevant effect when it comes to studying the particular case of social media and democracy. We also showed that it fits a possible rational strategy for Facebook to engage in sharing data with researchers while at the same time trying to prevent them from finding anything suspicious.

In the following paragraphs, we will explain how SS1 could be instrumentalized for such an agenda. We are not saying that SS1 as an organization is aware of this instrumentalization. On the contrary, we believe that especially the two chairs of SS1 are fighting very hard to get the most out of this deal for the research community. Nevertheless, the events around SS1 would fit very well into a playbook on how to manipulate independent research.

According to an article on the webpage protocol.com citing Gary King (https://www.protocol.com/facebook-data-sharing-researchers, Lapowsky 2020) the whole SS1 initiative started via a phone call by Mark Zuckerberg instantly after the Cambridge Analytica scandal. “In 2018, days after the Cambridge Analytica news broke, King says he got a call from Facebook CEO Mark Zuckerberg asking him to study Facebook's impact on the election. But King recalls that Zuckerberg was reluctant to give him access to all of the data he needed and allow him to publish his findings without Facebook's approval. As an alternative, King and Zuckerberg devised a plan to open data up to outside researchers and allow Social Science One to vet their research proposals” (Lapowsky 2020). Already a few days later, on 9th of April 2018, SS1 was announced (https://socialscience.one/faq-fb, SS1-FAQ). In the announcement, Gary King and Nate Persily lay out the structure for SS1: “In our structure, the first group is composed of independent academics who apply to study specific topics, are awarded data access, and have the freedom to publish without firm approval. The second group, serving as a trusted third party, includes senior, distinguished academics who sign non-disclosure agreements with the firm, forego the right to publish based on the data in return for complete access to the data and all other necessary information from the firm. This trusted third party thus provides a public service by certifying to the academic community the legitimacy of any data provided (or reporting publicly that the firm has reneged on its agreement)” (https://gking.harvard.edu/files/gking/files/partnerships.pdf, King/Persily 2019, 2-3). The “core-group” that is leading SS1 is described as follows: “The co-chairs of Social Science One are initially appointed by the firm and the nonprofit foundations. The co-chairs appoint other members with input from, but no decision-making authority by, the firm and foundations. The core group members, who sign restrictive confidentiality agreements, are compensated at fair market rates and, in highly charged partisan or otherwise sensitive environments, are paid by nonprofit foundations independent of the firm” (King/Persily 2019, 4).

Putting together the facts: SS1 goes back to an initiative of Mark Zuckerberg himself. He asked Gary King to do some analysis, but King refused. Both then agreed very fast on an alternative concept. In this concept, the “trusted third party” 1) has signed restrictive non-disclosure agreements, 2) is appointed directly by Facebook or by the people that have been selected by Facebook (with input from Facebook).

In scientific organizations there usually are two governance mechanisms to decide upon the leadership of the organization: one is competition, meaning there is the announcement of an open position and an elected commission decides in favour of one of the candidates; second is a democratic election where the relevant scientific community votes for a candidate. Appointment by a private company is normally not counted as a governance mechanism that secures the independence of a scientific organization. In addition, in the case of SS1 this is not just an initial appointment: No governance structure limits the power of the SS1 “commission”, like elections of the members, duties in reporting and so on. Even being a researcher with a project in the SS1 track does not come with any participation rights within SS1.

This seems to be an ideal type for Facebook’s engagement with the scientific community. We have had to witness a similar strategy at our own university, the Technical University of Munich. Facebook is founding an Institute for Ethics in Artificial Intelligence and all public statements claim, this institute is independent from Facebook. The TUM still claims: “The new TUM Institute for Ethics in Artificial Intelligence is supported by Facebook without further specifications” (https://www.tum.de/nc/en/about-tum/news/press-releases/details/35188/). But leaked contracts show that the company has handpicked the director of this institute (https://www.sueddeutsche.de/muenchen/muenchen-tu-finanzierung-facebook-1.4723566). Similar to the SS1 case, this director has not to be re-elected, researchers have no official status or influence on the institute and there are no duties guaranteeing a minimum level of transparency.

The next “feature” of this kind of cooperation is an unequal distribution of resources. On the webpage of SS1, I could not find any information on how the organization itself is financed (besides the compensation of the core group members). The only hint is the sentence: “Social Science One is being incubated at Harvard's Institute for Quantitative Social Science, which is directed by Gary King” (https://socialscience.one/overview). In contrast to this, Facebook states: “Over the past two years, we have dedicated more than $11 million and more than 20 full-time staff members to this work, making Facebook the largest contributor to the project” (Facebook Research 2020). I believe it is tough to “control” Facebook’s attempts in such a setting.

In the SS1 case, there is another problematic coincidence: Gary King is co-founder of Crimson Hexagon, a data analytics company. Alex Pasterneck wrote in an article for fastcompany.com: “For years, King, director for the Institute for Quantitative Social Science at Harvard University, had been working closely with Facebook to gather its data for his research, and strategizing ways that Facebook could share more data with academics. A company he founded, Crimson Hexagon, has claimed to possess more social media data than any entity except for Facebook itself” (https://www.fastcompany.com/90412518/facebooks-plan-for-radical-transparency-was-too-radical).

But the company was suspended from the access to the Facebook API in June 2018 (https://www.fastcompany.com/90219826/why-did-facebook-re-friend-a-data-firm-that-raised-spying-concerns). Some media outlets have reported, Crimson Hexagon was investigated by Facebook because of privacy concerns: “Facebook said early on that there was no evidence Crimson Hexagon had accessed non-public data, but the investigation centered around whether the startup was enabling customers to use social media data for surveillance” (https://www.bizjournals.com/boston/news/2018/08/17/facebook-lifts-suspension-on-crimson-hexagons.html). Some weeks later, access to the API for the company was provided again. According to the website FastCompany.com, Facebook said: “We appreciate their cooperation and look forward to working with them in the future” (https://www.fastcompany.com/90219826/why-did-facebook-re-friend-a-data-firm-that-raised-spying-concerns).

This means in theory that Facebook has leverage on Gary King. Cutting off Crimson Hexagon from the access to Facebook would probably affect the worth of the company. In addition, Crimson Hexagon was at this time probably already in negotiations with Brandwatch. The merger of these two companies was first announced in October 2018.

I am not at all saying that this situation had any influence on the work SS1 did. But to my understanding, this is is typically called a conflict of interest. SS1 and Gary King did not make this conflict public until the media was reporting on it. There is not even any governance structure within SS1 how to deal with such conflicts of interest.

In addition, Gary King has used the Crimson Hexagon infrastructure for his research as can be seen in his most influential publication on social media (https://gking.harvard.edu/files/censored.pdf, King et al. 2013). Limiting access to Facebook for this company, therefore, also limits the research capacities of Gary King and his team at Harvard. Many researchers would be happy, if the access to the regular Facebook API was granted again and would prefer this to access the URL-dataset. King still has this valuable access while the research community is kept silent with a vague promise to publish noisy data for SS1.

Finally, it is not clear if SS1 is independent of the research community, as claimed in the concept. In the majority of the actual projects, researchers from SS1 are engaged. They are acting on both sides: The research community and SS1. Also, there is the danger that researchers could become dependent from SS1: It was promised several times in internal meetings that the teams now engaged will be granted access to new datasets of SS1 without any project application. These promises give an incentive not to criticize SS1 - and it undermines the claim of strict peer review on project basis. The latest development is that nearly all teams of SS1 (excluding ours, of course) have agreed to work on a coordinated publication effort with a major scientific journal. To ensure this possibility to publish results that are not even developed yet (the data is available just for some weeks now) everyone was asked by SS1 to agree not to publish any results or to talk to the media until all studies are made and the publisher has made a decision on it.

What should be done now?

In our honest opinion:

SS1 should publicly report that Facebook has reneged on its agreement.
SS1 should enter a moratorium to reflect its own role as well as to make transparent all direct and indirect connections to Facebook.
Facebook should make the URL-dataset in its actual version public. SS1 and Facebook have agreed that there are no privacy concerns so the data can be shared with the whole research community.
Facebook should publish an advanced URL-dataset without differential privacy, including all URLs that have been shared publicly at least one time, going back till at least January 2016.
Facebook should stop appointing chairs or directors for scientific organizations.
The scientific community should reject all publications based on the URL-dataset, because without access to the data, nothing is reproducible and without independent access to the raw-data, no one can tell if the dataset really contains what Facebook is promising.
Politicians all over the world should think about how platforms like Facebook can be encouraged to work together with independent researchers and what kind of regulation is needed for this.
We still do not know if Facebook and other platforms have a negative effect on our democracies. We cannot analyze this question because we do not have data. If the platforms cannot prove that all concerns can be neglected, strict regulations are necessary.

Appendix: How bad is the URL-dataset?

Following the above argumentation, it is not relevant, what the quality of the URL-dataset is in the end, because it has been made clear that Facebook is not cooperating with the research community. Nevertheless, some researchers might be tempted to use the dataset anyway, just because it is available now. The following text evaluates the effect of the differential privacy noise. The codebook reports the noise's parameters for every differential private column (Codebook, 11). Two variables are essential to evaluate how much the Gaussian noise altered the original signal. The parameter σ indicates the standard deviation of the noise. The parameter k determines the 99 percentile of the initial distribution: “For example, for shares, we first compute the total number of URLs shared by each user. We then compute the differentially private 99th percentile and round up to the nearest positive integer to get k” (Codebook, 11).

For understanding the effect of the noise, it is essential to notice that Facebook data is not distributed normally. The absence of a normal distribution means we do not find the bell-shape of a Gaussian distribution where all the data-points nicely group around the mean. Instead, data on Facebook is closer to a log-normal distribution (Papakyriakopoulos et al. 2020). Most users do not like anything on Facebook, but some users do like many posts. In the dataset, this means that there are many rows where most of the user-actions have value zero while at the same time, there are some URLs with very high values.

Adding Gaussian-noise with mean zero does not change the mean of the original data. Therefore, we can extract this information from the dataset. We took ten times a random sample with replacement (bootstrapping) from a differential private column with 0.01 percent of the data and then calculated the mean and the standard error for these values. The following plot shows the results:

The mean of the variables is close to zero, except for views with mean 6.6. The error bars are so small, that they are not visible, except for views, again. These values already support our hypothesis that the original data is log-normal distributed: Most values are smallest, but the Variance - especially for views - seems to be quite high.

In signal processing, there is a measurement to estimate the quality of a signal. It is called signal-to-noise-ratio (SNR) (https://en.wikipedia.org/wiki/Signal-to-noise_ratio). The idea is to divide the expected value of the signal by the expected value of the noise. In our case, the expected value of the signal is the mean. If the noise’s mean is zero - like in our case - it is common to take the variance of the noise as the denominator. So we come up with the following formula:

where μ is the mean of the original signal (and of the noisy data as well) and σ is the standard deviation of the Gaussian noise.

An SNR higher than 1 indicates more signal than noise, while an SNR smaller 1 means more noise than signal. The following plot shows the SNR for the variables:

It is clear to see that there is hardly any signal left in the data. Even in the “best case” of shares_no_click, the noise is approximately 4.000 times higher than the original signal. This difference does not come as a surprise for us: The problem is the log-normal distribution of the data. The distribution is exceptionally skewed such that the mean is not a robust estimate for the values and the variance in the data is so high that the noise needed to hide the high view values is enormous. We pointed out this problem in a workshop with Facebook in summer of 2019.

We can derive a different indicator illustrating the same problem. As described above, 99 percent of the data in a column is equal or smaller to k. We can simulate the distribution of the noise by taking random samples from a Gaussian distribution with mean zero and given sigma. For most of the variables, k is one and σ is 10 (because k is rounded to the next integer). From the sampled noise, we can derive the empirical cumulative density function (ECDF), and this function tells us the probability that the noise exceeds a specific value. With σ = 10, the ECDF for the value -1 is 0.46, for 1 it is 0.54. The probability for the noise to be between -1 and 1 , therefore, is 0.08. In 91 percent of all cases, the noise has overwritten the original signal (The probability that absolute value of noise is bigger than 1 is 0.92. 99 percent of the entries are 1 or smaller. 0.99 * 0.92 = 0.91).

The extreme results emerge from rounding the parameter k. For columns with higher user-actions on average, the results are more moderate. For clicks, for example, 2 percent of the entries are overwritten with noise (but the noise is still extreme, as we will show).

We can go one step further: With the same bootstrap method we used to estimate the original mesan, we calculate the standard deviation of the columns in the noisy data set (see next figure).

Given that adding noise has not changed the mean we can now calculate the standard deviation of the original data with the following formula:

Where σ_Org is the initial standard deviation, σ_URL is the standard deviation from the dataset and σ_Noise is the standard deviation of the noise.

Assuming the original data to be log-normal distributed (see Papakyriakopoulos et al. 2020) we can now simulate the data.

First we have to transform the empirical standard deviation and the mean we observed n the data to the parameters σ_lnorm and μ_lnorm with the following formulas:

With these two parameters, we can now create samples following this distribution, which should be similar to the distribution of the original data.

Here is what sample of 1.000 draws from this distribution looks like when we take the values from the clicks-column:

The picture is not surprising. We have already seen that most entries are zero.

Now we add the same noise as in the URL-dataset and plot the new points in red:

This visualization again shows that hardly any signal is left after adding noise. SS1 says they have delivered “trillions of numbers”. That is true. But the only thing you can do with this data is to compare mean-values of extensive samples. From a data-science perspective, the URL-dataset is useless.

On the simulated data, we can determine a Spearman rank-correlation. To preserve trends in the data the ranking of the data-points should stay more or less intact. The Spearman rank-correlation applied on 1 millionen samples from a distribution fitted to match the click-distribution with the same data plus noise results in a Rho of 0.003 (with a p-value of 0.008). The results after applying the Spearman rank-correlation means - again - that hardly anything of the original signal is preserved after the noise is added (Rho takes values from 0 (no correlation) to 1 (perfect rank-correlation)).

Facebook has worked for more than a year now on this dataset. They do not even provide the most basic statistics or tests. Any data-scientist in the world would start with descriptive statistics and basic visualizations that are independent from any privacy issue. It is hard to believe that Facebook has no idea how bad the data is. It’s also hard to believe that SS1 is not aware of these issues. Part of the deal of SS1 with Facebook was that the core members get access to the real data. Either, SS1 has never done this, or they are not reporting on the results.

Personal Information

Simon Hegelich is professor for political data science at the Technical University Munich (TUM). He is leading the only research team at a German University in the Social Science One Initiative. Simon takes part in the discussion round on digitalism and responsibility (https://about.fb.com/de/news/h/discussion-round-on-digitalism-and-responsibility/). In a project funded by the bidt he is analyzing the potentials of differential privacy (https://www.bidt.digital/bidt-foerdert-neun-forschungsprojekte-zur-digitalisierung/). Simon is not engaged in the Institute for Ethics in Artificial Intelligence, Facebook has financed at the TUM.

Fabienne Marco is a PhD student at the professorship for political data science at Technical University Munich and works in the SS1 project.

Joana Bayraktar is a research student assistant at the professorship for political data science at Technical University Munich and works in the SS1 project.

Morteza Shahrezaye is a post-doc researcher at University of St. Gallen. He has a Phd in computer science from Technical University of Munich.

Political Data Science

Dieses Blog durchsuchen