You can’t have failed to notice the news surrounding Cambridge Analytica and its pooling of Facebook data. But what is Cambridge Analytica? Why does it exist? How did it apparently ‘breach’ Facebook and gather the data of tens of millions of profiles?
So far, there is no evidence of a data breach per se —no data was taken which was not publicly available. The word breach has been used somewhat liberally; the headlines could be more accurately described as ‘Cambridge Analytica used Facebook against its agreed terms’. To add further confusion, Cambridge Analytica may have found a loophole in Facebook’s terms by technically getting that data from other sources like search engines—which is why it’s not an open and shut case.
However, if you have to use the term ‘loophole’ you should probably be assessing the morality of what you are doing in the first place.
This data gathering event happened roughly two years ago, and little information has been directly released by the ICO. However, there is nothing too groundbreaking or complex in what Cambridge Analytica has achieved—only that it sold what most knowledgeable technology and data professionals would regard as ‘pretty crappy’ data.
We can make good assumptions based on common analytical techniques and technical knowledge about what has happened. But just because things are technically possible doesn't mean you should do them.
What happened and what data was gathered?
Firstly, it’s important to note that nothing was technically taken without permission. So, how do you gather a data set of 50+ million people on a closed platform? People were incentivised to take a personality test— and according to the BBC around 270,000 people did so.
This test was taken inside a Facebook app which asked for permission to view people’s profiles, which included access to their friend list and pages they have liked. The app asked various questions and created a psychological (and political) ‘profile’ of individuals.
From my own personal point of view, a line starts to appear here which says ‘do not cross’. The app also saved the friend’s list and the urls of friend’s Facebook pages. This is not hard and is a feature inside Facebook app development, usually used for innocent purposes—for example if the app had a ‘recommend to a friend’ function.
a ‘look-a-like’ cluster is a common term used in customer segmentation and analysis
However, instead of using the friend list for its intended purpose, a computer/script would then visit all of those links to friend’s profiles and save any information which was public about them. This includes things like pages they have liked publicly and posts that they had publicly made. It wouldn't have been able to read anything which was set to private.
Here’s where it gets a bit spurious and the ethics of data use come into play. The app created what is called a ‘data pool’ around one person— the person who took the original test. Assumptions were then made that the friends of these people would have similar characteristics, demographics and hold similar views to the person who took the test, especially if they had liked things in common.
Essentially, it predicted that the whole data pool around that individual would have similar results if everyone was to take the same personality test. This is called a ‘look-a-like’ cluster, a common term used in customer segmentation and analysis.
Data pools such as this can get very large, very quickly. At least 270,000 people were originally profiled. On average, people have 700 friends on Facebook, which would mean that only 30% of people in a friends list would need to have some public data available (they liked the Foo Fighters or watched Miss Congeniality for example), and that would create a data set of 56.7 million ‘profiles’.
In reality, this is what Facebook does with its marketing platform, but in a less crude and a more controlled way. It actually has this data, it doesn’t have to make assumptions, extrapolating data for tens of millions from a few hundred thousand and has strict rules for advertisers about how its data can be used.
As much as people like to beat up Facebook, generally speaking, they are very good at keeping individual’s data safe and enabling just enough to be used for targeting purposes— but not enough to compromise information about an individual’s personal, political or religious beliefs.
For example, they will enable people to be targeted who are interested in cars but won’t provide information on who is likely to vote for a particular political party. It’s the latter that Cambridge Analytica set out to do. Most likely without Facebook's direct knowledge.
How common is data pooling?
Data pooling is common practice and new regulations such as GDPR seek to help to control its misuse. Data pooling is most widely used in digital advertising such as display or video on demand.
the Cambridge Analytica data set pales into insignificance to that which profile advertising display networks have
It uses the same methodology that Cambridge Analytica employed, only with less sensitive subjects and would be more robust mathematically as it is more deterministic (actual data) than probabilistic (guessing based on similar data).
The Cambridge Analytica data set of tens of millions pales into insignificance in terms of access and to that which profile advertising display networks have. Players in that space have at least a billion profiles— three, four, five billion or more is very common.
It is common for content providers and marketers to interrupt individuals across channels with commercial messaging; this is how content can be consumed for free. However, marketers also have the responsibility to make these messages as interesting for a consumer as possible while keeping consumers data safe.