In a previous article, Nick Stringer from IAB UK explained how the EU wants to regulate the use of data on the Internet and why this might go too far when it comes to data-driven business models. My company is one of those data-driven businesses and we are helping many publishers and websites across Europe to deliver more relevant ads to their consumers in a privacy-friendly way – because we are working with a strong data-minimisation technology built in. The concept is called ‘Pseudonymisation’ and has been introduced to the policy-discussion at a relatively late stage – although it is carrying some of the core concepts of the proposed regulation as its principle.
Pseudonymisation describes a process in which recorded data is limited in such a way that it is no longer possible to link it to the individual from whom it originated. However, the granularity of those individuals is retained – which is not the case with the more familiar process of anonymisation – so that if one person is recorded as having an interest in sports, and two people in culture, once the data has been pseudonymised it still registers as three people and three data records. Anonymisation, on the other hand, would convert that information to something like a 33% interest in sport.
Pseudonymisation stands for privacy by design: It works by completely removing the characteristics of a data record that might be possible to single out a person – such as IP Address or Cookie-Identifier – and replacing them with a machine-generated identifier. We should be clear that most online businesses do not collect data that can identify an individual directly.
Of course, the reliability of this process is significantly dependent on how securely these characteristics are really being deleted, or how difficult it would be to recreate the original reference to an individual. For instance, this would make it possible for a third-party to remove the critical data and delete it in compliance with the contract, so that the party processing the data would be unable to store it other than in its pseudonymised state, from a technical point of view. In other contexts though, it can be adequate to pseudonymise the data in a company, so that it would, at least theoretically, be possible to restore it. This can be a legal requirement, for example to pursue infringements of the law. Admittedly, the protective function of pseudonymisation would render the operation weaker when making it easier to recreate a reference to individuals. In general, pseudonymisation plays an important role in data protection from the very beginning, when data is generated and stored, and prevents any critical data stores being produced. In addition, the user does not need to be ‘forgotten’ (the ‘right to be forgotten’ is one of the core concepts of the new data protection regulation presented by EU-Commissioner Reding in January) in pseudonymous data sets, because forgetting already takes place in the moment the data is stored.
However, a further dimension must be taken into consideration in order to make protection secure – and that is the granularity of the data stored. The fact is that, especially in the era of big data and the growth boom in data volumes, the granularity itself can sometimes be so specific that you can infer references to individuals, even without direct identification. That’s relatively easy to understand. For instance, if you combine indirect identifiers such as employer, make of car and preferred holiday destination. Nowadays you can use modern algorithms and the volume of data available to identify individuals from their data trail, even in supposedly harmless data scenarios, for instance by recording their search queries over longer periods (where most people would reveal their identity by ego-searching straight away).
So this is where effective pseudonymisation has to come in – to guarantee that the volume and granularity of the data stored does not provide the right conditions so that references to individuals can be recreated ‘via the back door’. One way of ensuring this, is to have the storage dimensions checked by a third-party. Technical options would also be a possibility. For instance, to ensure that it is never possible to exceed a batch size of one, even if the available data is combined in infinite ways. Companies that use pseudonymisation will check their data on a regular basis to avoid the risk of suddenly collecting critical stuff that could be used to single someone out and therefore act with data-protection in mind while collecting.
Why go to all that effort?
If the pseudonymisation process is successful and reliable, this results in benefits both for users’ data privacy and for data-hungry companies. Users benefit from pseudonymisation in several different ways. One way is that they can use online services requiring registration or identification for example, without divulging their full identity, which might be commenting on a website or blog. There is good reason for the Schleswig-Holstein Data Protection Commissioner’s recent warning to Facebook because the network does not provide its users with the legally-secure option of pseudonymised network use (according to the German law). This can be a significant protective feature if, for instance, people who need to protect themselves from stalking want to use networks of this type and cannot afford to disclose their identity.
The other benefit to the user is an indirect result of the process detailed above for guaranteeing pseudonymisation, because the company collecting the data does have to implement numerous safeguards in order to restrict the volume and accuracy of the collected data effectively. Data privacy experts also call this ‘data minimisation’. So, as a company you are obliged to consider exactly what data you really need, and for how long, and then you optimise data collection in such a way that only the data you need is stored, right from the start. In the case of big data companies on the Internet who store several terabytes of data on a daily basis because their services are used globally, this can quickly assume a considerable role. Pseudonymisation is the industry’s active way of self-limiting with regard to data. If a huge volume of data has been effectively pseudonymised in this way, it can even be harmless if the database is lost or hacked into – without reference to individuals the data is usually worthless to outsiders.
So, why should you go to all that effort when, after all, it’s easy just to store everything at relatively low-cost these days? It is indeed the case that the financial benefits of data minimisation are increasingly being eroded by the plunging costs of IT. There are, however, still two good reasons to opt for pseudonymisation: Firstly, it represents an active data protection strategy, so that even the best security measures for critical data cannot outperform its protective effect. Therefore, companies will tend to opt for this strategy in order to demonstrate to their users that they handle their data seriously and respectfully. This incentive alone is not likely to be enough to convince as many companies as possible to go down the pseudonymisation route – the pull of data-flows and their monetisation is too great.
However, there is another very effective lever besides this to reward the efforts of pseudonymisation, which has been provided for in the German Telemedia Act (TMG) for many years. This law can prevent companies from building critical data sets and should unquestionably be included in the current debate (and also the law) about the European General Data Protection Regulation. If data is to be verifiably rid of direct reference to individuals by means of reliable pseudonymisation, then collection of this data should be made easier as an incentive. More specifically, an objection policy for users should be adequate instead of a consent requirement, as is the case with the German Telemedia Act. With such a strong incentive it would be possible to lead the booming big data industry down the pseudonymisation route, benefiting all sides hugely.