If Ad Tech Wants to Succeed, it Must Treat Identity and Privacy as One

by Lindsay Rowntree on 7th Dec 2020 in News

Following his appearance on ExchangeWire's latest episode of TraderTalk TV, David Reischer, product manager at Permutive, writes about how the ad tech industry can overcome the challenges surrounding identity and privacy in a post-cookie landscape.

Over the last year, there have been a lot of discussions around identity. However, when we have these discussions, I believe it’s important to also keep in mind why we have them. The driving forces behind these industry changes are regulators and browser vendors and their motivation is really clear: protecting user privacy. So as the industry is discussing identity in a post-cookie world we cannot do that in isolation - we always have to discuss identity and privacy as one topic. Ad tech will not win an arms race against regulators and browsers, so we need to discuss sustainable solutions, which adhere to modern privacy standards.

What are cross-domain identifiers?

When the industry speaks of identifiers, we usually mean cross-domain identifiers. This means an identifier that allows a solution to recognise a user across multiple different domains and environments. So an identifier that allows an ad-tech vendor to recognise that I’m the same person across domain A, B and C. These IDs power some of ad-tech’s core functionality for programmatic advertising, including:

Measurement and attribution
Frequency capping
Data targeting

For all of these use-cases, ad-tech relies on the capability to recognise that you are the same user, regardless of which domain you are on.

What types of cross domain identifiers are there?

For a long time, the third-party cookie has been the predominant identifier on the open web. However, with the death of the cookie across Safari and Firefox, and with Chrome’s announcement, the industry has started to look at alternatives.

Lately, we’ve been hearing a lot about different ID solutions. But generally, you can bucket any solution into one of three categories. And I think it’s really important to remember that any ID solution will boil down to one of these three. There is no magic and what’s technically possible is clearly determined by browsers.

Cookies

Everybody will be familiar with cookies, so they won’t need much explanation. In summary: Cookies are a small text file that a web server can store in your browser, against their domain. This allows ad-tech.com to store a random ID in your browser and then re-identify you whenever a domain makes a request to ad-tech.com

Whether a cookie is a first-party or third-party cookie depends on the context. First-party cookies are in the scope of the domain you are visiting. They are not used for cross-domain tracking and they aren’t going away.

Deterministic Identifiers

Deterministic identifiers use a hard identifier like hashed email addresses. Often these hashed email addresses are then encrypted through some proprietary protocol, but it really comes down to that ID being a hashed email address.

What’s important to know is that deterministic identifiers will only work where the user has logged in. So if you are logged in on domain A and B you can be targeted there, but not on domain C and D, where you aren’t logged in.

Probabilistic identifiers / Fingerprinting

Lastly, there is a category of probabilistic identifiers. This approach takes a number of signals, which on their own don’t mean much. This can be for example your IP address, your user agent and certain device information like screen resolution or installed fonts. While on their own these signals don’t mean much, when you combine them they create a fingerprint that can identify an individual user, usually with a pretty high probability.

There are two types of fingerprinting, which are defined by the W3C:

Active fingerprinting: This requires code on the page to read things like installed fonts and your screen resolution.
Passive fingerprinting: This uses data points which are included in each HTTP request by default. This includes things like your IP address, cookies and user-agent.

What are the privacy implications of each of these?

As we speak about these different types of identifiers, it’s important to also look at them from a privacy perspective:

Cookies

While the application for cookies hasn’t been great for privacy, they do actually offer relatively good controls for the end user. You are able to delete your cookies and therefore reset your identity. This means even if a solution creates a new cookie for you, it will be a new identifier that cannot be reconnected to data which has previously been collected against your ID.

Deterministic

Deterministic identifiers still offer some controls for the end user. You can logout from a specific site and delete your cookies, which means you will no longer be able to be identified. This requires an extra step and additional knowledge.

However, the problem is that the end user has no way to reset their ID. Any data that has been collected against your ID in the bidstream can be reconnected, as soon as you log in again. While a cookie is typically based on a random string, deterministic IDs are based on your email address. And your email address is unlikely to change over time.

Probabilistic / Fingerprinting

As we come to fingerprinting, there’s no doubt that this approach is the worst for the end user. Unlike the other approaches, there are absolutely no controls for the end user to reset their identity. As a user, I cannot just change my IP address, my user agent or my screen resolution. Even when I browse in an incognito tab, I’ll still have the same IP address and user agent, which is really worrying.

What are the privacy learnings for the new open web?

Fingerprinting isn’t a great idea

Browser vendors have made it really clear that they’ll do whatever necessary to stop fingerprinting. Chrome has already announced that they will drastically reduce the amount of information that is in the user agent string. They also announced the Privacy Budget, which will warn users when a site is reading too many properties which can be used to calculate a probabilistic ID.

Safari WebKit has already limited access to several web APIs and clearly stated that they’ll treat any tracking circumvention with the same seriousness as the exploitation of security vulnerabilities.

The open web will be split into two

Today

In today’s world around 55% of the open web can be recognised through cookies. Less than 5% are authenticated - this means users who are logged in with a publisher. And around 40% are anonymous - which means publishers can still recognise these users, but they are anonymous to the buy-side.

Tomorrow

We know that the ad-tech world will look quite different in around 14 months time. Recognised users are going away. The number of authenticated users might go up, but it’s unlikely this will be greater than 10% any time soon. So the default state of the open web becomes anonymous - whether that’s 85% or 95% of users.

Across 95% of traffic, publishers will increasingly look like walled gardens. Because publishers can still recognise 100% of their users, data will continue to exist within each individual publisher environment. Publishers can still understand all the behaviors and interests of their users. But that connective tissue, in the form of cross-domain identity, which connects all of these environments goes away.

If we just focus on ID solutions, we’ll disregard 95% of users. This will be bad for both publishers and advertisers. However, building solutions for the anonymous web cannot be at the expense of privacy.

Privacy by design → Limit data

When building new solutions we need to adapt privacy by design principles. We need to look at specific use-cases and understand how we can limit data exposure as much as possible. For regulators, the amount of data exposed in the bid stream has been an area of focus.

If we look at use-cases for targeting users on a 1-1 basis, today user IDs are being exposed into the bid-stream, together with lots of other data points. And specifically, rich contextual data has allowed ad-tech to build up massive databases of user IDs and what sites they’ve visited, what content they consumed, when and where they perform any action.

We need to ask the question if we can limit the amount of data in the bidstream to just what is required. So just user IDs in this example. Without contextual data, it would be impossible to build these databases. And I believe this becomes even more important when user IDs are based on email addresses.

At the same time we can expose contextual and audience signals to the bidstream, if no user ID is present. This won’t allow us to aggregate these data points against a user ID and it’s therefore much more privacy safe. Ultimately it’s the ecosystem and specifically the publishers’ responsibility to limit what is being exposed. For publishers there are two clear objectives moving forwards:

Protect their users’ privacy - if you are selling audiences (with contextual data) do not connect them to IDs
Protect their first-party assets - if you have an ID on a user do not connect them to any contextual data.