Like anyone in our space, data has always been a big part of the company investment and focus. Given the aggressive growth of our client base, the amounts of data being processed per client, and the exploding role of data in the ecosystem, we place data in the centre of our strategy; viewing our growing data mountain as an asset to be mined and not just a technical challenge. The better we do at that, the more our current business excels and new avenues will be enabled.
What does Big Data mean to Improve?
It’s something that is simply core to what we do. Like I said on the panel: if you have a problem and you throw data on it, you usually end up with two problems. We understand this data is a real asset and isn’t just an aspect of our business that we have to deal with. This is where the concept of Big Data moves outside of technology; it is about embracing the challenges because we know what we can gain from it. This is actually quite different from the patterns in more traditional companies. For most companies, large databases were often something that had to be built for reasons often non-core to the business. Consequently, they were viewed more as a cost-centre than the enablers of new profit lines.
How does the European market for data-driven advertising compare to the US — particularly in its use of SSPs?
The US has much higher data liquidity, which means that third-party data can be very accurate and also user-specified and there is a huge market for data and data exchanges. Due to fragmentation and the strict European data regulations, third-party data will play a more limited role in Europe.
What, if anything, have we learned from overseas? What is unique to Europe in relation to the way that we are using technology to deal with Big Data?
In Europe, data will become highly relevant for every publisher, but publishers need a completely different approach here. The quality of internal data pools and internal data generation will be the key differentiator for European publishers. Therefore, we offer our clients ways to help them build and integrate internal data pools into our SSP solution.
When selecting vendors to provide solutions to Big Data problems, what are the evaluation criteria you do/will apply?
It’s a combination of capability, price performance and perceived risk. An easy mistake to make is to focus on this as a purely technical activity, do the benchmarks, compare feature sets and such.
Putting in place some sort of big data technology isn’t necessarily that hard, but integrating that with the rest of the systems, hooking it into the data flow, aligning it with existing and new business processes – that’s usually a lot harder. Keeping it running when your data volumes increase tenfold while you try to maintain all that integration, harder still.
Can you describe what the Improve data stack looks like right now?
We’ve got quite a broad range of processing which is reflected in our diverse technology stack. Today we service our main application with a combination of MySQL and Postgres relational database clusters. Alongside this we use Hadoop to produce derived data sets from our ad server logs, and in addition we use some other NoSQL technologies such as MongoDB and Redis for particular point solutions.
We’ve got well-segmented architecture where each component in the platform has well-defined responsibilities. So looking at our technical roadmap, this allows us to have focused efforts to expand specific capabilities in a pretty well-partitioned fashion. So ask the same question in a year and the technology platform will likely look quite different.
What is Improve’s approach to data storage? How are you leveraging cloud computing?
The intersection of data storage and cloud computing is a pretty interesting one, especially for a company like us with systems deployed in multiple locations across continents. It’s a trade between distributing data storage and processing around those locations – more complex but more scalable – against largely centralising where data lives and is analysed – the easier but less scalable approach.
We’ve adopted a hybrid model where we distribute some functions and centralise others, picking the best approach for each task while of course ensuring our system is resilient to failure at any one location. Similarly we use a combination of our own dedicated physical hardware in addition to extensive use of Amazon Web Services. This allows us to be very performance-focused and responsive; with such a broad geographic footprint we ensure we serve users as quickly as possible, while having the ability to dynamically add new capacity on demand. This all gives us great flexibility while not relying on one vendor.
Given that much of the data collected in the real-time advertising space is predominantly event-driven, i.e. structured, where do Big Data capabilities add value? Is it typically around speed and size? Is variety fairly standard?
For us, the formats are pretty well-defined and at least semi-structured. I’d highlight scale as the defining characteristic of the value of any solution. What is needed is a solution that will retain the needed performance characteristics of data access and processing as the volumes increase.
This is often a key value of a Big Data technology, architectural stability through growth. Looking at traditional data systems, a good rule of thumb would be that a given architecture would start to creak when data volumes increased by a factor of two or three. However, any data-driven company today – particularly those of us processing machine-generated data – needs to be thinking of how to handle ten or a hundred times as much data. Though the actual deployments will look very different, the fact is that many of the Big Data technologies can retain a common architecture through these magnitudes of growth. That may sound a little abstract, but it’s the difference between adding a few more servers every few months against having to completely rebuild a new system every year.
Where do you sit on the SQL versus NoSQL debate?
Use the right tool for the job. As mentioned earlier, we use a combination of traditional SQL as well as NoSQL technologies and I’m very comfortable with that. I think there is a case of history repeating itself here. To me one of the most compelling drivers behind NoSQL in its early days was a reaction to traditional system approaches that often saw data pushed into a SQL database just because it was a known technology. That was done even when the data was not intrinsically relational or was going to hit scales where the RDBMS would start to creak. NoSQL technologies stepped into this gap and provide a great compliment to relational technologies.
Now though there’s a risk that with the buzz around NoSQL that the same will happen in reverse; data will get pushed into a NoSQL store even if the nature of the data means this is a sub-optimal choice. I truly believe that most enterprises with any sort of large data volumes will likely need build a platform comprising both SQL and NoSQL technologies, it shouldn’t be seen as an either or debate.
Are we seeing a talent shift occur in the people required to mine Big Data analytics?
It’s possibly more a change in perspective and visibility than anything else. There have been people working on machine-learning algorithms forever, but traditionally this has been in industries that could afford the huge sums required by older approaches to large-scale data processing. The availability of Big Data technologies is bringing the ability to process large data sets into the reach of even quite small companies. It’s a natural evolution that once access to large amounts of data becomes a more commonplace problem that you then have more demand for people who know how to delve into that data and pull out the nuggets. Which is fantastic because this places you right at the intersection of business, technology and mathematics and that’s somewhere hugely fertile for innovation.
In terms of the capabilities Improve has around Big Data, how does this add bottom line value to publishers? How does it help improve yield performance?
There are two sides to this, the automatic and the manual. Many of the previous questions have touched on the former; building that data platform with the capabilities to derive the important facts and trends from the historical data.
However, on the manual side, the road is to increase access to the right information as near as possible to the point of decision. The last thing we want to do is just dump more data on our client’s screens; we have to match the data with better tools that allow them to get the information they want, when they need it. Data also has an interesting way of aging; at some point data moves from being a historical record of the past that is generally interesting to being actionable knowledge about recent events that allows someone to incorporate this into decisions made now. Everything we are doing at Improve Digital in terms of Big Data is anchored in these principles and its something that is a road of continuous improvement; there’s always better ways to process the data and more timely ways to present that to our clients.