Over the last few years, the demand for data science solutions skyrocketed. However, not every organization has the expertise needed to achieve meaningful results.
Recently, there has been a noticeable increase in numbers of companies looking to upgrade their processes in one way or another. While some followed the trends and implemented solutions like password managers and mobile VPNs, others decided to take it a few steps further and enhance their business operations with data analytics.
To discuss the challenges and benefits of data science, we invited Maxime Agostini, Cofounder & CEO of Sarus – a company revolutionizing the way data is shared and accessed.
What has your journey been like since your launch? How did the idea of Sarus originate?
With Nicolas and Vincent, we started Sarus with the intuition that data science and privacy could be made fully compatible but it required a change in how data was being shared. We previously built a successful company in the advertising world, helping publishers better monetize their inventory. The way personal data was being handled by advertisers was broken, putting everyone’s privacy at risk. But privacy regulations were not supposed to hinder innovation either. They just called for something to change in how data was accessed or shared. Sarus is this new paradigm: one where personal data can be leveraged without being shared with anyone.
In 2020, we started exploring privacy-enhancing technologies and how they can support our vision. In 2021, we joined Y Combinator, launched our MVP, and signed our first clients.
Can you introduce us to what you do? How is AI incorporated into Sarus?
Sarus brings a privacy layer directly into the modern data stack. Sarus enables practitioners to work on personal data without ever accessing it. It works as a data access service that filters queries and enforces the highest standards of privacy.
Each data processing job is filtered by Sarus and, when necessary, converted into a version that complies with the dataset’s privacy policies. In most cases, this means using differential privacy to guarantee it’s anonymous. When the query focuses on individual records, synthetic data is returned instead.
We both use and power AI research! AI is used in our synthetic data generation models to provide high-fidelity data for any data type. But more importantly, Sarus lets practitioners train AI models on data they would not be able to access easily.
You mention differential privacy quite often when describing your solutions. Would you like to share more about this technology?
Differential privacy is a mathematical definition of what it takes for some information to be anonymous. When it comes to anonymizing data, all our intuitions eventually break down. Removing or altering fields (data masking or de-identification) leaves room for someone with the right information to re-identify some individuals. Even aggregates can expose personal information to an attacker that would possess a marginally different aggregate. In 2006, researchers proposed a new framework to approach privacy risk. They set out the requirements to limit how much personal information can be inferred from an output: differential privacy. It works by injecting noise into the output of any computation. If the noise covers the impact of adding or removing some individuals, their privacy can be protected. If not, someone with a slightly different dataset may be able to carry out a differentiating attack.
Having a formal definition means that it can be applied to all data types and all types of information produced. It can finally scale the production of anonymous information!
How did the recent global events affect your field of work? Were there any new challenges you had to adapt to?
Data security is probably more important than ever. There is more and more awareness about the risks, including the risk that information that is released today may be used in the future to carry out cyber attacks. Even encrypted data can no longer be shared lightly with the advent of store-now-decrypt-later attacks. Another benefit of differential privacy is that it effectively protects against all information or computation means that may be made available in the future.
What are some of the worst mistakes companies make when handling large amounts of sensitive data?
A common mistake is to try to solve the problem by anonymizing the data. Unfortunately, perfect anonymization is impossible. Usually, they enter the rabbit hole of anonymization by deleting more and more information, only to be left with outstanding risks and not enough value to be extracted. Or they decide to reinforce processes and further limit data access across the company, which gets in the way of data fluidity and, eventually, efficiency and innovation.
Besides data science solutions, what other technologies do you think would greatly enhance business operations?
Organizations invest a lot in data science because it is innovative and fancy, but there are still low-hanging fruits in how data can enhance business operations. The mere fact of accessing data for collaborations is not solved yet. Privacy technologies and data clean rooms also make it possible to unlock simple data use cases that are still unaddressed. There is arguably more value in unlocking more data for simple use cases than in training fancier data science models on data that has already been studied.
In your opinion, what kind of threats organizations should be prepared to tackle in the next few years? What security measures are essential in combating these threats?
Cyber security threats are unlikely to go down in the future. This is the number one area of attention and there will not be a single solution. Data security ops is one component that has often been overlooked because internal users are assumed to be trustworthy. But this assumption may not be very sustainable, especially with more data being made accessible to more practitioners. As organizations expand their data operations, the likelihood of one user being compromised is growing, and the potential impacts are getting bigger. Of course, one should start with the basic security practices (MFA, strong password, strict permissioning) but we expect the data architecture to be more and more influenced by security concerns. Reducing the attack surface is a general approach that can be useful here: instead of granting full read access to many data practitioners, providing safe and privacy-preserving access through tools such as Sarus greatly reduces the number of potential leaks. As in other areas of cybersecurity, zero trust is becoming the baseline.
What tips would you give to companies looking to get more value out of their data?
The main advice is to invest in the tools that will support data democratization. Sourcing and processing data at scale is now well covered by cloud vendors. The modern data stack makes it easy. But making it possible for many stakeholders to leverage this data is largely untapped. Data can have value even beyond internal stakeholders. Data clean rooms and data exchanges are in their infancy. Democratization will require solving privacy and data security at scale which is what companies like Sarus are building.
Tell us, what’s next for Sarus?
This quarter, we are releasing a tremendous extension of the Sarus product that brings privacy directly into all modern data infrastructures, of any size, for any data pipeline. This opens the door to integrating natively with data warehouses and accelerates deployment and time-to-value for our clients. We also just graduated from Y Combinator which gave us a first foothold in the US where we see the most traction for privacy and data security solutions. We will be growing our operations there significantly!