Shalini Kurapati, Clearbox AI: “synthetic data solves the privacy problems that companies face when implementing AI”


With so much business data these days, handling and analyzing large datasets requires a new, privacy-first approach.

Over the last few years, the demand for new digital tools like antivirus security measures or automation tools skyrocketed. While companies are especially eager to implement AI solutions, our guest today emphasizes that not every company has the expertise needed to use this technology to its full potential.

To discuss how AI technologies can help overcome data challenges, we invited Shalini Kurapati Co-founder and CEO, Clearbox AI – a company allowing organizations to share and access sensitive data through synthetic data generation powered by AI.

ADVERTISEMENT

Would you like to share what has the journey been like for Clearbox AI?

Clearbox AI’s journey has been propelled by our team’s pragmatic idealism and a penchant for innovation. We’ve always loved to implement the newest AI technologies to solve business problems. Our professional and personal paths across Europe converged in Turin, Italy, where we decided to join forces to pursue a shared mission: to enable responsible AI adoption in companies. That’s how Clearbox AI was born in 2019.

We have since worked hard on listening closely to customers and understanding the challenges that they face to harness AI technologies. The AI progress in companies is often hindered by the problems related to privacy, fairness, and bias, due to data access, sharing, and quality issues.

As a startup, our journey is always evolving and (pleasantly) surprising us. We are currently in an exciting stage where we are able to derive satisfaction in helping companies to overcome these challenges with tangible results. All this while grounding our work in the principles of ethical AI.

As a matter of fact, we have been recently selected as awardees of the Women TechEU project by the European Commission. Our solutions play a fundamental role in the goals of the project in terms of bias mitigation for a more inclusive tech landscape. Our company mission integrates with these objectives because it places itself at the intersection between ethics and technology.

Can you tell us a little bit about what you do? How is Artificial Intelligence incorporated into your services?

In one line – we provide high-quality synthetic data. This is artificial data generated by AI on the basis of an original dataset. The result is a new dataset that is statistically similar to the original one, but different enough that it doesn’t give away personal information.

Synthetic data is an effective privacy-enhancing solution to access and share sensitive data within and outside organizations in a safe and responsible manner while respecting regulations like GDPR and CCPA. It offers high utility with a low risk of re-identification. Additionally, it can be used to safely test and improve AI models, mitigate bias and automate testing.

ADVERTISEMENT

We generate synthetic data through our data engine, powered by our proprietary implementation of generative models. For each synthetically generated dataset, we automatically provide reports on data quality and privacy diagnostics.

We packaged our product offering into a flexible Enterprise Solution that can be tailored to client infrastructure and data needs.

In your opinion, which industries would greatly benefit from implementing synthetic data solutions?

We think synthetic data has industry-agnostic applications. It can be used by companies who have established data science teams or those who are beginning to set up their AI infrastructure; they will benefit immensely by using synthetic data. A good chunk of AI projects are often stalled or sometimes fail due to the lack of production-ready data. Synthetic data unlocks such hurdles and accelerates AI adoption. Gartner projects that 60% of the data used for the de­vel­op­ment of AI and an­a­lyt­ics projects will be syn­thet­i­cally gen­er­ated by 2024.

In particular, we see fertile ground for synthetic data uptake in the banking, finance, insurance, retail, mobility, and healthcare industries. For instance, we are currently serving a major bank that requested our synthetic data generation solution for data augmentation and data processing in accordance with privacy regulations.

Have the recent global events encouraged you to integrate any new vital features?

We’ve always had a focus on responsible AI innovation that not only brings about prosperity but also respects citizen rights and applicable laws. Notwithstanding the current events, we have been doubling down on our efforts on advancing our privacy metrics and bias mitigation modules. The related features offer insights on the risk of re-identification of personal data and options to correct imbalances in data sets to ensure better performance while safeguarding ethical and privacy risks. We will also release open source modules on data profiling and bias mitigation in the autumn of 2022.

Since AI is a relatively new technology, people still tend to have some misconceptions and myths regarding it. Which ones do you notice most often?

One of the key pervasive myths is that AI is some kind of turnkey solution that will magically transform entire business processes and outcomes. What is often missed is the fact that there is a lot of ground in terms of specialized infrastructure, highly skilled professionals, stakeholder buy-in with the right expectations, and a collaborative framework for iterative testing and development.

We are committed to dispelling these myths and spreading awareness about the topics concerning AI, for example publishing periodical interviews with experts to exchange ideas and engage in enriching discussions.

ADVERTISEMENT

What are some of the worst mistakes companies make when handling large amounts of data?

To be fair, handling large amounts of data has been a thing for the past decade or so, therefore it’s natural that there are no right answers or established standards and manuals to follow. While we don’t expect perfection from the beginning because data and AI technologies are fast evolving, there are several aspects companies should be aware of to avoid making big mistakes. Starting from setting up robust and optimized data pipelines, and mechanisms to ensure data quality, and availability, and have overarching monitoring mechanisms not only for compliance but also for easy debugging of project results stemming from the said data.

Having said that, if I have to indicate one overarching issue, it’s the misdirected focus on data quantity rather than quality – big is always not better. Inaccurate, incomplete, or poor quality data, however large, will not deliver successful AI or analytics problems. Moreover, they can be very expensive to collect and maintain. It’s also computationally inefficient adding extra costs and creating a negative environmental impact.

What predictions do you have for the future of AI technology?

I think in the near and medium future Generative AI will play a major role in the advancement of AI technology. Not to mention the shift towards no code and low code AI which will further reduce the barriers of entry into the AI space. There will be an increased focus on smaller but higher-quality datasets. Privacy-preserving AI techniques like Federated Learning will continue to make strikes with increased efficiency. However, with advanced AI models, the computational resources required to train and run these models may create a computational divide. There is a high risk that within a few years only big companies will be able to afford training generative models unless we work on making models more efficient or computational resources more affordable.

In this age of ever-evolving technology, what do you think are the key security practices both businesses and individuals should adopt?

First and foremost both companies and individuals should have a minimum awareness and understanding of the type of security risks they may face. It’s naturally impossible to detect and mitigate every possible security risk, but there are many best practices to follow using a risk-based approach.

Security measures in companies are often not standalone, they have to be a combination of technical, organizational, policy, and people measures. The 5 safes framework of the UK data services is a great framework to start with. What that entails differs across companies, but for us it’s:

  • Safe data: data and models are treated to protect any confidentiality concerns.
  • Safe projects: Projects are set up in safe environments with access controls.
  • Safe people: personnel is trained and authorized to use data and model assets safely.
  • Safe settings: a Secure development environment both on-premises and on the cloud to prevent unauthorized use with state-of-the-art security configurations.
  • Safe outputs: screened and approved outputs that are non-disclosive.

Companies should have good data management and governance practices to have an accurate and reliable status of their safe practices.

Would you like to share what’s next for Clearbox AI?

ADVERTISEMENT

Our goal is to make synthetic data mainstream to accelerate digital transformation and innovation in companies. We have a market-ready product that can be securely installed whether on-premises or in the private cloud of companies to safely generate high-quality synthetic data for a large range of uses. We already have our first large clients using our solutions, so commercially we would like to step up and establish ourselves as market leaders in our niche. Technically, we would like to focus on using synthetic data to make AI models more interpretable, unbiased, and fair. We would also like to explore hybrid approaches to generate synthetic data, for example by using generative models together with agent-based simulations.