AI is having a moment. Recent advances in deep learning techniques have led to the emergence of bigger and better models and increased adoption in mission-critical settings (self-driving cars, healthcare, malware detection, etc.). When deployed in uncontrolled open-world environments, these settings require that the models remain reliable and perform their intended function.
Consider the context of autonomous vehicles. The camera sensors in AV systems are subject to various weather and lighting conditions, along with real-world perturbations (e.g. damaged stop signs). Irrespective of circumstances, the system needs to respond accurately in all situations.
Unlike traditional software engineering, it is challenging to build safety guarantees in machine learning (ML) systems, given the black-box nature of algorithms and the dynamic system architecture. More research is underway to address ML vulnerabilities such as interpretability, verifiability, and performance limitations, giving rise to a broader field of Responsible (and Safe) AI. This has led to the emergence of start-ups and spin-offs attacking various parts of this problem.
This post is an attempt to summarise what Responsible AI is, and its challenges and opportunities, including interesting commercial approaches, winning factors and challenges start-ups face.
“Deploying models is hard. Trusting predictions is harder.”
Until a few years back, efforts in the machine learning world were heavily focused on giving programs the ability to learn and perform seemingly simple tasks. It took us 35 years to go from creating a program capable of learning to play noughts and crosses (1961) to a computer capable of beating a human chess champion (1996).
Nowadays, innovation is hard to keep up with, but the level of interest points to a certain degree of maturity in the space - at least of inputs, if not outcomes.
The interest comes from the quest to fulfill the long-held promise of abstracting lower-order tasks from humans, improving everyday life, and even aiding in solving known and unknown challenges. Thinking about the complexity of our lives and real-world contexts makes it somewhat understandable that the systems and techniques needed to solve these problems have evolved to be increasingly complex. While we have made significant progress in controlled ‘lab-like’ environments, a lot of the recent conversation in the AI community has focused on two dimensions:
Deploying models into production
Only 30-50% of AI projects make it from pilot into production and those that do take an average of nine months to do so. This was echoed in the most recent tech leaders survey we did at MMC, where MLOps emerged as the top priority for tech and data teams in the coming quarters.
Trusting AI systems to perform reliably
Another challenge is ensuring AI systems deployed into production can be trusted to perform reliably. If guarantees break in critical settings, the consequences can be severe - causing financial and real-world damage.
Enter Responsible AI.
What is Responsible AI?
In the broadest sense, Responsible AI refers to ensuring that AI is deployed in ways that do not harm humanity. Responsible AI covers human concepts like alignment, social and political aspects like fairness and bias, alongside more technical concepts like system reliability, safety, robustness, transparency and security.
While the human concepts are fascinating, this post covers the technical research and interventions to identify and avoid unintended AI behaviour, including some exciting start-ups building on this opportunity.
Why is this interesting right now?
We’re not saying the Responsible AI market has exploded. It’s still early days. While MLOps adoption might be in the single digits, we estimate the adoption of responsible or safe AI practices to be even lower.
We at MMC Ventures partner with start-ups in the early days when ingredients are added, and the pot starts to simmer. Below are some signals that the dish might not be far from being served.
Increasing research focus
Research papers centred around the theme of Responsible AI-related themes have grown >10x in the past 5 years.
The scope of machine learning is expanding from standalone improvements in business processes to entire processes and strategies being designed around it. Additionally, within the last 12 months, we have seen the emergence of research labs like Anthropic, Conjecture AI and Redwood focused exclusively on AI safety and alignment - raising significant funding rounds.
Big tech is leading the way on frameworks, with Google publishing its 4th update and Microsoft making its standard public earlier this year. Anecdotally, key executives like OpenAI's Chief Scientist, Ilya Sutskever, recently pivoted to spending 50% of his time on safety research. We believe this is bound to spill over to tech teams everywhere.
Emergence of foundational models
Since the introduction of Transformers by Google Brain’s team in 2017, we have seen a massive leap in the emergence of foundational models exceedingly good at performing a wide variety of tasks, often out of the box. This has nudged the role of AI from transactional to creative domains, taking it from cost to revenue territory.
Although these models are massively powerful and promise significant benefits, they can also be black boxes. The unique mix of power and opacity of these deep learning models has prompted actors like Open AI and Deepmind, two companies at the forefront of the development, to proactively acknowledge and address the concepts of safety, misuse and harm.
Increased adoption and awareness of the existing vulnerabilities are paving the way for new regulations. In 2021, The European Commission proposed a regulatory and legal framework - the AI Act. In the context of safety, high-risk systems (medical devices, recruitment, etc.) will need to i) validate the use of high-quality training, validation and testing data, ii) establish traceability and auditability, and iii) ensure robustness, accuracy and cybersecurity.
On similar lines, EU GDPR (Recital 71) points to the right to explainability.
This year, in October 2022, the US joined the bandwagon, with The White House releasing a Blueprint for AI Bill of Rights which states that “[s]ystems should undergo pre-deployment testing, risk identification and mitigation, and ongoing monitoring that demonstrate they are safe and effective based on their intended use, mitigation of unsafe outcomes including those beyond the intended use, and adherence to domain-specific standards.” This will be a key driver for increasing conversation velocity in the coming years, and we at MMC are closely monitoring the timing here.
Reactions to Twitter’s ethical AI team’s layoff this week sums up the shift:
"…seeing a “sense of resolve” [among clients] coming out of the news of Twitter’s ethical AI layoffs. “They’re saying, ‘we can be better than that, we’re going to enable [responsible AI teams] and start getting them to put stuff into practice,’” she said. “Like, let’s move away from the thought leadership stuff about this and get rubber on the road.”
The challenges and opportunities of Responsible AI
For anyone who’s been down the MLOps rabbit hole and thought it was complex, let me start by painting an overwhelming breakdown of the components of Responsible AI.
The above framework shows the number of modules that need to fit together to fulfill the promise of Responsible AI. Like any new space, activity is fragmented and we see numerous start-ups take shots at different parts of the problem, often covering multiple modules.
By mapping the MLOps value chain as the attack vector, four critical areas of interventions for Responsible AI emerge, where a broad enough platform with substantial value can be created.
1. Data quality
Data is at the heart of all machine learning systems. Therefore, safety considerations must be included in at this point of the process chain, including thinking of data quality and reliability when starting to gather, prepare and refine data. As Andrew Ng suggests - If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.
The momentum on Data Centric AI movement is bringing back the focus on challenges like the availability of high-quality labelled data for training and validation. As teams plug this gap, often sourcing datasets from outsourced labelling vendors, issues like label errors on large-scale annotated data sets might crop up, leading to incorrect outcomes. Or other issues like bias and underrepresentation that need to be diagnosed and fixed. The good part is that a lot of the current (and in-development) techniques are designed to be generalisable across datasets.
And there are a number of startups exploiting this - further pointing to both, complexity of the problem and the importance of data.
Although there are start-ups working on seemingly small niches here, better data is the most significant way of ensuring better systems. Each niche ends up having a massive addressable market, and we remain bullish on the start-up value creation in this segment.
- As a starting point, reliability is addressed by deploying appropriate tests on incoming data through (modern, but now standard) testing platforms like Monte Carlo and open source libraries like Great Expectations (or ‘dbt-fy’ it like everything else through elementary data or re_data)
- The pursuit of better data quality is squarely behind the improvements in annotation techniques. Modern annotation companies like Labelbox and V7 Labs are bringing a range of model-assisted labelling features to improve annotation time. Humanloop pushes this further through techniques like active learning - selectively labelling valuable samples (“edge cases” from an unlabelled pool) for increased efficiency, and programmatic learning - rules based approximations for faster labelling. Start-ups like Quality Match provide a platform to triple-check the work of annotation providers and identify annotation errors and edge cases in incoming data
- In the last two years, Feature Stores have emerged as the formal bridge between the data pipelines and ML workflows. Most of the tests around data processing and transformation are rightfully integrated at the feature stores component, e.g. Hopsworks integrates with Great Expectations to ensure consistent data for downstream processes, in addition to supporting testing of feature pipelines and training pipelines locally before deployed to production
- Companies like Synthesized and Synativ have clever approaches of generating synthetic images to address edge cases, dataset imbalances and overall reduce the real-world data preparation burden to get models up and running
2. Model experimentation and early explainability
Specifying the desired model behaviour is a design requirement before any system development. However, it is not straightforward when dealing with dynamic and high-dimensional data, like images. Data scientists start attacking any problem by setting up draft models and analysing the initial outputs, hoping to get to a workable model through iterations.
The phase can be messy, with issues ranging from mechanical (needing to manually set up data pipelines for each edited iteration) to conceptual (a lack of clear understanding of how the machines convert high-dimension data into binary formats to determine where to begin).
Additionally, complex neural networks have multiple computation steps, and tracing output dependency on input data and/or individual nodes of the model is a complex problem. The challenge of explainability shows up here and continues to be relevant through further stages of development. Explainability needs are contextual, and can vary by use cases and industries. For example, in a logistics setting, a route optimisation program might need to explain to the driver if the suggested route is considering time or cost. The optimisation engineer in the back-end might want a deeper understanding of the factors involved in making the decision.
Each individual problem is big enough and start-ups are more focussed on valuable niches here:
- instill.tech is building reliable and replicable pipelines for unstructured data. Think Airbyte for computer vision data
- Tools like Rerun.io push for visualisation-driven development by mapping out unstructured computer vision data in an interactive spatial space. This is helpful when quickly prototyping new algorithms and debugging early edge cases
- Start-ups like Tensor leap and Tenyks are combining explainability techniques to help ML developers peek inside the black box and pinpoint where and why the algorithms and/or datasets fail (and how to fix them)
Experimentation as a stage is core to the data scientist’s work. Tech interventions, although thin, will have applicability across teams. High word-of-mouth and low competition will aid in faster adoption of tools in this category. We are opportunistically on the lookout for tools that unlock large enough values for data teams.
3. Testing, QA and debugging
After a few cycles of refining models and training them on high-quality data, teams might feel confident in their model’s ability to perform its intended function. Following DevOps learnings, the next step is systems testing before entering the production phase. However, unlike traditional software testing, ML testing is much more complex, given the need to cover three big areas - data, model and infrastructure.
Defining tests requires expectations of a fixed baseline, which is not straightforward due to the lack of transparency. Models are not static. They change and evolve every time re-trained with new data. Modelling real-world behaviour is complex, even when tested comprehensively, and there is a need for broad tests. Moreover, the Concept of Test coverage is difficult to define for ML systems. Unlike # of lines of code covered, (a key metric in traditional software), coverage is harder to define for ML systems where the number of parameters) v/s purely lines of code often is the focus.
The source paper for the image above is a good starting point for teams looking for a list of tests. There is, however, a need for a platform to enable self-serve testing modules, creating new ones and reusing them for future projects.
In contrast to the above two areas, start-ups in this segment aim to build a one-stop comprehensive suite covering a variety of tests. Given the early days, it is interesting to note how these start-ups are experimenting with different positioning messages for their customers.
- Robust intelligence promises ML integrity through a firewall approach covering initial testing and ongoing monitoring of incoming data
- Latticeflow offers auto-diagnoses of data and models without needing explicitly defined tests to find potential blind spots. Plus the synthetic data creation capability is tightly integrated to ‘fix’ vulnerabilities - one most vendors realise they have to plug in
- Lakera and Giskard are both creating a comprehensive library of standard ML tests out of the box with plans to source the long tail of tests through community contributions
Testing is the first of the ‘broad enough’ attack vectors for building a valuable platform. The timing of the intervention (pre-production) can work in favour of start-ups here to closely understand workflow needs and expand usage.
4. Monitoring and observability
Monitoring is a well-established concept in traditional software engineering. Tools for monitoring applications and infrastructure are found in most stacks. As more models come into production, teams naturally want to understand how the system behaves over time when exposed to real-world data.
In addition to being opaque, complex ML models and neural networks also have the challenge of failing silently. A model might continue giving an output accepted by downstream systems without being flagged. There might be a skew between training and production data - either due to real-world conditions or a mistake in the feature transformation steps. Issues like data drift (changing distributions of incoming data, causing performance degradation) and concept drift (changes in model output over time for the same input data) inevitably show up. For example, the model might tag “the band tonight was sick!” negatively if it were trained on text from a time before it became contextually positive slang. Like traditional engineering, alert fatigue can also be challenging - e.g. a surplus of alerts triggered when overall model performance was adequate.
This is the space where I see the most start-up activity. It is understandable, as observing system behaviour is essential across all use cases, even in the absence of safety considerations.
- Platforms like Fiddler AI and others are executing on the vision of emerging as the end-to-end platform enabling most monitoring, testing and eventually reporting modules for their customers
- NannyML is innovating at the individual prediction level through its algorithm to estimate performance metrics without current expectations of behaviour.
Analysing painkillers v/s vitamins products, monitoring is the first to arrive at the emergency room for most ML teams. This becomes the other significant route for value consolidation.
What makes the space exciting
If you’ve read this far, here’s a brief overview of what I find interesting as an investor:
- New category
Responsible AI solves a new type of problems for ML teams, who have been focused on model development. This creates a new market in which there are few islands of value left. We think this will be an important one. Start-ups are, however, quick to dissociate themselves from MLOps hype but safety considerations might soon get more attention from overwhelmed data scientists
- Possibility to innovate on business models
PaaS or SaaS will remain the dominant business model. Responsible AI considerations also present some new monetisation metrics. Conversations are still early, and the need for more consensus opens an opportunity to try different approaches. On the technical side number of models and tests are easy to define and measure. But they are also static and do not fully represent the business value of detecting edge cases and failure points. The amount of data processed is more representative of volume, but we are increasingly getting tired of paying additional fees every time we interact with data. Start-ups are innovating to move closer to business side metrics - the number of external integrations and failure detection SLAs. The next wave of start-ups has an opportunity to use clever combinations of these
- Sticky revenue
The ongoing nature of continuous testing, monitoring, and perpetual iteration of models should make the usage and revenues sticky, and hence more valuable. Once the monetisation unit is defined, it might be easier to drive high NRR.
- Regulation-driven adoption
Regulation is already starting to create adoption pressures for clients in healthcare, transportation and consumer-facing experiences to be affected soon. Building a product in lockstep with regulation requirements while making it easier for customers to run audits and reports opens a good tailwind for customer acquisition. Timing on privacy regulation worked great for Collibra’s data governance products. We believe responsible AI start-ups have a similar opportunity.
Challenges start-ups will need to overcome
- Is it a platform or a feature?
This continues to be the question every team has to answer. How do they go from solving narrow problems to getting embedded as a broad enough platform that fits their everyday worflow? Cutting through the noise will be challenging. Teams at the forefront of driving a mindset and workflow shift (beyond tooling alone) might find better adoption. Integrations with the existing MLOps stack and frameworks will be critical for frictionless adoption, making way for commercial partnerships.
- Managing multiple stakeholders
ML teams have multiple stakeholders involved in the delivery of systems. Responsible AI interventions will go through similar profiles - ML engineers, (normal) software engineers, data scientists, and business functions. Compliance and safety teams are coming soon. Data scientists are tasked with most of the safety outcomes and make for a prominent and common choice. Building killer user journeys for other profiles can be an area of differentiation, especially with collaboration becoming important out of the box
- Balancing both commercial and academic focus
Given that a lot of research on techniques and algorithms is still underway, Responsible AI start-ups will need to continue to participate and contribute alongside the academic community. The IP and differentiation will mainly come from the depth of tests, monitored metrics and debugging algorithms. Continuing focus on research is important for two reasons: i) research-focused start-ups will be more attractive for talent, which is mostly coming from academia, at least in the short term and ii) regulatory bodies are still figuring out the taxonomy of the space. Research and thought leadership will be key for co-developing industry standards and getting an edge
- Demonstrating exit potential
All-in-one AI infrastructure platforms are being developed by each of the Big 3 cloud providers, as well as vendors like Databricks and Datarobot, to fulfil broader MLOps needs. Responsible AI indeed goes beyond MLOps, but these players may present the most likely exit route. The economic slowdown might make it harder for start-ups to demonstrate growth - critical to be well-positioned for any potential M&A. Winning teams will focus on all the above to build a strong business
We at MMC are early-stage investors in AI businesses (Signal AI, Peak, Senseye, Recycleye) and the infrastructure layer (Snowplow, Ably, MindsDB, Tyk). We distinguish ourselves by going deeper - on the technologies we are investing in, and on the partnerships we build with the entrepreneurs applying them. If you are a founder building on the topics above, or are actively thinking about some of the variables discussed here, we’d love to talk and potentially involve you in our community of technical leaders. For further research and insights, click here.