Fighting bots at the source: Why we invested in BotGuard

Comments Off on Fighting bots at the source: Why we invested in BotGuard

The Botguard team

Bots are a big, growing problem. These automated software applications account for nearly half of all global web traffic, with 73% stemming from malicious bots, and it’s increasing 5% year on year. With the rise of AI, they will only grow in sophistication and attack velocity, requiring equally sophisticated defence mechanisms to fight them and accelerate the evolution away from outdated captcha forms.

BotGuard is that evolution.

Good bots, bad bots

There are broadly two sorts of bot traffic: good and bad. Good bots include search engine bots that index new content and site monitoring bots that analyse website performance. Bad bots engage in harmful activities, like competitive data mining, personal and financial data harvesting, brute-force logins, digital ad fraud, and denial of service.

The problem is that all bot traffic – good, bad, and in between – consumes bandwidth leading to additional server load, impeding critical services and causing outages. This is bad enough for domain owners, whose websites can go down or be compromised, but even worse for hosting service providers who are seeing their infrastructure running costs increase.

Democratising access to better web traffic

BotGuard was born out of this frustration, co-founders Nikita Rozenberg (CEO) and Denis Prochko (CTO) experiencing the bot problem first hand.

It all began as a side project. Denis had created a website focused on vegetarian diets with a lot of proprietary content – and then an army of scraper bots began to steal it all. Nik, starting his career in the ’90s as a web developer and spending two decades in the web hosting world, was also only too familiar with the problem. While there were plenty of tools on the market to deal with bots, they were all aimed at and priced for large enterprises. There were no affordable and effective tools for smaller players – a massive gap, given that 43% of cyber attacks are aimed at small businesses, and only 14% are prepared to defend themselves. So Denis created a prototype to solve the problem himself, and BotGuard was born.

The platform uses digital fingerprinting technology to intercept and selectively block potentially malicious web requests, adapting to the traffic patterns of a particular website in real time. It mirrors these requests in the BotGuard server for analysis, filtering out only the traffic that poses a threat and allowing legitimate requests through. As standard, about 80% of traffic gets blocked – the malicious bots – and then the last 20% comes down to the domain, so the domain owner can choose which “good” bots they want to allow on their website.

Targeting bots at the infrastructure layer

BotGuard stands out by tackling bots at the source. Rather than focusing on websites and domains, the product is designed specifically for the web clusters and data centres of hosting service providers (HSPs – think GoDaddy but smaller) and managed service providers (MSPs – typically digital agencies and consultancies) to block bots at the server level. And this makes sense: around 60% of bot traffic originates from data centres.

This approach not only reduces server load and traffic by up to 25%, cutting costs and improving service levels, but also extends the benefits of bot management to all domains on the network, democratising access to better web traffic.

This also creates a unique sales dynamic. For MSPs, BotGuard offers an avenue for additional recurring revenue by bundling bot management with their standard maintenance services. And HSPs become channel partners, able to upsell advanced features like dashboard and custom AI rules to their clients. They can differentiate their offerings while empowering end customers with more granular control over their own web traffic.

Fighting AI with AI

Bots are getting more sophisticated, with generative AI accelerating the trend. Using AI, bots can better learn and mimic human behavioural patterns – like how fast your cursor moves across the screen – and act autonomously, posing a more significant threat as they become increasingly indistinguishable from legitimate human users. Traditional defences like captcha forms, which also create high-friction user journeys for any website that with a sign up or payment flow, are becoming less effective against these evolving threats.

So BotGuard is building more sophisticated AI models and behavioural analytics tools to better detect and prevent these more sophisticated, AI-assisted attacks. Fighting AI with AI. And, looking ahead, this technology extends beyond mere defence.

Large language models (LLMs) typically depend on web scraping bots for their mass of training data – a live issue in the current discourse on AI and data regulation. BotGuard has the potential to allow domain owners to choose whether or not they want to block this sort of scraping or charge for access. If BotGuard cracks it, it would enable domain owners to not only protect but also profit from their websites and data, unlocking a completely new market and becoming an integral part of the regulatory infrastructure for AI.

An evolution in bot management

So we’re thrilled to be leading BotGuard’s €12 million Series A, alongside existing investors Tera Ventures and Expeditions Fund. This latest round will be instrumental in fueling BotGuard’s expansion – both technically and internationally. The team will use this funding to expand to the US, where bot traffic is a massive problem, invest in their AI capabilities to tackle the ever more sophisticated bot attacks, and lay the groundwork for new functionalities, including domain-level custom rules and monetisation opportunities for ‘good’ bots.

As a research-driven fund, we spent a lot of time researching the bot management space. After which, we concluded (much like Nik and Denis), that most existing solutions only catered to large enterprises at the domain layer, leaving the vast segment of small and mid-sized businesses and web hosts without adequate support. One of the things we love about the BotGuard team is their passion to build a secure internet that is accessible to everyone.

With its unique approach to bot management at the infrastructure layer, BotGuard is not just another cybersecurity product; it’s a fundamental shift in the way we protect and manage our digital ecosystem. And with their deep personal experience of the frustration and cost for domain owners, Nik, Denis and the team have the opportunity to build a platform that is truly indispensable to a previously underserved market.

This investment builds on our thesis of the growing need for security for AI solutions and the use of AI in security (keep an eye out for my colleague Advika’s next post on AI security research). I’ve personally backed the teams at Senseon (using AI to automate the process of threat detection, investigation and response) and Red Sift (using machine learning to prevent cyber attacks via email) as part of this trend.

If you’re building in this space, we’d love to hear from you!

The Metamodern Data Stack: Built for the Age of AI

No comments yet

Co-authored by Advika Jalan & Nitish Malhotra

The explosive growth of data and heightened interest in AI will only result in increased data engineering workloads. Against this backdrop, we outline the Metamodern Data Stack, which: (1) is built around interoperable open source elements; (2) uses technologies that have stood the test of time and continue evolving; (3) leverages AI at each stage of the data lifecycle; and (4) is partly decentralised. This is underpinned by a number of recent developments, ranging from Amazon’s launch of S3 Express One Zone (EOZ) to the open-sourcing of Onetable by Onehouse – all of which we discuss in greater detail in our note.

MMC was the first early-stage investor in Europe to publish unique research on the AI space in 2017. We have since spent eight years understanding, mapping and investing in AI companies across multiple sectors as AI has developed from frontier to the early mainstream and new techniques have emerged. This research-led approach has enabled us to build one of the largest AI portfolios in Europe.

We work with many entrepreneurs building businesses that enable enterprises to take advantage of AI. Through our conversations with practitioners and founders building in the data infrastructure space, we have identified a range of emerging technological trends that would support the creation of an AI-first data stack, which we call the Metamodern Data Stack.

Source: MMC

The TL;DR – key ideas underpinning the Metamodern Data Stack

The Prologue: The creation of the Modern Data Stack was accompanied by the proliferation of specialised tools, with poor levels of interoperability and integration between them. Additionally, many of these specialised tools often used proprietary formats, which worsened the vendor lock-in situation. This has led to a growing recognition of the need for interoperable open source software, and in particular reusable components.

The Present: Interoperable Open Ecosystems such as the one around Apache Arrow are gaining momentum, while late 2023 saw the launch of Interoperable Open Layers such as Onetable.

The Future: We expect more startups to build around Interoperable Open Ecosystems because it improves time to market, reduces undifferentiated heavy lifting, and allows them to instead focus on value-accretive activities. Meanwhile, we expect enterprises to adopt Interoperable Open Layers to unlock value across their tech stack.

The Prologue: Given the rapid innovation cycles where today’s hottest technology is obsolete tomorrow, practitioners are increasingly focused on longevity and evolvability – technologies that have or are likely to withstand the test of time, but are also getting better and more performant for emerging use cases. Object storage is a good example of Longevity and Evolvability, while SQL (the lingua franca for data since the 1970s) perfectly illustrates the idea of Longevity.

The Present: In November 2023, Amazon announced the launch of S3 Express One Zone (EOZ), a new object storage class that features low latency access (making the traditionally high-latency object storage c.10x faster). This should be beneficial to AI use cases that require fast access to storage. Meanwhile, new query languages which compile to SQL (such as PRQL and Malloy) continue gaining traction.

The Future: We believe startups can do two things: build their tech stacks on long-lived and evolvable technologies (such as object storage) and build solutions that evolve the long-lived (e.g. PRQL, Malloy). Particularly as AI/ML workloads increase, we expect more startups to build their primary storage on object storage, given the step change that Amazon S3 EOZ represents.

The Prologue: Enterprise demand for leveraging AI is increasing by leaps and bounds, and 70% of the work in delivering AI use cases is data engineering, making it imperative to enhance data engineering workflows. In particular, the ‘data-centric AI’ movement has renewed the focus on quality data.

The Present: Different ways of leveraging AI in the data lifecycle are emerging, e.g. improving data collection at the source by simulating image capture settings to optimise for quality, or automatically integrating image, text and video data by not only bringing them together, but also combining the context and narrative underpinning them.

The Future: We expect AI-powered data workflows to become more mainstream going forward, given various benefits such as automation, improved data quality, and better utilisation of unstructured data. Startups building in the data infrastructure space can consider adding AI-powered capabilities where it helps enhance workflows.

The Prologue: Data decentralisation is a growing trend, as growing workloads overwhelm centralised data teams, and regulations (such as those around data privacy and sovereignty) necessitate a distributed architecture. However, successful Data decentralisation would require us to overcome challenges with integration, interoperability and governance.

The Present: Many enterprises are implementing a hybrid version of the data mesh that involves a certain degree of centralisation (vs the more decentralised paradigm that characterises the data mesh i.e. no data warehouse). Independent of the data decentralisation trends, other technologies which would support “unification” in a decentralised environment are gaining traction e.g. data contracts, data catalogs and semantic layers.

The Future: We expect enterprises to adopt the “unification” technologies ahead of moving into a more decentralised structure, to ease the process and make the transition easier. As a result, we would expect more startups to build “unification” technologies to take advantage of this trend towards decentralisation.

Enterprises are reluctant to quickly replace mission-critical parts of their tech stack, given fears around disrupting business activities. Against this backdrop, a pragmatic approach would be to create solutions that let enterprises use their existing tools yet at the same time take advantage of the latest technologies.

The fundamental idea here is to focus on the existing and upcoming. For instance, a database gateway that decouples the app stack from the database engine wouldn’t just work with what is legacy and new as of today – it would also look at today’s latest technologies that will become obsolete in the future and provide a way to work with that alongside successor technologies. That’s why we believe pragmatism is a good starting point.


Taking a deeper dive into each of the key themes…

Interoperability and Openness

As we moved away from the tightly-coupled, monolithic models of the traditional data stack to the modularity of the modern data stack, this was accompanied by the proliferation of specialised tools, with poor levels of interoperability and integration between them. Additionally, many of these specialised tools often used proprietary formats, which worsened the vendor lock-in situation. This has led to a growing recognition of the need for interoperable open source software, and in particular reusable components. While the quest for interoperability isn’t new, it becomes all the more critical to achieve this to support growing AI workloads – and certain recent developments caught our attention.

For instance, at least 40% of Databricks customers use Snowflake, and vice-versa, despite the growing overlap between the two solutions as they race to be the primary choice for supporting AI workloads. However, they use different open source table formats – Databricks uses Delta Lake and Snowflake supports Iceberg. Although Databricks introduced UniForm (which automatically generates Iceberg metadata asynchronously, allowing Iceberg clients to read Delta tables as if they were Iceberg tables), this isn’t a truly bidirectional solution for interoperability between Delta Lake and Iceberg. Adding further complexity is the popularity of other formats, such as Hudi, and the emergence of newer ones, such as Paimon. This led to the creation of Onetable (open sourced in November 2023 by Onehouse), which supports omni-directional interoperability amongst all these different open table formats with the aim of creating a vendor-neutral zone.

Generally speaking, we see two main open source interoperability paradigms:

  1. Interoperable Open Ecosystem, which can be thought of a type of “vertical interoperability”, where open source components at different “levels” e.g. Parquet at storage level, Arrow at memory level, DataFusion at query engine level and Flight at network level are all highly compatible with each other.
  2. Interoperable Open Layer, which can be thought of as a type of “horizontal interoperability,” is an open source layer that sits atop different non-interoperable open source components at the same level to create interoperability e.g. Delta Lake, Iceberg and Hudi are all open table formats, so they are on the same level, but they aren’t directly interoperable with each other, which is where Onetable acts as the Interoperable Open Layer.

Source: MMC

We illustrate the Interoperable Open Ecosystem through the example of Apache Arrow, an open source in-memory columnar format. Arrow is language agnostic (so systems written in different programming languages can communicate datasets without serialisation overhead) and creates an interoperable standard. Without standardisation, every database and language would implement its own internal data format, resulting in costly serialisation-deserialisation (which is a waste of developer time and CPU cycles) when moving data from one system to another. Systems that use or support Arrow can therefore transfer data at little to no cost.

Arrow is language agnostic and zero-copy, which brings significant performance benefits

Source: https://arrow.apache.org/overview/

Arrow by itself is incredibly useful, and its appeal is greatly enhanced by other highly interoperable components built around it, such as DataFusion for the query engine or Flight for network data transfer. Arrow also works extremely well with the storage format Parquet. With the development of this interoperable open ecosystem, not only do we see a way forward for reduced fragmentation in the data stack, but also improved performance – which in turn is driving greater adoption of the Arrow ecosystem by vendors.

What’s happening now?

While we’ve seen data lakehouse vendors such as Dremio contributing to these projects since their inception and adopting them within their stack (Dremio uses Arrow, Iceberg and Parquet), we see early proof points that these are gaining traction. For example, InfluxDB 3.0, Polar Signal’s FrostDB, and GreptimeDB are new database solutions launched over the course of 2022-23 that have been built around the Arrow ecosystem.

InfluxDB’s visualisation of the FDAP Stack (Flight, DataFusion, Arrow, Parquet)

Source: https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/

What’s next?

Going forward, we would expect more vendors to adopt interoperable open source components in their stacks. Within the database space, the Arrow ecosystem offers superior performance and has been consistently improving since its inception in 2016, adding more and more tooling that enhances functionality and interoperability. Because it is composed of highly interoperable elements, data engineers waste less time on managing different integrations and compatibilities, which frees them up to pursue other value-accretive tasks rather than undifferentiated heavy lifting. This in turn creates competitive advantages. As a result, we expect startups to build around interoperable open ecosystems where possible. Meanwhile, we expect enterprises to adopt interoperable open layers to unlock value across their tech stack (e.g. extracting value from both Databricks and Snowflake investments by leveraging Onetable).


Longevity and Evolvability

Given the rapid innovation cycles where today’s hottest technology is obsolete tomorrow, practitioners are increasingly focused on longevity and evolvability – technologies that have or are likely to withstand the test of time, but are also getting better and more performant for emerging use cases.

Object storage gets a new objective: becoming primary storage

The principle of Longevity and Evolvability is well illustrated by object storage. Object storage is a low cost storage option that is highly scalable, durable and available – but it has high latency. Historically, it was either used for backup and archive storage or the latency was circumvented with complex caching strategies. This changed in November 2023, when Amazon launched a new object storage class called S3 Express One Zone (EOZ) that features low latency access in a single availability zone. This means that you can enjoy the scalability of object storage WITH low latency and WITHOUT implementing complex caching strategies. Workloads and core persistence layers can now be built entirely around object storage which simplifies the architecture and code and leads to faster time to market. This is a major development for powering AI use cases, for which fast access to storage is key. That said, at the time of writing, individual API operations with S3 EOZ were 50% cheaper but storage was 8x more expensive than S3 Standard – which means startups will have to balance the trade-off between low latency and high storage cost. Nevertheless, object storage continues to get better, and we eagerly look forward to the developments that will likely follow EOZ.

What is past is prologue: PRQL and SQL

“Death, taxes, and SQL. Amid all of the growth, upheaval, and reinvention in the data industry over the last decade, the only durable consensus has been our appreciation of SQL.”

– Benn Stancil, CTO & Founder at Mode (link).

SQL has been the lingua franca for data since the 1970s – and this appears unlikely to change. In keeping with the principle of longevity and evolvability, we observed the growing popularity of new query languages such as PRQL (Pipelined Relational Query Language, pronounced “prequel”) and Malloy which compile to SQL, making querying simpler and more concise. In particular, PRQL was designed to serve the growing need of writing analytical queries, emphasising data transformations, development speed, and readability. Because PRQL can be compiled into most SQL dialects, it is portable and reusable. Additionally, it offers a pipelined execution model, so unlike a traditional query language which fetches the entire dataset and processes it as a whole, PRQL processes the data in smaller, manageable chunks and delivers results progressively as they become available.

Most attempts to replace SQL with another query language have been (justifiably) met with deep scepticism, not to mention rants like Erik Bernhardsson’s famous “I don’t want to learn your garbage query language.” What makes PRQL especially interesting is that it continues to gain traction, and this can be attributed to its open source nature (its creators stated that it will always remain open source and there will never be a commercial product), its compatibility with any database that uses SQL, its simplicity and its focus on analytics.

PRQL continues to gain traction, despite general scepticism around new query languages

PRQL represents an important innovation, in our view – not only because of the utility of the language itself, but also because of the principles that underpin it. It represents an Evolution of a Long-Lived technology (in this case, SQL) as well as demonstrates the importance of Interoperability and Openness.

What’s happening now?

Our conversations with practitioners suggest that three intertwined concerns dominate their thinking currently: (1) major architectural decisions are often hard to reverse, which means core technologies must be chosen in a way that support both current and future needs; (2) nobody knows what future needs might look like, with everything sharply accelerated in the Age of AI; and (3) there is a dizzying array of tools to choose from. The idea of longevity with evolvability is the solution here, as we illustrated through the example of Amazon S3 EOZ.

What’s next?

We believe startups can do two things: build their tech stacks on long-lived and evolvable technologies (such as object storage) and build solutions that evolve the long-lived (such as PRQL and Malloy).

Particularly as AI/ML workloads increase, we expect more startups to build their primary storage on object storage, given the step change that Amazon S3 EOZ represents.

In the “build solutions that evolve the long-lived” case, we admit that this isn’t an option available to all startups (you could be building something brand new, the likes of which the world has never seen before, and such groundbreaking innovation with commercial use cases are incredibly exciting!) but where a battle-hardened piece of technology is ubiquitous and familiar to practitioners (consequently difficult to displace), it is more pragmatic/easier to drive rapid adoption through “evolving the long-lived.” Familiarity is especially comforting in times of great change and upheaval (which we expect in the Age of AI) – a bit like how the first motor cars attached imitations of horse’s heads and advertisements likened cars to horses, to drive adoption.


Data-Centric AI and AI-Centric Data

Our research has highlighted a key feature of the Metamodern Data Stack to be what we call AI-Centric Data, which uses AI at every stage of the data lifecycle (generation, ingestion, integration, storage, transformation, serving). We see this driven by five main factors:

  • Increased automation: Data engineering is 70% of the work in delivering AI use cases, and AI use cases are growing exponentially – resulting in a significant increase in data engineering workloads. In order to cope with this, we need higher levels of automation, something which AI itself can provide.
  • Better utilisation of data: Unstructured data (text, images, video, audio etc.) contains a wealth of detail that could generate meaningful business insights – for instance, looking at customer reviews (text) along with customer support calls (audio) and product images (visual) to get a holistic understanding of customer experience. Over 80% of a company’s data is unstructured, yet less than 1% is currently analysed – though AI is going to change that.
  • Improved quality: Without quality data, AI applications are meaningless. AI can be used for improving data collection at the source (e.g. simulating image capture settings to optimise for quality), for data curation (identifying the best data to train the AI model) and for data wrangling (identifying duplicates or errors).
  • Enhanced privacy and security: AI is already used to generate synthetic data that protects the privacy of individuals, identify and mask sensitive customer information, as well as prevent, detect and respond to security incidents such as data leakages.
  • Lower costs: AI helps to lower costs e.g. automatically detecting less frequently used data and moving it to cold storage which is cheaper, or using AI-driven data curation to inform decisions to reduce the amount of data collected upfront (less data collected = lower costs of processing, storing etc.).

What’s happening now?

We outline below some of the ways in AI is being used across the data lifecycle:

Data Generation: Besides using AI to generate synthetic datasets (artificially generated data to overcome issues around lack of data or data privacy concerns) or for automatically annotating/labelling unstructured data, there are other AI use cases that help improve the data generation process itself. To illustrate: Dotphoton generates machine-optimal synthetic image datasets with metrological precision. This process involves simulating variations in environmental and equipment settings to optimise data processing for downstream AI systems. For example, while more expensive camera systems can yield higher quality images, they also incur greater costs. Dotphoton’s data engines allow companies to identify the pareto front between data state and data cost for the best of both worlds: top AI system performance at the lowest possible cost.

Data Curation: With the growing focus on “better data” rather than “more data,” we expect data curation to see greater traction – and AI can be a big boon here. For instance, Lightly combines active- and self-supervised learning algorithms for data selection, and selects the subset of a company’s data that has the biggest impact on model accuracy. This allows the company to improve its model iteratively by using the best data for retraining. In fact, this can also shed light (pardon the pun) on a company’s data gathering process (a lot of data being collected may not be meaningful enough to train an accurate model), and drive a re-thinking of the data collection process.

Data Integration and Transformation: AI models can be leveraged to perform a variety of transformations on data (e.g. conditionals, complex multi-column joins, splits) and ensure that the data conforms to business rules, defined schema and quality standards. It can even be used for pipeline migration – this CNCF blog post talks about using LLMs and prompt engineering to automate the conversion of YAML scripts of existing Jenkins, Azure and GitLab pipelines to Tekton YAML scripts, making the pipeline migration process and adoption of newer technologies much simpler.

In particular, we are interested in AI-driven integration and transformation of unstructured data. Unstructured data is complex and exists in a variety of formats – not just image, video, and audio, but also industry-specific formats such as DICOM for medical imaging or .hdf5 for aerospace. Joining unstructured data sources with other data formats or datasets is particularly challenging – the integration is not just about bringing together different data types, but also the context and narrative underpinning them. For instance, looking at customer reviews (text) along with customer support calls (audio), product images (visual).

icon-quote

“The dilemma of managing unstructured data is a classic chicken-egg problem: effective AI models are needed to extract data, but these models require pre-collected or labelled unstructured data for learning. This data-driven approach, contrasting with traditional imperative programming for structured data, demands simultaneous focus on both data and AI model development.”

Ping-Lin ChangCo-Founder & CEO at Instill AI

As such, there is greater momentum in AI-powered data transformation. For instance, Instill AI provides a no-code ETL pipeline called Versatile Data Pipeline that extracts unstructured data and uses AI to transform it into analysable or meaningful data representations, and Matillion announced that it is launching LLM-enabled pipelines. We also see AI being used in feature engineering (the process of creating a dataset for ML by changing existing features or deriving new features to improve model performance). For instance, Datarobot is using AI for automated feature engineering.

A different type of pipeline for information retrieval (and especially unstructured data) is that of Vector Compute, which applies machine learning to raw data to generate vector embeddings. This is where Vector Compute providers such as Superlinked* play a critical role, as its solutions turn data into vectors and improve the retrieval quality, by helping enterprises create better and more suitable vectors that are aligned with the requirements of their use-case, bringing in data from multiple sources instead of just a piece of text or an image.

Data Storage: The practise of AIOps (application of AI to automating IT operation management tasks) has been around for a while, and most data storage vendors (Dell, IBM, Pure Storage) offer AIOps solutions. Particularly for data storage, AIOps is used to assess data usage patterns or access frequency for intelligent workload placement (e.g. frequently accessed data on faster, more expensive storage and infrequently accessed data on cold storage). AI can also automate several data storage tasks, such as backup, archiving and replication, and troubleshoot data storage issues. Additionally, we see startups emerging in the space, such as DBtune and OtterTune, that provide solutions for AI-powered database tuning and performance optimisation.

Data Serving: Most BI/Visualisation incumbents such as Tableau, ThoughtSpot, Qlik, Domo have introduced AI-powered tools that allow users to surface key insights or create charts from data by asking questions in natural language. Another AI-powered way of serving data is through in-database machine learning solutions provided by MindsDB* which enable users to consume AI output straight from their native database (such as MongoDB, PostgreSQL) without having to construct complex ETL (Extract, Transform, Load) pipelines.

What’s next?

Using AI within the tech stack isn’t a new idea – practices such as AIOps, which use AI to monitor the performance and reliability of applications and hardware have been around for a while. However, our belief is that this will likely become more widespread across the data stack given the benefits we outlined earlier and the latest AI-powered innovations. For instance, some startups are focused on tackling the harder problems of multi-modal data transformation using AI.

While not an explicit stage of the data lifecycle (such as generation → ingestion → storage → transformation → serving) there are some considerations that are of paramount importance across the entire lifecycle, such as data governance, data security and data privacy. It would be remiss of us not to discuss this as thoroughly as it deserves, and we will explore this in greater detail in our next blog post.


Decentralisation and Unification

Over the last few years there has been significant hype and controversy around the data mesh, which has 4 core principles: (1) decentralised domain ownership of data; (2) data as a product; (3) self-serve data platform; and (4) federated computational governance.

Given the proliferation of data sources, data consumers and data use cases (especially in the Age of AI), it stands to reason that centralised data teams would be inundated with requests; moving to decentralised systems would distribute the workload and enhance scalability. These centralised data teams may lack the domain expertise needed to understand the context around the data, in which case it makes sense to move the data-related decisions into the hands of business users who best understand them. With each domain responsible for the quality of the data it serves as products, clear accountability is created. Additionally, there is increasing regulatory focus around data privacy and security (not to mention data sovereignty)- in which case decentralisation becomes a necessity and not a choice.

For such a decentralised system to work, it needs to be underpinned by standardisation, interoperability and strong governance… all of this sounds fairly non-controversial. So what’s the issue with data mesh?

It primarily has to do with the extent of decentralisation in the data mesh architecture, and challenges with integration, interoperability and governance. According to the definition of Data Mesh as outlined by Zhamak Dehghani, it is a purely decentralised architecture with no data warehouse. The responsibility of integrating the data isn’t set aside, but it is distributed across multiple domains – which could potentially create new issues.

“What’s kind of neglected in Data Mesh is the amount of effort that you want to do if you go with a full decentralization when it comes, for example, to master data management.”

– Mahmoud Yassin, Lead Data Architect at ABN AMRO

(Data Mesh Radio episode #103 “4 Years of Learnings on Decentralized Data: ABN AMRO’s Data Mesh Journey”)

What’s happening now?

Enterprises have come up with their own modifications to the data mesh. For instance, they may have a local data product owner within the domain who is responsible for the data, but they are responsible for sending that data to the company’s centralised data infrastructure. The data within the central platform could be logically separated by domain, permissibility, classes and use cases. Thus there is a central team that not only provides the infrastructure, but also takes data provided from all the different data owners and aggregates it into unified datasets that can be used across domains. Independent of the data decentralisation trends, other technologies which would support “unification” in a decentralised environment are gaining traction – such as event streaming and streaming processing, data contracts, data catalogs, and semantic layers.

“So we are not doing a perfect data mesh by the books, but we have overtaken a lot of principles and ideas and then translated them to our needs.”

Moritz Heimpel, Head of Analytics and AI Enablement at Siemens

(The Analytics Engineering Podcast episode “Data Mesh Architecture at Large Enterprises”)

“We’ve seen customers do a blend where rather than data mesh, if you typically look at that sort of definition, it really doesn’t have a central team. And I think that’s wrong. So what I’ve seen customers really be successful in is implementing a hub and spoke style.”

– Dave Mariani, CTO & Founder at AtScale

(Monday Morning Data Chat episode #124 “The Rise of the Semantic Layer in the Modern Data Stack”)

Regardless of whether enterprises choose to implement a data mesh in the strictest sense or a loose interpretation of it (we expect more of the latter), we see a lot of merit in decentralising some ownership of data to domain-specific teams, and treating data as products.

Moving on to the practicalities of building a data mesh, one way to do so is to use event streaming. Events represent something happening, or a change in state (e.g. a customer clicking on a link, or placing an order), while event streams are a series of events ordered by time, that flow from the systems that produce the data to systems that consume the data (e.g. the email service that confirms the customer order has been placed, or the service that updates the inventory post receiving the order). Event streams therefore publish real time data to multiple teams with unique needs, and because these streams are stored, consumers can access both historical and real-time data. This combined with stream processing (which analyses, transforms and enhances data in event streams) enables powerful real-time AI applications for anything from fraud detection to instantaneous personalisation.

icon-quote

The data mesh movement is symbiotic with the decentralisation of software architectures. As organisations expand their data mesh, they create a central nervous system of event streams which presents an opportunity to gain real-time operational insights into complex and rapidly changing organisations.

Michael RosamFounder and CEO at Quix

This is where companies such as Quix* help, by providing a complete solution for building, deploying, and monitoring event stream processing applications. Quix makes it easier for domain engineers to self-serve access to streaming data and build data products. With a low barrier to entry and a high ceiling, Python engineers can reuse batch processing techniques to transform data whilst also leveraging the Python ecosystem to import ML models and libraries for more complex real-time ML and AI use cases. Of course, event streaming and stream processing applications can be applied to centralised or decentralised architectures, though we see it as particularly important for the latter, and especially for AI applications.

We also see data decentralisation as being facilitated by certain other developments that occurred over the course of 2023, such as the growing adoption of data contracts. In essence, data contracts bridge the communication gaps between data producers and data consumers by outlining the standards the data is expected to meet, who is responsible for ensuring the data meets those standards, the schema (structure) and semantics (meaning) of the data, SLAs for delivery, access and data quality as well as the policies governing the data. Additionally, data contracts can include specifications around interoperability of different data products. We couldn’t express it better than Yali Sassoon (Co-Founder and CPO at Snowplow*) in his article about ‘What is, and what isn’t, a data contract‘.

In a related development, in November 2023, The Linux Foundation announced Bitol, the Open Data Contract Standard (ODCS), as its latest Sandbox project. ODCS traces its origins to the data contract template that was used to implement Data Mesh at PayPal. Because ODCS leverages YAML, it can be easily versioned and governed. Tying back to our earlier discussion around Interoperability and Openness, the creation of standards and templates for data contracts should drive higher adoption, and we believe it will be a key enabler of data decentralisation.

Beyond data contracts, you need another unifying elements to make data decentralisation work, such as data catalogs, data modelling, semantic layers and so on. Data catalogs contain metadata (such as structure, format, location, dependencies) to enable data discovery and data ownership management, while semantic layers capture the business meaning of data (e.g. “what do you mean by a customer? Someone who purchased something from us in the last 90 days? Or someone who has ever purchased anything from us?”).

This is where startups such as Ellie are helping enterprises apply business context to data through data modelling and data collaboration tools. In particular, Ellie enables non-technical business users to extract value from the more technically-oriented data catalogs through a simple user interface that captures business intention and links it with data.

What’s next?

As workloads and regulatory pressures increase, we see the move to data decentralisation as inevitable; however, the degree of decentralisation would vary depending on the organisation’s natural ways of working. We believe that enterprises and startups should incorporate “unification” elements long before the actual decentralisation starts, to ease the process and make the transition easier.


Pragmatism and Idealism

Although many companies would like to adopt newer, better technologies rapidly, in practice they are slow to replace mission-critical parts of the tech stack given fears around disrupting business activities. Let us illustrate this with the example of databases. Practitioners dislike migrating from old databases to new ones, given schema changes, changes to indexes, re-writing queries and data manipulation expressions… and the list goes on.

Change is slow to come: Oracle, PostgreSQL, MongoDB etc continue to dominate, even as newer solutions gain traction

Source: DB-engines

Rather than conform to the unrealistic idea that enterprises will rapidly migrate all their data to the new database, the idea of “pragmatic idealism” acknowledges that enterprises are reluctant to do so, and instead creates a solution that lets enterprises use their existing tools yet at the same time take advantage of the latest technologies. For instance, Quesma is a lightweight compatibility and translation layer that connects an enterprise’s apps with both its legacy and new databases. It effectively unbundles the app stack from the database engine by introducing a smart database gateway/proxy in between, which translates the database queries into the format of the new database and enables double read and double write. In this manner, enterprises can adopt new database technologies for new use cases without migrating from old ones. Similarly, based on this principle of helping enterprises use existing technologies for new use cases, SuperDuperDB helps bring AI models to an enterprise’s existing database – without needing specialised vector databases.

icon-quote

AI adoption is still super slow because it is hard: Current solutions require taking data out of their origin – the databases – and bringing data to the AI models via complex pipelines. These involve various steps and tools including specialized vector databases, and come with massive data management and deployment overhead.

Timo HagenowCo-Founder & CEO at SuperDuperDB

The fundamental idea here is to focus on the existing and upcoming – for instance, a database gateway that decouples the app stack from the database engine wouldn’t just work with what is legacy and new as of today – it would also look at today’s latest technologies that will become obsolete in the future and provide a way to work with that alongside newer futuristic technologies. That’s why we believe pragmatism is a good starting point.


Our Evolving Thoughts

The common thread running through all of this is automation through AI, interoperability, and longevity with evolvability. Much of this is underpinned by openness, and to that end we’re excited by developments such as the Arrow ecosystem, PRQL, Onetable and Bitol.

Our idea of the Metamodern Data Stack is an evolving one, and we would love to hear your thoughts on the technologies you are most excited about, what you are building in the space, and what your hopes are for the future of data engineering.


*MindsDB, Quix, Snowplow and Superlinked are MMC-backed companies.

We published The Metamodern Data Stack as the first part of our series exploring the challenges around AI adoption. The subsequent parts of our series will be focused on key issues such as data privacy and security, emerging modelling paradigms, and domain-specific applications of AI. We’re also building out our map of European startups operating in the Data Infrastructure and Enterprise AI space – if you think your business should be on it, please get in touch.

Get in touch with Advika

Post script: Why call it “Metamodern”?

We borrowed from the world of cultural philosophy (Modern → Postmodern → Metamodern). Modernism is characterised by enthusiasm around science, development, progress, and grand visions, while Postmodernism is distinguished by cynicism regarding the same. Metamodernism oscillates between modernist enthusiasm and postmodernist scepticism. Metamodernism acknowledges things for what they are, but at the same time believes that things can be made better. Similarly, we believe that while the modern data stack is characterised by high levels of complexity and fragmentation, it can be made better through developments such as Onetable or Bitol. Metamodernism is therefore characterised by “pragmatic idealism” – technical debt is never going away, but we can always find new ways to make the most of existing technologies whilst incorporating innovative new capabilities.


The Bloomberg of energy: Why we invested in Modo Energy

No comments yet

As the global shift towards renewable energy gains momentum, there is a pressing need for a more flexible, agile power grid, which can quickly respond to shifts in supply and demand. But this is more than just a physical infrastructure problem – it’s a data puzzle, too.

Modo Energy wants to solve it.

The energy storage problem

While sources like solar and wind are less carbon-intensive, they have an intermittent, weather-dependent output, creating a challenge for grid stability. This will only be exacerbated by the electrification of transportation – EVs will be 18% of the overall car market by the end of 2023, up from 4% in 2020 – and home energy, increasing the strain on the grid.

The easiest way to increase base and peak load capacity is storage. Gas is the heavy lifter here. Today more than 90% of the UK’s energy storage comes from gas; it’s relatively easy to store and we can turn plants on and off fairly easily, but it’s expensive and fossil-fuelled. We need greener, faster methods that can react to the immediate changes and needs of the grid.

This is where batteries come in. Battery energy storage systems allow power system operators and utilities to gather energy from the grid or a power plant and then discharge it at a later time when it’s needed, reacting to and smoothing demand-supply imbalances or grid outages in less than a second. Batteries are not the only new method of storage, but they are a key part of the market, expected to go through a 20x expansion by 2030, to 350GW of capacity.

One of the consequences of this transition, however, is an increase in the complexity of energy markets and the speed at which those markets change. Prices, for instance, will become more volatile, creating the opportunity for grid-scale battery operators to monetise their assets through pricing arbitrage, charging the batteries in periods of low demand and selling the energy at peak times. Market participants will need accurate, real-time data to monitor and get the most value from their assets.

A deep understanding

Meeting that need is Modo Energy, a software platform that provides data analytics to enhance the operation and financial planning of energy assets, with a focus on the battery energy storage market.

We discovered Modo Energy very early in its journey, alerted to the opportunity by our data-driven sourcing model, and stayed close with the company until the team was ready to raise. Now we’re delighted to be leading its $15m Series A, alongside existing investors Triple Point Ventures, Fred Olsen Limited, and Catalyst Capital.

Founded in 2019 by Quentin Scrimshire (CEO) and Tim Overton (Director), Modo started as a small team in Birmingham before quickly emerging as the trusted authority in the UK’s rapidly growing battery storage market. Owners and operators representing approximately 90% of Britain’s grid-scale battery capacity rely on Modo’s platform for managing and optimising their assets. It offers revenue benchmarking, price forecasts, real-time data, and in-depth research, all in one place, and all designed to help monitor and optimise energy assets.

The two co-founders are natives of this market, Quentin previously head of energy storage at virtual power plant startup Kiwi Power, and Tim an engineering consultant at one of the world’s leading independent engineering consultancy firms for six years. They both bring deep expertise in energy storage, which is critical for navigating this highly technical, fast-evolving market. And their leadership and fast growth has established Modo as the place to go for talent in this space.

With this latest raise, the team’s attention has turned to global growth, and they are in pole position to own this space.

Transition creates opportunity

As the market for grid-scale battery storage grows – currently predicted to hit $16 billion by 2030 – there is a clear path to building a large, global business as the market matures and Modo’s customer base expands.

Coal, oil, gas and, more recently, wind and solar, have each gone through a long period of maturation where information sources, analytics, standards, and benchmarks emerged; and then those markets “financialised”, with the development of more investors, traders, consultants, and private equity backers. We expect the same thing will happen with the flexible energy market. Over the next 25 years there is expected to be $114tn investment in renewable infrastructure and assets, all of which will need to be rated, benchmarked, priced and researched, creating an enormous bond and derivatives market opportunity.

Modo is already building out its product to anticipate this need. Its recent product update, Modo 2.0, revolutionises the approach to revenue benchmarking and forecasting, offering users real-time data and autonomy over inputs and transparency of outputs, disrupting the legacy model of consultant-led, black-box forecasts. It makes this tooling invaluable to a broader set of market participants, including lenders (who typically finance such infrastructure projects), insurers, and investors – anyone who needs to understand what’s going on physically and financially with the grid and its assets.

And Modo’s expansion plans are as ambitious as their vision. The team has just opened their first US office in Austin, Texas, hoping to first crack the Texas and ERCOT market, followed by the rest of the US.

A natural fit

Our investment in Modo is a continuation of our commitment to support companies that use cutting edge data science and AI for industry-specific analytics. By way of example, in recent years I have led investments in AI-powered media and risk intelligence platform Signal AISenseye’s predictive maintenance for manufacturing equipment and Sky-Futures’ drone-based infrastructure monitoring.

And we apply this same lens to companies powering the transition to renewable energy, including Ogre’s AI energy management platform, Eatron’s battery management software for the automotive industry, and LiveEO’s AI-powered analytics of satellite earth observation data for monitoring energy assets. Modo Energy, with its mission to support a new electrified, flexible grid paradigm through the data layer, is a natural fit.

I’d like to credit my colleagues, Tom Scowsill and Lucci Levi, for their support and bringing their expertise to the process – Tom for his experience as a former Chief of Staff at Bulb Energy, and Lucci for her close collaboration with the Modo team on hiring.

This latest round will be instrumental in fueling Modo Energy’s expansion. It will enable the company to boost its product offering, building towards analytics, benchmarking, and predictive tools that allow renewable energy operators and investors to comprehensively track and index existing and emerging commercial opportunities in real-time, facilitating the clean energy revolution. Longer term, it puts them in a strong position to selectively target other forms of energy storage, and participate in new parts of the market as they develop, like vehicle-to-grid and the aggregation of home battery storage by utility companies.

Ultimately, Modo has the potential to become the Bloomberg terminal of energy – a common resource for every stakeholder in the flexible energy market. We’re excited to support Quentin, Tim and the rest of the Modo team’s ambition to shape a sustainable, efficient, and flexible energy future.

MMC leads $15 million Series A for Modo Energy

No comments yet

We’re excited to have led the $15 million Series A for Modo Energy, the innovative software-as-a-service (SaaS) platform specialising in data analytics for renewable energy assets, alongside existing investors Triple Point Ventures, Fred Olsen Limited, and Catalyst Capital. This substantial injection of capital will propel Modo Energy’s ambitious expansion plans, focusing on product enhancement and global market entry.

Founded in 2019 by Quentin Scrimshire and Tim Overton, Modo Energy has swiftly emerged as a trusted authority in the rapidly growing battery energy storage market in Great Britain. The company’s integrated suite of data-backed tools empowers owners and operators of renewable energy assets, particularly grid-scale battery energy storage systems, with the insights needed to navigate the dynamic landscape of the energy market during a period of unprecedented change.

icon-quote

We’ve diligently expanded our product offering while maintaining a close connection with our customers, delivering exciting and market-leading features. This investment from MMC is a testament to their faith in our products, our team, and the limitless potential Modo Energy’s solutions offer, shaping the future of sustainable energy for the better.

Quentin Draper-ScrimshireCo-founder of Modo Energy

Revolutionizing Energy Storage Analytics with Modo 2.0

Modo Energy recently unveiled Modo 2.0, a cutting-edge update that revolutionizes the approach to revenue benchmarking and forecasting in battery energy storage. This new iteration reinforces Modo Energy’s position as the all-in-one platform for investors, developers, owners, and operators of battery energy storage assets, offering an array of features such as: long-term, bankable price forecasts; in-depth revenue comparisons and trusted price indices; world-leading written research; educational materials; real-time market screens; and a comprehensive array of up-to-the-minute downloadable data.

The platform is an essential part of the workflow for the owners and operators of approximately 90% of Britain’s installed grid-scale battery capacity, enabling them to stay ahead of industry trends and make informed decisions on how best to maximize revenues for their assets.

International Expansion and Product Development

The $15 million in Series A funding will be instrumental in fuelling Modo Energy’s expansion beyond Great Britain. The company’s ambitious roadmap centres on its global expansion, starting with entry into the Texas and ERCOT market, followed by the rest of the USA and Europe. It will also enable Modo Energy to boost its product offering – to allow renewable energy investors and owners to comprehensively track and index existing and emerging commercial opportunities, facilitating the clean energy revolution.

MMC’ Endorsement

MMC, a capital fund known for investing in early-stage, high-growth businesses, commended Modo Energy’s rapid rise in the battery energy storage sector. In a statement, MMC noted,

Read more in the story via Business Insider.

MMC leads $12 million Series A for Crezco

No comments yet

We’re delighted to have led the $12 million Series A round for Crezco alongside 13books Capital. Crezco is dedicated to simplifying the B2B payment process. The company leverages open banking to facilitate an account-to-account payments API, fundamentally altering how businesses manage their transactions. 

icon-quote

Crezco was born to solve the SME payment problem, domestic and cross- border. Working with Xero has been a pleasure, they are equally focused on the user experience and this is just the start of our journey.There is so much more we can do together for the end users.

Ralph RoggeFounder and CEO, Crezco

Along with announcing the latest fundraise, Crezco also announced its partnership with Xero, the global small business platform with 3.7 million subscribers. It offers a simple and secure way to manage, approve and pay bills without leaving Xero, making Xero the first small business cloud accounting software in the UK to offer on-platform bill payments using open banking.

icon-quote

The partnership between Xero and Crezco is a great opportunity for open banking to address SME payments in the UK and, over time, internationally. It’s a transformational moment for the company and we’re excited to be backing Ralph and the team at this critical point in their growth.

Oliver RichardsPartner, MMC

Read more on the story via FinTech Global.

Six barriers to AI adoption – and what enterprises can do about them

No comments yet

While political leaders debate on Terminator-esque AI paradigms, enterprises are more concerned about their AI initiatives abruptly terminating: around half of AI initiatives fail between pilot and production. Through our conversations with practitioners and senior business buyers from some of the largest enterprises in the world, as well as a range of entrepreneurs looking to solve these issues, we identified 6 key challenges that enterprises encounter when driving AI initiatives, and potential mitigants.

The TL;DR

The Problem: Enterprises encounter data quality issues through the various stages such as data collection, transformation, storage, tracking and monitoring. As a result, data scientists end up spending up to 80% of their time bringing data quality up to scratch.

The Solution: Address these issues as early in the data lifecycle as possible, which starts with creating and collecting high-quality data. E.g. a data set that describes temperature of the an engine over time is only as good as the sensor used to record the temperature; if discoveries are made afterwards that suggest the readings were inaccurate, merely “cleaning” the data wouldn’t make the data reliable. This is where early stage companies such as Snowplow enable enterprises to create high-quality, purpose-built data sets for their AI models from the very outset.

The Problem 2.1: Enterprises are worried about LLMs inadvertently leaking confidential data, and falling afoul of GDPR.

The Problem 2.2: Enterprises are concerned about accidentally leveraging copyrighted information through GenAI solutions, and getting embroiled in lawsuits.

The Solution: Technological solutions range from Patronus AI’s EnterprisePII to help enterprises test whether their LLMs detect confidential information typically found in business documents (e.g. meeting notes, commercial contracts), to Lakera’s tool to prevent PII leakage. Non-technological solutions include working with vendors (e.g. Microsoft, Adobe) who have stated that they will assume the legal risks if their GenAI customers are sued for copyright infringement.

Tip for early stage companies: You don’t have to go down the route of offering indemnity – but it would help to make sure that your models are built on data that’s legally okay for you to use – and give your enterprise customers reassurance around the same.

The Problem: LLMs are hallucinating, and this is undermining trust amongst users. Particularly in healthcare, misleading information could have life-altering consequences, making LLM adoption slower in highly regulated industries.

The Solution: Currently enterprises are relying on Retrieval Augmented Generation or RAG, which involves augmenting an LLM’s knowledge with internal company data to make it more context-aware and give relevant answers, and “Chain of Thought” prompting (which breaks down a problem into a series of intermediate steps). Additionally, researchers are experimenting with new approaches, such as Autogen (we discuss it in detail later), which we believe could address the problem of hallucination.

The Problem: Enterprises struggle to get comfortable with the reliability of evaluation metrics (”how do you evaluate the evaluation metrics?”) and are afraid of putting too much faith in them (the “moral hazard” problem).

The Solution: Focus on domain-specific evaluation metrics that look at real world use cases- for instance, Patronus AI’s automated AI evaluation solutions can auto-generate novel adversarial testing sets at scale to find all the edge cases where an enterprise’s models fail.

The Problem: It is getting increasingly difficult for enterprises to estimate the costs of AI initiatives, especially the ballooning inference costs. It is also difficult to define benefits from revenue-generating AI initiatives (e.g. the tech division of a bank internally builds an AI model to help relationship managers identify the right financial products to cross-sell, but the head of commercial banking attributes revenues to the relationship manager “doing his job” rather than the AI tool.)

The Solution: Technological solutions range from TitanML’s Takeoff Inference Server (to reduce inference costs) to NannyML’s solution for measuring the business impact of AI models and tying the performance of the model to monetary or business outcomes (to establish RoI). Strategic solutions involve “Buy Now, Build Later”- where enterprises deploying AI in new ways first “Buy” AI solutions so they can experiment with non-critical use cases (and using less sensitive data) thus avoiding chunky upfront investments in the “Build” approach- and retaining the flexibility to Build once there is greater comfort around the AI solution.

The Problem: There is a widespread AI/ML skills shortage, and resistance to AI adoption given issues with safety and reliability.

The Solution: AI companies such as MindsDB are helping enterprises overcome the AI/ML skills shortage by helping software developers rapidly ship AI/ML products. On addressing problems around hallucination, evaluation, explainability, and demonstrating real value to end users, enterprises can lower the resistance of their employees towards adopting AI solutions more broadly.

Here’s more detail around what we learned:


Problem #1: Data quality issues

The Problem: Enterprises encounter data quality issues through the various stages such as data collection, transformation, storage, tracking and monitoring. As a result, many data scientists end up spending up to 80% of their time bringing data quality up to scratch. Even as myriad data-related solutions emerged, such as synthetic data (essentially artificially generated data to overcome issues around lack of data or data privacy concerns), they came with a new set of challenges. For instance, synthetic data could potentially lead to model collapse, where models forget the true underlying data distribution and give less diverse outputs. Given concerns that we may soon run out of high quality human generated data to train AI models, a number of senior enterprise buyers we spoke to believe that enterprises with access to human-generated data will be more likely to create high quality models.

The Solution: Enterprises should address these issues as early in the data lifecycle as possible, which starts with creating and collecting high-quality data. For instance, Snowplow’s Behavioral Data Platform (BDP) enables enterprises to create and operationalise rich, first-party customer behavioral data to fuel advanced data-driven use cases – directly from the company’s data warehouse or data lake in real-time. In this manner, enterprises can access high-quality, accurate, consistent customer behaviour data according to enterprise definitions and in a format that is suited to their AI models.

icon-quote

“Data is not like oil: you don’t get good data by mining bad data and then processing it into higher quality material — you have to deliberately create it, with the requisite quality, from scratch.”

YALI SASSOONCo-Founder & CPO at snowplow.io

Snowplow illustrates the “quality at source” point for customer behavioural data, but this is a common challenge across different data types and formats (e.g. video, image). Additionally, “quality at source” should not be one-off, but a continuous process. To illustrate: while most annotation tools treat data creation as a one-off activity at the beginning of each project, it is important to monitor and analyse the predictions of a model in production and continuously collect more data to improve the model over time- which is where Argilla’s data curation platform enables practitioners to iterate as much as needed. Similarly, other early stage companies such as YData are building automated data profiling, augmentation, cleaning and selection, in a continuous flow to improve training data and models performance. As such, superior data quality at source combined with continuous improvement is the way forward for enterprises to successfully adopt AI.


Problem #2: Data security and privacy

The Problem 2.1: Enterprises are worried about LLMs inadvertently leaking confidential data, and falling afoul of GDPR.

The Solution 2.1: When using GenAI solutions, there are chances that the LLM answers using confidential data that the user is not supposed to have access to. In order to address this issue, startups like Patronus AI have developed solutions such as EnterprisePII to help enterprises test whether their LLMs detect confidential information typically found in business documents like meeting notes, commercial contracts, marketing emails, performance reviews, and more. Typical PII detection models are based on Named Entity Recognition (NER), and only identify Personally Identifiable Information (PII) such as addresses, phone numbers, or information about individuals. These models fail to detect most business-sensitive information, such as revenue figures, customer accounts, salary details, project owners, and notes about strategy and commercial relationships; EnterprisePII aims to change that, in the process overcoming a business-critical risk that holds back enterprises from adopting LLMs.

The Problem 2.2: Enterprises are concerned about accidentally leveraging copyrighted information through GenAI solutions, and getting embroiled in lawsuits.

The Solution: Sometimes the solutions don’t have to be a new piece of technology. There are many operational/commercial things that technology vendors can do to encourage adoption of AI solutions. For instance, GoogleMicrosoftIBMOpenAI and Adobe have agreed to assume responsibility for the potential legal risks involved if their GenAI customers are challenged on copyright grounds. Some of the enterprises that we spoke to adopted these GenAI solutions partly because of the indemnification offered, which gives us early proof points on the efficacy of this strategy.

That said, the vendors themselves would need to be careful that they are not burdened by heavy financial losses from lawsuits. For vendors to offer such indemnification with confidence, they will need to maintain tighter control on the data used for training their GenAI models, which we view as a positive. For instance, Adobe Firefly was trained on Adobe Stock images, openly licensed content, and public domain content. Additionally, Adobe has developed a compensation model for Adobe Stock contributors whose content is used in the dataset to retrain Firefly models. On the other hand, IBM has published the sources of its data in a white paper that customers can review as well. For early stage companies, which likely would not be in a position to offer indemnification against legal risks, the focus would nevertheless be on maintaining tight control over the data used to train their models, which would reduce the risk of copyright infringement.

Something that could potentially be a solution: There is a lot of interest around Machine Unlearning, which aims to be the answer to the question of “how do you remove data used to train the model without reducing its accuracy and without re-training the model every time data is removed?” Against the data security and privacy context, it becomes even more important given regulations such as GDPR and CCPA uphold the “right to be forgotten” which means that an individual could ask for all data related to them to be completely deleted from a company’s systems- and this would include any effects from their data on a model. Researchers are continuing investigations into Machine Unlearning, and we’ll be eagerly following its progress.



Problem #3: Hallucination

The Problem: LLMs are hallucinating, where they fabricate information or invent facts in moments of uncertainty – and this is undermining trust amongst users. Particularly in healthcare, misleading information could have life-altering consequences, making LLM adoption slower in highly regulated industries.

The Solution: Currently enterprises are relying on Retrieval Augmented Generation or RAG (which involves augmenting an LLM’s knowledge with internal company data to make it more context-aware and give relevant answers) and “Chain of Thought” prompting (which breaks down a problem into a series of intermediate steps). In order to implement RAG, companies use vector embeddings. Vector embeddings are essentially lists of numbers that represent words or image pixels or any data objects. This representation makes it easier to search and retrieve information through semantic search or “similarity search,” where the concepts can be quantified by how close they are to each other as points in vector spaces (e.g. “walking” is to “walked” as “swimming” is to “swam”). This is where early stage companies such as Superlinked play a critical role, as its solutions turn data into vectors and improve the retrieval quality, by helping enterprises create better and more suitable vectors that are aligned with the requirements of their use-case, bringing in data from multiple sources instead of just a text or an image.

icon-quote

“It’s very difficult to control how a language model recalls facts that have been trained into it – this is where hallucinations come from. But there is a solution – think of the LLM as a summarisation and reasoning engine and make your data available to it through RAG, organised in a way that allows for quick, precise and high-quality recall – and that’s done by turning your data into the lingua franca of machine learning: The vector embeddings.”

DANIEL SVONAVACEO & Co-Founder at SUPERLINKED

Something that could potentially be a solution: On speaking to practitioners, we discovered that they are most excited by new research such as Autogen, which we believe could address the problem of hallucination. AutoGen is an early, experimental approach proposed by researchers at Microsoft, which allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. These Autogen agents are customisable, and they focus on specialised, narrow tasks. This is essentially like having a small team of experts rather than one generalist, which is a useful approach given: (1) LLMs such as ChatGPT show the ability to incorporate feedback, which means that LLM agents can converse with each other to seek or provide reasoning, validation etc; and (2) LLMs are able to better solve complex problems when the tasks are broken into simpler tasks, so assigning each Autogen agent a “role” (e.g. one agent who writes code, another who checks the code, another who executes the code) and so on would produce better results.

These agents are able to communicate dynamically as needed (as in, there is no pre-scripted pattern in which the communication flows)- for instance, Agent 2 may write a code for Agent 3 to execute, and Agent 3 may revert saying that a particular package is not installed, in which case Agent 2 comes back with a revised code and set of instructions (to install that particular package). The agents will communicate as needed to get the job done. Furthermore, Autogen lets a human participate in agent conversation via human-backed agents, which could solicit human inputs at certain stages of the conversation. We believe the structure of this solution could enable agents to serve as checks and balances for each other (e.g. one agent could act as a virtual adversarial checker to other agents), which could potentially reduce hallucinations and improve quality of output.


Problem #4: AI model evaluation and explainability

The Problem: Given the lack of standardisation around benchmarks and evaluation metrics, enterprises on their AI journeys are asking themselves what we call “meta questions” such as “how do you evaluate the evaluation metrics?” or “how do you benchmark the benchmarks?” On the one hand, enterprises are questioning the reliability of evaluation metrics and explainability solutions; on the other hand, enterprises are also concerned about “moral hazard.” Moral hazard is a concept in economics where protections in place encourage risky behaviour- e.g. when wearing a seat-belt became mandatory, the rate of accidents increased. In similar vein, trusting the AI model based on good performance on evaluation metrics could lead to unfavourable outcomes. As Anthropic illustrated in a blog post:

“BBQ scores bias on a range of -1 to 1, where 1 means significant stereotypical bias, 0 means no bias, and -1 means significant anti-stereotypical bias. After implementing BBQ, our results showed that some of our models were achieving a bias score of 0, which made us feel optimistic that we had made progress on reducing biased model outputs. When we shared our results internally, one of the main BBQ developers (who works at Anthropic) asked if we had checked a simple control to verify whether our models were answering questions at all. We found that they weren’t — our results were technically unbiased, but they were also completely useless. All evaluations are subject to the failure mode where you overinterpret the quantitative score and delude yourself into thinking that you have made progress when you haven’t.

The Solution: While enterprises are currently using general benchmarks such as HELM or HuggingFace Leaderboard, we believe the focus will likely shift towards domain-specific evaluation metrics that look at real world use cases- for instance, Patronus AI’s automated AI evaluation solutions can auto-generate novel adversarial testing sets at scale to find all the edge cases where an enterprise’s models fail.

icon-quote

“So far, companies have tried to use academic benchmarks to assess language model performance, but academic benchmarks don’t capture the long right tail of diverse real world use cases. LLMs might score highly on grade school history questions and the LSAT, but enterprise leaders care about business use cases like financial document Q&A and customer service. This is why domain-specific evaluation is so important.”

ANAND KANNAPPANCEO & Co-Founder at PATRONUS AI


Problem #5: Costs and uncertain RoI

The Problem: The enterprises we spoke to also talked about difficulties in ascertaining the costs and RoI for AI initiatives. Firstly, enterprises often do not clearly define the business metrics that the AI solution is intended to have a positive impact on, and how the enterprise would measure them. This would determine which data points the enterprise should be collecting to measure progress, and this should be collected through the course of the AI initiative rather than collected towards the end in hindsight (which unfortunately many enterprises do). Secondly, outside of productivity benefits, enterprises struggle to outline additional benefits (e.g. revenue outcomes) from the AI solution implementation. The calculation of RoI is not just a mathematical computation; it is often a political one as well. To illustrate: the technology division within a bank may build an AI model which would help relationship managers to identify the right financial products to cross-sell to its customers, but business leaders (e.g. the head of commercial banking) would attribute the revenues to the relationship manager “doing his job” rather than the tool.

The Solution: Technological solutions range from TitanML’s Takeoff Inference Server (to reduce inference costs) to NannyML’s solution for measuring the business impact of AI models and tying the performance of the model to monetary or business outcomes (to establish RoI). Working collaboratively with the business heads and agreeing upon attribution when defining RoI of revenue-generating AI use cases would likely help overcome some of the political issues around RoI calculation. There are also other strategic solutions, such as “Buy Now, Build Later.”

icon-quote

“In practise, many enterprises miscalculate how difficult the execution for technology projects might be. This is why our preference for now would be to buy rather than build.”

CIO At a large European Bank

When looking to deploy AI in new ways, enterprises focus on careful experimentation, where one uses commercial AI solutions as a means of furthering the enterprise’s own understanding of these solutions and applying them to non-critical use cases (as well as using less sensitive data) to limit risk. “Buying” the solution requires lower upfront investment, and going forward the enterprise has the freedom to revisit its build vs buy decision once there is greater familiarity with the new AI solutions. Given we are in the hyper evolution phase of the AI cycle, and many enterprises are still in learning mode, enterprises are reluctant (as of now) to invest time and resources building their own model from scratch. That said, for enterprises that are further along their AI adoption journeys and operate in heavily regulated industries, there is greater openness to working with open source models (using their own data and hosted in their secure environment) for sensitive use cases- albeit tempered by concerns around security and licensing of open source models.

“In the rapidly evolving landscape of artificial intelligence, committing to a specific architecture like transformers for long-term development carries the inherent risk of obsolescence. The breakneck pace of innovation in this field could render today’s cutting-edge solutions outdated within a mere 12–18 months. To remain at the forefront of AI advancement, we must embrace adaptability such as investing in modular and composable architectures.”

ARUN NANDI, SR. DIRECTOR & HEAD OF DATA & ANALYTICS AT UNILEVER

Something that could potentially be a solution: The mounting inference costs (e.g. the cost of ChatGPT answering your queries) have prompted research into alternative model architectures, such as Retentive Networks (RetNet), which could potentially replace the Transformer architecture that underpins all the major models today from GPT-4 to Midjourney. A recent research paper by Microsoft and Tsinghua University titled “Retentive Network: A Successor to Transformer for Large Language Models” introduced RetNet, which essentially enables parallelism in training (which makes its performance comparable with Transformers) and recurrent representation in inferences (which lowers inference costs and latency significantly). Putting it simplistically, it combines the best of both worlds in Transformers and RNNs. There have been other proposed alternatives to transformers (such as Hyena), and only time will tell whether RetNet becomes a dominant architecture. Nevertheless, we keep an eye out for changes in model architecture, and other novel solutions to reduce inference cost.


Problem #6: Talent and culture

The Problem: There is a widespread AI/ML skills shortage, and resistance to AI adoption given issues with safety and reliability.

The Solution: Beyond stepping up recruitment efforts, some of the ways in which enterprises are addressing the skills gap is through increased training and development, reducing manual and repetitive tasks, and low-code/no-code tools. This is where we believe providers such as MindsDB are addressing a critical pain point for enterprises, as its platform equips virtually any developer to rapidly ship AI and machine learning applications — effectively transforming them into AI/ML Engineers. Therefore tools which augment the capabilities of in-house talent would rapidly find adoption within large enterprises. Additionally, enterprises can lower the resistance of their employees towards adopting AI solutions more broadly by: (1) assuaging fears around job losses by re-training and re-skilling employees as well as re-designing their roles to accommodate AI in their workflows as a tool that augments their capabilities rather than replacing them; and (2) building trust in AI solutions amongst business users by addressing problems around hallucination, evaluation and explainability (which we addressed earlier).

icon-quote

“Today, there are close to 30 million software developers around the world, but only less than five percent are proficient AI/ML engineers. However, the world is facing a new transformation where most software that you know today will need to be upgraded with an AI-centric approach. To accomplish this, every developer worldwide, regardless of their AI knowledge, should be capable of producing, managing and plugging AI models to existing software infrastructure.”

JORGE TORRESCo-Founder & CEO of MINDSDB


Final Thoughts

The AI world remains in flux; given the hyper evolution phase that we are in, enterprises are prioritising flexibility, adaptability and rapid time to value when it comes to choosing AI solutions. Although the current paradigm creates challenges for enterprises, we see tremendous opportunities for entrepreneurs building the next generation of AI companies – the key is to become a trusted partner during these turbulent times, by focusing on domain-specificity, ensuring safety and demonstrating value.

If you’re an enterprise struggling with any of these challenges, or a founder building a product that addresses these issues…

Get in touch with Advika

MMC was the first early-stage investor in Europe to publish unique research on AI in 2016. We have since spent eight years understanding, mapping and investing in AI companies across multiple sectors as AI has developed from frontier to the early mainstream and new techniques have emerged. This research-led approach has enabled us to build one of the largest AI portfolios in Europe.

Note: Snowplow, Superlinked and MindsDB are MMC portfolio companies.


What kind of billion-dollar AI company are you building?

No comments yet

Simon Menashy, Partner at MMC, shares his view as an investor of where we can expect to see AI innovation in the coming years.

OpenAI’s CEO Sam Altman said that the age of giant AI models is over. Whether or not that’s true, I’m more interested in exploring the different layers where innovation can be found within the AI space and what companies are likely to be built in the coming years.

If you’re building in this space, you don’t have to go and raise €105m and start developing your own large language model, competing with the likes of OpenAI, Google and Meta. Whoever (one or more) wins the race to the best foundational model, I expect to see a thousand good businesses built touching this new ecosystem over the coming years, with a variety of different models and focuses.

Most market maps I’ve seen – and I’m sure you’ve seen your fair share too – focus on the most visible way to break down the generative AI space, by medium (text, code, images, video, speech etc.) or sector. But there is more to explore. I want to delve into five layers where we can expect to see key innovation from AI – the new landscape where founders can establish their presence and investors can find opportunities beyond the most apparent generative AI applications.

  1. Foundational models
  2. Specialised models
  3. Vertical use cases
  4. The orchestration layer
  5. Supporting tools and enablers

Let’s go through them one by one.

Foundational models

OpenAI, Google and Meta dominate this space, building large, foundational models that create human-like general language output and can address a broad set of use cases. The state of the art is advancing at a breathtaking pace, with big leaps every 6-12 months and well-funded start-ups like AnthropicAI21CohereMistral AI, and Character.AI investing substantial resources to catch up. Even Elon Musk is throwing his hat into the ring.

In short, it’s a capital intensive game – and that’s what you’d need to compete. We will certainly see innovation in this space, but it’s likely going to be amongst the biggest tech players and the few scale-ups they sponsor, who have the necessary resources to fuel such efforts.

So you could build your own language model. New AI model designs, architectures, and further refinements based on human feedback are promising areas worth investigating, and every few weeks we meet a professor claiming a superior or alternative methodology. There’s also utility in models built for speed and low cost, or those that run on particular hardware architectures. But unless you have easy access to tens of millions of starting capital, I believe there’s a lot of value to be created elsewhere in the diverse ecosystem of companies that can be built around AI technology.

Specialised models

Start-ups might struggle to compete with the best-funded foundational models, but there are lots of specific areas where big general models may fall short or lack focus.

Some businesses will take foundational models and fine tune them for specific use cases. Others will develop more specialised models. These could be very narrow but widely applicable – for example, emotion analysis on voice, or interpreting the particular syntax of tax codes or insurance categories. They could also be very deep and focused on one problem, for example collision avoidance for driverless cars. I expect to see particularly interesting and defensible examples where there is sensitive proprietary data or models operating within highly regulated areas.

The crucial question for businesses creating these models is whether there is sufficient applicability to licence or offer their technology to other businesses via APIs. If the model is specialised enough, a company might establish its own vertical use case based on their unique model, carving out a niche in the market (see the next category).

This may also birth new go-to-market and distribution models. Marketplaces and communities, like Hugging Face, are already developing so people can use, build on and stitch together different open-source specialised models. As start-ups develop their own closed specialised models, they will need to think how to sell into and connect with a diverse range of contexts. Tied with agentic applications I talk about below, businesses might develop and sell specialised, preset AI agents, functioning like digital employees with a specific skill set.

Vertical use cases

AI models, particularly as part of enterprise SaaS solutions, will also be adapted for specific applications and industries.

AI has potential applications across every business and function – from customer support, sales and advertising to insurance pricing, transaction analysis, drug discovery and countless other examples. We are seeing a new generation of innovation potential in enterprise SaaS, with start-ups building these verticalised tools.

Vertical products are likely to draw on a mix of foundational and specialised models, whether licensed or called via an API from leading players, or in some cases built (or at least tuned) in-house, potentially stitched together with proprietary datasets. But the product still needs to be a quality, sufficiently-functional, enterprise-ready product with the right integrations with other systems – it’s not just about showing up with the latest generation of generative AI.

Innovation in the user interface will be the other big enabler. A large part of ChatGPT’s success can be attributed to the simple, accessible UI built on top of the underlying technology. Those businesses that can make AI tools usable in the simplest, most streamlined way (like Synthesia’s simple studio UI) will win, because it will increase the adoption rate across the business, rather than being locked away in data science and engineering teams. When starting out with an unfamiliar product, most people know what they want to do, but they don’t know how to do it. Implementing a chat-first UI would allow people to navigate unfamiliar, powerful tools more intuitively. But there’s still a lot of space for innovation in UI design, with scope to redefine how we interact with AI.

This creates an interesting dynamic. The previous generation of AI companies has already built a lot of the enterprise infrastructure and customer base, so I expect we’ll see a lot of innovation come from these established businesses building the new generation of tech into their products. For example, our portfolio company Signal AI is adding additional intelligence layers on top of its unique, web-scale dataset of licenced media and social content, ensuring output is both fact-based (based on verified sources) and compliant with the requirements of its suppliers. Or perhaps we will see new businesses spun out from these existing companies, taking AI use cases as their north star.

It will be interesting to observe how new entrants challenge the previous generation of companies that are just starting to get meaningful penetration and adoption in these different vertical use cases. Who’s going to find it easier to build some initial scale in this rapidly evolving landscape?

The orchestration layer

The orchestration layer could be the most crucial area for innovation, as it focuses on deploying AI models and use cases in practice within an enterprise environment. This layer serves as the connective tissue for AI in businesses – linking prompts to outputs, drawing on integrations with other systems and feeding outputs into the appropriate places.

We’ve already seen this with Auto-GPT, an approach to building “AI agents” that suggest tasks and carry them out in sequence – stringing many sub-tasks together into an evolving, self-directing project. But open-source frameworks like LangChain, Hugging Face’s Transformers Agent, and Microsoft’s Guidance take this even further. These frameworks act like a bridge between the language model and specific data sources or capabilities, giving the LLM access to structured information, APIs, and other data outside its original training set, and chaining prompts together – building something that is greater than the sum of its parts.

In this new paradigm, applications don’t just make an API call to a language model, but connect to and interact with various data sources (beyond the model’s training data) in real-time, providing more relevant, accurate, and insightful responses. And by allowing language models to interact with their environment and act on behalf of the user, it pushes the boundaries of their potential applications, enabling them to make decisions and execute actions based on current context, greatly enhancing automation capabilities. Autocomplete goes from “complete this sentence”, to “complete this workflow”.

This will become increasingly important as AI models go multi-modal, needing to accept inputs and output in various formats (text, speech, images, video, tables, documents, data stores, etc). Different core and specialist models will do different jobs, drawing on data within the modern enterprise data stack and interacting with different kinds of data schema and other software products. The goal is to seamlessly integrate various technologies, including non-generative AI, to address end-to-end problems or fit into enterprise business processes.

By using these tools, developers can extend the functionality of an LLM, customising it to accomplish specific objectives, like solving complex problems, answering specialised queries, or interacting with external systems in a more intelligent and context-aware manner.

Supporting tools and enablers

Tools that support and enable the use of AI within businesses will be crucial for effective implementation, monitoring and measuring outcomes.

Vector databases, specialised in managing multidimensional data, are already emerging as game-changers in AI applications. Rather than relying on conventional methods like tags or labels, they use vector embeddings that capture relevant properties, allowing for similarity-based searches. This fundamentally shifts how we engage with unstructured data, moving from clumsy keyword-based queries to ones that better reflect the intention or meaning behind the query.

This helps reduce false positives, streamline queries, and enhance productivity across the knowledge economy, with better handling of creative and open-ended queries. And it revolutionises real-time personalisation, as our portfolio company Superlinked illustrates. As CEO Daniel Svonava said in a recent podcast, by incorporating vector search into products, companies can tap the power of deep learning to deliver better user experiences in many places – from personalised recommendations to user clustering and bot detection.

It is the companies that solve these sort of hard, practical problems and provide the underlying tooling that will benefit as AI reshapes value chains. And this perspective guides our view of responsible AI’s potential, too.

As my colleague Nitish Malhotra put it in an earlier piece on the opportunity in responsible AI, ‘Deploying models is hard. Trusting predictions is harder.’ Unlike traditional software engineering, it is challenging to build safety guarantees in machine learning (ML) systems, given the black-box nature of algorithms and the dynamic system architecture. But the quest to address these vulnerabilities – interpretability, verifiability, and performance limitations – opens up an opportunity for start-ups and spin-offs to attack various parts of this problem. Too often, responsible AI is considered a hindrance to the pace of innovation, but I think it’s a prerequisite. Data quality; model experimentation and early explainability; testing, QA and debugging; and monitoring and observability – these are all necessary foundations for innovative, scalable, responsible AI.

For example, ensuring that datasets used for training AI models are of high quality, comprehensive, properly audited, and appropriately licensed will be key. The concept of data quality and observability is nothing new, and existing players in the space can use their expertise and customer bases to beat new entrants to the punch. When only 30-50% of AI projects make it from pilot into production, and those that do take an average of nine months to do so, tools that help developers go from zero to one and beyond faster and more effectively are invaluable. And as AI-powered products grow in complexity and scope, observability tools will need to evolve as well. The orchestration layer will require models that not only evaluate their own output, but also scrutinise the performance of other models, systematically checking for hallucinations, data breaches and bias. For the same reason, reporting and analytics tools to facilitate the tracking and monitoring of AI-driven systems will also be crucial, enabling organisations to maximise the potential of their AI stack.


MMC’ focus on TechBio

No comments yet

As a research-led VC, MMC strives to provide deeper expertise and insight into the emerging technologies and trends that will reshape our society over the coming decades, such as TechBio.

Charlotte and Mira break down MMC’s investment thesis in this compelling space.

Biotech, healthtech, and medtech are some of the terms we are used to hearing when discussing technological advancements in the bioscience space. More recently, however, the term ‘TechBio’ is being used to better differentiate and categorise companies blending cutting-edge technologies, computational methods, and data-driven, engineering-first approaches applied to biological research and drug development.

Within the TechBio landscape, computational biology plays a crucial role in various aspects, including advancements in genomics research, drug discovery and development, personalised medicine, and systems biology. It involves the application of computer science, mathematics, statistics, and bioinformatics, and has become an integral part of systems biology research.

As a technology investor, we at MMC have decided to break down our investment thesis in the space. We believe the next generation of toolkits is emerging, empowering scientists and accelerating their research and development efforts.


Why now?

The unprecedented scientific progress and boundless technological innovations we see today are setting the stage for a new wave of opportunities for TechBio start-ups, driven by the explosive convergence of industry trends and the rise of computational biology.

Moreover, the digitisation and virtualisation of samples and research findings, facilitated by cloud-based platforms, have revolutionised data-sharing capabilities. This allows for the seamless exchange of ever-larger datasets across the globe, fostering collaboration and enabling comprehensive analyses on an unprecedented scale.

Despite the increase in the data, industry professionals have confirmed that although biopharma is collecting 7x more investigational drug data than 20 years ago, drugs are not getting to market faster nor with a higher hit rate. While technologies are improving at pace, the technologies across the drug discovery paradigm still need to successfully translate this data into insights.

Moreover, the digitisation and virtualisation of samples and research findings, facilitated by cloud-based platforms, have revolutionised data-sharing capabilities. This allows for the seamless exchange of ever-larger datasets across the globe, fostering collaboration and enabling comprehensive analyses on an unprecedented scale.

Cost per genome data - 2022

Cost per genome data – 2022

Despite the increase in the data, industry professionals have confirmed that although biopharma is collecting 7x more investigational drug data than 20 years ago, drugs are not getting to market faster nor with a higher hit rate. While technologies are improving at pace, the technologies across the drug discovery paradigm still need to successfully translate this data into insights.

We have observed a significant shift in mindset within the industry. Companies have embraced the data/engineering and statistical perspective, transforming traditional research and development (R&D) processes. Instead of starting with a hypothesis and conducting experiments to test it, start-ups are flipping the script and adopting a data-first approach to identify new targets. They begin by mining vast datasets (the most valuable ones being self-generated) for insights, using sophisticated algorithms to uncover patterns and connections that may have otherwise gone unnoticed. Whilst a promising start, we believe this needs to be combined with later-stage human-derived and clinical biomarker data, particularly when it comes to deepening our understanding of disease pathology.

In this dynamic environment, TechBio start-ups have the opportunity to disrupt conventional practices by leveraging the power of data, advanced analytics, ML/AI algorithms, and cloud computing, which can drive innovation at the intersection of technology and biology. These advancements can optimise personalised treatment and accelerate various stages of the drug development pipeline.


Market map

As the intersection of technology and biology continues to shape the future of the life sciences and drug development industries, we need to understand the evolving market landscape and identify key segments of interest.

We have identified five distinct segments, each targeting different aspects of the life sciences tech stack, ranging from lab automation, data generation and management, analysis, to clinical trial optimisation.

We noticed that while some companies focus on offering recurring Software-as-a-Service (SaaS) models for data infrastructure and management, the majority adopt more varied business models, which combine elements of SaaS, fee-based services, and strategic partnerships with pharmaceutical companies where revenue is linked to the upside potential (such as positive trial outcomes or value-sharing arrangements).

MMC’s market map

Following this analysis, we have distilled the market map into two primary segments that are of particular interest to MMC:

  1. Lab automation, Data Infrastructure and Analysis
  2. Next-Generation AI Drug Discovery Platforms

A third segment includes biotech companies that operate across the entire spectrum of drug development, more akin to the traditional model. While these companies fall outside the immediate scope of our analysis, they’ll play a significant role in shaping the future of the life sciences industry.

In the following sections, we will delve deeper into these two key segments, exploring the emerging trends, notable players, and investment opportunities within each domain.


Deep dive #1: Lab Automation & Data Infrastructure and Analysis

Unlocking the power of bioinformatics for large-scale data analysis

Bioinformatics, a field focused on managing and interpreting vast amounts of biological data, has become a critical bottleneck in harnessing the full potential of Next-Generation Sequencing (NGS) technology. The exponential growth in “omics” data necessitates the use of bioinformatics tools across various scientific disciplines, including wet lab biologists who lack advanced bioinformatics skills. However, this is changing with many biologists teaching themselves basic computational skills.

The market lacks advanced tools for NGS data analysis, such as annotation, alignment, and identification of meaningful patterns. Most existing solutions today are legacy systems, custom scripts deployed on on-premise servers, or open-source tools, which are widely regarded as the gold standard by scientists despite being fragmented.

There is a pressing need for a transformation in the bioinformatics tech stack, presenting an opportunity for innovative start-ups such as Latch Bio, that can efficiently standardise and scale data workflows to collect and curate complex data for downstream analysis. Ultimately, many early-stage challenges in bioinformatics are data management and interpretation problems, which can be approached as engineering problems. This realisation is the driving force behind companies like Lamin.ai, which is building better data infrastructure for R&D teams of any size.

Recognising the widely adopted model among scientists, open-source platforms are also experiencing a notable transformation to meet the changing needs of the evolving bioinformatics field. This trend aligns with the broader rise of open-source software, which has gained popularity among the wider community of software developers and data engineers seeking cost-effective and scalable solutions. Companies like Nextflow, are at the forefront of this movement, providing scalable and reproducible computational workflows specifically designed for bioinformatics analysis.

Democratising data engineering tasks in the lab

Traditionally, wet lab scientists heavily rely on bioinformaticians to handle unstructured data from sequencers or public datasets. This involves performing quality assurance, running custom scripts, and using open-source tools to narrow down areas of interest. However, there is a rising demand for democratising access to bioinformatics expertise and empowering wet lab scientists to perform the relatively simple parts of the analysis. It is possible achieving this through user-friendly graphical interfaces (GUIs) and the introduction of automation for downstream processes. A good example is Pipe Bio, a platform that enables wet lab scientists to analyse and manage DNA sequencing data without needing support from bioinformaticians or programmers.

Adopting cloud technologies in research and development (R&D) introduces new challenges as scientists spend significant time converting data formats, transferring data between applications, manipulating tables, aggregating statistics, and visualising data to extract insights. Addressing these challenges can unlock substantial value in the workflow, resulting in improved accuracy, efficient utilisation of expensive bioinformatics expertise for novel and complex tasks, and, most importantly, significant time-saving. According to McKinsey, pharmaceutical companies can bring medicines to the market more than 500 days faster and reduce costs by 25%, mainly through automated processes.

Meeting diverse buyer and user profiles

To effectively navigate the landscape of lab operations and data infrastructure, it is essential to grasp the distinct motivations, priorities, budgets, and technical proficiency of the various buyer and user profiles. These encompass large pharmaceutical companies, academics, smaller biotech firms, as well as wet and dry lab scientists.

Large pharma companies often possess in-house teams that develop their own platforms with support from bioinformatics experts, but while collaboration and reuse are prioritised in these expansive operations, the tooling employed may not always be at the cutting edge. Conversely, smaller biotech firms, typically comprising 50-80 individuals with fewer bioinformaticians (at a ratio of 1:20), seek workflow tools that can better serve their biologists.

Dry lab scientists, including bioinformaticians and computational biologists, strive for standardisation in their workflow, particularly at the initial stages where most of their time is spent. They adopt a detailed, developer-like approach to deconstruct visualisations and construct custom tools from scratch. For these professionals, the extensibility of the platform, integration of diverse data sources and tools, and robust data testing capabilities are crucial. In contrast, wet lab scientists prioritise platforms that automate annotation and downstream visualisation processes, equipping them with fundamental bioinformatics skills.

Irrespective of the organisation’s size, a team with extensive domain expertise and the capacity to expand the platform’s capabilities to provide comprehensive end-to-end solutions is essential for effectively selling to technical buyers in this rapidly evolving industry.

In terms of purchasing decisions, scientists search for “best of breed” solutions that cover more bases rather than an inferior “off-the-shelf” platform. The pricing power of a tool is often linked to its complexity, but fostering loyalty among price-sensitive users is possible. For instance, scientists in academic institutions and small biotech companies, who prioritise platforms capable of storing and capturing structured genetic data, may be more conscious of pricing. However, if they have previous experience using a platform in other organisations, they are more inclined to develop a compelling business case to justify the purchase of these tools.

What’s next?

At MMC, we anticipate a significant expansion in the scientist toolkit as biopharma and scientists seek more efficient ways to collaborate and standardise data pipelines, ultimately enhancing the efficacy of life sciences R&D. Just as engineering experienced a revolution in workflow standardisation, we foresee a similar trend impacting bioinformatics.

We believe that the winners in this evolving landscape will be teams that possess both technical expertise and a strong product focus. They will actively listen to customer needs, incorporating training requests and desired product features while recognising the importance of allowing customisation on their platform. Considering the complexity of potential features, solutions, and open-source simulations, an extensible platform that can seamlessly integrate various best-of-breed solutions is crucial. This flexibility empowers developers, data scientists, and bioinformaticians to delve into the code and tailor the platform to their specific requirements.

The viability of building a significant fee-for-service (SaaS) business in this space is still debatable. However, companies like Benchling, a category leader in driving product-led growth (PLG) for scientists, provide confidence in a successful SaaS model. Conversely, many companies that initially started with fee-for-service models are now exploring revenue upside through royalties, exemplified by the case of Schrödinger.

At MMC, we find this emerging market incredibly exciting. Based on our bottom-up analysis, which considers the increasing number of scientists, we project that this market could reach $10 billion by 2030 (EU and US markets only). We are actively seeking conversations with founders operating in this space as we recognise the tremendous opportunities and potential for growth.


Deep dive #2: Next generation of AI drug discovery platforms

The challenge of siloed data

To comprehend the path ahead, we must examine the starting point. In the field of biopharma, a prevailing concern revolves around the issue of data. Biopharmaceutical companies consider data as the cornerstone of their intellectual property (IP), resulting in a significant portion of industry data being confined within silos. This fragmented data landscape poses a hindrance as the complexity of diseases targeted by biopharma surpasses the available data’s capacity to support them adequately. Additionally, many companies lack the talent required to extract optimal insights from their data, assuming it is of sufficient quality in the first place.

Just as the Genome project in 2000 revolutionised biopharma by enabling the identification of new drug targets and inspiring ground-breaking initiatives like the UK Biobank through public datasets, we anticipate a similar wave of innovation when the obstacles of siloed data are overcome. This progress will serve as a catalyst for further advancements in data-driven drug discovery, paving the way for transformative breakthroughs.

The first generation of AI-enabled drug development: A paradigm shift in biopharma

In the world of drug discovery, the rise of artificial intelligence (AI) was like a ray of hope, offering the promise of revolutionising the entire process. The first generation of AI-enabled drug discovery (AIDD) companies, including Exscientia and Benevolent emerged as pioneers in this field by leveraging AI algorithms to analyse vast public datasets or proprietary data from biopharmaceutical companies. Others, such as Owkin, are paving the way by using federated AI to draw insights from data sitting in the silos to build better disease models and identify new biomarkers.

Integrating AI into the drug discovery process marked a significant departure from the traditional phenotype-driven approach, where candidates were screened based on known diseases and their symptom changes. It introduced a data-driven approach focused on identifying new targets and predicting properties and outcomes by analysing large, diverse datasets. This shift was met with excitement as biopharmaceutical companies recognised the potential for AI to accelerate their progress and became aware that AI was a “must-have” to remain competitive in the industry.

However, hiring top talent and data scientists in the biopharma industry remains a challenge. Instead, biopharma companies are accessing this talent through strategic partnerships with leading tech companies (e.g. Novartis partnering with Microsoft’s in 2021), acquiring AI capability (e.g. Genentech/Roche acquiring Prescient Design in 2021), or acquiring smaller AI companies that complement another one’s existing tech stack (e.g. Recursion acquiring Valence and Cyclica in 2023). We anticipate a continued wave of strategic collaborations and acquisitions as biopharma companies strive to unlock the full potential of AI-driven technologies.

Despite the initial enthusiasm surrounding AI-driven drug discovery, it has become increasingly clear that AI alone is not a silver bullet that guarantees success. The drug discovery process is inherently complex, spanning an average of 10 years from target identification to drug approval. Many pioneering AIDD companies have been active for nearly a decade, yet we have not witnessed the approval of drugs solely supported by AI-driven approaches. This realisation prompted the industry to acknowledge the need for a collection of technologies, rather than relying solely on AI, to enhance the drug discovery process in the long run.

Source: How Artificial Intelligence is Revolutionizing Drug Discovery

One of the critical limitations of the first-generation AIDD approaches is the heavy reliance on public or incomplete datasets that may not be readily applicable to machine learning (ML) algorithms. To address this challenge, a second generation of AIDD companies has emerged focusing on data generation and curation, ensuring the availability of high-quality, comprehensive datasets. These companies have also learned that feedback loops, achieved through constant iteration and testing of findings in the wet lab, are crucial to testing new hypotheses and bringing a granular, “data-driven” approach back to seeing the big picture.

Fuelling the success of second-generation AIDD companies

The success of second-generation AI-enabled drug discovery (AIDD) companies can be attributed to several converging trends shaping the drug development landscape. These trends are driving advancements in data generation, human disease modelling, regulatory considerations, and the emergence of generative AI technologies.

One key factor contributing to the success of these companies is the increasing ability to generate multi-“omic” data. Recent advancements in sequencing technologies such as 10x GenomicsIllumina, and Berkeley Lights have made detecting not just genes, but RNA, proteins, and other cellular products more economically viable. This wealth of multi-“omic” data enables scientists to build a more comprehensive understanding of complex, multi-factorial diseases. Start-ups like MultiOmic and CardiaTec are leveraging this data to gain insights into metabolic diseases and cardiovascular conditions. By integrating multi-“omic” data, scientists can identify novel targets and mechanisms manifesting these challenging diseases.

Another significant development is the emergence of start-ups focused on creating improved human models of diseases and generating human-derived data. At the forefront of this innovation are companies like Pear Bio and Turbine AI (creating cancer models), Ochre Bio (creating liver models), and Xilis (creating organoids). These human disease models provide a closer approximation of human physiology and disease pathology, enabling scientists to simulate the disease phenotype more accurately and improve the testing of drug candidates. With improved human models, data-driven target selection becomes more relevant and effective, moving away from constant iterative trial and error at a single or few gene levels.

Regulatory changes also favour the adoption of more effective human-derived models for testing new drug targets. In a significant shift, the FDA announced in 2022 that animal trials would no longer be required for every new drug approval in an attempt to address ethical issues and improve the efficacy of new drugs. While the full realisation of these changes will take time and not fully replace animal trials in the medium term, they represent a positive step forward in mitigating the decline in available and accurate human data as a drug target progresses towards clinical trials.

Additionally, as data is an expensive asset to generate to a high standard, there is considerable buzz around the promise of generative AI in drug discovery and its ability to reduce the cost and time associated with data generation by simulating large amounts of synthetic data. Companies like Cradle.ai are leveraging generative AI to improve predictions of protein structures, while others like Syntho and Replica Analytics (acquired by real-world evidence firm Aetion in 2022) are creating privacy-protected synthetic data to improve testing, research, and analysis.

What are buyers looking for?

When evaluating partnerships or adopting new software solutions, buyers aim to accelerate the drug discovery process and maximise their return on investment (ROI), primarily focused on time saved rather than financial considerations. Indeed, by saving time, buyers can sell the drug while it is still under patent protection, which is a far greater financial consideration and represents additional billions of revenues.

For consideration, the main ROI for buying new software solutions is full-time equivalent (FTE) hours. For example, traditional high-throughput screening efforts may require a team of at least five individuals to operate and maintain equipment for several weeks. In contrast, computationally assisted virtual screening can be completed in approximately a week, with most of the work performed by the computer, even with wet lab validation.

Managing partnerships and navigating intellectual property (IP) ownership rights also play a significant role for buyers. Buyers seek multi-disciplinary teams with expertise in partnering with pharmaceutical companies and value peer-reviewed publications that prove the technology’s effectiveness and reliability. Furthermore, buyers are particularly interested in teams that possess experience in navigating complex commercial arrangements that are mutually beneficial for both parties. A crucial aspect of these negotiations revolves around shared versus proprietary IP, as pharmaceutical companies are keen on protecting their interests considering their extensive experience in IP acquisition to expand their pipelines. Therefore, while start-up companies may find the allure of IP ownership and potential royalties appealing, it is essential to carefully consider how these models would affect potential strategic partnerships and how this drives their longer-term business model as a technology platform on one end of the scale vs full stack biotech on the other.

What’s next?

Although AI algorithms can provide valuable insights, there is always doubt whether virtual screening accurately represents what would occur in a real-world setting. This doubt will persist until one of the AI-developed drugs currently undergoing clinical trials gets approved. This challenge prompts buyers to seek assurance and validation methodologies that bridge the gap between virtual and experimental results, providing confidence in the effectiveness of AI-driven predictions.


If you are building something at Seed or Series A in the TechBio space, we are very interested in talking to you!

Get in touch with Charlotte or Mira

Managing cash has never been more important – why we continue to back TreasurySpring

No comments yet

TreasurySpring has raised a $29m Series B. Ollie Richards, explains why we are excited to continue supporting the company on its growth journey


When we first met with Kevin, Mathew, James and the TreasurySpring team in early 2019, we were immediately impressed by the specialist cash management insights the team had. We were captivated by the founders’ ambitious vision and saw tremendous potential for their proposition within our portfolio (we have since helped turn over 20 of our portfolio companies into happy clients). This led us to invest in the companies Seed round and go on to co-lead the Series A. I highlighted our thinking for our initial investment at the time.

Since then, the appeal of Fixed-Term Funds (FTFs) as a secure, attractive investment product for our portfolio and the wider market of corporates with excess cash has become clear, along with some strong tailwinds having been created in the macro environment.


TreasurySpring positions itself today as a leading provider of cash management solutions that minimise risk through diversification, offer access to multiple investment options, collateralisation, and eliminate exposure to maturity mismatches.

When reflecting on the journey so far, one key insight from this Seed to Series B adventure is that you don’t have to hyper-scale the team to deliver rapid commercial growth. The team’s capital efficiency in scaling the business has been impressive. They have achieved remarkable growth while maintaining a relatively small team — a testament to the smart business model and their operational excellence.

Navigating Changing Market Dynamics

Since our initial investment, the market environment has undergone a significant transformation. Interest rates have shifted, opening people’s eyes to the possibility of generating yield from large cash balances. We have witnessed this unexpected new revenue across our portfolio companies that have invested in FTFs through TreasurySpring. This macro shift has resulted in implications across the fintech sector, particularly for banks and wealth tech firms, which benefit from a higher rate environment.

The SVB Wake-Up Call

The recent collapse of SVB highlighted the importance of treasury management to all businesses, and in particular, the early-stage venture capital ecosystem. Since SVB, we, along with our peers, have proactively advocated for prudent cash management among founders and on the boards we have the privilege of working with. I am hopeful that the recent learnings in this area will help inform founders for many years to come.

MMC published some best practices (linked below), and Kevin’s recent blog (also linked below) sheds light on the lessons learned from SVB. Among them are the significance of diversification across client bases and funding sources and the risks associated with maturity transformation and liquidity mismatches. Regulations and predictive models underestimated the potential liquidity needs in a bank run and certainly in the small VC-backed cohort of companies that exhibited herd-like behaviour in fund withdrawals.

Collective Action and Swift Resolutions

Almost immediately after the desperate cash calling, it was great to see the European VC and start-up community rally together over the weekend of carnage and effectively push for swift action from politicians to resolve what could have become a systemic problem. The coordinated efforts displayed the resilience and determination of the community to tackle challenges head-on made me proud of our ecosystem and to play a small part in advocating for a solution.


Big Plans

The next few years hold immense opportunities for TreasurySpring’s further growth and development. The company has plans for geographic expansion, continued product evolution, and significant initiatives to drive broader adoption of FTFs.

I’m delighted to welcome Rana and Rob from Balderton along with Mubadala Capital on this journey as part of the company’s most recent funding round, and I’m excited to see what the TreasurySpring team can achieve from here.


As part of our career development planning at MMC, we encourage secondments within our portfolio companies. It’s great to have Masamba, one of our investment team members, currently on secondment at TreasurySpring. He’s playing a part in actively driving forward key initiatives and contributing to the company’s ongoing success.

Fantastic TreasurySpring Future

I believe the future shines brightly for TreasurySpring and FTFs, and we are excited to witness the next phase of its remarkable journey. With a strong foundation and product, an expert team and a range of exciting growth opportunities to pursue.

I’m looking forward to the next phase!


If you’re an entrepreneur building something in B2B software, please reach out. We love learning about new ideas — ollie@mmc.vc

Follow Ollie on LinkedIn

TreasurySpring raises $29 million

No comments yet



TreasurySpring has raised $29 million in Series B. MMC first backed the team in 2019, and was proud to participate in its latest round


TreasurySpring‘s investment platform helping firms of all sizes unlock and protect the true value of their cash assets.

Founded in 2017, TreasurySpring addresses the growing need for companies of all sizes to diversify their cash deposits, access high-quality investments and minimise their banking risk. This challenge is especially urgent in the wake of rising interest rates and with the dangers of unsecured deposits having been brought into sharp focus by the recent collapses of SVB and Credit Suisse. Thanks to its seamless, secure solution, demand for the TreasurySpring platform has accelerated, with over 100 clients currently in the onboarding process, adding to the 250+ institutional clients that are already signed up.

TreasurySpring’s intuitive solution makes it easy for companies – from FTSE 100 corporations and leading multinationals to Series A startups, scale-ups and private charities – to place cash assets in a secure and flexible platform. Designed to help companies maximise returns whilst minimising risk, the platform gives access to investment capabilities and diversification that have historically been reserved for the world’s largest financial institutions.

There has been such demand for its services that TreasurySpring grew more than 7x in the last 12 months and total issuance through its platform now stands at more than $50bn.

icon-quote

We have been growing the business rapidly for the last five years and are very well-placed to capitalise on current market dynamic

KEVIN COOKCO-FOUNDER AND CEO, TREASURYSPRING

The expertise to transform an unloved area of finance

TreasurySpring is tapping into the vastly underserved cash market, a multi-trillion dollar industry which has to date been ignored by the fintech boom. This is, in part, due to the complexity, barriers to entry and lack of knowledge about money markets and their potential. The TreasurySpring founding team, Kevin Cook (CEO), Matthew Longhurst (COO) and James Skillen (CTO), with their decades of experience built up across hedge funds, asset management and investment consulting, have the skills and vision to transform this sector for the better.

Currently, companies typically keep cash in bank deposits which are only secure up to a maximum of £85,000 in the UK and $250,000 in the US, with only the largest corporations and banks having the teams, expertise and infrastructure necessary to access a better, broader range of cash investment options. TreasurySpring is changing this: through the platform companies of all sizes can quickly gain access to over 600 standardised cash investment products, across seven different currencies and three categories: governments, corporates and banks such as Goldman Sachs, Barclays and Societe Generale. The platform’s Fixed-Term- Funds (FTFs) provide standardised, regulated, access to an ever-expanding universe of cash- investment options, ensuring flexibility, safety and diversification for clients.

TreasurySpring is on a mission to offer the power of large-scale treasury management to all businesses. The $29 million Series B, which brings the total raised to $42 million, will be used to invest in TreasurySpring’s product, sales, marketing and tech teams with plans to grow its headcount by 50% in the next 12 months, further develop its products and services, and accelerate international expansion.