Posted on

Gladia believes that real-time processing represents the next frontier of audio transcription APIs

Gladia believes that real-time processing represents the next frontier of audio transcription APIs

French startup Gladia, which offers a speech recognition application programming interface (API), has raised $16 million in a Series A funding round. Essentially, Gladia’s API allows you to convert any audio file to text with high accuracy and fast turnaround time.

While Amazon, Microsoft and Google all offer speech-to-text APIs as part of their cloud hosting product suites, they are not as powerful as newer models from specialized startups.

There has been tremendous progress in this area in recent years, especially after OpenAI released Whisper. Gladia competes with other well-funded companies in the space, such as AssemblyAI, Deepgram and Speechmatics.

Gladia originally offered a fine-tuned version of Whisper’s speech-to-text model with some much-needed improvements. For example, the startup supports diarization out of the box – it can detect when multiple speakers are involved in a conversation and separate the recording and transcribed text depending on who is speaking.

Gladia supports 100 languages ​​and a variety of accents. This reporter can confirm that it works, as we used Gladia to transcribe a few interviews and accents were not an issue.

The startup offers its speech-to-text model as a hosted API that users can use in their own applications and services. Over 600 companies use Gladia, including several meeting recorders and note-taking assistants such as Attention, Circleback, Method Financial, Recall, Sana, and Veed.io.

This particular use case is interesting because many companies need to chain API calls. They first convert speech into text, which they then feed into a large language model (LLM) such as GPT-4o or Claude 3.5 Sonnet to extract knowledge from large walls of text.

With the new funding, Gladia aims to simplify this pipeline by integrating audio intelligence and LLM-based tasks into a single API call. For example, a customer could receive a conversation summary generated from a handful of bullet points without relying on a third-party LLM API.

The other problem Gladia wants to solve is latency. You may have seen some demos of real-time audio conversations with an AI-based call agent (11x has a good demo on its website), and these systems need to be able to transcribe in near real-time to make such conversations sound human-like as possible.

“We found that real-time is generally not very good in terms of quality in the market. And people had a strange use case. They performed real-time processing, then grabbed the audio and ran it in batch. We asked ourselves, “Why are you doing this?” They told us, “The quality is not good with real-time processing, so we batch transcribe it afterwards,” co-founder and CEO Jean-Louis Quéguiner (pictured above; right) told TechCrunch .

Gladia has addressed this problem and is currently able to transcribe a live conversation with a latency of under 300 milliseconds. The company claims that real-time processing is now more or less as good as the standard asynchronous batch transcription API, but without proper testing it’s difficult for us to judge. As Quéguiner says, the startup aims for “batch quality with real-time capability.”

Aside from AI call agents, you can imagine a call center leveraging these real-time capabilities to help call agents find relevant information during a call. “Our single API is compatible with all existing tech stacks and protocols, including SIP, VoIP, FreeSwitch and Asterisk,” said co-founder and CTO Jonathan Soto (pictured above; left) in a statement.

XAnge led the Series A funding round. Illuminate Financial, XTX Ventures, Athletico Ventures, Gaingels, Mana Ventures, Motier Ventures, Roosh Ventures and Soma Capital also participated.

Gladia believes we are on the verge of a “ChatGPT moment” for audio applications. GPT technology has been around for years, but ChatGPT has really popularized LLMs with its chat-like interface for consumers.

When Apple or Google start integrating transcription models into iOS or Android, consumers will begin to understand the value of automated transcription in the apps they use. Developers will then likely integrate audio functionality into their products, and this is where API providers like Gladia will come into play.