Two approaches to latency in AI confernce calls: a technical overview

AI-confernce-calls-agents

Blog Image
Introduction

Today we are introducing two distinct products designed to deploy AI agents in conference calls. One integrates with Google Meet. The other operates on a purpose-built native infrastructure. This post outlines the architectural rationale behind each, and the criteria for choosing between them.

Throughout the development of our AI platform for business calls, one question has come up consistently in enterprise conversations: "Is this compatible with Google Meet?"

The answer is yes — and we have now formalized that capability into a dedicated product. At the same time, we are releasing a second product that takes a fundamentally different approach. This post provides a technical account of both, the design decisions behind each, and a framework for determining which is most appropriate for a given use case.

New Product #1

Callin for Google Meet

Our first announcement today is Callin for Google Meet: an AI voice translation layer that plugs directly into your existing Google Meet calls — no workflow changes, no new tools to install, no team training required.

How it works

Callin for Google Meet operates as an AI participant in the call. It connects via the Google Meet API, captures each speaker's audio in real time, runs transcription and translation through our AI model, and delivers the translated audio back to the configured participants.


Architecture — Callin for Google Meet

Speaker A (EN)──▶Google Meet Server──▶Callin Bot

↕STT → LLM translate → TTS

Speaker B (IT)◀──Google Meet Server◀──Callin Bot

* Audio packets transit through Google's infrastructure before reaching the bot

Rationale

In most enterprise environments, Google Meet is already the established standard. Procurement processes have been completed, IT policies mandate its use, and integration with Google Workspace is deeply embedded in day-to-day workflows. Proposing a platform migration as a prerequisite for adopting AI translation is rarely a viable path, particularly for organizations operating at scale.

Callin for Google Meet is designed to meet organizations where they already operate. It requires no changes to existing infrastructure, no retraining of staff, and no disruption to established meeting workflows. Teams continue using Meet; the AI translation layer operates transparently within it.


Ideal use case: companies with international teams already on Google Workspace, regular client calls in different languages, or multilingual commercial negotiations where switching platforms is not a viable option in the near term.

Architectural constraints

Transparency about the limitations of this integration is central to how we communicate about our products. Callin for Google Meet operates within structural boundaries that are inherent to Google's infrastructure and cannot be addressed at the application layer.


// Latency breakdown — Callin for Google Meet (internal measurements)

Network first-hop (client → Meet server) ~80–150 ms

Meet internal routing ~40–80 ms

Bot audio capture + buffering ~60–120 ms

STT (speech-to-text) ~150–250 ms

LLM translation ~80–150 ms

TTS (text-to-speech) + playback ~100–180 ms

End-to-end latency (estimated) ~600–900 ms

The primary source of latency is not the AI model itself, but the audio pipeline imposed by Google Meet's architecture. Meet was not designed as a real-time relay for AI agents. Its internal buffering, routing logic, and infrastructure — optimized for high-quality video delivery between human participants — introduce structural overhead that cannot be reduced from an external integration point.

New Product #2

Callin Voice Engine

The second announcement represents the longer-term infrastructure investment undertaken by our engineering team. Callin Voice Engine is our native audio infrastructure: a real-time stack designed from the ground up for production AI voice agents, with no architectural dependency on third-party video conferencing platforms.

Context and motivation

Google Meet is purpose-built for a well-defined objective: delivering high-definition video to human participants in a shared session. It addresses that problem effectively. AI voice agents, however, operate under a substantially different set of requirements.

For a conversational AI agent, response latency is not merely a performance metric — it is a determinant of whether the interaction reads as natural or mechanical. Below approximately 300 ms, conversation feels fluid and responsive. In the 400–700 ms range, users begin to perceive an unnatural pause. Beyond 700 ms, conversational continuity breaks down: users hesitate, question whether the agent has processed their input, and may repeat themselves.

The consequence extends beyond user experience quality. At elevated latency, the perceived reliability of the product itself is affected — a particularly significant consideration in customer-facing deployments.

Architecture overview

Callin Voice Engine is built on a native WebRTC stack optimized specifically for AI pipeline workloads. The following sections describe each architectural layer and its contribution to end-to-end latency reduction.

1 · UDP transport — eliminating head-of-line blocking

Callin Voice Engine uses UDP as its base transport layer, paired with SRTP (Secure Real-time Transport Protocol) for encryption at the packet level.

TCP — the protocol underlying HTTP and WebSocket — enforces ordered, reliable delivery. When a packet is lost in transit, the receiving buffer stalls until retransmission is complete. In real-time audio streaming, this behavior is counterproductive: a lost audio packet is less harmful than a delayed one. The codec is designed to handle packet loss gracefully; it cannot compensate for a pipeline blocked on TCP retransmission.

UDP foregoes delivery guarantees in favor of continuity. Resilience is handled at the application layer by the codec, removing TCP's structural blocking entirely from the latency-critical path.

2 · Distributed edge infrastructure — first-hop at ~13 ms (P50)

Network propagation delay is bounded by physical constraints: signal transmission over optical fiber travels at approximately 200,000 km/s, which sets an absolute floor on round-trip times. The practical lever available to engineers is geography — specifically, reducing the distance between the client and the first server node.

Callin Voice Engine operates on a globally distributed network of edge nodes, positioned to achieve a median first-hop latency of approximately 13 ms between client and our infrastructure. Once traffic enters our network, it is routed over private inter-datacenter backbones — dedicated connections with lower and more consistent latency profiles than public Internet paths, which any Meet-based integration must rely on for the full end-to-end route.

3 · SFU architecture — zero transcoding overhead

Callin Voice Engine operates on a Selective Forwarding Unit (SFU) model, as opposed to a Multipoint Control Unit (MCU).

An MCU decodes each incoming audio stream, mixes the signals, re-encodes the output, and redistributes it to participants. Each codec pass introduces additional latency — typically 60–150 ms per transcoding leg. An SFU, by contrast, performs selective packet forwarding without decoding or re-encoding the payload. Our media server routes audio packets as received; the bitstream is never touched. The result is a forwarding path with no transcoding overhead.

4 · Opus tuned for AI — 10 ms frames, DTX, in-band FEC

The audio codec in Callin Voice Engine is Opus, configured with parameters specific to AI use cases:

  • Frame size: 10 ms — halves the algorithmic delay versus the 20 ms default. Perceived quality for speech remains unchanged; packetization latency is cut in half.

  • DTX (Discontinuous Transmission) — nothing is transmitted during silence. Reduces bandwidth and AI pipeline load during natural speech pauses.

  • In-band FEC — Forward Error Correction embedded in the stream: if a packet is lost, the next one carries enough information to reconstruct it. No fallback, no retransmission request.

5 · Adaptive jitter buffer — target 10–20 ms

The jitter buffer is a common source of latency in audio systems that is frequently over-provisioned in general-purpose implementations. Standard configurations typically target 40–80 ms to maintain robustness across variable network conditions. Callin Voice Engine sets the target to 10–20 ms, with dynamic adaptation based on observed real-time jitter measurements.

This configuration is sustainable because traffic operates on a private edge network with substantially lower and more consistent jitter than public Internet paths. The reduced buffer target is not an accepted trade-off — it is a direct consequence of the infrastructure properties the network is designed to provide.

6 · Continuous audio streaming — parallel pipeline execution

From an AI pipeline perspective, this is arguably the most consequential architectural difference — and the one least apparent from raw latency figures alone.

In a Meet-based integration, audio is necessarily delivered in discrete chunks. The STT model processes each chunk sequentially; the LLM begins only once a complete utterance has been transcribed; TTS synthesis follows. Each stage must conclude before the next can begin. The pipeline is inherently serial.

Callin Voice Engine delivers audio as a continuous stream. The STT model begins transcription on the first incoming frames, while the speaker is still talking. The LLM receives transcription tokens progressively and can initiate response generation before the utterance is complete. The TTS model begins synthesizing the first output tokens before the LLM has finished its completion.

The three stages execute concurrently rather than sequentially. End-to-end perceived latency is therefore determined by the slowest stage at its margin — not by the cumulative sum of all stage durations.


AI Pipeline — Callin Voice Engine vs. serial approach


✕ Serial pipeline (Meet-based approach)

Audio chunk→Full STT→Full LLM→Full TTS→Playback


✓ Streaming pipeline (Callin Voice Engine)

Audio stream→STT streaming

↕ token stream

LLM streaming

↕ token stream

TTS streaming→Playback

Head-to-head: measured latency

The figures below come from internal measurements on real sessions. Google Meet values are measured on calls with participants in Western Europe; Callin Voice Engine values are measured under comparable network conditions.

Component

Callin for Meet

Callin Voice Engine

Δ gained

First-hop network

80–150 ms

~13 ms (P50)

−70–140 ms

Routing / internal server

40–80 ms

5–15 ms (SFU)

−35–65 ms

Jitter buffer

40–80 ms

10–20 ms

−30–60 ms

Transcoding overhead

60–150 ms

0 ms (pure SFU)

−60–150 ms

Codec frame size

20 ms (default)

10 ms

−10 ms

STT → LLM → TTS pipeline

serial ~500 ms

parallel ~200 ms

~−300 ms

End-to-end (estimated)

600–900 ms

180–350 ms

~3× faster


// Latency breakdown — Callin Voice Engine

Network first-hop (client → edge node) ~13 ms P50

SFU forwarding (no transcoding) ~5–15 ms

Jitter buffer (aggressive, private network) ~10–20 ms

STT streaming (starts on first frames) ~80–120 ms

LLM (streaming, parallel to STT) ~60–100 ms

TTS (streaming, parallel to LLM) ~40–80 ms

End-to-end latency (estimated) ~180–350 ms

Selecting the appropriate product

The two products are designed for meaningfully different deployment contexts. The following criteria are intended to support an informed evaluation:


Callin for Google Meet is well-suited when:
The organization is already standardized on Google Workspace · Platform migration is not feasible within the current IT or contractual framework · Call volumes are moderate (<50/day) · End-to-end latency in the 600–900 ms range is acceptable for the intended use case (internal meetings, training sessions, non-time-critical multilingual workflows).


Callin Voice Engine is the appropriate choice when:
The deployment involves a production AI voice agent · The use case is customer-facing, where latency has a direct and measurable impact on user experience and retention · Full programmatic control over the audio pipeline is required (VAD configuration, interruption handling, custom turn-taking logic) · The system must support high call volumes with predictable, scalable infrastructure.


Closing remarks

Both products reflect a consistent underlying principle: the appropriate solution depends on the specific constraints of the problem at hand, not on a single architectural preference.

Callin for Google Meet minimizes the adoption effort for organizations that are already invested in that platform. Callin Voice Engine removes the architectural ceilings for teams operating under tighter latency requirements or building customer-facing products where conversational quality is a differentiator.

The distinction matters because latency in AI voice is not an abstract engineering concern. At 200 ms, a conversation with an AI agent reads as natural. At 800 ms, the interaction pattern shifts — users adapt to the delay, responses feel scripted, and the interface resembles automated telephone systems of a prior era rather than a genuinely conversational product.

For teams evaluating Callin Voice Engine for production workloads, our engineering team is available for dedicated architecture reviews.

Methodological notes
Latency measurements reported above are averages from controlled test sessions with participants in Western Europe (Milan, Frankfurt, Paris) on consumer fiber connections (50–200 Mbps). Real-world values vary based on geography, network quality, and AI model load. "Callin for Google Meet" figures reflect structural overhead inherent to the Google Meet architecture, independent of Callin's implementation. Full benchmarks are available on request for enterprise customers.