Real-Time Video AI Agents: Beyond Voice Mode
Voice mode changed how many people think about AI assistants. Now a new category is pushing further: agents that take live video in, process it, and talk back on video in near real time. Alibaba's Wan-Streamer research paper, published on arXiv in late June 2026, put this idea in front of a wide audience. Commentators called it the end of voice mode. The reality is more nuanced.
The post below breaks down what real-time video AI agents actually are, which tools you can use today, and what Wan-Streamer signals about where the category is heading. None of these tools are the same thing, and the differences matter if you're building something or evaluating a purchase.
What separates this category from video generation
Video generation tools -- Sora, RunwayML, Kling, or Alibaba's own Wan 2.x series -- take a prompt or reference image and produce a video clip. You wait for the render. There is no conversation happening.
Real-time video AI agents are different in three ways:
- They listen and respond. The agent takes in your voice (and sometimes your face) and produces spoken output, not a rendered clip.
- They do it live. The interaction happens in under a second, close enough to feel like a phone call.
- Some output a video face. The agent's response comes back as a talking avatar, not just audio.
The third capability is the rare one. Most tools today get to points one and two but stop short of outputting a live video face back to you. That gap is exactly what Wan-Streamer is trying to close architecturally.
The tools you can use now
Hume EVI 3
Hume's Empathic Voice Interface (EVI) is the most emotionally aware voice agent available via API. EVI 3, released in 2025, is built as a unified speech-language model rather than a pipeline of separate components: it handles both language generation and voice characteristics in a single model. That means it can hear irritation in your tone and soften its response, or pick up uncertainty and hold the pause longer.
Latency lands around 300 milliseconds -- fast enough for natural back-and-forth. The API supports any voice you design or license, plus voice cloning from minimal audio samples. You can also plug in Claude, Gemini, or another LLM as the reasoning layer while EVI 3 handles the voice interface.
What Hume does not do is send you a video face back. The output is audio. That puts it firmly in the audio-only quadrant: video in (optionally), audio out.
Pricing: Free tier includes 5 minutes of EVI usage per month. The Starter plan is $3/month with 40 minutes included at $0.07/min. Pro is $70/month with 1,200 minutes at $0.06/min. Business is $500/month with 12,500 minutes at $0.04/min. Additional minutes are billed at the plan's rate.
Best for: Developers building customer support bots, coaching tools, or any agent where emotional tone matters more than seeing a face.
ElevenLabs Conversational AI
ElevenLabs is known for text-to-speech quality, and its Conversational AI product (ElevenAgents) extends that into full turn-taking voice conversations. You configure an agent with a system prompt, assign it a voice, and connect it via API or their hosted platform. The agent handles speech recognition, LLM reasoning, and voice synthesis in one stack.
The voice quality is noticeably better than most competitors because ElevenLabs' voice models are genuinely strong. Latency runs around 300 milliseconds for the speech-to-speech loop, depending on the LLM you connect.
Like Hume, the output is audio only. No video avatar. Pricing is billed by conversation minute, which keeps it predictable for outbound calling or scheduled voice interactions.
Pricing: Free tier includes 15 call minutes per month with up to 4 concurrent calls. The Creator plan is $22/month (currently $11 for the first month) with 275 minutes. Pro is $99/month with 1,238 minutes. Additional minutes above your plan allowance cost $0.08 per minute (burst pricing at $0.16/min if you exceed concurrency). Note: LLM usage and telephony are billed separately on top.
Best for: Outbound voice agents, interactive voice response systems, or any use case where your existing ElevenLabs voice library needs to hold a real conversation.
See the full ElevenLabs review for more on their voice library and TTS capabilities.
HeyGen LiveAvatar
HeyGen's LiveAvatar is the most video-forward tool on this list that's actually available. You submit a short video of yourself or your brand representative, HeyGen creates a streaming avatar model, and from that point the avatar streams in real time via WebRTC. You connect an LLM (ChatGPT being the most common integration) to power the responses, and the avatar's mouth, gestures, and expressions sync to the generated speech.
You get the "talking AI face" experience that audio-only agents do not offer. The trade-off is architecture: unlike Wan-Streamer's single unified model, HeyGen LiveAvatar is a modular pipeline. An LLM generates the text, a voice model speaks it, and a separate animation layer drives the avatar. Each handoff adds latency, which is why the overall response time is measurably higher than EVI 3 or ElevenLabs -- typically over one second end-to-end, depending on the LLM and connection speed.
Output is up to 1080p on higher plans. The Starter plan caps session length at 5 minutes per session; higher tiers extend this.
Pricing: LiveAvatar uses a credit system on top of a subscription. Paid plans start at $29/month for the Creator tier and scale up through Pro (from $49/month) and Business ($149/month plus per-seat fees). Credits fund streaming time: 1 credit covers 30 seconds in Full mode or 60 seconds in Lite mode. API access for streaming sessions is usage-based; check the LiveAvatar pricing page for current per-credit rates, as HeyGen has updated their tier structure in 2026.
Best for: Sales kiosks, customer-facing education platforms, or anywhere you want a visible AI representative rather than a voice in someone's ear.
See the full HeyGen review for more on their avatar and video generation tools.
GPT-4o Realtime API
OpenAI's Realtime API gives you low-latency, full-duplex voice conversation with GPT-4o. Audio goes in, audio comes out, and the model handles speech recognition, language processing, and synthesis as a single step rather than three chained services. Interruption handling is built in -- you can speak over the model and it adjusts.
First-audio latency is under 250 milliseconds in favorable conditions. The model understands tone, pace, and emphasis in addition to words, so it can respond more appropriately to how something is said.
Like Hume and ElevenLabs, the output is audio only. OpenAI has not shipped a video output mode. It is a capable, well-documented API if you want the intelligence of GPT-4o in a voice wrapper -- but there is no avatar.
Pricing: The Realtime API is priced by audio tokens. Current rates (as of mid-2026) are approximately $32 per million audio input tokens and $64 per million audio output tokens -- roughly $0.06/min for input audio and $0.24/min for output audio. Actual costs depend on conversation length and response verbosity. There is no free tier for the Realtime API; standard OpenAI account credits apply.
Best for: Developers already in the OpenAI ecosystem who want reliable, low-latency voice agents with GPT-4o intelligence and minimal infrastructure overhead.
The research signal: Wan-Streamer v0.1
Alibaba's Wan Team published arXiv paper 2606.25041 on June 23, 2026 (revised June 25). The paper describes a single causal Transformer that handles video input, audio input, and text input simultaneously, and produces video output, audio output, and text output -- all within one model, without separate ASR, LLM, TTS, or avatar animation modules.
The numbers from the paper: model-side response latency of roughly 200 milliseconds, total interaction latency including 350 milliseconds of network overhead at roughly 550 milliseconds, 25 fps video output, streaming units of 160 milliseconds. The current output resolution is 192p -- intentionally low because this is a proof of concept, not a shipping product.
The architectural claim is significant. Every tool listed above -- Hume, ElevenLabs, HeyGen -- uses a pipeline of specialized modules. Each module handoff creates latency and synchronization error. Wan-Streamer's architecture learns perception, turn-taking, generation, and cross-modal sync jointly, end-to-end. The paper argues this is why it can see your face and respond on video while staying under 600 milliseconds total.
What Wan-Streamer is not:
- Not a tool you can use. There is no public API, no open-source code, no model weights on HuggingFace, and no pricing disclosure as of late June 2026. The website wan-streamer.com exists but is access-gated.
- Not part of the Wan 2.x video generation lineage in any practical sense. The Wan 2.x series (2.1 through 2.7) generates video clips from prompts. Wan-Streamer is an interactive conversational agent. Same team, different product.
- Not a finished product. The paper calls v0.1 a proof of concept. The 192p output resolution reinforces that.
The social signal around the paper -- 320,000+ views on a single X post -- reflects genuine interest in the architectural idea, not evidence that you can go build something with it today. If you need a real-time video agent this month, look at the four tools above.
If you want to watch this space, the signals to track are: a GitHub repository under the Wan-Video org, model weights appearing on HuggingFace under Wan-AI, or a public API announcement from Alibaba Cloud.
How to choose
| What you need | Tool to look at |
|---|---|
| Emotional nuance in voice, no avatar needed | Hume EVI 3 |
| Best voice quality for outbound calling or telephony | ElevenLabs Conversational AI |
| A visible AI avatar face in the conversation | HeyGen LiveAvatar |
| GPT-4o intelligence in a voice wrapper | GPT-4o Realtime API |
| Full video in + video out in a single unified model | Wait for Wan-Streamer or a competitor to ship |
The four live tools are genuinely useful today. They just occupy different points on the audio-video spectrum. Hume and ElevenLabs are strong on voice quality and low latency but produce no video face. HeyGen produces a video face but with higher latency and a modular architecture. GPT-4o Realtime is the most capable reasoning layer but stops at audio.
Wan-Streamer's contribution is proving -- at research level -- that all four capabilities (video in, audio in, video out, audio out) can live in one unified model without the synchronization overhead of a pipeline. Whether Alibaba ships a usable version of that architecture, or whether OpenAI, Google, or someone else gets there first, is an open question.
FAQ
What is a real-time video AI agent? A real-time video AI agent is an AI system that takes live video and audio as input from a user, processes it, and responds with synthesized speech or video output -- all within a sub-second latency window. The term distinguishes these tools from video generation tools, which produce video clips on demand but do not hold a live conversation.
Is Wan-Streamer available to use? No. As of late June 2026, Wan-Streamer v0.1 is a research paper published by Alibaba's Wan Team on arXiv (2606.25041). There is no public API, no open-source code release, no model weights, and no pricing information. The project website exists but is access-gated. Treat it as a technical signal about where the category is heading, not a tool you can deploy.
Which real-time voice AI agent has the lowest latency? GPT-4o Realtime API targets under 250 milliseconds for first-audio response. Hume EVI 3 targets approximately 300 milliseconds. ElevenLabs Conversational AI is in a similar range. HeyGen LiveAvatar, which also outputs a video face, typically runs over one second end-to-end due to its modular architecture.
Does HeyGen LiveAvatar actually see you during the conversation? HeyGen LiveAvatar streams a pre-built avatar model that responds via a connected LLM. The avatar does not use your camera feed as a real-time visual input to the AI model -- it produces video output but does not process video input in the way Wan-Streamer is designed to. Most live use cases connect it to a voice pipeline: your audio goes in, the LLM generates a response, and the avatar lip-syncs the output.
What is the difference between Hume EVI 3 and ElevenLabs Conversational AI? Both are voice-only conversational AI agents (no video output). Hume EVI 3 is built around emotional intelligence -- it reads tone and expression in your voice and adjusts its own vocal delivery accordingly. ElevenLabs Conversational AI is built around voice quality, leveraging their extensive voice model library. Hume starts at $3/month; ElevenLabs' Creator plan is $22/month. For telephony and outbound calling, ElevenLabs has more native integrations.
