The Question

Design a Scalable Video Conferencing System

Design the core backend architecture for a real-time video conferencing platform like Zoom. The system must support millions of daily users and meetings with up to 100 concurrent participants. Focus specifically on minimizing end-to-end latency, handling media distribution at scale, and ensuring high availability during server failures. Detail your choice of media routing architecture (SFU vs. MCU), how signaling and handshakes are managed, and how you would optimize the network path for a global user base.

WebRTC

SFU

UDP

SRTP

WebSockets

Redis

PostgreSQL

Anycast DNS

Simulcast

Questions & Insights

Clarifying Questions

Scale: What is the targeted number of concurrent participants per meeting and total Daily Active Users (DAU)?

Assumption: 10M DAU, supporting up to 100 participants per meeting for the MVP.

Latency: What is the acceptable end-to-end latency for audio/video?

Assumption: Sub-150ms for a seamless conversational experience.

Features: Do we need advanced features like cloud recording, transcriptions, or breakout rooms in the MVP?

Assumption: No. Focus strictly on real-time video, audio, screen sharing, and simple chat (YAGNI).

Client Support: Are we targeting Web (WebRTC) or native apps?

Assumption: Backend will support a generic Real-time Transport Protocol (RTP) approach, primarily optimized for WebRTC-compatible clients.

Thinking Process

The core challenge of Zoom is not the "Meeting CRUD" but the low-latency distribution of high-bandwidth media streams.

How do we handle media distribution at scale? We use a Selective Forwarding Unit (SFU) architecture rather than a Multipoint Control Unit (MCU) to minimize server-side transcoding and reduce latency.

How do clients discover and connect to the right media server? A Signaling Service manages the handshake (SDP/ICE) and assigns clients to the closest available media node.

How do we maintain "Presence" and state? A distributed Session Store (Redis) tracks who is in which room and which media server is hosting the session.

How do we handle network volatility? Implement Simulcast, where clients push multiple bitrates, and the SFU forwards the appropriate one based on the receiver's downlink.

Bonus Points

Cascading SFUs: For very large meetings, connect SFUs in a tree structure across regions to localize traffic and reduce cross-continental bandwidth costs.

Global Edge Optimization: Deploy Media Servers at the "Edge" (POPs) to terminate UDP streams as close to the user as possible, using a private backbone for inter-server communication.

SVC (Scalable Video Coding): Use H.264/VP9 SVC to allow a single bitstream to be peeled into layers (resolution/frame rate) without needing multiple encoders on the client (Simulcast).

Jitter Buffer Management: Advanced server-side monitoring of packet loss to trigger NACK (Negative Acknowledgement) or PLI (Picture Loss Indicator) to maintain stream integrity.

Design Breakdown

Functional Requirements

Core Use Cases:

Create/Join a meeting via a unique ID.

Real-time Synchronous Video/Audio streaming.

Screen sharing capability.

Real-time text chat during the meeting.

Scope Control:

In-scope: Signaling, Media Routing (SFU), Presence, Meeting Metadata.

Out-of-scope: Recording, Virtual Backgrounds (Client-side), PSTN integration (Dial-in), End-to-end encryption (E2EE) key management (assume standard TLS/SRTP for MVP).

Non-Functional Requirements

Scale: Support 100k concurrent meetings.

Latency: < 150ms glass-to-glass latency.

Availability & Reliability: 99.99% availability; if a media server fails, clients should reconnect to a new one within 5 seconds.

Consistency: Eventual consistency for meeting history; Strong consistency for session occupancy (preventing over-filling rooms).

Fault Tolerance: Media servers must be stateless regarding long-term data; if one dies, the signaling layer reroutes users.

Estimation

Traffic:

10M DAU, average 2 meetings/day = 20M meetings/day.

Peak concurrent users (PCU): 1M.

Audio/Video Bandwidth: ~1Mbps per participant (high quality).

Storage:

Meeting metadata: 1KB per meeting. 20M meetings * 1KB = 20GB/day.

Bandwidth:

1M PCU * 1Mbps = 1 Tbps aggregate egress bandwidth at peak. This necessitates a decentralized Media Server distribution.

Blueprint

Concise Summary: A geo-distributed architecture using a central Signaling/API layer to manage metadata and a fleet of edge-deployed SFU (Selective Forwarding Units) to route media streams with minimal latency.

Major Components:

API Gateway/Signaling: Manages WebRTC handshakes and meeting orchestration.

Meeting Service: Handles meeting creation, validation, and participant lifecycle.

Selective Forwarding Unit (SFU): Routes media packets from one sender to many receivers without transcoding.

Presence/Session Store: Tracks real-time mapping of User -> Meeting -> SFU Node.

Simplicity Audit: We avoid MCU (transcoding) because it is CPU intensive and adds latency. We use WebSockets for signaling as it is standard and robust for the MVP.

Architecture Decision Rationale:

SFU vs MCU: SFU is chosen for the MVP because it scales horizontally easily and puts the encoding burden on the clients.

Functional Satisfaction: Covers join/stream/chat.

Non-functional Satisfaction: Edge deployment of SFUs minimizes latency; stateless API services ensure high availability.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

Anycast DNS: Routes users to the nearest Signaling and SFU cluster based on network proximity.

Security & Perimeter:

API Gateway: Handles JWT-based authentication.

Rate Limiting: Applied to "Create Meeting" and "Join" endpoints to prevent DDoS on signaling resources.

TLS/SRTP Termination: Signaling uses WSS (Secure WebSockets); Media uses SRTP (Secure RTP) with keys exchanged via Signaling.

Service

Topology & Scaling:

Stateless API Services: Scaled based on CPU/Request count.

Signaling Service: Stateful via Persistent WebSockets; uses heartbeats to detect client disconnection.

API Schema Design:

POST /v1/meetings: Creates a meeting; Returns meeting_id.

GET /v1/meetings/{id}/join: Validates user; Returns an SFU_IP and Token for media handshake.

Signaling (WebSocket): offer, answer, ice-candidate messages for WebRTC negotiation.

Resilience & Reliability:

SFU Health Checks: If an SFU node fails, the Meeting Service marks it as "unhealthy" and the Signaling service instructs clients to reconnect to a different SFU node.

Storage

Access Pattern:

Write-heavy for Presence (heartbeats).

Read-heavy for Meeting Metadata during join.

Database Table Design:

Meetings Table (PostgreSQL): id (UUID), host_id, start_time, status, config (JSON).

Participants Table (PostgreSQL): id, meeting_id, user_id, joined_at.

Technical Selection:

PostgreSQL: For metadata due to strong consistency requirements for meeting ownership and auditing.

Redis: For Session Store. High-speed TTL-based storage for "Who is on which SFU".

Distribution Logic:

Shard Meetings table by meeting_id.

Cache

Purpose & Justification: Redis acts as the Presence Store. It maps meeting_id -> sfu_node_id and user_id -> status.

Key-Value Schema:

Key: meeting:sfu:{meeting_id}, Value: sfu_ip_address, TTL: 2 hours.

Key: presence:{meeting_id}, Value: Set of UserIDs.

Failure Handling: If Redis fails, use the Meeting Service's fallback logic to rediscover SFUs (though this causes temporary join latency).

Messaging

Purpose & Decoupling: Used for the meeting chat and internal events (e.g., "Participant Joined" to trigger UI updates).

Event / Topic Schema: Topic: meeting.events.{meeting_id}.

Technical Selection: Redis Pub/Sub for the MVP. It provides low-latency message delivery for users currently online in a specific meeting without the complexity of Kafka.

Infrastructure (Optional)

Observability:

Metrics: Monitor "Packet Loss Ratio" and "Round Trip Time (RTT)" per SFU node.

Distributed Coordination:

Service Discovery: SFU nodes register themselves with a registry (e.g., Consul or internal DB) with their current load (stream count) so the Signaling service can perform load balancing.

Wrap Up

Advanced Topics

Trade-offs (SFU vs Mesh):

We reject P2P Mesh because it doesn't scale beyond 3-4 people (n-1 uploads required for each user).

We chose SFU over MCU to keep server costs lower and latency minimal, acknowledging that clients need more downlink bandwidth.

Reliability:

Reconnection Logic: If a UDP stream drops, the client immediately attempts a "warm" reconnect using the same ICE credentials to the same SFU.

Bottleneck Analysis:

The primary bottleneck is the SFU's Network I/O. Optimization involves using kernel-level packet forwarding (XDP/eBPF) or specialized libraries like DPDK if scale exceeds standard Linux socket performance.

Security:

Media is encrypted with SRTP. The Signaling server facilitates the key exchange but doesn't necessarily need to be in the media path.