The Question

Scalable Video Conferencing System (Zoom/Webex)

Design a globally distributed video conferencing system capable of supporting millions of concurrent users. The system should handle high-density meetings (up to 1,000 participants) with sub-200ms latency. Detail the media routing strategy (SFU vs MCU), the signaling mechanism for session establishment, and how you would handle varying network conditions and global scaling across multiple regions.

WebRTC

SFU

UDP

SRTP

WebSockets

Redis

PostgreSQL

Simulcast

SVC

TURN/STUN

Questions & Insights

Clarifying Questions

Scale and Participant Limit: What is the maximum number of participants per meeting for the MVP, and what is the expected concurrent user count (e.g., 300M daily participants)?

Geographic Distribution: Does the system need to support global low-latency connectivity across different continents?

Feature Set: Are we prioritizing core video/audio and screen sharing over advanced features like break-out rooms, virtual backgrounds, or cloud recording for the MVP?

Network Constraints: Should the design account for varying network conditions (3G/4G vs. Fiber) and firewall traversals (NAT)?

Assumptions:

Scale: Support up to 1,000 participants per meeting; 10M concurrent users globally.

Latency: End-to-end latency must be < 200ms for a "real-time" feel.

MVP Focus: Real-time audio, video, screen sharing, and group chat.

Connectivity: Use WebRTC standards for media transport.

Thinking Process

Core Bottleneck: How do we distribute high-bandwidth video streams to hundreds of participants without melting the sender's upload bandwidth?

Progressive Logic:

Signaling: How do users find each other and negotiate media capabilities? (WebSockets + Metadata Service).

Media Routing: Why use a Selective Forwarding Unit (SFU) instead of a Multipoint Control Unit (MCU) or P2P? (SFU provides the best balance of scalability and server CPU cost).

Global Latency: How do we ensure a user in London and a user in Tokyo have a low-latency experience? (Geo-located Media PoPs).

Adaptability: how do we handle a participant with a poor 3G connection in a high-def meeting? (Simulcast/SVC).

Bonus Points

Simulcast & SVC: Instead of one stream, the client sends 3 versions (Low, Med, High resolution). The SFU forwards the appropriate version based on each receiver's bandwidth.

Cascading SFUs: For massive meetings (10k+), SFUs are linked in a tree structure to distribute the load across regions.

Custom Congestion Control: Implementing GCC (Google Congestion Control) or BBR at the application layer over UDP to handle packet loss gracefully.

E2EE (End-to-End Encryption): Implementing Insertable Streams (WebRTC) to ensure the SFU only routes encrypted packets without having the keys to decrypt them.

Design Breakdown

Functional Requirements

Core Use Cases:

Create/Join a meeting via a unique ID/URL.

Real-time Audio/Video streaming with <200ms latency.

Screen sharing capabilities.

In-meeting text chat.

Participant presence (Who is in the room).

Scope Control:

In-Scope: Signaling, Media Routing (SFU), Presence, and basic Metadata management.

Out-of-Scope: Cloud recording, Transcribing/Captions, Virtual backgrounds (client-side), and PSTN (Phone) integration for MVP.

Non-Functional Requirements

Scale: Support 10M+ concurrent participants.

Latency: < 200ms end-to-end; Jitter buffer management to handle network fluctuations.

Availability: 99.99% availability; meetings must persist even if a single signaling server fails.

Consistency: Eventual consistency for meeting metadata; high consistency for presence during the session.

Reliability: Graceful degradation (Audio-only mode if bandwidth drops).

Security: AES-256 encryption for media; SRTP for transport security.

Estimation

Traffic:

10M concurrent users. If average meeting size is 10, that's 1M concurrent meetings.

Video Stream: 1Mbps (720p). Audio: 50kbps.

Aggregated Bandwidth: 10M users * 1Mbps = 10 Tbps total throughput.

Storage:

Meeting metadata is small (~2KB per meeting). 100M meetings/day = 200GB/day.

Bandwidth:

Signaling: Low (JSON over WebSockets).

Media: High (UDP/SRTP). This is the primary cost driver.

Blueprint

The architecture uses a decoupled signaling and media path. Signaling (joining, leaving, permissions) happens over HTTPS/WebSockets, while media (audio/video) flows through a global network of Selective Forwarding Units (SFUs).

Signaling Service: Manages meeting state, participant lists, and WebRTC SDP (Session Description Protocol) exchange.

SFU (Media Server): Acts as a high-performance packet router. It receives one stream from a sender and forwards it to N receivers without transcoding.

Presence Service: Tracks who is online/offline in a meeting using a fast in-memory store.

Simplicity Audit: By choosing SFU over MCU, we avoid expensive server-side video transcoding, allowing us to scale with raw CPU/Network IO.

Architecture Decision Rationale:

Why SFU?: MCU (mixing video) is too CPU intensive. P2P (mesh) fails for >3 participants due to upload bandwidth limits. SFU is the industry standard for scale.

Functional Satisfaction: Covers all real-time needs and scales to 1,000 participants easily.

Non-functional Satisfaction: PoP-based deployment minimizes physical distance, reducing latency.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Global Traffic Routing: Latency-based DNS routing to direct users to the nearest Signaling and Media PoP.

API Gateway: Handles SSL termination and JWT-based authentication.

UDP Entry: SFUs listen on a range of UDP ports. We use TURN/STUN servers (often integrated into the SFU edge) to bypass symmetric NATs and firewalls.

Service

Signaling Service:

Protocol: WebSockets (Socket.io or raw WS) for real-time events.

Responsibilities: SDP Exchange (Offer/Answer), ICE Candidate exchange, Room management.

SLA: 99.9% (Critical for joining, but media can continue if signaling blips).

Presence Service:

Heartbeat mechanism (every 5-10s) to detect disconnected users.

Pub/Sub to notify other participants when a user joins/leaves.

Storage

Access Pattern:

Write-heavy when meetings start; Read-heavy for session validation.

Database (PostgreSQL):

Meetings: (meeting_id [PK], host_id, start_time, settings_json).

Participants: (id, meeting_id, user_id, joined_at).

Technical Selection: Relational DB (Postgres) is sufficient for metadata for the MVP. It provides ACID for meeting creation and easy querying.

Cache

Purpose: Store transient session state and presence.

Schema:

meeting:{id}:participants -> Set of user IDs.

user:{id}:status -> "online", "meeting_id", "sfu_endpoint".

Technical Selection:Redis. Its high throughput and built-in TTLs are perfect for session-based presence.

Wrap Up

Advanced Topics

Trade-offs (SFU vs MCU): We chose SFU. The trade-off is higher client-side CPU usage (decoding multiple streams), but it enables significantly higher server-side scalability and lower end-to-end latency.

Reliability:

SFU Failure: If an SFU node dies, the Signaling service detects the heartbeat loss and triggers a "re-connect" event to all clients in that room, assigning them a new SFU.

Jitter Buffer: Clients implement a jitter buffer to reorder UDP packets that arrive out of sequence.

Security:

SRTP: Encrypts the media payload.

Room Passwords: Scoped tokens required to join a specific meeting_id.

Optimization (Bandwidth):

Last-N-Speakers: The SFU only forwards the video of the top 5-9 active speakers to save receiver bandwidth. Others are sent as audio-only or low-res thumbnails.