The Question

Large-Scale Video Conferencing Infrastructure Design

Design a globally distributed video conferencing system capable of supporting millions of concurrent users. The architecture must handle real-time audio/video streaming with sub-200ms latency, screen sharing, and cloud recording. Explain your choice of media routing architecture (SFU vs MCU), how you manage global signaling state, and how the system scales to accommodate massive webinars with over 50,000 participants while ensuring high availability and fault tolerance.

WebRTC

SFU

WebSocket

Redis

PostgreSQL

Kafka

DTLS

SRTP

FFmpeg

Anycast

ICE

Questions & Insights

Clarifying Questions

What is the maximum number of participants per meeting?

Assumption: Support up to 1,000 participants for standard meetings and 50,000+ for webinars.

What is the target end-to-end latency for audio/video?

Assumption: Sub-200ms for a high-quality interactive experience.

How do we handle global distribution?

Assumption: Users are worldwide; we need regional Edge PoPs (Points of Presence) to minimize the "first-mile" latency.

What are the recording requirements?

Assumption: Meetings can be recorded to the cloud (S3) and processed asynchronously for playback.

Is screen sharing treated differently from camera video?

Assumption: Yes, screen sharing requires higher resolution/lower frame rate and higher reliability (TCP-like characteristics within the media stream).

Thinking Process

The core challenge of a video conferencing system is managing high-bandwidth, low-latency data streams while maintaining state across a distributed environment.

Media Routing: How do we efficiently route video/audio packets without the massive CPU overhead of transcoding every stream (SFU vs. MCU)?

Signaling & State: How do we manage the "handshake" (SDP/ICE) and real-time room metadata (who is muted, who is talking)?

Global Orchestration: How do we ensure a user in London and a user in NYC connect to a media server that provides the lowest latency for both?

Resilience: How does the system handle a Media Server crashing mid-meeting without dropping the call?

Bonus Points

Scalable Video Coding (SVC): Instead of sending multiple bitrates (simulcast), use SVC layers to allow the SFU to drop packets for low-bandwidth clients without re-encoding.

Cascading SFUs: For massive meetings, link SFUs in a tree structure across regions to minimize inter-continental bandwidth and latency.

Jitter Buffer Management: Implement sophisticated client-side adaptive jitter buffers to handle network fluctuations.

UDP Hole Punching & ICE: Deep understanding of STUN/TURN for NAT traversal to ensure peer-to-peer or peer-to-server connectivity in restrictive networks.

Design Breakdown

Functional Requirements

Core Use Cases:

Real-time 1:1 and group audio/video calls.

Meeting scheduling and join via URL.

Instant screen sharing.

Breakout rooms (partitioning a main meeting into sub-sessions).

Cloud recording and retrieval.

Scope Control:

In-scope: Core media engine, signaling, scheduling, and basic recording.

Out-of-scope: Virtual backgrounds (client-side), transcription/AI summary (MVP+), and PSTN (telephone) integration.

Non-Functional Requirements

Scale: 10 million Daily Active Users (DAU); 1 million Peak Concurrent Users (PCU).

Latency: < 200ms latency for media; < 500ms for signaling/chat.

Availability & Reliability: 99.99% availability; seamless reconnection on network switch (e.g., WiFi to LTE).

Consistency: Eventual consistency for meeting metadata; strong consistency for auth and scheduling.

Security & Privacy: End-to-end encryption (E2EE) options, AES-256 for media in transit.

Estimation

Traffic Estimation:

1M Peak Concurrent Users.

Average stream: 1 Mbps (720p).

Total Bandwidth: 1,000,000 * 1 Mbps = 1 Tbps aggregate egress.

Storage Estimation:

5% of meetings recorded.

1-hour recording = 500 MB.

10k recordings/day = 5 TB/day. 1.8 PB/year.

QPS:

Signaling (Join/Leave/Mute): ~100k QPS at peak.

Metadata (Scheduling): ~1k QPS.

Blueprint

The MVP focuses on a Selective Forwarding Unit (SFU) architecture. Unlike an MCU (which mixes video), an SFU simply forwards video packets from the sender to all other participants. This is highly scalable as it requires minimal CPU on the server.

Major Components:

Signal Service: Handles WebSockets for room orchestration, SDP exchange, and state changes (mute/unmute).

Selective Forwarding Unit (SFU): The media engine that routes UDP packets between participants.

Meeting Service: Manages the lifecycle of meetings, scheduling, and metadata in a persistent store.

Media Pipeline (Kafka + Worker): Handles the asynchronous task of persisting and stitching media streams for recording.

Simplicity Audit: By using SFU instead of MCU, we avoid the complexity of GPU-accelerated transcoding for the MVP.

Architecture Decision Rationale:

SFU Architecture: Best for high-scale as it shifts the "mixing" burden to the client.

WebSockets for Signaling: Necessary for real-time state synchronization.

Geographic Sharding: Meetings are hosted on media servers closest to the "center of gravity" of participants.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:

Use Anycast IP to route users to the nearest regional PoP.

Geo-DNS resolves the Signal Service and Meeting Service to the closest data center.

Security & Perimeter:

API Gateway handles JWT validation and rate limiting.

WebRTC uses DTLS/SRTP for media encryption.

Service

Topology & Scaling:

Signal Service: Stateless, horizontally scaled. Uses Redis Pub/Sub to sync events across instances if participants in the same meeting are connected to different signal nodes.

SFU: Stateful during the session. If an SFU node reaches capacity, new meetings are scheduled on other nodes.

API Schema Design:

POST /v1/meetings: Create a meeting. Returns meeting_id and join_token.

GET /v1/meetings/{id}/join: Returns the IP of the assigned SFU and Signal server.

WebSocket Events: USER_JOINED, USER_LEFT, MUTE_CHANGED, SCREENSHARE_STARTED.

Resilience & Reliability:

Ice Restart: If the connection drops, the client triggers an ICE restart to find a new path or reconnect to the SFU.

SFU Heartbeats: Signal service monitors SFU health. If an SFU fails, Signal Service instructs clients to reconnect to a standby SFU.

Storage

Access Pattern:

High read/write for SessionDB (Redis) as users join/leave and change state.

High read/low write for MeetingDB (PostgreSQL) for scheduling.

Database Table Design:

Meetings: id, host_id, start_time, duration, status, settings_json.

Participants: id, meeting_id, user_id, joined_at, left_at.

Technical Selection:

PostgreSQL: For relational integrity and scheduling.

Redis: For ephemeral session state (TTL set to meeting duration).

Distribution Logic: Shard PostgreSQL by host_id or meeting_id.

Cache

Purpose & Justification: Redis stores the "Active Meeting Registry"—which SFU is hosting which meeting.

Key-Value Schema: meeting:{id}:sfu_ip -> string. meeting:{id}:participants -> Set<user_id>.

Failure Handling: If Redis fails, the system can reconstruct state by querying SFUs directly (Health check/State sync), but Redis provides a centralized low-latency view.

Messaging

Purpose & Decoupling: Kafka decouples the real-time media forwarding from the heavy recording/transcoding process.

Event / Topic Schema: meeting.record.start, meeting.record.stop. Payload includes sfu_node_id and stream_ids.

Technical Selection: Kafka for high throughput and durability.

Data Processing

Processing Model: The Recording Worker pulls raw media chunks from the SFU (or a temporary buffer), stitches them together into an MP4/WebM file.

Correctness Guarantees: Use sequence numbers in RTP packets to ensure frames are stitched in the correct chronological order despite network jitter.

Technical Selection: FFmpeg-based custom workers or GStreamer pipelines.

Wrap Up

Advanced Topics

Trade-offs (SFU vs. MCU): We chose SFU for the MVP because it scales better. The trade-off is higher bandwidth consumption for the client (receiving

N-1

streams). To mitigate this, we use Simulcast: the sender uploads 3 resolutions, and the SFU forwards the best one for each receiver's bandwidth.

Reliability: Use NACK (Negative Acknowledgement) for packet loss. If a packet is missing, the receiver asks the SFU to re-send it.

Scalability (10x): To support 50k users, we use Cascading SFUs. A "Master SFU" receives the presenter's stream and forwards it to several "Relay SFUs," which then serve chunks of 500-1000 users each.

Security: Implement Waiting Rooms and Passcodes at the Meeting Service layer to prevent "Zoom-bombing."