The Question

Scalable Video Conferencing Platform Design

Design the frontend architecture for a real-time video conferencing application similar to Zoom or Microsoft Teams. Your solution should address the challenges of managing multiple high-bandwidth WebRTC streams, maintaining UI responsiveness during large meetings (up to 50 participants), and implementing a robust signaling and state synchronization mechanism. Detail your strategy for rendering optimization, media stream management, and handling varied network conditions while ensuring a seamless user experience across modern browsers.

WebRTC

React

SFU

WebSockets

Zustand

Tailwind CSS

TypeScript

Intersection Observer API

Questions & Insights

Clarifying Questions

What is the maximum number of participants per meeting for the MVP?

Assumption: We will support up to 50 participants per room. While large-scale webinars are a future goal, the MVP focuses on high-quality small-to-medium team collaboration.

Is the focus on Peer-to-Peer (P2P) or a media server-based architecture?

Assumption: We will use a Selective Forwarding Unit (SFU) architecture. P2P is insufficient for more than 3-4 participants due to upload bandwidth constraints on client devices.

Which platforms are prioritized?

Assumption: Modern desktop browsers (Chrome, Safari, Firefox) are the primary target. Mobile web is secondary; native apps are out of scope for the MVP.

What are the must-have collaboration features?

Assumption: Video/Audio streaming, Screen Sharing, and a simple text-based Chat are the core functional requirements.

Crash Strategy

Core Bottleneck: Efficiently rendering and managing multiple high-bandwidth media streams without crashing the browser's main thread or draining the user's CPU.

Key Strategy:

How do we manage the lifecycle of WebRTC connections and media tracks? (Establish a robust MediaStream Manager).

How do we ensure the UI remains responsive when 50 videos are active? (Implement Grid Virtualization and Dynamic Resolution Scaling).

How do we synchronize state (mute/unmute, hand raises) across all clients? (Utilize a low-latency Signaling Service via WebSockets).

How do we handle network instability? (Implement Adaptive Bitrate (ABR) logic and ICE restart mechanisms).

Elite Bonus Points

Audio-Only Mode: Automatically downgrading to audio-only for participants with low bandwidth to maintain session persistence.

Canvas Overlay for Annotations: Using a transparent Canvas layer over screen shares for real-time collaboration.

Web Workers for Signaling: Offloading WebSocket message parsing and SDP negotiation to a Web Worker to keep the UI thread clear for 60fps rendering.

Simulcast Support: Sending multiple resolutions of the same video stream to the SFU, allowing the server to forward lower quality to participants with weak connections.

Design Breakdown

Requirements

Functional Requirements:

Join/Leave meetings via unique URLs.

Real-time Audio/Video streaming with Mute/Unmute capabilities.

Screen sharing (presenter mode).

Side-panel Chat for text communication.

Participant list with status indicators (e.g., "speaking", "muted").

Non-Functional Requirements:

Latency: Sub-200ms glass-to-glass latency for audio/video.

Performance: Maintain 30-60fps UI responsiveness even with multiple video tracks.

Scalability: Support up to 50 concurrent video streams in a single room.

Accessibility: ARIA labels for media controls and keyboard shortcuts (e.g., 'M' to mute).

Security: End-to-end encryption (E2EE) or at least encryption in transit (DTLS/SRTP).

Design Summary

Concise Summary: A React-based SPA utilizing WebRTC and an SFU backend to facilitate multi-party video conferencing, focusing on a centralized Media Management layer to decouple stream logic from UI components.

Major Components:

App Shell: The persistent container managing authentication, routing, and global error boundaries.

Meeting Manager: An orchestration service (Application Layer) that handles joining rooms, signaling, and participant state.

MediaStream Manager: A domain-level service responsible for WebRTC PeerConnections, track lifecycle, and device constraints.

Video Grid: A smart layout component that dynamically calculates tile sizes and prioritizes "Active Speakers."

CUJ Walkthrough:

User lands on a URL -> Meeting Manager fetches room metadata -> MediaStream Manager requests camera permissions -> Signaling Client connects to SFU -> Media flows into Video Grid -> Control Bar allows interaction.

Simplicity Audit: This architecture uses an SFU to simplify the frontend's connection logic (1 Up/N Down instead of Mesh). It avoids complex global state for media tracks, keeping them in a dedicated manager.

Architecture Decision Rationale:

Why this architecture?: SFU is the industry standard for scalability. Separation of the Media Layer from the UI Layer allows for independent testing and easier transitions if the underlying WebRTC library changes.

Requirement Satisfaction: WebRTC handles low latency; CSS Grid/Intersection Observer in the Presentation Layer handles UI scale; WebSocket Signaling ensures real-time state sync.

System Diagram

Architecture Deep Dive

Presentation Layer

Component Hierarchy: The App Shell wraps the Meeting Layout. The layout splits into a Video Grid (central) and a Sidebar Chat. The Video Grid dynamically renders Video Card components based on the participant list.

Interaction Layer: Includes the Control Bar for toggling tracks. Input validation ensures only valid Meeting IDs are processed. Accessibility is handled by managing focus when sidebars open and providing high-contrast icons for muted states.

Rendering Layer: Uses Client-Side Rendering (CSR). To optimize performance, we use Intersection Observer on Video Cards—if a participant's video is scrolled out of view, we detach the srcObject from the <video> element to save GPU/CPU cycles.

UI Frameworks: React for component tree management, Tailwind CSS for responsive layouts, and Lucide-React for accessible iconography.

Application Layer

Data Fetching Layer: REST for initial room configuration; WebSockets (via Signaling Client) for real-time participant events (join, leave, mute).

State Management Layer: A centralized store (e.g., Zustand or Redux) tracks the "Participant List" and "Local User State." Media streams are kept out of the global store (stored in the Domain Layer) to prevent unnecessary re-renders of the entire app when a stream's internal metadata changes.

Routing & Navigation: Simple URL-based routing (e.g., /room/:id). Route guards check for camera/mic permissions before allowing entry into the meeting.

Domain Layer

Business Rules: Logic for "Who is the active speaker?" based on audio level analysis. Logic for "Maximum concurrent videos" (e.g., limit to 12 visible videos, others become avatars).

Entities / Models: Participant model includes ID, name, track status (audio/video/screen), and role (host/attendee).

Inter-layer Contracts: The MediaStream Manager exposes a clean API to the Application Layer: toggleVideo(), startScreenShare(), and an event emitter for onTrackReceived.

Infrastructure Layer

API / Network: Uses WebRTC for media. The Signaling Client uses WebSockets to exchange SDP (Session Description Protocol) and ICE Candidates with the SFU.

Storage: localStorage stores user preferences (preferred camera/mic ID, username).

SFU Interaction: The frontend sends one "Upstream" (local mic/cam) and receives multiple "Downstreams" (remote participants). The SFU handles the heavy lifting of multiplexing these streams.

Wrap Up

Wrap-up

Trade-offs: We chose SFU over P2P. Trade-off: Higher server costs for the provider, but significantly better user experience and battery life for participants.

Optimization: For the 50-person limit, we implement "Pagination" or "Speaker View." Only the top 6-9 active speakers get high-resolution video; others get low-res or just audio/avatars.

Security: All traffic is encrypted via SRTP. For the MVP, we rely on the SFU for security, with a roadmap to implement "Insertable Streams" for true E2EE in browsers.