The Question

Scalable Peer-to-Peer File Distribution System

Design a high-performance distributed file transfer system similar to BitTorrent. The system must support millions of concurrent users sharing large files (GBs to TBs) while minimizing server bandwidth costs. Focus on the architecture for peer discovery (Tracker), metadata management, and the logic for direct peer-to-peer data exchange. Address challenges such as peer churn, data integrity verification, and incentives to prevent leeching (free-riding).

Redis

PostgreSQL

P2P

SHA-256

REST

L7 Load Balancing

TCP Pipelining

Questions & Insights

Clarifying Questions

Scale of Operation: What is the expected number of active torrents and concurrent peers (DAU/Peak)?

Centralization: Should we design a fully decentralized system (DHT-based) or a tracker-based approach for the MVP?

File Size & Integrity: What is the maximum file size supported, and how is data integrity verified (e.g., SHA-1/SHA-256 hashes of pieces)?

Network Constraints: Are we targeting public internet users (NAT traversal issues) or a controlled data center environment?

Assumptions for MVP:

Scale: 10 million DAU, 1 million concurrent "swarms" (groups of peers for a specific file).

Architecture: Hybrid approach using a Centralized Tracker for the MVP (simpler to implement/manage than DHT) with P2P data exchange.

Integrity: Files are split into fixed-size pieces (e.g., 256KB - 1MB), each verified by a hash.

Scope: Focus on the Tracker, Metadata management, and the Peer-to-Peer protocol logic.

Thinking Process

Core Bottleneck: The primary bottleneck in large-scale file distribution is the server's outbound bandwidth. We solve this by offloading 99% of data transfer to the peers themselves.

Key Progression:

How does a client find "who has the file"? (The Tracker/Discovery problem).

How is a file represented to ensure it can be downloaded in parallel? (The Metadata/Segmentation problem).

How do peers decide which piece to download and who to upload to? (The Scheduling & Game Theory problem).

How do we ensure the system remains healthy when peers leave? (The Churn/Availability problem).

Bonus Points

Rarest-First Algorithm: Prioritize downloading the pieces that are least common in the swarm to increase the "redundancy" of the file segments across the network.

Tit-for-Tat (Game Theory): Implement a choking/unchoking mechanism where peers reward those who upload to them, preventing "leeching" and ensuring network health.

Optimistic Unchoking: Occasionally upload to a random peer to discover better-performing neighbors and allow new peers to bootstrap into the swarm.

End-Game Mode: When a client has almost all pieces, it sends requests for the remaining pieces to all known peers to minimize the "tail latency" of the final few blocks.

Design Breakdown

Functional Requirements

Core Use Cases:

Create/Publish: A "seeder" generates a metadata file (.torrent) and registers with the tracker.

Discovery: A "leecher" queries the tracker to get a list of active peers for a specific hash.

P2P Transfer: Peers exchange pieces of the file directly using a bitfield to track progress.

Reporting: Peers periodically "announce" their status (downloaded/remaining) to the tracker.

Scope Control:

In-Scope: Tracker API, Peer discovery, Metadata storage, Piece selection logic.

Out-of-Scope: Global search engine (indexing torrents), DHT implementation (Kademlia), UI/UX for the client.

Non-Functional Requirements

Scale: Support millions of concurrent swarms and high-frequency "announce" requests.

Latency: Tracker responses must be < 100ms to allow peers to start downloading quickly.

Availability: The tracker must be highly available; if it goes down, new peers cannot join swarms.

Consistency: Eventual consistency for peer lists is acceptable.

Fault Tolerance: Handle high peer "churn" (peers joining and leaving abruptly).

Security: Integrity checks for every piece via hashes to prevent poisoning attacks.

Estimation

Traffic:

10M DAU, each announcing every 5 minutes.

10,000,000 / 300s \approx 33,000

Announce QPS.

Storage:

1M active torrents. Metadata (name, hashes)

\approx 50KB

per torrent.

1,000,000 \times 50KB = 50GB

for metadata.

Peer Swarm lists: 1M torrents

\times

50 peers per list

\times

16 bytes (IP+Port)

\approx 800MB

in memory.

Bandwidth:

Tracker:

33,000 QPS \times 1KB \approx 33MB/s

(negligible for modern clusters).

P2P Data: Handled by peers, not our infrastructure.

Blueprint

Concise Summary: A centralized Tracker-based system where a metadata store handles file descriptors and a high-performance cache tracks peer-to-swarm mappings.

Major Components:

Tracker Service: A stateless API that manages the "Announce" protocol and returns peer lists.

Metadata DB: Stores the .torrent file information and piece hashes.

Swarm Cache (Redis): Stores transient lists of active peer IPs for each file hash.

Peer Client: The logic running on user machines to manage piece requests and bitfields.

Simplicity Audit: We use a centralized Tracker instead of a DHT (Distributed Hash Table) because DHTs are harder to debug and have higher discovery latency for the MVP.

Architecture Decision Rationale:

Scalability: Decoupling the tracker from data transfer allows the system to scale infinitely relative to file size.

Reliability: Using Redis for the peer lists ensures sub-millisecond lookups for the high-frequency Announce requests.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Tracker traffic is dynamic; we use a Global Load Balancer (L7) with Anycast IP for low-latency access to the nearest tracker instance.

Security:

Rate limiting on "Announce" requests to prevent Tracker DDoS.

SSL/TLS termination at the Gateway.

Service

Topology: Stateless Tracker nodes deployed in multiple Availability Zones.

API Schema:

GET /announce:

Params: info_hash, peer_id, port, downloaded, left, event (started/stopped/completed).

Response: interval (seconds until next announce), peers (List of binary IP/Port).

Resilience:

Client-side Retry: If tracker is down, clients use exponential backoff.

Interval Adjustment: During high load, the tracker increases the interval value in the response to slow down clients.

Storage

Access Pattern:

Metadata DB: Low-frequency writes (creation), high-frequency reads (getting info_hash details).

Swarm Cache: Extremely high-frequency read/writes (peer heartbeats).

Technical Selection:

Metadata DB: PostgreSQL with a JSONB column for piece hashes or a separate pieces table.

Swarm Cache: Redis. Use the SET or ZSET (Sorted Set) data structure where the score is the last-seen timestamp to easily expire dead peers.

Database Schema:

torrents: id, info_hash (Primary Key), name, piece_length, created_at.

pieces: torrent_id, index, hash_value.

Cache

Purpose: To manage the list of active peers for every info_hash.

Key-Value Schema:

Key: swarm:{info_hash}

Value: Set of ip:port strings.

TTL: 10-15 minutes (slightly longer than the announce interval).

Failure Handling: If Redis fails, the tracker becomes "read-only" (cannot add new peers to swarms) until the cache is restored.

Wrap Up

Advanced Topics

Trade-offs (PACELC): We prioritize Availability and Partition Tolerance (AP). If a tracker instance is out of sync, it might return a slightly stale peer list, which is fine as the client will simply fail to connect to those specific peers and move on.

Reliability & Failure Handling:

Stale Peer Cleanup: A background worker or Redis TTL removes peers that haven't announced in > 10 minutes.

Optimization (The Client):

Bitfield: Peers exchange a bitfield (a compact array of bits) representing which pieces they have. This minimizes bandwidth for signaling.

Pipelining: Clients request multiple blocks (sub-pieces) at once to saturate the TCP window and maximize throughput.

Security:

Hash Verification: Every downloaded piece is hashed and compared against the metadata file. If a peer sends bad data, they are blacklisted by the client.

Distinguishing Insights:

To bridge the gap to a fully decentralized system later, the Tracker can return a "bootstrap node" list for a DHT (Kademlia), allowing the system to transition from tracker-based to trackerless without a hard fork.