SQS
Cheat Sheet
Prime Use Case
Use SQS when you need to decouple components of a distributed system to ensure asynchronous processing, buffer against traffic spikes, or manage work distribution among a pool of consumers.
Critical Tradeoffs
- At-least-once delivery vs. Exactly-once delivery (Standard vs. FIFO)
- Latency overhead of polling vs. immediate processing
- Managed simplicity vs. limited message size and retention
Killer Senior Insight
The 'Visibility Timeout' is the most critical lever in SQS; setting it incorrectly is the primary cause of either duplicate processing (too short) or stalled recovery (too long) in production systems.
Recognition
Common Interview Phrases
Common Scenarios
- Order processing pipelines in e-commerce
- Asynchronous task execution in serverless (Lambda) architectures
- Buffering log data before ingestion into a database
- Implementing the 'Producer-Consumer' pattern at cloud scale
Anti-patterns to Avoid
- Using SQS for real-time, low-latency request-response cycles
- Storing large files directly in the message (exceeding 256KB)
- Using it as a permanent data store (messages expire after 14 days)
- Broadcasting messages to multiple subscribers (use SNS for that)
The Problem
The Fundamental Issue
Tight coupling between services leads to cascading failures and inability to handle variable load.
What breaks without it
Synchronous calls block the producer, leading to thread exhaustion
Downstream service outages cause immediate upstream failure
Sudden traffic bursts overwhelm and crash consumer databases or APIs
Why alternatives fail
Database-based queues suffer from locking contention and scaling bottlenecks at high throughput
In-memory queues (like those in Sidekiq or Celery) lose data if the application node crashes
Self-managed brokers (RabbitMQ) require significant operational overhead for high availability and scaling
Mental Model
The Intuition
Think of SQS as a Post Office Box (PO Box). A sender drops a letter in the box (Producer). The letter stays there until the recipient (Consumer) checks the box. When the recipient takes the letter, they tell the post office 'I'm reading this, don't let anyone else see it for 30 seconds.' If they don't finish and delete it in that time, the letter becomes visible again for someone else to try.
Key Mechanics
SendMessage: Producer pushes a message to the queue
ReceiveMessage: Consumer polls for messages (Short vs. Long Polling)
Visibility Timeout: The period a message is hidden from other consumers after being picked up
DeleteMessage: Consumer must explicitly delete the message after successful processing
Dead Letter Queue (DLQ): Where 'poison pill' messages go after failing multiple times
Framework
When it's the best choice
- AWS-native environments where operational zero-touch is preferred
- Workloads requiring massive horizontal scaling (virtually unlimited throughput for Standard queues)
- Scenarios where message order is not strictly required (Standard) or throughput is < 3000 msg/s (FIFO)
When to avoid
- When sub-millisecond latency is required between producer and consumer
- When you need to 'replay' messages (use Kafka or Kinesis instead)
- When you need a 'Pub/Sub' model where one message goes to many different services
Fast Heuristics
Tradeoffs
Strengths
- Infinite scalability with no manual intervention
- Highly durable (messages stored across multiple Availability Zones)
- Decouples scaling of producers and consumers independently
- Pay-per-use pricing model is cost-effective for low/medium volumes
Weaknesses
- Standard queues do not guarantee strict ordering
- Standard queues may deliver duplicate messages (At-least-once)
- Message size limit of 256KB (requires S3 'Claim Check' pattern for larger data)
- Polling architecture introduces a small amount of inherent latency
Alternatives
When it wins
High-throughput stream processing and message replayability (log-based)
Key Difference
Kafka is a distributed log where messages stay after being read; SQS is a transient queue where messages are deleted after processing.
When it wins
Complex routing logic (headers, exchange types) and very low latency requirements
Key Difference
RabbitMQ is a traditional message broker that pushes messages to consumers; SQS requires consumers to pull (poll) messages.
When it wins
Real-time data streaming and analytics on large volumes of data with strict ordering per shard
Key Difference
Kinesis allows multiple consumers to read the same stream at different offsets; SQS messages are typically consumed by one worker.
Execution
Must-hit talking points
- Mention 'Long Polling' (WaitTimeSeconds) to reduce costs and empty responses
- Discuss 'Idempotency' as a requirement for handling SQS duplicates
- Explain the 'Visibility Timeout' and how to extend it for long-running tasks
- Highlight 'Dead Letter Queues' (DLQ) for handling unprocessable messages
Anticipate follow-ups
- Q:How do you handle messages larger than 256KB? (Answer: S3 Claim Check pattern)
- Q:How do you ensure exactly-once processing? (Answer: FIFO queues or Idempotency keys in the DB)
- Q:How do you scale consumers? (Answer: CloudWatch Alarms on 'ApproximateNumberOfMessagesVisible' triggering Auto Scaling)
Red Flags
Setting the Visibility Timeout too short.
Why it fails: The consumer is still processing, but the message becomes visible again, leading to a second consumer picking it up and causing duplicate work or race conditions.
Not implementing idempotency in the consumer.
Why it fails: SQS Standard guarantees 'at-least-once' delivery. Without idempotency, your system will eventually process the same message twice, leading to corrupted data (e.g., double-charging a customer).
Using SQS for high-frequency request-response.
Why it fails: The overhead of polling and the distributed nature of SQS adds latency (tens to hundreds of milliseconds) that is unacceptable for synchronous UI interactions.