SQS

Amazon Simple Queue Service (SQS) is a fully managed, highly scalable distributed message queuing service designed to decouple and scale microservices, distributed systems, and serverless applications.

Cheat Sheet

Prime Use Case

Use SQS when you need to decouple components of a distributed system to ensure asynchronous processing, buffer against traffic spikes, or manage work distribution among a pool of consumers.

Critical Tradeoffs

  • At-least-once delivery vs. Exactly-once delivery (Standard vs. FIFO)
  • Latency overhead of polling vs. immediate processing
  • Managed simplicity vs. limited message size and retention

Killer Senior Insight

The 'Visibility Timeout' is the most critical lever in SQS; setting it incorrectly is the primary cause of either duplicate processing (too short) or stalled recovery (too long) in production systems.

Recognition

Common Interview Phrases

Need to handle 'spiky' traffic without crashing downstream services
Requirement for asynchronous background processing (e.g., image resizing, email sending)
Decoupling a fast producer from a slow consumer
Ensuring system resilience where the producer shouldn't care if the consumer is temporarily down

Common Scenarios

  • Order processing pipelines in e-commerce
  • Asynchronous task execution in serverless (Lambda) architectures
  • Buffering log data before ingestion into a database
  • Implementing the 'Producer-Consumer' pattern at cloud scale

Anti-patterns to Avoid

  • Using SQS for real-time, low-latency request-response cycles
  • Storing large files directly in the message (exceeding 256KB)
  • Using it as a permanent data store (messages expire after 14 days)
  • Broadcasting messages to multiple subscribers (use SNS for that)

The Problem

The Fundamental Issue

Tight coupling between services leads to cascading failures and inability to handle variable load.

What breaks without it

Synchronous calls block the producer, leading to thread exhaustion

Downstream service outages cause immediate upstream failure

Sudden traffic bursts overwhelm and crash consumer databases or APIs

Why alternatives fail

Database-based queues suffer from locking contention and scaling bottlenecks at high throughput

In-memory queues (like those in Sidekiq or Celery) lose data if the application node crashes

Self-managed brokers (RabbitMQ) require significant operational overhead for high availability and scaling

Mental Model

The Intuition

Think of SQS as a Post Office Box (PO Box). A sender drops a letter in the box (Producer). The letter stays there until the recipient (Consumer) checks the box. When the recipient takes the letter, they tell the post office 'I'm reading this, don't let anyone else see it for 30 seconds.' If they don't finish and delete it in that time, the letter becomes visible again for someone else to try.

Key Mechanics

1

SendMessage: Producer pushes a message to the queue

2

ReceiveMessage: Consumer polls for messages (Short vs. Long Polling)

3

Visibility Timeout: The period a message is hidden from other consumers after being picked up

4

DeleteMessage: Consumer must explicitly delete the message after successful processing

5

Dead Letter Queue (DLQ): Where 'poison pill' messages go after failing multiple times

Framework

When it's the best choice

  • AWS-native environments where operational zero-touch is preferred
  • Workloads requiring massive horizontal scaling (virtually unlimited throughput for Standard queues)
  • Scenarios where message order is not strictly required (Standard) or throughput is < 3000 msg/s (FIFO)

When to avoid

  • When sub-millisecond latency is required between producer and consumer
  • When you need to 'replay' messages (use Kafka or Kinesis instead)
  • When you need a 'Pub/Sub' model where one message goes to many different services

Fast Heuristics

If you need 'Exactly-once' and 'Ordering', use SQS FIFO
If you need 'Fan-out' to multiple consumers, use SNS + SQS
If you need to process streams of data with multiple 'pointers' to the same data, use Kinesis

Tradeoffs

+

Strengths

  • Infinite scalability with no manual intervention
  • Highly durable (messages stored across multiple Availability Zones)
  • Decouples scaling of producers and consumers independently
  • Pay-per-use pricing model is cost-effective for low/medium volumes

Weaknesses

  • Standard queues do not guarantee strict ordering
  • Standard queues may deliver duplicate messages (At-least-once)
  • Message size limit of 256KB (requires S3 'Claim Check' pattern for larger data)
  • Polling architecture introduces a small amount of inherent latency

Alternatives

Apache Kafka
Alternative

When it wins

High-throughput stream processing and message replayability (log-based)

Key Difference

Kafka is a distributed log where messages stay after being read; SQS is a transient queue where messages are deleted after processing.

RabbitMQ
Alternative

When it wins

Complex routing logic (headers, exchange types) and very low latency requirements

Key Difference

RabbitMQ is a traditional message broker that pushes messages to consumers; SQS requires consumers to pull (poll) messages.

Amazon Kinesis
Alternative

When it wins

Real-time data streaming and analytics on large volumes of data with strict ordering per shard

Key Difference

Kinesis allows multiple consumers to read the same stream at different offsets; SQS messages are typically consumed by one worker.

Execution

Must-hit talking points

  • Mention 'Long Polling' (WaitTimeSeconds) to reduce costs and empty responses
  • Discuss 'Idempotency' as a requirement for handling SQS duplicates
  • Explain the 'Visibility Timeout' and how to extend it for long-running tasks
  • Highlight 'Dead Letter Queues' (DLQ) for handling unprocessable messages

Anticipate follow-ups

  • Q:How do you handle messages larger than 256KB? (Answer: S3 Claim Check pattern)
  • Q:How do you ensure exactly-once processing? (Answer: FIFO queues or Idempotency keys in the DB)
  • Q:How do you scale consumers? (Answer: CloudWatch Alarms on 'ApproximateNumberOfMessagesVisible' triggering Auto Scaling)

Red Flags

Setting the Visibility Timeout too short.

Why it fails: The consumer is still processing, but the message becomes visible again, leading to a second consumer picking it up and causing duplicate work or race conditions.

Not implementing idempotency in the consumer.

Why it fails: SQS Standard guarantees 'at-least-once' delivery. Without idempotency, your system will eventually process the same message twice, leading to corrupted data (e.g., double-charging a customer).

Using SQS for high-frequency request-response.

Why it fails: The overhead of polling and the distributed nature of SQS adds latency (tens to hundreds of milliseconds) that is unacceptable for synchronous UI interactions.