Elasticsearch
Cheat Sheet
Prime Use Case
Use when you need high-performance full-text search, complex filtering, or real-time ad-hoc analytics across massive semi-structured datasets.
Critical Tradeoffs
- Near-Real-Time (NRT) vs. Immediate Consistency
- High Memory/Heap Consumption vs. Search Performance
- Write Throughput vs. Indexing Latency
- Schema Flexibility vs. Mapping Explosion
Killer Senior Insight
Elasticsearch isn't a database; it's a distributed inverted index. Its power comes from the 'Segment' architecture where immutable files allow for OS-level page caching, but this necessitates a 'Merge' process that is the primary source of I/O spikes and performance jitter.
Recognition
Common Interview Phrases
Common Scenarios
- E-commerce product catalogs with complex filtering.
- Log aggregation and analysis (ELK/EFK stack).
- Geospatial search (finding 'points of interest' nearby).
- Application performance monitoring (APM) dashboards.
Anti-patterns to Avoid
- Using it as the primary source of truth for ACID transactions.
- Storing frequently updated counters (leads to versioning overhead and segment fragmentation).
- Relational data with heavy 'Join' requirements (ES is effectively flat).
The Problem
The Fundamental Issue
Traditional B-Tree based relational databases cannot perform efficient full-text searches or complex multi-dimensional filtering at scale.
What breaks without it
SQL 'LIKE %query%' queries cause full table scans, crashing production databases.
Ranking results by relevance (TF-IDF or BM25) becomes computationally impossible at runtime.
Adding new searchable fields requires expensive schema migrations.
Why alternatives fail
Standard RDBMS indexes are optimized for exact matches or prefix ranges, not tokenized text.
In-memory caches (Redis) lack the complex query DSL and ranking logic.
NoSQL stores (Cassandra/DynamoDB) lack secondary indexing flexibility for ad-hoc combinations of filters.
Mental Model
The Intuition
Imagine a massive library where, instead of a card catalog sorted by title, you have a giant index at the back of a book that lists every single word ever written and exactly which page and line it appears on. To find something, you look up the word and instantly see all locations.
Key Mechanics
Inverted Index: Mapping terms to document IDs.
Document Routing: Using 'hash(routing_key) % primary_shards' to locate data.
Segments: Immutable Lucene sub-indexes that are periodically merged.
Translog: A write-ahead log that ensures data durability before segments are flushed to disk.
Cluster Coordination: Master nodes managing cluster state while Data nodes handle CRUD and search.
Framework
When it's the best choice
- When read-to-write ratio is high and search latency is critical.
- When data is semi-structured (JSON) and fields may vary across documents.
- When you need to scale horizontally by simply adding more nodes to a cluster.
When to avoid
- When you have highly relational data that requires frequent joins.
- When you have a very high volume of updates to existing documents (ES updates are actually delete-and-reinsert).
- When you have strict constraints on data loss (ES is eventually consistent across replicas).
Fast Heuristics
Tradeoffs
Strengths
- Massive horizontal scalability via sharding.
- Rich Query DSL supporting aggregations, geo-queries, and fuzzy matching.
- High availability through cross-node replication.
- Schemaless-ish: Dynamic mapping allows for evolving data structures.
Weaknesses
- Significant JVM Heap management overhead (Garbage Collection pauses).
- Split-brain risk in poorly configured clusters (though improved in v7+).
- Deep pagination is expensive (O(N+M) complexity).
- Storage overhead due to inverted indices and stored fields.
Alternatives
When it wins
In enterprise environments requiring heavy XML support or legacy Hadoop integration.
Key Difference
Solr is more 'static' configuration-heavy; ES was built for dynamic, cloud-native environments from day one.
When it wins
When you want a managed SaaS solution for front-end search with zero infra management.
Key Difference
Algolia is a proprietary hosted service; ES is open-core/source and can be self-hosted.
When it wins
When the dataset is small and you want to avoid the operational complexity of a second cluster.
Key Difference
Postgres is a monolithic RDBMS; ES is a distributed search engine.
When it wins
When you need a lightweight, simpler alternative for smaller datasets with great out-of-the-box defaults.
Key Difference
Written in C++/Rust, focusing on developer experience over the massive feature set of ES.
Execution
Must-hit talking points
- Explain the 'Refresh' vs 'Flush' interval and how it affects NRT search.
- Discuss Shard Strategy: Over-sharding leads to 'small shard' problems and heap pressure.
- Mention 'Doc Values' for aggregations (columnar storage on disk).
- Highlight the 'Circuit Breaker' mechanism that prevents OOM by killing expensive queries.
Anticipate follow-ups
- Q:How do you handle the 'Deep Pagination' problem? (Search After API vs. Scroll API).
- Q:How do you handle a 'Mapping Explosion' when users can define arbitrary fields?
- Q:What is your strategy for Zero-Downtime Reindexing?
- Q:How does ES handle node failure and shard rebalancing?
Red Flags
Treating Elasticsearch as a primary relational database.
Why it fails: Lack of ACID transactions and expensive 'joins' (via parent-child or nested types) lead to data integrity issues and performance collapse.
Using too many shards for a small dataset.
Why it fails: Every shard is a Lucene index that consumes CPU, file descriptors, and memory. Thousands of tiny shards will crush the Master node's cluster state management.
Ignoring the 'Split Brain' scenario in older versions.
Why it fails: If 'discovery.zen.minimum_master_nodes' is not set to a quorum (N/2 + 1), a network partition can create two independent clusters, leading to permanent data divergence.