Elasticsearch

A distributed, RESTful search and analytics engine built on top of Apache Lucene, designed for horizontal scalability, reliability, and real-time search capabilities.

Cheat Sheet

Prime Use Case

Use when you need high-performance full-text search, complex filtering, or real-time ad-hoc analytics across massive semi-structured datasets.

Critical Tradeoffs

Near-Real-Time (NRT) vs. Immediate Consistency
High Memory/Heap Consumption vs. Search Performance
Write Throughput vs. Indexing Latency
Schema Flexibility vs. Mapping Explosion

Killer Senior Insight

Elasticsearch isn't a database; it's a distributed inverted index. Its power comes from the 'Segment' architecture where immutable files allow for OS-level page caching, but this necessitates a 'Merge' process that is the primary source of I/O spikes and performance jitter.

Recognition

Common Interview Phrases

We need 'search-as-you-type' functionality.

The system must handle fuzzy matching and relevance scoring.

We need to aggregate logs or metrics across thousands of nodes.

Users need to filter products by multiple dynamic attributes (faceting).

Common Scenarios

E-commerce product catalogs with complex filtering.
Log aggregation and analysis (ELK/EFK stack).
Geospatial search (finding 'points of interest' nearby).
Application performance monitoring (APM) dashboards.

Anti-patterns to Avoid

Using it as the primary source of truth for ACID transactions.
Storing frequently updated counters (leads to versioning overhead and segment fragmentation).
Relational data with heavy 'Join' requirements (ES is effectively flat).

The Problem

The Fundamental Issue

Traditional B-Tree based relational databases cannot perform efficient full-text searches or complex multi-dimensional filtering at scale.

What breaks without it

SQL 'LIKE %query%' queries cause full table scans, crashing production databases.

Ranking results by relevance (TF-IDF or BM25) becomes computationally impossible at runtime.

Adding new searchable fields requires expensive schema migrations.

Why alternatives fail

Standard RDBMS indexes are optimized for exact matches or prefix ranges, not tokenized text.

In-memory caches (Redis) lack the complex query DSL and ranking logic.

NoSQL stores (Cassandra/DynamoDB) lack secondary indexing flexibility for ad-hoc combinations of filters.

Mental Model

The Intuition

Imagine a massive library where, instead of a card catalog sorted by title, you have a giant index at the back of a book that lists every single word ever written and exactly which page and line it appears on. To find something, you look up the word and instantly see all locations.

Key Mechanics

Inverted Index: Mapping terms to document IDs.

Document Routing: Using 'hash(routing_key) % primary_shards' to locate data.

Segments: Immutable Lucene sub-indexes that are periodically merged.

Translog: A write-ahead log that ensures data durability before segments are flushed to disk.

Cluster Coordination: Master nodes managing cluster state while Data nodes handle CRUD and search.

Framework

When it's the best choice

When read-to-write ratio is high and search latency is critical.
When data is semi-structured (JSON) and fields may vary across documents.
When you need to scale horizontally by simply adding more nodes to a cluster.

When to avoid

When you have highly relational data that requires frequent joins.
When you have a very high volume of updates to existing documents (ES updates are actually delete-and-reinsert).
When you have strict constraints on data loss (ES is eventually consistent across replicas).

Fast Heuristics

If 'Full-text + Ranking' then Elasticsearch.

If 'Strict ACID + Relations' then PostgreSQL.

If 'Simple Key-Value + Sub-millisecond' then Redis.

If 'Massive Write Volume + Simple Query' then Cassandra.

Tradeoffs

Strengths

Massive horizontal scalability via sharding.
Rich Query DSL supporting aggregations, geo-queries, and fuzzy matching.
High availability through cross-node replication.
Schemaless-ish: Dynamic mapping allows for evolving data structures.

−

Weaknesses

Significant JVM Heap management overhead (Garbage Collection pauses).
Split-brain risk in poorly configured clusters (though improved in v7+).
Deep pagination is expensive (O(N+M) complexity).
Storage overhead due to inverted indices and stored fields.

Alternatives

Apache Solr

Alternative

When it wins

In enterprise environments requiring heavy XML support or legacy Hadoop integration.

Key Difference

Solr is more 'static' configuration-heavy; ES was built for dynamic, cloud-native environments from day one.

Algolia

Alternative

When it wins

When you want a managed SaaS solution for front-end search with zero infra management.

Key Difference

Algolia is a proprietary hosted service; ES is open-core/source and can be self-hosted.

PostgreSQL GIN Indexes

Alternative

When it wins

When the dataset is small and you want to avoid the operational complexity of a second cluster.

Key Difference

Postgres is a monolithic RDBMS; ES is a distributed search engine.

Typesense / Meilisearch

Alternative

When it wins

When you need a lightweight, simpler alternative for smaller datasets with great out-of-the-box defaults.

Key Difference

Written in C++/Rust, focusing on developer experience over the massive feature set of ES.

Execution

Must-hit talking points

Explain the 'Refresh' vs 'Flush' interval and how it affects NRT search.
Discuss Shard Strategy: Over-sharding leads to 'small shard' problems and heap pressure.
Mention 'Doc Values' for aggregations (columnar storage on disk).
Highlight the 'Circuit Breaker' mechanism that prevents OOM by killing expensive queries.

Anticipate follow-ups

Q:How do you handle the 'Deep Pagination' problem? (Search After API vs. Scroll API).
Q:How do you handle a 'Mapping Explosion' when users can define arbitrary fields?
Q:What is your strategy for Zero-Downtime Reindexing?
Q:How does ES handle node failure and shard rebalancing?

Red Flags

Treating Elasticsearch as a primary relational database.

Why it fails: Lack of ACID transactions and expensive 'joins' (via parent-child or nested types) lead to data integrity issues and performance collapse.

Using too many shards for a small dataset.

Why it fails: Every shard is a Lucene index that consumes CPU, file descriptors, and memory. Thousands of tiny shards will crush the Master node's cluster state management.

Ignoring the 'Split Brain' scenario in older versions.

Why it fails: If 'discovery.zen.minimum_master_nodes' is not set to a quorum (N/2 + 1), a network partition can create two independent clusters, leading to permanent data divergence.