Logistic Regression

A probabilistic linear model that estimates the probability of a binary outcome by passing a linear combination of features through the logistic (sigmoid) function.

Cheat Sheet

Prime Use Case

Use as a baseline for binary classification, especially when interpretability, low-latency inference, and well-calibrated probabilities are required in high-dimensional sparse feature spaces.

Critical Tradeoffs

  • High interpretability vs. limited capacity for non-linear relationships
  • Extremely low inference latency vs. heavy reliance on manual feature engineering
  • Excellent calibration out-of-the-box vs. sensitivity to outliers and multicollinearity

Killer Senior Insight

In large-scale production systems like AdTech or Search, Logistic Regression is rarely just a 'simple model'; it is the backbone of high-throughput ranking when combined with feature hashing and online learning algorithms like FTRL-Proximal.

Recognition

Common Interview Phrases

The interviewer emphasizes 'low latency' or 'high throughput' requirements.
The problem involves massive, sparse feature sets (e.g., millions of categorical IDs).
There is a strict requirement for model explainability (e.g., 'Why was this loan denied?').
The system needs to provide a well-calibrated probability, not just a hard classification.

Common Scenarios

  • Click-Through Rate (CTR) prediction in real-time bidding.
  • Initial stage of a multi-stage ranking funnel.
  • Fraud detection where decision transparency is legally mandated.
  • Medical risk scoring based on clinical indicators.

Anti-patterns to Avoid

  • Using raw Logistic Regression for image or audio data without deep feature extractors.
  • Proposing it for small datasets with complex, non-linear interactions without cross-features.
  • Using it when the target variable is continuous (this is a classification model, despite the name).

The Problem

The Fundamental Issue

Mapping an unbounded linear combination of features to a bounded [0, 1] probability space while maintaining a convex loss function for efficient optimization.

What breaks without it

Linear Regression would predict values outside [0, 1], making probability interpretation impossible.

Hard classifiers (like SVMs) do not naturally provide the probability scores needed for ranking or expected value calculations.

Non-convex models (like Deep Learning) may converge to local minima, making them harder to debug in simple scenarios.

Why alternatives fail

Decision Trees are computationally expensive for high-cardinality categorical features (e.g., UserID).

SVMs do not scale well to millions of rows and do not provide calibrated probabilities without expensive post-processing like Platt Scaling.

Neural Networks require significantly more compute and data to outperform a well-tuned LR on tabular data.

Mental Model

The Intuition

Imagine a light dimmer switch. Instead of just 'On' or 'Off', the switch calculates a score based on various inputs (room brightness, time of day) and slides to a specific percentage of brightness. Logistic Regression is that slider, converting a raw score into a probability.

Key Mechanics

1

Linear Predictor: z = w0 + w1x1 + ... + wnxn

2

Sigmoid Activation: 1 / (1 + exp(-z))

3

Log-Loss (Cross-Entropy) objective function

4

Maximum Likelihood Estimation (MLE) for parameter optimization

Framework

When it's the best choice

  • When features are sparse and high-dimensional (e.g., bag-of-words, one-hot encoded IDs).
  • When the model must be updated frequently via online learning (streaming data).
  • When the model serves as a 'first-pass' filter in a recommendation system.

When to avoid

  • When the relationship between features and target is highly non-linear (e.g., XOR patterns).
  • When you have 'wide' data with few samples but many features (unless using strong L1 regularization).
  • When feature engineering resources are limited and you need the model to 'learn' interactions automatically.

Fast Heuristics

If latency < 5ms and features are sparse
Logistic Regression.
If data is dense and non-linear
GBDT or Random Forest.
If data is unstructured (Image/Text)
CNN/Transformer.

Tradeoffs

+

Strengths

  • Fastest inference time among standard ML models.
  • Outputs are naturally calibrated probabilities.
  • Convex optimization guarantees a global optimum.
  • Weights directly represent feature importance (odds ratios).

Weaknesses

  • Cannot capture feature interactions (e.g., A AND B) without manual cross-product features.
  • Assumes independence of irrelevant alternatives (in multinomial cases).
  • Highly sensitive to multicollinearity which can inflate weight variance.

Alternatives

Gradient Boosted Decision Trees (GBDT)
Alternative

When it wins

When working with dense, tabular data where non-linear relationships dominate.

Key Difference

GBDTs learn non-linear decision boundaries through ensembles of trees rather than a single linear hyperplane.

Factorization Machines (FM)
Alternative

When it wins

When you need to capture second-order feature interactions in sparse data without manual engineering.

Key Difference

FMs model interactions by learning a latent vector for each feature, allowing it to generalize to unseen feature combinations.

Support Vector Machines (SVM)
Alternative

When it wins

When the margin between classes is the primary concern and the dataset is relatively small.

Key Difference

SVMs maximize the margin between classes using hinge loss, whereas LR maximizes the likelihood of the data.

Execution

Must-hit talking points

  • Mention 'Log-Loss' as the optimization objective to show mathematical grounding.
  • Discuss 'L1 (Lasso) vs L2 (Ridge) Regularization' and how L1 can be used for feature selection.
  • Explain 'Calibration'—why a predicted 0.7 should actually mean a 70% chance in the real world.
  • Highlight 'Feature Engineering' (binning, crossing, hashing) as the primary way to improve LR performance.

Anticipate follow-ups

  • Q:How do you handle imbalanced classes? (Downsampling, class weights, or adjusting the threshold).
  • Q:How do you scale LR to billions of examples? (Distributed SGD or Parameter Servers).
  • Q:What happens if two features are perfectly correlated? (Weights become unstable/non-unique).

Red Flags

Failing to normalize/scale features when using regularization.

Why it fails: Regularization penalizes the magnitude of weights; if features are on different scales, the penalty is applied unfairly, leading to biased coefficients.

Assuming Logistic Regression handles outliers well.

Why it fails: The sigmoid function can still be heavily influenced by extreme values in the linear combination, pulling the decision boundary away from the optimal position.

Using it for multi-class problems without specifying the strategy.

Why it fails: Standard LR is binary; you must choose between One-vs-Rest (OvR) or Multinomial (Softmax) approaches, which have different computational and theoretical implications.