CNN
Cheat Sheet
Prime Use Case
When the input data exhibits strong spatial or temporal local correlations, such as images, video frames, or audio spectrograms.
Critical Tradeoffs
- High parameter efficiency via weight sharing vs. limited global context capture
- Translation invariance vs. loss of precise spatial orientation (e.g., in max pooling)
- Inductive bias for local patterns vs. high computational cost of deep feature maps
Killer Senior Insight
CNNs succeed not just because they are deep, but because they impose a 'spatial inductive bias' that matches the physics of the real world—local pixels are more related than distant ones.
Recognition
Common Interview Phrases
Common Scenarios
- Object Detection and Segmentation in Autonomous Driving
- Medical Imaging Analysis (MRI/CT scans)
- Facial Recognition and Biometric Verification
- Content Moderation (NSFW detection)
Anti-patterns to Avoid
- Using CNNs for tabular data with no inherent spatial ordering.
- Applying standard CNNs to non-Euclidean data like social graphs (use GNNs instead).
- Using deep CNNs for very small datasets without transfer learning.
The Problem
The Fundamental Issue
The 'Curse of Dimensionality' in image processing where a fully connected layer would require millions of weights for a single small image, leading to massive overfitting.
What breaks without it
Model size becomes prohibitive for memory (O(N*M) weights).
Loss of spatial hierarchy; the model treats pixels as independent features.
Lack of translation invariance; a cat in the top-left is 'different' from a cat in the bottom-right.
Why alternatives fail
MLPs (Multi-Layer Perceptrons) fail to scale to high-resolution inputs and ignore the 2D structure.
Traditional Computer Vision (SIFT/HOG) requires manual feature engineering and lacks the representational power of deep learning.
Mental Model
The Intuition
Imagine sliding a small magnifying glass (kernel) across a large map. The glass only looks for one specific shape (like a crossroad). Every time it finds that shape, it marks a 'hit' on a new, smaller map. By stacking these maps, you go from finding simple lines to finding complex landmarks.
Key Mechanics
Convolution: Element-wise multiplication and summation using a learnable kernel.
Pooling: Downsampling to reduce dimensionality and provide local translation invariance.
Receptive Field: The specific area of the input image that affects a particular neuron's output.
1x1 Convolutions: Used for dimensionality reduction (channel pooling) and adding non-linearity without changing spatial dimensions.
Framework
When it's the best choice
- Latency-sensitive vision tasks where specialized hardware (TPU/NPU) acceleration is available.
- Scenarios with limited training data where the CNN's strong inductive bias prevents overfitting.
When to avoid
- Tasks requiring global reasoning across the entire image from the first layer (use Transformers).
- Extremely long-range sequence modeling where the receptive field growth is too slow (O(L) vs O(1)).
Fast Heuristics
Tradeoffs
Strengths
- Parameter efficiency through weight sharing.
- Hierarchical feature learning (edges -> textures -> parts -> objects).
- Highly optimized kernels available in all major deep learning libraries.
Weaknesses
- Fixed input size requirements (usually solved by adaptive pooling or resizing).
- Poor at capturing long-range dependencies without very deep stacks.
- Susceptibility to adversarial attacks (small pixel changes can flip predictions).
Alternatives
When it wins
When massive datasets are available and global context is critical.
Key Difference
Uses Self-Attention instead of Convolutions; has no inherent spatial inductive bias.
When it wins
When viewpoint invariance and part-whole relationships are more important than translation invariance.
Key Difference
Uses 'routing' instead of pooling to preserve spatial orientation and pose.
When it wins
When a simpler, non-convolutional, non-attentional architecture is desired for competitive performance.
Key Difference
Repeatedly applies MLPs across patches and channels.
Execution
Must-hit talking points
- Explain the calculation of the Receptive Field and why it matters for object size.
- Discuss the 'Vanishing Gradient' problem and how Residual Connections (ResNet) solved it.
- Mention Depthwise Separable Convolutions for mobile-optimized architectures.
- Highlight the importance of Data Augmentation (rotation, scaling) to improve invariance.
Anticipate follow-ups
- Q:How do you handle varying input resolutions in a production pipeline?
- Q:What is the difference between Dilated (Atrous) Convolutions and standard ones?
- Q:How would you optimize a CNN for inference on an FPGA or mobile device?
Red Flags
Ignoring the Receptive Field size.
Why it fails: If the receptive field is smaller than the object you are trying to detect, the model will never 'see' the whole object, leading to poor performance.
Over-using Max Pooling.
Why it fails: Aggressive pooling throws away spatial information (the 'where' info), which is catastrophic for tasks like image segmentation or pose estimation.
Not normalizing inputs.
Why it fails: CNNs are sensitive to the scale of input pixel values; failing to normalize to [0,1] or [-1,1] leads to slow convergence or exploding gradients.