What is the CAP theorem and how does it affect system design?

The CAP theorem states that a distributed system can only guarantee two of three properties at the same time: Consistency (all nodes see the same data), Availability (every request gets a response), and Partition Tolerance (system works despite network failures). In practice, since network partitions are unavoidable, you choose between CP (e.g., HBase, ZooKeeper) and AP (e.g., Cassandra, DynamoDB) systems based on your business requirements.

How do you design a rate limiter?

Common algorithms: Token Bucket (allows bursts, refill at fixed rate), Leaky Bucket (smooth output, no bursts), Fixed Window Counter (simple, boundary spike risk), and Sliding Window Log/Counter (accurate, no boundary spikes). For distributed systems, store counters in Redis with INCR + EXPIRE. In Spring Boot, use Resilience4j RateLimiter or Bucket4j backed by Redis.

What is consistent hashing and when do you use it?

Consistent hashing maps both data keys and servers onto a ring. When a server is added or removed, only ~K/N keys are remapped (K=keys, N=nodes), versus traditional hash mod N which remaps nearly all keys. Use it for distributed caches (Memcached, Redis Cluster), load balancers routing sticky sessions, and sharding databases.

SQL vs NoSQL — how do you choose?

Choose SQL (PostgreSQL, MySQL) when you need ACID transactions, complex joins, and your data schema is stable. Choose NoSQL when you need horizontal scalability at massive scale, flexible schema, or specific access patterns: key-value (Redis — caching, sessions), document (MongoDB — catalogs, user profiles), wide-column (Cassandra — time-series, write-heavy), graph (Neo4j — social networks, recommendations).

What are the key microservices design patterns?

Core patterns: API Gateway (single entry point), Service Discovery (Eureka/Consul), Circuit Breaker (Resilience4j — fail fast), Saga (distributed transactions via events), Outbox Pattern (reliable event publishing), CQRS (separate read/write models), Event Sourcing (store events not state), Sidecar (cross-cutting concerns in a proxy). In Spring Boot, implement with Spring Cloud Gateway, Resilience4j, and Spring Kafka.

System Design Hub

Everything a Java engineer needs to ace system design interviews — scalability, microservices, caching, Kafka, database design, and FAANG-level patterns. Built around real Spring Boot implementations.

📋 What's in This Hub

1. Scalability Fundamentals 2. Microservices Patterns 3. Database Design & Scaling 4. Caching Strategies 5. Messaging & Event-Driven 6. Key Design Patterns 7. Case Studies 8. Interview Q&A 9. 90-Day Roadmap 10. All Articles

The 4 Qualities of Every Well-Designed System

Every system design question is really asking how you balance these four

📈

Scalability

Handle 10× traffic without redesigning. Horizontal scaling, stateless services, sharding, read replicas, CDN.

🛡️

Reliability

Keep working when components fail. Redundancy, circuit breakers, retries with back-off, graceful degradation.

⚡

Performance

Fast response under load. Caching at every layer, async processing, connection pooling, efficient DB queries.

🔧

Maintainability

Easy to change and operate. SOLID principles, clear service boundaries, observability, API versioning.

1. Scalability Fundamentals

The concepts every interviewer expects you to know cold

Horizontal vs Vertical Scaling

Vertical = bigger machine. Horizontal = more machines. Prefer horizontal for availability; vertical has a hard ceiling. Stateless services scale horizontally; stateful ones need sticky sessions or external state stores (Redis).

Load Balancing

Distribute traffic across instances: Round Robin (equal load), Least Connections (variable request times), IP Hash (sticky sessions), Consistent Hash (cache affinity). In AWS: ALB for HTTP/HTTPS, NLB for TCP/UDP at scale.

CAP Theorem

Distributed systems can guarantee only 2 of 3: Consistency, Availability, Partition Tolerance. Since partitions are unavoidable, choose CP (HBase, ZooKeeper, etcd) or AP (Cassandra, DynamoDB, CouchDB) based on your tolerance for stale reads vs unavailability.

Consistent Hashing

Maps servers and keys onto a ring. Adding/removing a server remaps only ~K/N keys. Used in Redis Cluster, Cassandra, CDNs, and any distributed cache where you want minimal reshuffling on topology changes.

Rate Limiting

Protect your API from overuse. Algorithms: Token Bucket (allow bursts), Leaky Bucket (smooth rate), Sliding Window Counter (accurate, no boundary spikes). Implement with Resilience4j or Bucket4j + Redis in Spring Boot.

Back-of-Envelope Estimation

Quickly size your system: QPS, storage per day/year, bandwidth. Useful ratios: 1 million RPS = ~1000 servers at 1000 RPS/server; 1 TB/day = ~12 MB/sec write throughput. Practice estimating before designing.

2. Microservices Patterns with Spring Boot

The patterns that separate junior from senior designs — with Spring Boot implementations

Microservices

Microservices Design Patterns Explained with Spring Boot

API Gateway, Circuit Breaker, Service Discovery, Saga, CQRS — complete guide with Spring Boot implementations and real trade-offs.

📅 2026-04-28

Microservices

Microservices Design Patterns with Spring Boot: A Complete Guide

Deep dive into API Gateway, Circuit Breaker with Resilience4j, distributed tracing, and service mesh concepts.

📅 2026-03-29

Microservices

Building a Scalable Spring Boot Microservices Example with Docker

Step-by-step: build multiple Spring Boot services, containerize with Docker, orchestrate with Docker Compose, add service discovery.

📅 2026-03-25

Pattern	Problem it solves	Spring Boot tool
API Gateway	Single entry point, auth, rate limiting, routing	Spring Cloud Gateway
Circuit Breaker	Fail fast when downstream is slow/down	Resilience4j
Service Discovery	Dynamic service location without hardcoded IPs	Eureka / Consul
Saga	Distributed transactions across services	Spring Kafka + compensating events
Outbox Pattern	Guarantee event is published when DB write succeeds	Debezium + Kafka / Spring Batch
CQRS	Separate read and write models for performance	Spring Data + separate query service
Bulkhead	Isolate failures — one slow service doesn't starve others	Resilience4j Bulkhead
Sidecar	Cross-cutting concerns (logging, tracing) without changing service code	Envoy / Istio sidecar

3. Database Design & Scaling

SQL vs NoSQL, sharding, replication, and the ACID vs BASE trade-off

SQL vs NoSQL — choosing right

SQL (PostgreSQL, MySQL): ACID, complex joins, stable schema. Best for financial data, orders, user accounts.
NoSQL: flexible schema, horizontal scale. Key-value (Redis), Document (MongoDB), Wide-column (Cassandra), Graph (Neo4j).

Browse SQL & Database posts →

Read Replicas & Sharding

Read replicas: route all reads to replicas, writes to primary. Works when reads >> writes (common in most apps).
Sharding: partition data horizontally — range-based, hash-based, directory-based. Adds complexity; try replicas and caching first.

Indexing Strategies

B-Tree index for range queries and equality. Composite index: column order matters (most selective first). Covering index: includes all queried columns. Partial index for filtered queries. Avoid over-indexing — every index slows writes.

ACID vs BASE

ACID (SQL): Atomicity, Consistency, Isolation, Durability — strong guarantees, harder to scale.
BASE (NoSQL): Basically Available, Soft state, Eventually consistent — better scalability, weaker guarantees. Many modern systems mix both.

Connection Pooling (HikariCP)

Spring Boot uses HikariCP by default. Key settings: maximumPoolSize (default 10 — usually too low), connectionTimeout (30s — reduce for fast-fail), idleTimeout. Rule of thumb: pool size ≈ (2 × CPU cores) + number of disk spindles.

Database tutorials →

Database Migration (Flyway / Liquibase)

Version-control your schema changes. Flyway: SQL-based, simple, runs on startup. Liquibase: XML/YAML/JSON changesets, rollback support. Both integrate natively with Spring Boot via spring.flyway.enabled=true.

4. Caching Strategies

Cache at the right layer with the right strategy — and know when not to cache

Strategy	How it works	Best for	Risk
Cache-Aside (Lazy Loading)	App checks cache first; on miss, fetches from DB and populates cache	Read-heavy, cacheable objects	Cache stampede on cold start
Write-Through	Write to cache and DB simultaneously on every write	Low write latency tolerance, strong consistency	Writes slower; cache fills with rarely-read data
Write-Behind (Write-Back)	Write to cache only; async flush to DB later	Very write-heavy workloads	Data loss if cache fails before flush
Read-Through	Cache sits in front of DB; cache fetches data on miss	Simpler application code	First request always slow; need TTL discipline
Refresh-Ahead	Proactively refresh cache before TTL expires	Predictable access patterns	Wasteful if predictions wrong

⚡ Cache Eviction Policies

LRU — evict least recently used (most common)
LFU — evict least frequently used (better for skewed access)
TTL — expire after fixed time (simplest, prevents stale data)
FIFO — evict oldest entry first (rarely optimal)

🔴 Cache Problems to Know

Cache Penetration — requests for non-existent keys bypass cache. Fix: bloom filter or cache null results.
Cache Avalanche — many keys expire simultaneously. Fix: stagger TTLs with jitter.
Cache Stampede — many requests hit DB on cache miss. Fix: mutex lock or probabilistic early expiry.

5. Messaging & Event-Driven Architecture

Decouple services, absorb traffic spikes, and build resilient async systems with Kafka

Kafka

Building Scalable Systems with Event-Driven Architecture using Spring Boot and Kafka

Producers, consumers, topics, partitions, consumer groups, offset management, and reliability guarantees — with Spring Kafka.

📅 2026-04-19

Kafka

Event-Driven Architecture with Spring Boot and Kafka: Complete Guide

Design patterns for event-driven systems: event sourcing, CQRS, saga choreography, dead letter topics, and exactly-once semantics.

📅 2026-03-27

Case Study

System Design: Scalable Notification System

End-to-end design of a push/email/SMS notification system — fan-out, priority queues, deduplication, and delivery guarantees.

📅 2026

Kafka vs RabbitMQ — When to Use Which

Use Kafka when:

High throughput (millions of msgs/sec)
Replay events (event sourcing, audit logs)
Multiple independent consumers per topic
Stream processing (Kafka Streams, Flink)

Use RabbitMQ when:

Complex routing (topic/fanout/direct exchanges)
Message TTL and per-message priority
Low-latency task queues (job processing)
Simpler ops and smaller scale (< 10K msg/sec)

6. Key Architectural Patterns

The patterns you will be asked to draw on a whiteboard at Meta/FAANG interviews

CQRS (Command Query Responsibility Segregation)

Separate the write model (Commands → DB) from the read model (Queries → optimised read store). Read side can use a denormalised view, Elasticsearch, or Redis. Dramatically improves read performance at the cost of eventual consistency.

Event Sourcing

Store events (things that happened) instead of current state. Replay events to rebuild state at any point in time. Natural fit with CQRS. Use Kafka as the event log. Downside: query complexity, eventual consistency, schema evolution.

Saga Pattern (Distributed Transactions)

Choreography: services emit and react to events — no central coordinator, loose coupling. Orchestration: a saga orchestrator directs each step — simpler to track, single point of complexity. Use orchestration when business logic is complex.

Outbox Pattern

Write to DB and to an outbox table in the same transaction. A relay process reads the outbox and publishes to Kafka. Guarantees at-least-once delivery without distributed transactions. Essential when your service writes to DB and emits events.

Strangler Fig

Incrementally migrate a monolith to microservices by routing new features to new services while keeping the old system running. Route via API Gateway. Retire old code when traffic is fully migrated. Low-risk, incremental approach.

Bulkhead

Isolate critical services from slow ones using separate thread pools (Resilience4j ThreadPoolBulkhead) or separate service instances. If the recommendation service hangs, the checkout service keeps working. Named after ship hull compartments.

7. System Design Case Studies

Classic interview problems — how to approach each one in 45 minutes

System	Key challenges	Core components	Scale hint
URL Shortener (TinyURL)	Unique ID generation, redirect latency, analytics	Base62 encoding, Redis cache, DB for mapping, CDN for redirects	100M URLs, 10B redirects/day
Notification System	Fan-out at scale, deduplication, delivery guarantees, priority	Kafka topics per channel, priority queue, idempotency key, retry with DLQ	10M notifications/day
Rate Limiter	Distributed state, accuracy vs performance, API key vs user	Redis INCR+EXPIRE (sliding window counter), Token Bucket, API Gateway layer	1M RPS across 10 servers
News Feed / Social Feed	Fan-out vs fan-in, ranking, pagination, celebrity users	Push model (write to each follower's feed), pull for celebrities, Redis sorted sets	500M users, 10M posts/day
Search Autocomplete	Low latency, prefix matching, trending suggestions	Trie (in-memory), or Elasticsearch prefix query, Redis sorted set for frequency	10M searches/day, <100ms p99
Distributed Cache	Consistent hashing, eviction, replication, hot key problem	Consistent hash ring, LRU eviction, primary/replica replication, client-side sharding	1TB cache, 1M QPS

Deep Dive: Scalable Notification System →

8. System Design Interview Q&A

The questions Meta, Google, and Amazon actually ask — with structured answers

Q1. How would you design a URL shortener like TinyURL?

Clarify: read-heavy (100:1 read/write), need custom aliases? analytics?
Write path: generate unique 7-char Base62 ID (counter + Base62 encode, or MD5+truncate), store short_id → long_url in a DB (PostgreSQL or DynamoDB).
Read path: GET /{id} → check Redis cache first → on miss, DB lookup → HTTP 301/302 redirect. Use Redis with TTL for hot URLs (cache-aside).
Scale: Read replicas for DB, CDN for the redirect response, consistent hashing to shard the cache.
Trade-offs: 301 (permanent) vs 302 (temporary) — 301 is cached by browser (fewer hits, less analytics); 302 is not cached (more server hits, full analytics).

Q2. How would you design a rate limiter for an API?

Clarify: per user, per IP, per API key? Global or per-service? Hard limit or soft?
Algorithm choice: Sliding Window Counter (most accurate, no boundary spike) — store user:timestamp → count in Redis using sorted sets.
Redis implementation: ZADD user_id:requests timestamp timestamp, then ZREMRANGEBYSCORE to remove old entries, ZCARD to count. If count > limit → 429 Too Many Requests.
Spring Boot: Implement as a HandlerInterceptor or Spring Cloud Gateway filter with Bucket4j + Redis backend.
Edge cases: What if Redis is down? Fail open (allow) or fail closed (block)? Fail open is safer for availability.

Q3. How would you design a notification system?

Clarify: channels (push, email, SMS), volume (10M/day), priority (transactional > promotional), delivery guarantee (at-least-once).
Architecture: API Service → Kafka topics per channel → Channel Workers → 3rd party (FCM/APNs for push, SendGrid for email, Twilio for SMS).
Reliability: Idempotency key on each notification to prevent duplicates. Dead-letter topic for failed messages. Retry with exponential back-off.
Fan-out at scale: If user has 10M followers, don't send 10M DB writes synchronously — use a fan-out worker that reads the follower list and publishes to Kafka in batches.
Priority: Separate Kafka topics (high-priority, low-priority) with different consumer lag SLAs.

Q4. CAP Theorem — explain with a real example.

A distributed database across two data centers is separated by a network partition (P is unavoidable). Now you must choose:
CP (Consistent + Partition Tolerant): Return an error or wait until the network heals rather than return stale data. Example: a bank transfer — stale balance data could cause double-spend. Use: ZooKeeper, HBase, etcd.
AP (Available + Partition Tolerant): Return the best available data even if stale. Example: a shopping cart — showing a slightly stale cart is acceptable. Use: Cassandra, DynamoDB (eventually consistent reads), CouchDB.
Nuance: CAP is binary but PACELC adds latency: even without partitions, there's a latency-consistency trade-off. Modern systems like DynamoDB let you choose consistency level per request.

Q5. How do you handle the database bottleneck as traffic grows?

Step-by-step progression (don't jump straight to sharding):
1. Query optimisation — proper indexes, avoid N+1 queries, use EXPLAIN ANALYZE.
2. Connection pooling — HikariCP, tune pool size to 2×CPU+disk spindles.
3. Caching — Redis cache-aside for frequently read, rarely changing data.
4. Read replicas — route all reads to replicas; reduces primary load by 80% for most apps.
5. Vertical scaling — bigger instance (quick win, limited ceiling).
6. Sharding — partition by user ID or region. Adds complexity; do this last.
7. NoSQL migration — if access pattern is key-value or time-series, move that data to a purpose-built store.

Q6. Explain the Saga pattern for distributed transactions.

A distributed transaction (e.g., place order → reserve inventory → charge payment) cannot use a single DB transaction across services. Saga breaks it into local transactions with compensating actions on failure.
Choreography-based Saga: Each service publishes an event when its step succeeds. Next service listens and acts. On failure, the service publishes a failure event and each prior service runs its compensating action. Simple, loose coupling, hard to track overall state.
Orchestration-based Saga: A central orchestrator (a Spring Boot service) calls each step and handles failures by calling compensating endpoints. Easier to debug and monitor. Recommended when the workflow is complex.
Spring Boot: Implement with Spring Kafka (choreography) or as a Spring State Machine (orchestration). Library: Axon Framework provides built-in saga support.

9. The 90-Day Java + System Design Roadmap

Focused on what actually matters for Meta/FAANG — skip what doesn't

Month 1 — Java Foundations & Spring Boot Mastery

W1

Core Java Deep Dive

Java memory model, GC algorithms (G1, ZGC), virtual threads (Project Loom), CompletableFuture, ExecutorService, happens-before guarantees.

Can you explain the Java memory model from first principles?
What is a happens-before relationship?
When do you use virtual threads vs platform threads?

W2

Spring Boot Internals

Auto-configuration internals, bean lifecycle, @Conditional annotations, Spring Security filter chain, JPA N+1 problem, Spring Boot Actuator.

How does @SpringBootApplication auto-configure beans?
How does SecurityFilterChain work under the hood?
How do you detect and fix the N+1 select problem?

W3

Testing with JUnit 5 & Mockito

Write unit tests with @Mock/@InjectMocks, slice tests with @WebMvcTest, integration tests with @SpringBootTest + TestContainers, ArgumentCaptor, verify patterns.

Difference between @Mock and @MockBean?
How do you test a Kafka consumer with TestContainers?
What is a test pyramid and why does it matter?

W4

SOLID + Design Patterns in Practice

Apply SOLID to a real codebase. Master the 10 most asked patterns: Factory, Builder, Strategy, Observer, Decorator, Proxy, Singleton, Command, Template Method, Chain of Responsibility.

Identify where each pattern is used in Spring itself
Refactor a God class to follow SOLID
Implement Strategy pattern for payment processing

Month 2 — System Design from Scratch

W5

Scalability & Estimation

Back-of-envelope estimation, horizontal vs vertical scaling, load balancing algorithms, stateless design, CDN, consistent hashing.

Estimate storage for 100M users uploading 1 photo/day at 200KB each
Design a load balancer that routes by consistent hash
When do you choose sticky sessions over stateless?

W6

Databases & Caching

SQL vs NoSQL decision framework, read replicas, sharding strategies, caching patterns (cache-aside, write-through), Redis data structures, cache stampede/avalanche/penetration fixes.

Design a caching layer for a product catalog (10M products, 1M RPS reads)
How do you handle hot keys in Redis?
When would you choose Cassandra over PostgreSQL?

W7

Microservices & Kafka

API Gateway, Circuit Breaker (Resilience4j), service discovery, Saga patterns, Outbox pattern, Kafka producers/consumers, partitions, consumer groups, at-least-once delivery.

How do you implement a distributed transaction across 3 services without 2PC?
What happens when a Kafka consumer crashes mid-processing?
How do you prevent a slow downstream from cascading to all services?

W8

Design 3 Classic Systems End-to-End

URL shortener, notification system, rate limiter. For each: gather requirements, estimate scale, design components, identify bottlenecks, iterate.

Draw full architecture diagrams on paper (no IDE)
Time yourself: 45 minutes per system
Record yourself explaining — watch it back

Month 3 — Interview Simulation & Polish

W9

DSA: Patterns Not Problems

Focus on 14 patterns: Sliding Window, Two Pointers, Fast/Slow Pointers, Merge Intervals, Cyclic Sort, BFS/DFS, Dynamic Programming (Memoization + Tabulation), Topological Sort, Union-Find.

W10

Behavioural + Leadership Principles

STAR format for 10 key situations. Meta specifically tests: Move Fast, Be Direct, Build Awesome Things. Prepare 5 strong leadership stories covering conflict, failure, ambiguity, impact, and collaboration.

W11-12

Mock Interviews — Full Loop Simulation

2 coding interviews (LeetCode medium/hard), 1 system design, 1 behavioural. Use pramp.com, interviewing.io, or find a peer. After each mock: write a debrief noting what you'd do differently.

Can you explain your solution clearly while coding?
Do you clarify requirements before jumping in?
Do you proactively mention trade-offs?

All System Design & Architecture Articles

Microservices

Start Your System Design Journey Today

11 years of Java experience + system design mastery = FAANG-ready. Start with the 90-day roadmap above.

View 90-Day Roadmap → Interview Prep Hub →

System Design Hub

📋 What's in This Hub

The 4 Qualities of Every Well-Designed System

Scalability

Reliability

Performance

Maintainability

1. Scalability Fundamentals

Horizontal vs Vertical Scaling

Load Balancing

CAP Theorem

Consistent Hashing

Rate Limiting

Back-of-Envelope Estimation

2. Microservices Patterns with Spring Boot

Microservices Design Patterns Explained with Spring Boot

Microservices Design Patterns with Spring Boot: A Complete Guide

Building a Scalable Spring Boot Microservices Example with Docker

3. Database Design & Scaling

SQL vs NoSQL — choosing right

Read Replicas & Sharding

Indexing Strategies

ACID vs BASE

Connection Pooling (HikariCP)

Database Migration (Flyway / Liquibase)

4. Caching Strategies

5. Messaging & Event-Driven Architecture

Building Scalable Systems with Event-Driven Architecture using Spring Boot and Kafka

Event-Driven Architecture with Spring Boot and Kafka: Complete Guide

System Design: Scalable Notification System

6. Key Architectural Patterns

CQRS (Command Query Responsibility Segregation)

Event Sourcing

Saga Pattern (Distributed Transactions)

Outbox Pattern

Strangler Fig

Bulkhead

7. System Design Case Studies

8. System Design Interview Q&A

9. The 90-Day Java + System Design Roadmap

Month 1 — Java Foundations & Spring Boot Mastery

Core Java Deep Dive

Spring Boot Internals

Testing with JUnit 5 & Mockito

SOLID + Design Patterns in Practice

Month 2 — System Design from Scratch

Scalability & Estimation

Databases & Caching

Microservices & Kafka

Design 3 Classic Systems End-to-End

Month 3 — Interview Simulation & Polish

DSA: Patterns Not Problems

Behavioural + Leadership Principles

Mock Interviews — Full Loop Simulation

All System Design & Architecture Articles

Microservices Design Patterns Explained with Spring Boot

Building Scalable Systems with Event-Driven Architecture using Spring Boot and Kafka

Microservices Design Patterns with Spring Boot: A Complete Guide

Event-Driven Architecture with Spring Boot and Kafka: Complete Guide

Microservices Design Patterns Explained with Spring Boot and Examples

System Design: Scalable Notification System

Read Next

Start Your System Design Journey Today