System Design Hub
Everything a Java engineer needs to ace system design interviews — scalability, microservices, caching, Kafka, database design, and FAANG-level patterns. Built around real Spring Boot implementations.
Advertisement
📋 What's in This Hub
The 4 Qualities of Every Well-Designed System
Every system design question is really asking how you balance these four
Scalability
Handle 10× traffic without redesigning. Horizontal scaling, stateless services, sharding, read replicas, CDN.
Reliability
Keep working when components fail. Redundancy, circuit breakers, retries with back-off, graceful degradation.
Performance
Fast response under load. Caching at every layer, async processing, connection pooling, efficient DB queries.
Maintainability
Easy to change and operate. SOLID principles, clear service boundaries, observability, API versioning.
1. Scalability Fundamentals
The concepts every interviewer expects you to know cold
Horizontal vs Vertical Scaling
Vertical = bigger machine. Horizontal = more machines. Prefer horizontal for availability; vertical has a hard ceiling. Stateless services scale horizontally; stateful ones need sticky sessions or external state stores (Redis).
Load Balancing
Distribute traffic across instances: Round Robin (equal load), Least Connections (variable request times), IP Hash (sticky sessions), Consistent Hash (cache affinity). In AWS: ALB for HTTP/HTTPS, NLB for TCP/UDP at scale.
CAP Theorem
Distributed systems can guarantee only 2 of 3: Consistency, Availability, Partition Tolerance. Since partitions are unavoidable, choose CP (HBase, ZooKeeper, etcd) or AP (Cassandra, DynamoDB, CouchDB) based on your tolerance for stale reads vs unavailability.
Consistent Hashing
Maps servers and keys onto a ring. Adding/removing a server remaps only ~K/N keys. Used in Redis Cluster, Cassandra, CDNs, and any distributed cache where you want minimal reshuffling on topology changes.
Rate Limiting
Protect your API from overuse. Algorithms: Token Bucket (allow bursts), Leaky Bucket (smooth rate), Sliding Window Counter (accurate, no boundary spikes). Implement with Resilience4j or Bucket4j + Redis in Spring Boot.
Back-of-Envelope Estimation
Quickly size your system: QPS, storage per day/year, bandwidth. Useful ratios: 1 million RPS = ~1000 servers at 1000 RPS/server; 1 TB/day = ~12 MB/sec write throughput. Practice estimating before designing.
2. Microservices Patterns with Spring Boot
The patterns that separate junior from senior designs — with Spring Boot implementations
Microservices Design Patterns Explained with Spring Boot
API Gateway, Circuit Breaker, Service Discovery, Saga, CQRS — complete guide with Spring Boot implementations and real trade-offs.
MicroservicesMicroservices Design Patterns with Spring Boot: A Complete Guide
Deep dive into API Gateway, Circuit Breaker with Resilience4j, distributed tracing, and service mesh concepts.
MicroservicesBuilding a Scalable Spring Boot Microservices Example with Docker
Step-by-step: build multiple Spring Boot services, containerize with Docker, orchestrate with Docker Compose, add service discovery.
| Pattern | Problem it solves | Spring Boot tool |
|---|---|---|
| API Gateway | Single entry point, auth, rate limiting, routing | Spring Cloud Gateway |
| Circuit Breaker | Fail fast when downstream is slow/down | Resilience4j |
| Service Discovery | Dynamic service location without hardcoded IPs | Eureka / Consul |
| Saga | Distributed transactions across services | Spring Kafka + compensating events |
| Outbox Pattern | Guarantee event is published when DB write succeeds | Debezium + Kafka / Spring Batch |
| CQRS | Separate read and write models for performance | Spring Data + separate query service |
| Bulkhead | Isolate failures — one slow service doesn't starve others | Resilience4j Bulkhead |
| Sidecar | Cross-cutting concerns (logging, tracing) without changing service code | Envoy / Istio sidecar |
3. Database Design & Scaling
SQL vs NoSQL, sharding, replication, and the ACID vs BASE trade-off
SQL vs NoSQL — choosing right
SQL (PostgreSQL, MySQL): ACID, complex joins, stable schema. Best for financial data, orders, user accounts.
NoSQL: flexible schema, horizontal scale. Key-value (Redis), Document (MongoDB), Wide-column (Cassandra), Graph (Neo4j).
Read Replicas & Sharding
Read replicas: route all reads to replicas, writes to primary. Works when reads >> writes (common in most apps).
Sharding: partition data horizontally — range-based, hash-based, directory-based. Adds complexity; try replicas and caching first.
Indexing Strategies
B-Tree index for range queries and equality. Composite index: column order matters (most selective first). Covering index: includes all queried columns. Partial index for filtered queries. Avoid over-indexing — every index slows writes.
ACID vs BASE
ACID (SQL): Atomicity, Consistency, Isolation, Durability — strong guarantees, harder to scale.
BASE (NoSQL): Basically Available, Soft state, Eventually consistent — better scalability, weaker guarantees. Many modern systems mix both.
Connection Pooling (HikariCP)
Spring Boot uses HikariCP by default. Key settings: maximumPoolSize (default 10 — usually too low), connectionTimeout (30s — reduce for fast-fail), idleTimeout. Rule of thumb: pool size ≈ (2 × CPU cores) + number of disk spindles.
Database Migration (Flyway / Liquibase)
Version-control your schema changes. Flyway: SQL-based, simple, runs on startup. Liquibase: XML/YAML/JSON changesets, rollback support. Both integrate natively with Spring Boot via spring.flyway.enabled=true.
4. Caching Strategies
Cache at the right layer with the right strategy — and know when not to cache
| Strategy | How it works | Best for | Risk |
|---|---|---|---|
| Cache-Aside (Lazy Loading) | App checks cache first; on miss, fetches from DB and populates cache | Read-heavy, cacheable objects | Cache stampede on cold start |
| Write-Through | Write to cache and DB simultaneously on every write | Low write latency tolerance, strong consistency | Writes slower; cache fills with rarely-read data |
| Write-Behind (Write-Back) | Write to cache only; async flush to DB later | Very write-heavy workloads | Data loss if cache fails before flush |
| Read-Through | Cache sits in front of DB; cache fetches data on miss | Simpler application code | First request always slow; need TTL discipline |
| Refresh-Ahead | Proactively refresh cache before TTL expires | Predictable access patterns | Wasteful if predictions wrong |
- LRU — evict least recently used (most common)
- LFU — evict least frequently used (better for skewed access)
- TTL — expire after fixed time (simplest, prevents stale data)
- FIFO — evict oldest entry first (rarely optimal)
- Cache Penetration — requests for non-existent keys bypass cache. Fix: bloom filter or cache null results.
- Cache Avalanche — many keys expire simultaneously. Fix: stagger TTLs with jitter.
- Cache Stampede — many requests hit DB on cache miss. Fix: mutex lock or probabilistic early expiry.
5. Messaging & Event-Driven Architecture
Decouple services, absorb traffic spikes, and build resilient async systems with Kafka
Building Scalable Systems with Event-Driven Architecture using Spring Boot and Kafka
Producers, consumers, topics, partitions, consumer groups, offset management, and reliability guarantees — with Spring Kafka.
KafkaEvent-Driven Architecture with Spring Boot and Kafka: Complete Guide
Design patterns for event-driven systems: event sourcing, CQRS, saga choreography, dead letter topics, and exactly-once semantics.
Case StudySystem Design: Scalable Notification System
End-to-end design of a push/email/SMS notification system — fan-out, priority queues, deduplication, and delivery guarantees.
Kafka vs RabbitMQ — When to Use Which
Use Kafka when:
- High throughput (millions of msgs/sec)
- Replay events (event sourcing, audit logs)
- Multiple independent consumers per topic
- Stream processing (Kafka Streams, Flink)
Use RabbitMQ when:
- Complex routing (topic/fanout/direct exchanges)
- Message TTL and per-message priority
- Low-latency task queues (job processing)
- Simpler ops and smaller scale (< 10K msg/sec)
6. Key Architectural Patterns
The patterns you will be asked to draw on a whiteboard at Meta/FAANG interviews
CQRS (Command Query Responsibility Segregation)
Separate the write model (Commands → DB) from the read model (Queries → optimised read store). Read side can use a denormalised view, Elasticsearch, or Redis. Dramatically improves read performance at the cost of eventual consistency.
Event Sourcing
Store events (things that happened) instead of current state. Replay events to rebuild state at any point in time. Natural fit with CQRS. Use Kafka as the event log. Downside: query complexity, eventual consistency, schema evolution.
Saga Pattern (Distributed Transactions)
Choreography: services emit and react to events — no central coordinator, loose coupling. Orchestration: a saga orchestrator directs each step — simpler to track, single point of complexity. Use orchestration when business logic is complex.
Outbox Pattern
Write to DB and to an outbox table in the same transaction. A relay process reads the outbox and publishes to Kafka. Guarantees at-least-once delivery without distributed transactions. Essential when your service writes to DB and emits events.
Strangler Fig
Incrementally migrate a monolith to microservices by routing new features to new services while keeping the old system running. Route via API Gateway. Retire old code when traffic is fully migrated. Low-risk, incremental approach.
Bulkhead
Isolate critical services from slow ones using separate thread pools (Resilience4j ThreadPoolBulkhead) or separate service instances. If the recommendation service hangs, the checkout service keeps working. Named after ship hull compartments.
7. System Design Case Studies
Classic interview problems — how to approach each one in 45 minutes
| System | Key challenges | Core components | Scale hint |
|---|---|---|---|
| URL Shortener (TinyURL) | Unique ID generation, redirect latency, analytics | Base62 encoding, Redis cache, DB for mapping, CDN for redirects | 100M URLs, 10B redirects/day |
| Notification System | Fan-out at scale, deduplication, delivery guarantees, priority | Kafka topics per channel, priority queue, idempotency key, retry with DLQ | 10M notifications/day |
| Rate Limiter | Distributed state, accuracy vs performance, API key vs user | Redis INCR+EXPIRE (sliding window counter), Token Bucket, API Gateway layer | 1M RPS across 10 servers |
| News Feed / Social Feed | Fan-out vs fan-in, ranking, pagination, celebrity users | Push model (write to each follower's feed), pull for celebrities, Redis sorted sets | 500M users, 10M posts/day |
| Search Autocomplete | Low latency, prefix matching, trending suggestions | Trie (in-memory), or Elasticsearch prefix query, Redis sorted set for frequency | 10M searches/day, <100ms p99 |
| Distributed Cache | Consistent hashing, eviction, replication, hot key problem | Consistent hash ring, LRU eviction, primary/replica replication, client-side sharding | 1TB cache, 1M QPS |
8. System Design Interview Q&A
The questions Meta, Google, and Amazon actually ask — with structured answers
Q1. How would you design a URL shortener like TinyURL?
Clarify: read-heavy (100:1 read/write), need custom aliases? analytics?
Write path: generate unique 7-char Base62 ID (counter + Base62 encode, or MD5+truncate), store short_id → long_url in a DB (PostgreSQL or DynamoDB).
Read path: GET /{id} → check Redis cache first → on miss, DB lookup → HTTP 301/302 redirect. Use Redis with TTL for hot URLs (cache-aside).
Scale: Read replicas for DB, CDN for the redirect response, consistent hashing to shard the cache.
Trade-offs: 301 (permanent) vs 302 (temporary) — 301 is cached by browser (fewer hits, less analytics); 302 is not cached (more server hits, full analytics).
Q2. How would you design a rate limiter for an API?
Clarify: per user, per IP, per API key? Global or per-service? Hard limit or soft?
Algorithm choice: Sliding Window Counter (most accurate, no boundary spike) — store user:timestamp → count in Redis using sorted sets.
Redis implementation: ZADD user_id:requests timestamp timestamp, then ZREMRANGEBYSCORE to remove old entries, ZCARD to count. If count > limit → 429 Too Many Requests.
Spring Boot: Implement as a HandlerInterceptor or Spring Cloud Gateway filter with Bucket4j + Redis backend.
Edge cases: What if Redis is down? Fail open (allow) or fail closed (block)? Fail open is safer for availability.
Q3. How would you design a notification system?
Clarify: channels (push, email, SMS), volume (10M/day), priority (transactional > promotional), delivery guarantee (at-least-once).
Architecture: API Service → Kafka topics per channel → Channel Workers → 3rd party (FCM/APNs for push, SendGrid for email, Twilio for SMS).
Reliability: Idempotency key on each notification to prevent duplicates. Dead-letter topic for failed messages. Retry with exponential back-off.
Fan-out at scale: If user has 10M followers, don't send 10M DB writes synchronously — use a fan-out worker that reads the follower list and publishes to Kafka in batches.
Priority: Separate Kafka topics (high-priority, low-priority) with different consumer lag SLAs.
Q4. CAP Theorem — explain with a real example.
A distributed database across two data centers is separated by a network partition (P is unavoidable). Now you must choose:
CP (Consistent + Partition Tolerant): Return an error or wait until the network heals rather than return stale data. Example: a bank transfer — stale balance data could cause double-spend. Use: ZooKeeper, HBase, etcd.
AP (Available + Partition Tolerant): Return the best available data even if stale. Example: a shopping cart — showing a slightly stale cart is acceptable. Use: Cassandra, DynamoDB (eventually consistent reads), CouchDB.
Nuance: CAP is binary but PACELC adds latency: even without partitions, there's a latency-consistency trade-off. Modern systems like DynamoDB let you choose consistency level per request.
Q5. How do you handle the database bottleneck as traffic grows?
Step-by-step progression (don't jump straight to sharding):
1. Query optimisation — proper indexes, avoid N+1 queries, use EXPLAIN ANALYZE.
2. Connection pooling — HikariCP, tune pool size to 2×CPU+disk spindles.
3. Caching — Redis cache-aside for frequently read, rarely changing data.
4. Read replicas — route all reads to replicas; reduces primary load by 80% for most apps.
5. Vertical scaling — bigger instance (quick win, limited ceiling).
6. Sharding — partition by user ID or region. Adds complexity; do this last.
7. NoSQL migration — if access pattern is key-value or time-series, move that data to a purpose-built store.
Q6. Explain the Saga pattern for distributed transactions.
A distributed transaction (e.g., place order → reserve inventory → charge payment) cannot use a single DB transaction across services. Saga breaks it into local transactions with compensating actions on failure.
Choreography-based Saga: Each service publishes an event when its step succeeds. Next service listens and acts. On failure, the service publishes a failure event and each prior service runs its compensating action. Simple, loose coupling, hard to track overall state.
Orchestration-based Saga: A central orchestrator (a Spring Boot service) calls each step and handles failures by calling compensating endpoints. Easier to debug and monitor. Recommended when the workflow is complex.
Spring Boot: Implement with Spring Kafka (choreography) or as a Spring State Machine (orchestration). Library: Axon Framework provides built-in saga support.
9. The 90-Day Java + System Design Roadmap
Focused on what actually matters for Meta/FAANG — skip what doesn't
Month 1 — Java Foundations & Spring Boot Mastery
Core Java Deep Dive
Java memory model, GC algorithms (G1, ZGC), virtual threads (Project Loom), CompletableFuture, ExecutorService, happens-before guarantees.
- Can you explain the Java memory model from first principles?
- What is a happens-before relationship?
- When do you use virtual threads vs platform threads?
Spring Boot Internals
Auto-configuration internals, bean lifecycle, @Conditional annotations, Spring Security filter chain, JPA N+1 problem, Spring Boot Actuator.
- How does @SpringBootApplication auto-configure beans?
- How does SecurityFilterChain work under the hood?
- How do you detect and fix the N+1 select problem?
Testing with JUnit 5 & Mockito
Write unit tests with @Mock/@InjectMocks, slice tests with @WebMvcTest, integration tests with @SpringBootTest + TestContainers, ArgumentCaptor, verify patterns.
- Difference between @Mock and @MockBean?
- How do you test a Kafka consumer with TestContainers?
- What is a test pyramid and why does it matter?
SOLID + Design Patterns in Practice
Apply SOLID to a real codebase. Master the 10 most asked patterns: Factory, Builder, Strategy, Observer, Decorator, Proxy, Singleton, Command, Template Method, Chain of Responsibility.
- Identify where each pattern is used in Spring itself
- Refactor a God class to follow SOLID
- Implement Strategy pattern for payment processing
Month 2 — System Design from Scratch
Scalability & Estimation
Back-of-envelope estimation, horizontal vs vertical scaling, load balancing algorithms, stateless design, CDN, consistent hashing.
- Estimate storage for 100M users uploading 1 photo/day at 200KB each
- Design a load balancer that routes by consistent hash
- When do you choose sticky sessions over stateless?
Databases & Caching
SQL vs NoSQL decision framework, read replicas, sharding strategies, caching patterns (cache-aside, write-through), Redis data structures, cache stampede/avalanche/penetration fixes.
- Design a caching layer for a product catalog (10M products, 1M RPS reads)
- How do you handle hot keys in Redis?
- When would you choose Cassandra over PostgreSQL?
Microservices & Kafka
API Gateway, Circuit Breaker (Resilience4j), service discovery, Saga patterns, Outbox pattern, Kafka producers/consumers, partitions, consumer groups, at-least-once delivery.
- How do you implement a distributed transaction across 3 services without 2PC?
- What happens when a Kafka consumer crashes mid-processing?
- How do you prevent a slow downstream from cascading to all services?
Design 3 Classic Systems End-to-End
URL shortener, notification system, rate limiter. For each: gather requirements, estimate scale, design components, identify bottlenecks, iterate.
- Draw full architecture diagrams on paper (no IDE)
- Time yourself: 45 minutes per system
- Record yourself explaining — watch it back
Month 3 — Interview Simulation & Polish
DSA: Patterns Not Problems
Focus on 14 patterns: Sliding Window, Two Pointers, Fast/Slow Pointers, Merge Intervals, Cyclic Sort, BFS/DFS, Dynamic Programming (Memoization + Tabulation), Topological Sort, Union-Find.
Behavioural + Leadership Principles
STAR format for 10 key situations. Meta specifically tests: Move Fast, Be Direct, Build Awesome Things. Prepare 5 strong leadership stories covering conflict, failure, ambiguity, impact, and collaboration.
Mock Interviews — Full Loop Simulation
2 coding interviews (LeetCode medium/hard), 1 system design, 1 behavioural. Use pramp.com, interviewing.io, or find a peer. After each mock: write a debrief noting what you'd do differently.
- Can you explain your solution clearly while coding?
- Do you clarify requirements before jumping in?
- Do you proactively mention trade-offs?
All System Design & Architecture Articles
Microservices Design Patterns Explained with Spring Boot
KafkaBuilding Scalable Systems with Event-Driven Architecture using Spring Boot and Kafka
MicroservicesMicroservices Design Patterns with Spring Boot: A Complete Guide
KafkaEvent-Driven Architecture with Spring Boot and Kafka: Complete Guide
MicroservicesMicroservices Design Patterns Explained with Spring Boot and Examples
Case StudySystem Design: Scalable Notification System
Read Next
Start Your System Design Journey Today
11 years of Java experience + system design mastery = FAANG-ready. Start with the 90-day roadmap above.
View 90-Day Roadmap → Interview Prep Hub →