Home › System Design

System Design Hub

Everything a Java engineer needs to ace system design interviews — scalability, microservices, caching, Kafka, database design, and FAANG-level patterns. Built around real Spring Boot implementations.

Advertisement

The 4 Qualities of Every Well-Designed System

Every system design question is really asking how you balance these four

📈

Scalability

Handle 10× traffic without redesigning. Horizontal scaling, stateless services, sharding, read replicas, CDN.

🛡️

Reliability

Keep working when components fail. Redundancy, circuit breakers, retries with back-off, graceful degradation.

Performance

Fast response under load. Caching at every layer, async processing, connection pooling, efficient DB queries.

🔧

Maintainability

Easy to change and operate. SOLID principles, clear service boundaries, observability, API versioning.

1. Scalability Fundamentals

The concepts every interviewer expects you to know cold

Horizontal vs Vertical Scaling

Vertical = bigger machine. Horizontal = more machines. Prefer horizontal for availability; vertical has a hard ceiling. Stateless services scale horizontally; stateful ones need sticky sessions or external state stores (Redis).

Load Balancing

Distribute traffic across instances: Round Robin (equal load), Least Connections (variable request times), IP Hash (sticky sessions), Consistent Hash (cache affinity). In AWS: ALB for HTTP/HTTPS, NLB for TCP/UDP at scale.

CAP Theorem

Distributed systems can guarantee only 2 of 3: Consistency, Availability, Partition Tolerance. Since partitions are unavoidable, choose CP (HBase, ZooKeeper, etcd) or AP (Cassandra, DynamoDB, CouchDB) based on your tolerance for stale reads vs unavailability.

Consistent Hashing

Maps servers and keys onto a ring. Adding/removing a server remaps only ~K/N keys. Used in Redis Cluster, Cassandra, CDNs, and any distributed cache where you want minimal reshuffling on topology changes.

Rate Limiting

Protect your API from overuse. Algorithms: Token Bucket (allow bursts), Leaky Bucket (smooth rate), Sliding Window Counter (accurate, no boundary spikes). Implement with Resilience4j or Bucket4j + Redis in Spring Boot.

Back-of-Envelope Estimation

Quickly size your system: QPS, storage per day/year, bandwidth. Useful ratios: 1 million RPS = ~1000 servers at 1000 RPS/server; 1 TB/day = ~12 MB/sec write throughput. Practice estimating before designing.

2. Microservices Patterns with Spring Boot

The patterns that separate junior from senior designs — with Spring Boot implementations

PatternProblem it solvesSpring Boot tool
API GatewaySingle entry point, auth, rate limiting, routingSpring Cloud Gateway
Circuit BreakerFail fast when downstream is slow/downResilience4j
Service DiscoveryDynamic service location without hardcoded IPsEureka / Consul
SagaDistributed transactions across servicesSpring Kafka + compensating events
Outbox PatternGuarantee event is published when DB write succeedsDebezium + Kafka / Spring Batch
CQRSSeparate read and write models for performanceSpring Data + separate query service
BulkheadIsolate failures — one slow service doesn't starve othersResilience4j Bulkhead
SidecarCross-cutting concerns (logging, tracing) without changing service codeEnvoy / Istio sidecar

3. Database Design & Scaling

SQL vs NoSQL, sharding, replication, and the ACID vs BASE trade-off

SQL vs NoSQL — choosing right

SQL (PostgreSQL, MySQL): ACID, complex joins, stable schema. Best for financial data, orders, user accounts.
NoSQL: flexible schema, horizontal scale. Key-value (Redis), Document (MongoDB), Wide-column (Cassandra), Graph (Neo4j).

Browse SQL & Database posts →

Read Replicas & Sharding

Read replicas: route all reads to replicas, writes to primary. Works when reads >> writes (common in most apps).
Sharding: partition data horizontally — range-based, hash-based, directory-based. Adds complexity; try replicas and caching first.

Indexing Strategies

B-Tree index for range queries and equality. Composite index: column order matters (most selective first). Covering index: includes all queried columns. Partial index for filtered queries. Avoid over-indexing — every index slows writes.

ACID vs BASE

ACID (SQL): Atomicity, Consistency, Isolation, Durability — strong guarantees, harder to scale.
BASE (NoSQL): Basically Available, Soft state, Eventually consistent — better scalability, weaker guarantees. Many modern systems mix both.

Connection Pooling (HikariCP)

Spring Boot uses HikariCP by default. Key settings: maximumPoolSize (default 10 — usually too low), connectionTimeout (30s — reduce for fast-fail), idleTimeout. Rule of thumb: pool size ≈ (2 × CPU cores) + number of disk spindles.

Database tutorials →

Database Migration (Flyway / Liquibase)

Version-control your schema changes. Flyway: SQL-based, simple, runs on startup. Liquibase: XML/YAML/JSON changesets, rollback support. Both integrate natively with Spring Boot via spring.flyway.enabled=true.

4. Caching Strategies

Cache at the right layer with the right strategy — and know when not to cache

StrategyHow it worksBest forRisk
Cache-Aside (Lazy Loading)App checks cache first; on miss, fetches from DB and populates cacheRead-heavy, cacheable objectsCache stampede on cold start
Write-ThroughWrite to cache and DB simultaneously on every writeLow write latency tolerance, strong consistencyWrites slower; cache fills with rarely-read data
Write-Behind (Write-Back)Write to cache only; async flush to DB laterVery write-heavy workloadsData loss if cache fails before flush
Read-ThroughCache sits in front of DB; cache fetches data on missSimpler application codeFirst request always slow; need TTL discipline
Refresh-AheadProactively refresh cache before TTL expiresPredictable access patternsWasteful if predictions wrong
⚡ Cache Eviction Policies
  • LRU — evict least recently used (most common)
  • LFU — evict least frequently used (better for skewed access)
  • TTL — expire after fixed time (simplest, prevents stale data)
  • FIFO — evict oldest entry first (rarely optimal)
🔴 Cache Problems to Know
  • Cache Penetration — requests for non-existent keys bypass cache. Fix: bloom filter or cache null results.
  • Cache Avalanche — many keys expire simultaneously. Fix: stagger TTLs with jitter.
  • Cache Stampede — many requests hit DB on cache miss. Fix: mutex lock or probabilistic early expiry.

5. Messaging & Event-Driven Architecture

Decouple services, absorb traffic spikes, and build resilient async systems with Kafka

Kafka vs RabbitMQ — When to Use Which

Use Kafka when:

  • High throughput (millions of msgs/sec)
  • Replay events (event sourcing, audit logs)
  • Multiple independent consumers per topic
  • Stream processing (Kafka Streams, Flink)

Use RabbitMQ when:

  • Complex routing (topic/fanout/direct exchanges)
  • Message TTL and per-message priority
  • Low-latency task queues (job processing)
  • Simpler ops and smaller scale (< 10K msg/sec)

6. Key Architectural Patterns

The patterns you will be asked to draw on a whiteboard at Meta/FAANG interviews

CQRS (Command Query Responsibility Segregation)

Separate the write model (Commands → DB) from the read model (Queries → optimised read store). Read side can use a denormalised view, Elasticsearch, or Redis. Dramatically improves read performance at the cost of eventual consistency.

Event Sourcing

Store events (things that happened) instead of current state. Replay events to rebuild state at any point in time. Natural fit with CQRS. Use Kafka as the event log. Downside: query complexity, eventual consistency, schema evolution.

Saga Pattern (Distributed Transactions)

Choreography: services emit and react to events — no central coordinator, loose coupling. Orchestration: a saga orchestrator directs each step — simpler to track, single point of complexity. Use orchestration when business logic is complex.

Outbox Pattern

Write to DB and to an outbox table in the same transaction. A relay process reads the outbox and publishes to Kafka. Guarantees at-least-once delivery without distributed transactions. Essential when your service writes to DB and emits events.

Strangler Fig

Incrementally migrate a monolith to microservices by routing new features to new services while keeping the old system running. Route via API Gateway. Retire old code when traffic is fully migrated. Low-risk, incremental approach.

Bulkhead

Isolate critical services from slow ones using separate thread pools (Resilience4j ThreadPoolBulkhead) or separate service instances. If the recommendation service hangs, the checkout service keeps working. Named after ship hull compartments.

7. System Design Case Studies

Classic interview problems — how to approach each one in 45 minutes

SystemKey challengesCore componentsScale hint
URL Shortener (TinyURL) Unique ID generation, redirect latency, analytics Base62 encoding, Redis cache, DB for mapping, CDN for redirects 100M URLs, 10B redirects/day
Notification System Fan-out at scale, deduplication, delivery guarantees, priority Kafka topics per channel, priority queue, idempotency key, retry with DLQ 10M notifications/day
Rate Limiter Distributed state, accuracy vs performance, API key vs user Redis INCR+EXPIRE (sliding window counter), Token Bucket, API Gateway layer 1M RPS across 10 servers
News Feed / Social Feed Fan-out vs fan-in, ranking, pagination, celebrity users Push model (write to each follower's feed), pull for celebrities, Redis sorted sets 500M users, 10M posts/day
Search Autocomplete Low latency, prefix matching, trending suggestions Trie (in-memory), or Elasticsearch prefix query, Redis sorted set for frequency 10M searches/day, <100ms p99
Distributed Cache Consistent hashing, eviction, replication, hot key problem Consistent hash ring, LRU eviction, primary/replica replication, client-side sharding 1TB cache, 1M QPS
Deep Dive: Scalable Notification System →

8. System Design Interview Q&A

The questions Meta, Google, and Amazon actually ask — with structured answers

Q1. How would you design a URL shortener like TinyURL?

Clarify: read-heavy (100:1 read/write), need custom aliases? analytics?
Write path: generate unique 7-char Base62 ID (counter + Base62 encode, or MD5+truncate), store short_id → long_url in a DB (PostgreSQL or DynamoDB).
Read path: GET /{id} → check Redis cache first → on miss, DB lookup → HTTP 301/302 redirect. Use Redis with TTL for hot URLs (cache-aside).
Scale: Read replicas for DB, CDN for the redirect response, consistent hashing to shard the cache.
Trade-offs: 301 (permanent) vs 302 (temporary) — 301 is cached by browser (fewer hits, less analytics); 302 is not cached (more server hits, full analytics).

Q2. How would you design a rate limiter for an API?

Clarify: per user, per IP, per API key? Global or per-service? Hard limit or soft?
Algorithm choice: Sliding Window Counter (most accurate, no boundary spike) — store user:timestamp → count in Redis using sorted sets.
Redis implementation: ZADD user_id:requests timestamp timestamp, then ZREMRANGEBYSCORE to remove old entries, ZCARD to count. If count > limit → 429 Too Many Requests.
Spring Boot: Implement as a HandlerInterceptor or Spring Cloud Gateway filter with Bucket4j + Redis backend.
Edge cases: What if Redis is down? Fail open (allow) or fail closed (block)? Fail open is safer for availability.

Q3. How would you design a notification system?

Clarify: channels (push, email, SMS), volume (10M/day), priority (transactional > promotional), delivery guarantee (at-least-once).
Architecture: API Service → Kafka topics per channel → Channel Workers → 3rd party (FCM/APNs for push, SendGrid for email, Twilio for SMS).
Reliability: Idempotency key on each notification to prevent duplicates. Dead-letter topic for failed messages. Retry with exponential back-off.
Fan-out at scale: If user has 10M followers, don't send 10M DB writes synchronously — use a fan-out worker that reads the follower list and publishes to Kafka in batches.
Priority: Separate Kafka topics (high-priority, low-priority) with different consumer lag SLAs.

Q4. CAP Theorem — explain with a real example.

A distributed database across two data centers is separated by a network partition (P is unavoidable). Now you must choose:
CP (Consistent + Partition Tolerant): Return an error or wait until the network heals rather than return stale data. Example: a bank transfer — stale balance data could cause double-spend. Use: ZooKeeper, HBase, etcd.
AP (Available + Partition Tolerant): Return the best available data even if stale. Example: a shopping cart — showing a slightly stale cart is acceptable. Use: Cassandra, DynamoDB (eventually consistent reads), CouchDB.
Nuance: CAP is binary but PACELC adds latency: even without partitions, there's a latency-consistency trade-off. Modern systems like DynamoDB let you choose consistency level per request.

Q5. How do you handle the database bottleneck as traffic grows?

Step-by-step progression (don't jump straight to sharding):
1. Query optimisation — proper indexes, avoid N+1 queries, use EXPLAIN ANALYZE.
2. Connection pooling — HikariCP, tune pool size to 2×CPU+disk spindles.
3. Caching — Redis cache-aside for frequently read, rarely changing data.
4. Read replicas — route all reads to replicas; reduces primary load by 80% for most apps.
5. Vertical scaling — bigger instance (quick win, limited ceiling).
6. Sharding — partition by user ID or region. Adds complexity; do this last.
7. NoSQL migration — if access pattern is key-value or time-series, move that data to a purpose-built store.

Q6. Explain the Saga pattern for distributed transactions.

A distributed transaction (e.g., place order → reserve inventory → charge payment) cannot use a single DB transaction across services. Saga breaks it into local transactions with compensating actions on failure.
Choreography-based Saga: Each service publishes an event when its step succeeds. Next service listens and acts. On failure, the service publishes a failure event and each prior service runs its compensating action. Simple, loose coupling, hard to track overall state.
Orchestration-based Saga: A central orchestrator (a Spring Boot service) calls each step and handles failures by calling compensating endpoints. Easier to debug and monitor. Recommended when the workflow is complex.
Spring Boot: Implement with Spring Kafka (choreography) or as a Spring State Machine (orchestration). Library: Axon Framework provides built-in saga support.

9. The 90-Day Java + System Design Roadmap

Focused on what actually matters for Meta/FAANG — skip what doesn't

Month 1 — Java Foundations & Spring Boot Mastery

W1

Core Java Deep Dive

Java memory model, GC algorithms (G1, ZGC), virtual threads (Project Loom), CompletableFuture, ExecutorService, happens-before guarantees.

  • Can you explain the Java memory model from first principles?
  • What is a happens-before relationship?
  • When do you use virtual threads vs platform threads?
W2

Spring Boot Internals

Auto-configuration internals, bean lifecycle, @Conditional annotations, Spring Security filter chain, JPA N+1 problem, Spring Boot Actuator.

  • How does @SpringBootApplication auto-configure beans?
  • How does SecurityFilterChain work under the hood?
  • How do you detect and fix the N+1 select problem?
W3

Testing with JUnit 5 & Mockito

Write unit tests with @Mock/@InjectMocks, slice tests with @WebMvcTest, integration tests with @SpringBootTest + TestContainers, ArgumentCaptor, verify patterns.

  • Difference between @Mock and @MockBean?
  • How do you test a Kafka consumer with TestContainers?
  • What is a test pyramid and why does it matter?
W4

SOLID + Design Patterns in Practice

Apply SOLID to a real codebase. Master the 10 most asked patterns: Factory, Builder, Strategy, Observer, Decorator, Proxy, Singleton, Command, Template Method, Chain of Responsibility.

  • Identify where each pattern is used in Spring itself
  • Refactor a God class to follow SOLID
  • Implement Strategy pattern for payment processing

Month 2 — System Design from Scratch

W5

Scalability & Estimation

Back-of-envelope estimation, horizontal vs vertical scaling, load balancing algorithms, stateless design, CDN, consistent hashing.

  • Estimate storage for 100M users uploading 1 photo/day at 200KB each
  • Design a load balancer that routes by consistent hash
  • When do you choose sticky sessions over stateless?
W6

Databases & Caching

SQL vs NoSQL decision framework, read replicas, sharding strategies, caching patterns (cache-aside, write-through), Redis data structures, cache stampede/avalanche/penetration fixes.

  • Design a caching layer for a product catalog (10M products, 1M RPS reads)
  • How do you handle hot keys in Redis?
  • When would you choose Cassandra over PostgreSQL?
W7

Microservices & Kafka

API Gateway, Circuit Breaker (Resilience4j), service discovery, Saga patterns, Outbox pattern, Kafka producers/consumers, partitions, consumer groups, at-least-once delivery.

  • How do you implement a distributed transaction across 3 services without 2PC?
  • What happens when a Kafka consumer crashes mid-processing?
  • How do you prevent a slow downstream from cascading to all services?
W8

Design 3 Classic Systems End-to-End

URL shortener, notification system, rate limiter. For each: gather requirements, estimate scale, design components, identify bottlenecks, iterate.

  • Draw full architecture diagrams on paper (no IDE)
  • Time yourself: 45 minutes per system
  • Record yourself explaining — watch it back

Month 3 — Interview Simulation & Polish

W9

DSA: Patterns Not Problems

Focus on 14 patterns: Sliding Window, Two Pointers, Fast/Slow Pointers, Merge Intervals, Cyclic Sort, BFS/DFS, Dynamic Programming (Memoization + Tabulation), Topological Sort, Union-Find.

W10

Behavioural + Leadership Principles

STAR format for 10 key situations. Meta specifically tests: Move Fast, Be Direct, Build Awesome Things. Prepare 5 strong leadership stories covering conflict, failure, ambiguity, impact, and collaboration.

W11-12

Mock Interviews — Full Loop Simulation

2 coding interviews (LeetCode medium/hard), 1 system design, 1 behavioural. Use pramp.com, interviewing.io, or find a peer. After each mock: write a debrief noting what you'd do differently.

  • Can you explain your solution clearly while coding?
  • Do you clarify requirements before jumping in?
  • Do you proactively mention trade-offs?

Start Your System Design Journey Today

11 years of Java experience + system design mastery = FAANG-ready. Start with the 90-day roadmap above.

View 90-Day Roadmap → Interview Prep Hub →