Ad Click Aggregation: System Design

Requirements & API: Ad Click Aggregation

The first move in any interview: define requirements and sketch the API before drawing a single box.

Functional requirements

•Ingest click events from ad serving infrastructure at 100K+ events/sec.
•Aggregate click counts by (ad_id, time_window) in near-real time (< 1 minute lag).
•Detect and deduplicate fraudulent or duplicate click events.
•Provide accurate hourly and daily aggregates for advertiser billing.
•Support ad-hoc queries: total clicks for ad A in the last 7 days.

Non-functional requirements

•Exactly-once counting: no click counted twice, no valid click dropped.
•Fault tolerance: a worker crash must not cause data loss or double-counting.
•Throughput: sustain 100K events/sec with 10x burst headroom.
•Freshness: near-real-time aggregates available within 60 seconds for advertiser dashboards.

API contract

POST /events/click { click_id, ad_id, advertiser_id, user_id, timestamp, ip, user_agent } → 202

Fire-and-forget from ad servers. Returns immediately; processing is async.

GET /aggregates?ad_id=X&start=T1&end=T2&granularity=hour → [{ window, clicks, spend }]

Billing-grade aggregates from the batch layer. Accurate but up to 1 hour lag.

GET /aggregates/realtime?ad_id=X&window=5m → { clicks, spend }

Real-time approximate counts from the stream layer. Under 1 min lag, may have minor corrections.

About Ad Click Aggregation

Every time you click an ad, a system somewhere counts it. That count is the basis for billing an advertiser billions of dollars per year. The system must be accurate: under-counting loses revenue, over-counting triggers advertiser disputes and regulatory risk. It must be fast: advertisers need near-real-time dashboards to optimize their campaigns. And it must handle 100,000+ click events per second at peak.

The fundamental challenge is exactly-once aggregation at high throughput. If a click is counted twice (due to duplicate event delivery or a processing retry), the advertiser overpays. If it's dropped (due to a crashed worker), the publisher loses revenue. The design must guarantee every valid click is counted exactly once, with a clear audit trail.

Two architectural patterns compete here. The Lambda architecture runs a real-time stream processing layer (Flink, Spark Streaming) for low-latency approximate counts, and a batch processing layer (Spark, MapReduce) for accurate historical reconciliation. The two results are merged at query time. The Kappa architecture simplifies this by using only the stream layer, making the stream log replayable enough to serve as the source of truth for both real-time and historical queries.

The data model is deliberately simple: each event is a (click_id, ad_id, advertiser_id, user_id, timestamp, metadata) tuple. The aggregations are counts and spend rollups grouped by (ad_id, time_window). The complexity isn't in the data model. It's in ensuring those counts are correct under every failure mode.