Kafka Partition Design for IoT: Throughput and Ordering


Kafka partition design is the main lever for scaling IoT ingestion. Bad partition strategy causes hot brokers, delayed consumers, and broken ordering for device state events.

This guide gives a practical model for choosing partition count and topic layout.

1. Core Rules

  • Ordering is guaranteed only inside one partition.
  • One partition can be consumed by one consumer instance at a time in a consumer group.
  • More partitions increase parallelism, but also increase operational overhead.

2. Throughput-Based Partition Sizing

A simple starting formula:

partitions = max(target_throughput / producer_capacity_per_partition,
                 target_throughput / consumer_capacity_per_partition)

Example:

  • target throughput: 120 MB/s
  • producer capacity per partition: 12 MB/s
  • consumer capacity per partition: 10 MB/s

Result:

  • producer side needs 10 partitions
  • consumer side needs 12 partitions
  • choose at least 12 partitions

3. Consumer Scaling Constraint

If a consumer group has 16 instances and the topic has 12 partitions, 4 instances stay idle. For independent parallelism, partition count should be at least the expected max consumer instances in critical groups.

4. IoT Ordering Strategy

Classify events by ordering requirement.

Strong ordering required:

  • device lifecycle events (online, offline, reboot)
  • control commands and acknowledgements
  • financial or billing events

Weak ordering acceptable:

  • periodic telemetry snapshots
  • non-critical sensor aggregates

For strong ordering events, use device_id as key so all events for one device land in the same partition.

{
  "device_id": "dev-1001",
  "event_type": "state_change",
  "state": "offline",
  "timestamp": "2026-03-05T00:00:00Z"
}

5. Topic Layout for IoT

Use bounded domain topics instead of one topic per device.

Recommended:

  • iot.telemetry.raw
  • iot.device.state
  • iot.alerts
  • iot.commands

Avoid:

  • iot.device.<id> for every device (metadata explosion)

6. Broker and Cluster Guardrails

Operational limits to watch:

  • partitions per broker
  • leader balance per broker
  • consumer lag by partition
  • under-replicated partitions

If one broker leads too many hot partitions, rebalance partition leadership and review key skew.

7. Key Skew Detection

Even with enough partitions, one key can become a hotspot.

Track:

  • top keys by message rate
  • bytes in per partition
  • p95/p99 produce latency per partition

Mitigation options:

  • split oversized tenants into dedicated keys or topics
  • move heavy device cohorts to separate topic
  • apply producer-side batching/compression tuning

8. Reference Design for Mid-Size Fleet

Scenario:

  • 150k devices
  • mixed telemetry and state events
  • 3 critical consumers

Suggested baseline:

  • iot.telemetry.raw: 48 partitions, key by device_id
  • iot.device.state: 24 partitions, key by device_id
  • iot.commands: 24 partitions, key by device_id
  • replication factor: 3
  • min in-sync replicas: 2

Adjust by measured lag and broker saturation, not by guesswork.

9. Rollout Plan

  1. Start with estimated partition count from throughput and consumer sizing.
  2. Run load test with realistic device traffic distribution.
  3. Measure lag, produce latency, and key skew.
  4. Increase partitions before production saturation.
  5. Revisit quarterly as fleet size changes.

Correct partition design gives stable throughput and predictable ordering without overloading brokers.