Async Image Batching for High-Volume Stores

Within the Image Parsing & Computer Vision Workflows section, async image batching is the layer that decides how many shelf frames travel through a GPU detector at once, and when. High-volume retail generates imagery at a scale that overwhelms synchronous request-response inference: a national grocery banner running a daily store-walk can emit tens of thousands of high-resolution gondola photos inside a single morning window, and every frame must become a structured compliance record before a category manager’s briefing. Blocking HTTP handlers and single-threaded batch scripts saturate immediately under that load. Async batching decouples photo ingestion from model inference, buffers frames in a durable broker, and assembles them into right-sized groups so the pipeline can absorb audit surges, keep expensive GPUs fully utilized, and still hold a predictable latency SLA.

This page specifies the batching component itself — the contract it consumes, the aggregation logic that turns a frame stream into inference batches, the tuning knobs that keep VRAM bounded, and the failure modes that show up only at scale. The concrete worker runtime that executes these batches is covered in the companion walkthrough on Implementing Celery for Async Shelf Photo Processing; here the focus is the batching strategy that runs in front of any task queue.

The aggregator buckets frames by fixture class and resolution tier, seals each batch on size or a 0.75s timeout, and absorbs surges in the durable broker — with VRAM telemetry shedding batch size and undispatched batches falling to a dead-letter queue rather than being lost.

Concept & Data Contract Jump to heading

The batching layer consumes a stream of validated capture envelopes and produces inference batches: contiguous groups of frames that share enough characteristics to run through one model invocation efficiently. It does not run detection itself, and it does not re-validate payloads — by the time a frame reaches the aggregator it has already passed the ingestion boundary described in the parent section, so the envelope’s capture_id, fixture_class, planogram_id, and image_uri are trustworthy. The aggregator’s only job is to answer two questions: which frames belong together, and when is a group ready to dispatch.

Grouping is not arbitrary. Frames are bucketed by the attributes that determine inference cost and the detector that will handle them — primarily fixture_class and image resolution tier — because mixing a 4K refrigerated-cooler capture with a low-res handheld endcap shot in the same tensor forces padding to the largest frame and wastes GPU memory on every smaller one. The routing decision that picks the actual detector for each bucket lives in Vision Model Routing for Shelf Detection; the aggregator simply respects the bucket key that routing layer assigns so a batch never straddles two model tiers.

A batch is emitted under either of two triggers — it fills to the target size, or a per-bucket timer expires — whichever comes first. The timeout is what bounds tail latency for slow-filling buckets: a rarely-photographed fixture class should not wait indefinitely for 32 peers that may never arrive during off-peak hours.

The typed contracts make the boundary explicit:

from __future__ import annotations

from dataclasses import dataclass, field
from enum import Enum
from typing import NewType

CaptureId = NewType("CaptureId", str)


class ResolutionTier(str, Enum):
    LOW = "low"        # < 2 MP, handheld / degraded
    STANDARD = "std"   # 2-8 MP, associate mobile
    HIGH = "high"      # > 8 MP, fixed aisle / robotic


@dataclass(frozen=True, slots=True)
class CaptureEnvelope:
    """One validated shelf frame as it leaves the ingestion boundary."""
    capture_id: CaptureId
    store_id: str
    fixture_id: str
    fixture_class: str
    planogram_id: str
    resolution_tier: ResolutionTier
    image_uri: str
    capture_timestamp: str  # ISO-8601, UTC


@dataclass(frozen=True, slots=True)
class InferenceBatch:
    """A dispatch-ready group sharing one model tier and resolution profile."""
    batch_id: str
    bucket_key: str                  # f"{fixture_class}:{resolution_tier}"
    model_tier: str                  # assigned by the routing layer
    frames: tuple[CaptureEnvelope, ...]
    sealed_reason: str               # "size" | "timeout" | "shutdown"
    created_at: str = field(default="")

Downstream, each frame in a batch resolves into a compliance record carrying the familiar typed fields — planogram_id, fixture_id, compliance_percentage, out_of_stock_flags, misplaced_sku_list, and capture_timestamp — but those are produced after detection and post-processing, not by the batching layer itself.

Implementation Architecture Jump to heading

The aggregator is a small, stateful service that sits between a durable message broker and the GPU worker pool. We back it with Redis Streams or RabbitMQ rather than an in-memory asyncio.Queue for one reason: durability. A frame that has been acknowledged to the store’s uploader but not yet inferred must survive a worker restart, and a broker with consumer groups and persistence gives that guarantee while also providing the backpressure signal (queue depth) the size controller reads. The choice of asyncio for the aggregator process is deliberate too — batching is an I/O-bound coordination problem (broker reads, object-store fetches, dispatch writes), so a single event loop holds thousands of in-flight frames without thread overhead, and the CPU-heavy detection stays isolated in separate GPU workers.

The core is a per-bucket accumulator with a size trigger and a timeout trigger. Each bucket flushes independently, so a fast-filling cooler-aisle bucket never blocks a slow endcap bucket:

import asyncio
import logging
import time
import uuid
from collections import defaultdict

logger = logging.getLogger("batcher")


class DynamicBatcher:
    """Aggregates validated frames into per-bucket inference batches."""

    def __init__(
        self,
        dispatch,                       # async callable(InferenceBatch) -> None
        *,
        target_size: int = 32,
        max_wait_s: float = 0.75,
        max_inflight_batches: int = 8,
    ) -> None:
        self._dispatch = dispatch
        self._target_size = target_size
        self._max_wait_s = max_wait_s
        self._sem = asyncio.Semaphore(max_inflight_batches)
        self._buffers: dict[str, list[CaptureEnvelope]] = defaultdict(list)
        self._timers: dict[str, asyncio.TimerHandle] = {}
        self._lock = asyncio.Lock()

    async def add(self, frame: CaptureEnvelope, model_tier: str) -> None:
        bucket = f"{frame.fixture_class}:{frame.resolution_tier.value}"
        async with self._lock:
            buf = self._buffers[bucket]
            buf.append(frame)
            if bucket not in self._timers:
                loop = asyncio.get_running_loop()
                self._timers[bucket] = loop.call_later(
                    self._max_wait_s,
                    lambda: asyncio.create_task(self._flush(bucket, "timeout", model_tier)),
                )
            if len(buf) >= self._target_size:
                await self._flush(bucket, "size", model_tier)

    async def _flush(self, bucket: str, reason: str, model_tier: str) -> None:
        async with self._lock:
            frames = self._buffers.pop(bucket, [])
            timer = self._timers.pop(bucket, None)
            if timer is not None:
                timer.cancel()
        if not frames:
            return
        batch = InferenceBatch(
            batch_id=uuid.uuid4().hex,
            bucket_key=bucket,
            model_tier=model_tier,
            frames=tuple(frames),
            sealed_reason=reason,
            created_at=time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        )
        # Bound concurrent GPU pressure: never have more than N batches in flight.
        async with self._sem:
            try:
                await self._dispatch(batch)
            except Exception:
                logger.exception("dispatch failed for batch %s (%d frames)",
                                 batch.batch_id, len(frames))
                raise  # surface to the broker so the frames are re-queued, not lost

Two design points carry their weight in production. The Semaphore capping max_inflight_batches is the real safety valve: it converts an unbounded surge of incoming frames into bounded GPU pressure, so a promotional-reset spike grows the broker backlog (cheap, durable) instead of the GPU memory footprint (expensive, fatal). And the _flush re-raises on dispatch failure rather than swallowing it — because the frames came from a broker with at-least-once delivery, an unacknowledged batch is simply redelivered, which is exactly the behavior you want when a GPU worker dies mid-batch.

Production Configuration & Tuning Jump to heading

Batch sizing is the single most consequential knob, and it is a memory-and-cost decision, not an accuracy one. Undersized batches pay fixed per-invocation overhead (model warm-up, host-to-device transfer, Python dispatch) on too few frames; oversized batches risk a CUDA out-of-memory kill or blow past the latency budget. The defaults below are a sane starting point for a mixed grocery fleet and should be overridden per deployment via environment variables so the same image runs differently on an A10G than on an L4:

import os

CONFIG = {
    # Frames per batch at the HIGH resolution tier. Scale up only after
    # confirming steady-state VRAM headroom under load test.
    "BATCH_TARGET_HIGH": int(os.getenv("BATCH_TARGET_HIGH", "8")),
    "BATCH_TARGET_STD": int(os.getenv("BATCH_TARGET_STD", "32")),
    "BATCH_TARGET_LOW": int(os.getenv("BATCH_TARGET_LOW", "64")),

    # Seal a partial batch after this long so slow buckets don't starve.
    "BATCH_MAX_WAIT_S": float(os.getenv("BATCH_MAX_WAIT_S", "0.75")),

    # Hard ceiling on concurrent batches per GPU worker.
    "MAX_INFLIGHT_BATCHES": int(os.getenv("MAX_INFLIGHT_BATCHES", "8")),

    # Shed batch size when device memory utilization crosses this fraction.
    "VRAM_HIGH_WATER": float(os.getenv("VRAM_HIGH_WATER", "0.85")),

    # Latency SLA after which the circuit breaker throttles ingestion (seconds).
    "SCORING_SLA_S": float(os.getenv("SCORING_SLA_S", "2.5")),
}

The numbers that matter and why they are set where they are:

Resolution-tiered targets. A HIGH tier 4K cooler frame can carry an order of magnitude more pixels than a LOW handheld shot, so a fixed batch size across tiers either wastes memory on small frames or OOMs on large ones. Keep BATCH_TARGET_HIGH near 8 and let BATCH_TARGET_LOW rise to 64.
Adaptive shedding at 0.85 VRAM. A background sampler reads device memory each cycle; when utilization crosses VRAM_HIGH_WATER, the size controller halves the target for that bucket until utilization recovers. This is hysteresis, not a hard cutoff — it prevents the oscillation you get from reacting to a single noisy reading.
The 0.75s seal timeout is tuned against the SLA, not the throughput. If end-to-end capture-to-verdict must stay under 2.5s, the aggregator can only afford a fraction of that budget waiting to fill a batch.
Inference-time cache hygiene. GPU workers should clear the framework allocator cache between batch cycles to prevent VRAM fragmentation; the per-worker teardown pattern for that is part of the Celery worker configuration for this stage.

A circuit breaker enforces the SLA from the other direction: when downstream scoring latency exceeds SCORING_SLA_S, the breaker stops pulling new frames from the broker so the backlog grows in durable storage rather than letting category managers receive stale verdicts dressed up as fresh ones.

Failure Modes & Debugging Workflow Jump to heading

When throughput collapses or compliance numbers go stale during a peak audit window, work the diagnosis in this order instead of guessing:

Confirm the bottleneck is GPU, not broker. Compare broker queue depth against GPU utilization. A deep, growing queue with the GPU pinned near 100% is correct saturation — add workers. A deep queue with the GPU idle means the aggregator is not dispatching: check that max_inflight_batches is not stuck at 0 from a leaked semaphore, and that no bucket’s flush task crashed silently.
Check for tail-latency starvation on rare buckets. If specific fixture classes report verdicts minutes late while common ones are fast, the seal timeout is too long relative to that bucket’s arrival rate. Confirm by logging sealed_reason: a healthy mix is mostly "size"; a flood of "timeout" seals on one bucket means it never fills and should get a shorter BATCH_MAX_WAIT_S or a coarser bucket key.
Reproduce OOM kills under controlled load. A worker that dies with CUDA out-of-memory under peak but not in staging is almost always batching a resolution tier larger than tested. Replay the offending batch with nvidia-smi sampling; if peak VRAM tracks frame count linearly past the 0.85 water mark, the adaptive shedder is not firing — verify the VRAM sampler thread is alive and its reading is per-device, not host RSS.
Trace frames that never produce a record. A capture_id acknowledged by the uploader but absent from the compliance store is a dropped frame. Because dispatch re-raises on failure, the frame should have returned to the broker; if it did not, a flush swallowed an exception or a batch was sealed with "shutdown" and never re-enqueued. Route these to a dead-letter path carrying image_uri, batch_id, and the failure reason so they can be replayed, not lost — the same forensic pattern detailed in Error Handling in Computer Vision Pipelines.
Separate batching drift from detection drift. If accuracy degrades but throughput is healthy, the batching layer is probably innocent. Confirm by checking whether confidence scores dropped uniformly (a model or preprocessing regression) or only for one fixture class (a routing or localization issue handled during Bounding Box Extraction & SKU Localization). Batching faults change latency and completeness, not per-frame accuracy.

The recurring root cause behind most batching incidents is the same: a size target chosen against average load and never re-validated against a worst-case frame. Load-test with synthetic imagery that simulates the largest resolution, heaviest occlusion, and densest facings you expect during a national promotional rollout, and set targets against that worst case rather than the median.

Scaling & Performance Benchmarks Jump to heading

Async batching scales on three axes: broker throughput, GPU concurrency, and the aggregator’s own coordination overhead. The aggregator is cheap — it holds frames in memory and does no inference — so a single event-loop process comfortably fronts a worker pool until it saturates the broker connection, at which point you shard by bucket_key across aggregator replicas rather than scaling the single instance vertically.

The metric that actually governs capacity is queue depth relative to drain rate. Target a steady-state broker depth near zero during normal trading, and alert when depth exceeds what one full sweep of GPU workers can clear within a single BATCH_MAX_WAIT_S window — sustained growth past that point predicts an unbounded backlog and a missed briefing. Autoscale GPU workers on queue depth, not CPU, because the workers are memory-bound and a CPU-based trigger will under-provision exactly when a surge hits.

For latency, report two numbers separately. End-to-end capture-to-verdict P95 must hold under the platform budget (commonly 2.5s–30s depending on whether the consumer is a live dashboard or an overnight report). The aggregator’s own contribution — enqueue-to-dispatch — should stay a small, stable fraction of that, dominated by the seal timeout; if it creeps up, the inflight-batch semaphore is the constraint and you need more GPU concurrency, not a bigger batch. Emit OpenTelemetry spans per stage and track batch_queue_depth, gpu_inference_latency_ms, batch_fill_ratio, and sku_detection_confidence so a regression can be attributed to a stage rather than chased across the whole pipeline.

On cost: larger batches amortize fixed per-invocation overhead and raise GPU utilization, which is the single biggest lever on inference spend — but only up to the VRAM ceiling, beyond which you trade a marginal utilization gain for OOM risk. The economical operating point is the largest batch that holds steady-state VRAM below 0.85 for your worst-case resolution tier, paired with deferring non-urgent buckets (ambient store-condition imagery) to off-peak windows so the GPU fleet is sized for compliance traffic, not for every frame a store happens to capture.

Frequently Asked Questions Jump to heading

How do I choose a batch size for shelf photos versus generic image classification? Shelf frames vary enormously in resolution and density, so a single global batch size is wrong by construction. Bucket by fixture_class and resolution tier, then set each bucket’s target to the largest count that keeps steady-state VRAM under 0.85 for that tier’s worst-case frame — roughly 8 for HIGH 4K captures and up to 64 for LOW handheld shots. Tune against a load test that replays your heaviest frames, not the median.

Why use a durable broker instead of an in-memory async queue for batching? Because a frame acknowledged to a store uploader must survive a worker crash. An in-memory queue loses everything on restart, silently dropping compliance records. A broker with consumer groups and persistence gives at-least-once delivery, so an unacknowledged batch is redelivered rather than lost, and it exposes queue depth as the backpressure and autoscaling signal the size controller reads.

What stops a surge of uploads from OOM-killing the GPU workers? A semaphore caps max_inflight_batches per worker, so an arrival surge grows the durable broker backlog instead of GPU memory. On top of that, an adaptive shedder halves the per-bucket batch size when device memory crosses the 0.85 water mark, with hysteresis so it reacts to a sustained trend rather than one noisy reading. Memory pressure is absorbed in cheap storage, never on the device.

How does batching interact with model routing? Routing assigns each frame a model tier and bucket key before it reaches the aggregator, and a batch never straddles two tiers. The batching layer respects that key and groups only same-tier, same-resolution frames so one tensor invocation runs a single detector efficiently. The routing logic itself, including secondary-endpoint failover, lives in the Vision Model Routing component.

What latency should the batching stage itself add? Only its seal timeout, typically a fraction of a second. The aggregator does no inference, so its enqueue-to-dispatch contribution should be small and stable — dominated by BATCH_MAX_WAIT_S for slow-filling buckets and near zero for buckets that fill to target. If that number creeps up, the inflight-batch ceiling is the constraint and the fix is more GPU concurrency, not a larger batch.

Implementing Celery for Async Shelf Photo Processing — the worker runtime, queue topology, and retry config that execute these batches
Vision Model Routing for Shelf Detection — how each batch’s model tier and bucket key are assigned upstream
Bounding Box Extraction & SKU Localization — the detection stage that consumes batched frames and emits facings
Error Handling in Computer Vision Pipelines — dead-letter forensics and retry semantics for dropped or failed batches
Image Parsing & Computer Vision Workflows — the workflow section this batching layer belongs to

Async Image Batching for High-Volume Stores

Concept & Data Contract Jump to heading#

Implementation Architecture Jump to heading#

Production Configuration & Tuning Jump to heading#

Failure Modes & Debugging Workflow Jump to heading#

Scaling & Performance Benchmarks Jump to heading#

Frequently Asked Questions Jump to heading#

Related Jump to heading#