Implementing Celery for Async Shelf Photo Processing

This guide sits within the Async Image Batching for High-Volume Stores component and solves one specific operational task: wiring a Celery task queue so that thousands of high-resolution shelf photographs per hour move off the request path and through vision inference without timeouts, connection-pool exhaustion, or lost frames. When field auditors, smart carts, and fixed aisle cameras all push captures at once, a synchronous handler that decodes, normalizes, runs detection, and commits compliance results inside one HTTP request will block, back up, and shed load during exactly the peak audit windows that matter most. The steps below stand up a Redis broker, a routed worker topology, fault-tolerant retries, and a compliance result backend so a planogram program can absorb burst traffic and still report at the shelf, aisle, and store level.

Prerequisites Jump to heading

Before applying this page, confirm the following are in place:

Python 3.11+ with celery[redis], opencv-python-headless, numpy, and your inference runtime (torch or onnxruntime) installed in the worker image.
A Redis instance reachable from every worker, reserved for queue traffic (this guide uses database index 0 for the broker and 1 for the result backend).
A detection model already selected per store format. Model selection itself belongs to Vision Model Routing for Shelf Detection; this page consumes that decision as a routing key.
A planogram compliance schema to write into, using the typed fields the rest of the site assumes — planogram_id, fixture_id, compliance_percentage, out_of_stock_flags, capture_timestamp.
GPU inference nodes with CUDA_VISIBLE_DEVICES pinned per worker, plus CPU nodes for decode and normalization.

Step-by-Step Implementation Jump to heading

Step 1 — Harden the Redis broker against frame loss Jump to heading

Configure Redis so a broker restart or rolling deploy never silently drops a task payload. Enable AOF (Append-Only File) logging alongside periodic RDB snapshots, and set maxmemory-policy noeviction so the broker refuses new writes rather than evicting queued captures under memory pressure. Isolate queue traffic onto a dedicated database index and apply network ACLs so application session caching never shares the same keyspace. RabbitMQ remains viable when you need strict global FIFO ordering, but shelf pipelines are stateless and idempotent at the task level, so they favor Redis throughput and graceful degradation over ordering guarantees.

# redis.conf (broker)
appendonly yes
appendfsync everysec
save 900 1
maxmemory-policy noeviction

Step 2 — Define the queue topology and worker-safe defaults Jump to heading

Split work into one queue per pipeline stage and per store format so CPU decode never starves GPU inference. Set task_acks_late=True and worker_prefetch_multiplier=1 so a worker that dies mid-inference returns its message to the broker instead of acknowledging it on receipt.

# celery_app.py
import os
from celery import Celery
from kombu import Queue, Exchange

BROKER_URL = os.getenv("CELERY_BROKER_URL", "redis://redis-broker:6379/0")
RESULT_BACKEND = os.getenv("CELERY_RESULT_BACKEND", "redis://redis-broker:6379/1")

app = Celery("shelf_vision", broker=BROKER_URL, backend=RESULT_BACKEND, include=["shelf_tasks"])

app.conf.task_queues = (
    Queue("ingestion", Exchange("shelf"), routing_key="ingestion"),
    Queue("preprocessing", Exchange("shelf"), routing_key="preprocessing"),
    Queue("gondola_inference", Exchange("shelf"), routing_key="gondola"),
    Queue("endcap_inference", Exchange("shelf"), routing_key="endcap"),
    Queue("compliance_aggregation", Exchange("shelf"), routing_key="compliance"),
)

app.conf.task_routes = {
    "shelf_tasks.validate_and_ingest": {"queue": "ingestion"},
    "shelf_tasks.normalize_lighting": {"queue": "preprocessing"},
    "shelf_tasks.run_gondola_detection": {"queue": "gondola_inference"},
    "shelf_tasks.run_endcap_detection": {"queue": "endcap_inference"},
    "shelf_tasks.aggregate_compliance": {"queue": "compliance_aggregation"},
}

app.conf.update(
    task_serializer="json",
    accept_content=["json"],
    result_serializer="json",
    timezone="UTC",
    enable_utc=True,
    task_acks_late=True,
    worker_prefetch_multiplier=1,
    task_reject_on_worker_lost=True,
    broker_transport_options={"visibility_timeout": 3600, "retry_on_timeout": True},
)

Step 3 — Decompose ingestion into a chained task graph Jump to heading

A monolithic task that handles upload, preprocessing, inference, and the database commit will fail as a unit on any partial outage. Break the work into discrete, chainable tasks: ingestion validates metadata, applies EXIF orientation correction, strips PII from metadata headers, and derives a deterministic task ID. Tie that ID to a SHA-256 hash of the raw image bytes so a redelivered, unacknowledged message never produces a duplicate detection result.

# shelf_tasks.py
import gc
import hashlib
import logging
import os

import cv2
import numpy as np
import torch
from celery import chain

from celery_app import app

logger = logging.getLogger(__name__)


@app.task(
    bind=True,
    autoretry_for=(Exception,),
    retry_backoff=True,
    retry_backoff_max=600,
    retry_jitter=True,
    max_retries=3,
    name="shelf_tasks.validate_and_ingest",
)
def validate_and_ingest(self, image_payload: dict) -> str:
    """Validate EXIF, derive a deterministic ID, and route to preprocessing."""
    try:
        image_hash = hashlib.sha256(image_payload["raw_bytes"]).hexdigest()
        self.update_state(state="PROGRESS", meta={"step": "validation", "hash": image_hash})
        corrected_bytes = apply_exif_orientation(image_payload["raw_bytes"])

        store_format = image_payload.get("store_format", "standard")
        routing_key = "endcap" if store_format == "flagship" else "gondola"

        chain(
            normalize_lighting.s(corrected_bytes, image_hash),
            route_detection.s(routing_key),
            aggregate_compliance.s(image_payload["store_id"]),
        ).apply_async()
        return image_hash
    except KeyError as exc:
        logger.error("Malformed payload, no retry: %s", exc)
        raise

Step 4 — Normalize retail lighting on CPU workers Jump to heading

Fluorescent aisles, daylight zones, and freezer glare wreck detection confidence before the model ever runs. Apply CLAHE on the L channel in LAB space so contrast is equalized without blowing out packaging color. This stage is CPU- and I/O-bound, so it runs on the high-concurrency preprocessing pool, never on the GPU nodes.

@app.task(bind=True, name="shelf_tasks.normalize_lighting", max_retries=2)
def normalize_lighting(self, image_bytes: bytes, image_hash: str) -> tuple[bytes, str]:
    """Apply CLAHE white-balance correction for retail lighting variance."""
    try:
        decoded = cv2.imdecode(np.frombuffer(image_bytes, np.uint8), cv2.IMREAD_COLOR)
        if decoded is None:
            raise ValueError(f"Undecodable frame for hash={image_hash}")
        lab = cv2.cvtColor(decoded, cv2.COLOR_BGR2LAB)
        l, a, b = cv2.split(lab)
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
        merged = cv2.merge([clahe.apply(l), a, b])
        normalized = cv2.cvtColor(merged, cv2.COLOR_LAB2BGR)
        _, buffer = cv2.imencode(".jpg", normalized)
        return buffer.tobytes(), image_hash
    except Exception as exc:
        logger.warning("Lighting normalization failed: %s", exc)
        raise self.retry(exc=exc, countdown=10)

Step 5 — Route to the right detector and pin the GPU Jump to heading

Dispatch standard gondola shots to a lightweight detector and complex endcap or promotional displays to a heavier one, keeping each on its own queue. Pin every GPU worker to a single CUDA device and cap concurrency at 1 per process: two PyTorch or ONNX Runtime sessions sharing one card cause silent OOM kills and corrupted bounding boxes. Force garbage collection and clear the CUDA cache at the end of each task so data loaders do not leak memory across thousands of sequential inferences.

@app.task(name="shelf_tasks.route_detection")
def route_detection(payload: tuple[bytes, str], routing_key: str):
    image_bytes, image_hash = payload
    target = run_endcap_detection if routing_key == "endcap" else run_gondola_detection
    return target.delay(image_bytes, image_hash)


@app.task(bind=True, name="shelf_tasks.run_gondola_detection", max_retries=1)
def run_gondola_detection(self, image_bytes: bytes, image_hash: str) -> list[dict]:
    """Lightweight inference for standard gondola shelves."""
    try:
        detections = run_inference_pipeline(image_bytes, model_variant="yolov8n_shelf")
        return detections
    except Exception as exc:
        raise self.retry(exc=exc, countdown=30)
    finally:
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

Step 6 — Aggregate detections into compliance results Jump to heading

The final task in the chain scores detections against the master planogram and writes a structured result through the Celery result backend. Normalize outputs against the planogram schema so downstream dashboards read a stable contract. The same compliance_percentage value is what the Threshold Tuning for Compliance Accuracy module calibrates against, which keeps the scoring boundary consistent across the platform.

{
  "planogram_id": "PLN-2026-0421",
  "fixture_id": "AISLE-07-BAY-03",
  "store_id": "STR-1184",
  "compliance_percentage": 94.2,
  "out_of_stock_flags": ["SKU-88231", "SKU-90114"],
  "capture_timestamp": "2026-06-28T14:07:33Z",
  "status": "COMMITTED"
}

Step 7 — Launch CPU and GPU worker pools Jump to heading

Run process-isolated workers with concurrency matched to each node’s role. Size the CPU pool to physical cores minus one to leave headroom for OS scheduling and broker heartbeats; hold GPU inference at 1.

# CPU node: decode + normalize
celery -A celery_app worker --pool=prefork --concurrency=7 \
  --loglevel=INFO -Q preprocessing,ingestion

# GPU node: inference only, one session per card
CUDA_VISIBLE_DEVICES=0 celery -A celery_app worker --pool=prefork \
  --concurrency=1 --loglevel=INFO -Q gondola_inference,endcap_inference

Verification & Testing Jump to heading

Confirm the pipeline is production-ready before pointing live stores at it:

Idempotency. Submit the same payload twice and assert both runs resolve to one image_hash and a single committed compliance row. Duplicate rows mean the SHA-256 dedupe in Step 3 is not wired to your persistence layer.
Late-ack recovery. Kill a GPU worker mid-inference (kill -9) and confirm the in-flight message reappears on gondola_inference and is picked up by a sibling worker, proving task_acks_late and task_reject_on_worker_lost hold.
Backpressure. Push 5000 payloads and watch queue depth; the broker should buffer rather than evict, and redis-cli info memory must never show evicted_keys climbing.
GPU stability. Run 2000 sequential inferences and watch nvidia-smi; resident VRAM should plateau, not climb. A monotonic rise means the empty_cache() in the finally block is not firing.
Latency SLA. With Flower or a Prometheus exporter, assert p95 end-to-end latency stays under your audit-window target and inference success rate holds above 99.5%.

Troubleshooting Jump to heading

Symptom	Root cause	Remediation
Queue depth grows unbounded during peak audits	GPU concurrency too low or one detector starved by shared queue	Scale `gondola_inference` workers horizontally and keep endcap traffic on its own queue so heavy frames never block light ones
Workers silently die under load, no traceback	Two inference sessions share one CUDA device and trigger an OOM kill	Pin `CUDA_VISIBLE_DEVICES` per worker and hold `--concurrency=1` on GPU nodes
Same shelf scored twice in the dashboard	Broker redelivered an unacknowledged message and dedupe is missing	Key the commit on the Step 3 SHA-256 `image_hash` and upsert rather than insert
Tasks retry forever on a corrupt frame	Blanket `autoretry_for=(Exception,)` catches permanent decode errors	Raise non-retryable errors (e.g. `KeyError`, `ValueError`) past the retry guard and route them to a dead-letter queue for inspection, as covered in Debugging Vision Model Drift in Retail Environments
Lost tasks after a broker restart	Redis running without persistence or with an evicting policy	Enable AOF and set `maxmemory-policy noeviction` per Step 1

Frequently Asked Questions Jump to heading

Should I use Redis or RabbitMQ as the broker? Use Redis for stateless shelf inference where idempotency is enforced at the task level — it gives lower latency and simpler operations. Reach for RabbitMQ only when you genuinely need strict global FIFO ordering, which shelf compliance pipelines almost never do.

Why set worker_prefetch_multiplier to 1? Long-running inference tasks should not be hoarded by a single worker. A prefetch of 1 keeps messages on the broker until a worker is actually free, so a slow or dying node cannot strand a batch of captures it will never finish.

How do I stop duplicate compliance rows when the broker redelivers a message? Derive a deterministic ID from the SHA-256 hash of the raw image bytes at ingestion and make the final commit an upsert keyed on that hash. Redelivery then overwrites rather than duplicates, which is the same idempotency guarantee the batching layer relies on.

Where should model selection live — in Celery or upstream? Keep Celery responsible only for routing to the correct queue. The decision of which architecture handles which fixture belongs to the routing logic described in Vision Model Routing for Shelf Detection; the task graph just carries the resulting key.

Async Image Batching for High-Volume Stores — the parent component this Celery topology implements.
Vision Model Routing for Shelf Detection — how the per-fixture detector decision that drives Step 5 routing is made.
Error Handling in Computer Vision Pipelines — dead-letter queue forensics and retry policy for the failure modes above.

Implementing Celery for Async Shelf Photo Processing

Prerequisites Jump to heading#

Step-by-Step Implementation Jump to heading#

Step 1 — Harden the Redis broker against frame loss Jump to heading#

Step 2 — Define the queue topology and worker-safe defaults Jump to heading#

Step 3 — Decompose ingestion into a chained task graph Jump to heading#

Step 4 — Normalize retail lighting on CPU workers Jump to heading#

Step 5 — Route to the right detector and pin the GPU Jump to heading#

Step 6 — Aggregate detections into compliance results Jump to heading#

Step 7 — Launch CPU and GPU worker pools Jump to heading#

Verification & Testing Jump to heading#

Troubleshooting Jump to heading#

Frequently Asked Questions Jump to heading#

Related Jump to heading#