Why raise the box loss weight when training YOLOv8 for shelf detection?

On dense grocery shelving a box that is a few pixels off merges two adjacent facings, so spatial precision matters more than classification early in training. Raising the box loss weight toward 10.0 relative to cls and dfl forces the anchor-free head to nail localization first, after which classification converges on top of accurate boxes rather than fighting them.

When should I use INT8 quantization instead of FP16 for the exported model?

Default to FP16 on modern NVIDIA hardware because it preserves accuracy with minimal effort. Only move to INT8 after running calibration on a representative dataset, and only if its mAP gap to FP16 stays under 2 percent on a holdout. Un-calibrated INT8 degrades accuracy badly on low-contrast packaging, so it is a deliberate optimization, never a default.

Why disable mosaic augmentation late in training?

Mosaic blends four images into one, which helps the model generalize early but distorts precise shelf boundaries and barcode regions that matter for planogram alignment. Disabling it for the final 60 epochs lets the model fine-tune on undistorted, real shelf geometry so its boxes line up with actual facings.

How does the optimized detector connect to planogram compliance?

The post-processing pass maps every surviving detection to a shelf_id by validating its centroid against the calibrated shelf grid, producing typed records of class, confidence, box, and shelf level. Those records are exactly what the position-validation and threshold-tuning stages consume to compute a compliance score, so shelf-grid assignment lives in detection, not the compliance layer.

My mAP is strong in training but accuracy drops in stores — what now?

That pattern is distribution drift, not a training defect. The live capture distribution has moved away from the training set as stores re-light and packaging refreshes. Diagnose it with the drift-monitoring workflow and retrain or re-band on recent captures rather than re-tuning hyperparameters on stale data.

Optimizing YOLOv8 for Grocery Shelf Detection

This page sits under Vision Model Routing for Shelf Detection and answers one precise question: how do you turn a stock YOLOv8 checkpoint into the dense-packing detector the router dispatches high-resolution gondola frames to? Out of the box, YOLOv8 treats every bounding box as an independent classification target, which collapses under real grocery conditions — facings packed 8-wide with millimetre gaps, repeated packaging geometry that fools the classification head, and lighting that swings from 6500K daylight at the entrance to warm refrigerated spill. The fix is not one hyperparameter. It is an ordered procedure — annotation discipline, loss-weighted training, a hardened export, and a deterministic post-processing pass — that each step below makes independently verifiable, so the model graduates from a generic detector into a compliance-grade one whose output the bounding box extraction & SKU localization stage can trust.

Prerequisites & Context Jump to heading

This procedure assumes detection — not routing — is your problem. If recall is fine but phantom boxes are inflating share-of-shelf, suppression belongs in Reducing False Positives in SKU Bounding Boxes instead. Confirm the following before you start:

Runtime: Python 3.11+ with ultralytics>=8.1, torch, opencv-python, and numpy on a CUDA host. TensorRT compilation needs trtexec on the PATH.
Labelled dataset: SKU-level annotations in YOLO format that preserve shelf-level hierarchy — each facing labelled to its catalog sku, not a generic product class — with shelf edges and promotional signage labelled as explicit negative classes.
Fixture metadata: the per-fixture shelf-grid coordinates from one-time store calibration, so the post-processing pass can map a detection’s centroid to a shelf_id rather than guessing.
Hardware target: the GPU tier the router declares for this endpoint (the dense-packing tier expects a modern NVIDIA card supporting FP16).
A baseline number: the stock yolov8n.pt mAP on your validation set, so every change below is measured against a real starting point rather than intuition.

The goal is high recall on tightly packed facings and clean spatial mapping, so each step is paired with a metric you can assert on.

Step 1 — Curate the Dataset and Mine Hard Negatives Jump to heading

Standard COCO-style annotation collapses under retail density because it ignores shelf hierarchy and the planogram grid. Effective shelf detection needs SKU-level labels that preserve the spatial relationship between facings, shelf edges, and signage. Use stratified sampling so low-velocity SKUs, seasonal endcaps, and temporary displays are represented in proportion to how often they actually appear — not how often they are photographed.

Then apply sliding-window tiling so the network learns local typography and packaging texture instead of global scene context, and mine hard negatives aggressively: inject crops of adjacent SKUs with near-identical palettes, partially occluded facings, and empty gaps. This forces the classification head to discriminate on fine packaging cues rather than defaulting to a confident wrong answer.

# hard_negative_sampler.py
from pathlib import Path

import cv2
import numpy as np


def generate_hard_negatives(
    image_dir: Path,
    output_dir: Path,
    patch_px: int = 128,
    stride_px: int = 64,
) -> int:
    """Extract overlapping adjacent-facing patches and apply controlled color
    jitter so the classifier learns fine-grained discrimination. Returns the
    number of negative crops written."""
    output_dir.mkdir(parents=True, exist_ok=True)
    written = 0
    for img_path in sorted(Path(image_dir).glob("*.jpg")):
        img = cv2.imread(str(img_path))
        if img is None:
            continue  # skip unreadable frames rather than crashing the run
        h, w = img.shape[:2]
        for y in range(0, h - patch_px, stride_px):
            for x in range(0, w - patch_px, stride_px):
                patch = img[y:y + patch_px, x:x + patch_px]
                # Simulate lighting variance across daylight/refrigerated zones.
                jittered = cv2.convertScaleAbs(
                    patch,
                    alpha=float(np.random.uniform(0.9, 1.1)),
                    beta=float(np.random.uniform(-10, 10)),
                )
                out_name = f"hn_{img_path.stem}_{x}_{y}.jpg"
                cv2.imwrite(str(output_dir / out_name), jittered)
                written += 1
    return written

Verify this step: confirm hard negatives are roughly 15–25% of the training set and that every catalog sku has at least 50 labelled instances; classes below that floor are where precision will later crater.

Step 2 — Tune Loss Weighting and the Augmentation Schedule Jump to heading

YOLOv8 is anchor-free, which simplifies deployment but makes loss weighting the lever that matters. Early in training, raise the box loss weight relative to cls and dfl so the network nails spatial localization before classification converges — on dense shelves, a box that is 5px off merges two facings. Control augmentation tightly: disable mosaic after epoch 60 so synthetic blending stops distorting shelf boundaries and barcode regions, and cap mixup at 0.1 to avoid unrealistic color gradients that degrade SKU discrimination. Drive the learning rate with cosine annealing and a 5-epoch linear warmup to stabilize gradients across high-density batches.

# custom_shelf_training.yaml
task: detect
mode: train
model: yolov8n.pt
data: shelf_dataset.yaml
epochs: 120
patience: 15
batch: 32
imgsz: 1280          # dense facings need resolution; do not drop below 1280
optimizer: AdamW
lr0: 0.001
lrf: 0.01
warmup_epochs: 5
warmup_bias_lr: 0.0
warmup_momentum: 0.8
box: 10.0            # prioritize localization over classification early
cls: 0.5
dfl: 1.5
mosaic: 1.0
mixup: 0.1
copy_paste: 0.0
close_mosaic: 60     # mosaic off for the final 60 epochs

Monitor validation mAP@0.5:0.95 alongside the precision-recall curve, isolating the bottom 20% of SKUs by frequency. If class precision drops below 0.85, halve lr0 and feed the next cycle the targeted hard negatives from Step 1 rather than training longer at the same rate. These are the same low-frequency precision floors that Threshold Tuning for Compliance Accuracy later depends on when it converts detections into a compliance score.

Step 3 — Export and Compile for Production Latency Jump to heading

A trained .pt checkpoint is not a production artifact. Export it to ONNX, then compile a TensorRT engine so the router’s latency budget for this endpoint holds under a morning store-walk burst. Use FP16 on modern NVIDIA hardware; reserve INT8 quantization for after you have run calibration on a representative dataset, because un-calibrated INT8 wrecks accuracy on low-contrast packaging. Enable dynamic shapes so varying capture resolutions do not trigger constant memory reallocation.

# export_and_compile.py
import subprocess
from pathlib import Path

from ultralytics import YOLO


def optimize_for_production(
    weights_path: str,
    output_dir: Path,
    precision: str = "fp16",
) -> Path:
    """Export YOLOv8 weights to ONNX, then compile a TensorRT engine. Raises
    CalledProcessError if trtexec fails, so a broken build never ships."""
    if precision not in {"fp16", "int8"}:
        raise ValueError(f"unsupported precision: {precision!r}")
    output_dir.mkdir(parents=True, exist_ok=True)

    model = YOLO(weights_path)
    onnx_path = model.export(
        format="onnx",
        dynamic=True,
        half=(precision == "fp16"),
        simplify=True,
    )

    engine_path = output_dir / f"shelf_detector_{precision}.engine"
    cmd = [
        "trtexec",
        f"--onnx={onnx_path}",
        f"--saveEngine={engine_path}",
        f"--{precision}",
        "--minShapes=images:1x3x640x640",
        "--optShapes=images:4x3x1280x1280",
        "--maxShapes=images:8x3x1280x1280",
    ]
    subprocess.run(cmd, check=True)
    return engine_path

Verify this step: benchmark the engine on a held-out batch and assert P95 inference latency lands inside the endpoint’s declared budget, and that FP16 mAP stays within 1% of the FP32 checkpoint. A larger gap means the export, not the training, regressed.

Step 4 — Map Detections to the Planogram Grid Jump to heading

Raw model output is a list of boxes with no shelf semantics. A deterministic post-processing pass turns them into planogram-ready records. Apply Non-Maximum Suppression with an IoU threshold tuned for dense facings (0.45–0.55), then run a spatial constraint that validates each box centroid against the calibrated shelf grid, dropping phantom detections from reflections or floor clutter. For tilted captures, recover the fronto-parallel plane with a homography before the grid test so shelf-level assignment is meaningful.

# shelf_postprocessor.py
from dataclasses import dataclass

import numpy as np


@dataclass(frozen=True)
class MappedDetection:
    sku_class: int
    confidence: float
    bbox: tuple[int, int, int, int]
    shelf_id: int


class ShelfPlanogramAligner:
    """Apply confidence filtering and shelf-grid validation to raw detections."""

    def __init__(
        self,
        shelf_grid: list[tuple[int, int]],
        conf_threshold: float = 0.65,
    ) -> None:
        # shelf_grid: vertical (y1, y2) band per shelf level, top to bottom.
        self.shelf_grid = shelf_grid
        self.conf_threshold = conf_threshold

    def filter_and_map(self, detections: np.ndarray) -> list[MappedDetection]:
        """detections: array of [x1, y1, x2, y2, score, class] rows."""
        mapped: list[MappedDetection] = []
        for x1, y1, x2, y2, score, cls in detections:
            if score < self.conf_threshold:
                continue
            cy = (y1 + y2) / 2.0  # vertical centroid drives shelf assignment
            shelf_id = self._resolve_shelf_level(cy)
            if shelf_id is None:
                continue  # outside any valid shelf band — reflection or clutter
            mapped.append(
                MappedDetection(
                    sku_class=int(cls),
                    confidence=float(score),
                    bbox=(int(x1), int(y1), int(x2), int(y2)),
                    shelf_id=shelf_id,
                )
            )
        return mapped

    def _resolve_shelf_level(self, cy: float) -> int | None:
        for level, (sy1, sy2) in enumerate(self.shelf_grid):
            if sy1 <= cy <= sy2:
                return level
        return None

These MappedDetection records — class, confidence, box, and shelf_id — are exactly what the position validation algorithms for planograms downstream expect, which is why the shelf-grid assignment belongs here and not in the compliance layer.

Verification & Testing Jump to heading

Confirm the optimized model deterministically rather than trusting a single mAP figure:

Dense-recall holds. On a validation slice of 8-wide packed facings, assert post-export recall stays within 2% of the FP32 checkpoint and that facing count per shelf matches the label — undercounting means localization, not classification, regressed.
Low-frequency precision floor. Assert the bottom 20% of SKUs by frequency hold class precision at or above 0.85; if not, the hard-negative mix in Step 1 is too thin for those classes.
Latency budget. Assert P95 engine latency on the target GPU sits inside the router’s declared budget for this endpoint, measured on the optShapes resolution.
Grid mapping is total. Feed ShelfPlanogramAligner a frame with a known reflection box above the top shelf band and assert it is dropped (shelf_id is None), while every real facing resolves to a shelf_id.
No silent INT8 regression. If you ship INT8, assert its mAP gap to FP16 is under 2% on the calibration holdout; a wider gap means recalibrate, do not ship.

A healthy run shows mAP@0.5:0.95 climbing while the facing-count error per shelf falls toward zero and the engine latency stays flat across batch sizes.

Troubleshooting Jump to heading

Symptom	Likely root cause	Remediation
Adjacent facings merged into one box (undercounted)	`box` loss weight too low, or `imgsz` dropped below `1280` so packed edges blur	Raise `box` toward `10.0`, keep `imgsz` at `1280`, and lower the NMS IoU toward `0.45` for that density
Low-velocity SKUs misclassified at high confidence	Insufficient hard negatives — the head defaults to a confident wrong class	Re-run `generate_hard_negatives` on adjacent-SKU crops and confirm each class clears the `50`-instance floor
mAP good in training, poor in production	Distribution drift between training set and live captures	This is a drift problem — route it to Debugging Vision Model Drift in Retail Environments, not a retrain at the same data
Box fragmentation on glossy or specular packaging	Highlights saturate edges so the detector splits one facing	Apply CLAHE at preprocessing before inference; route unrecoverable frames to quarantine
Latency spikes during peak store hours	Inference contending with capture on the same thread	Decouple capture from inference with Async Image Batching for High-Volume Stores so throughput is jitter-independent
False positives on promotional cardboard displays	No negative class for signage; geometry never filters them	Label signage as an explicit negative class and enforce the aspect-ratio and ROI checks from the suppression pass

Vision Model Routing for Shelf Detection — the routing layer that dispatches dense-packing frames to this optimized detector
Reducing False Positives in SKU Bounding Boxes — the post-detection suppression pass that runs on this model’s output
Bounding Box Extraction & SKU Localization — the stage that consumes these mapped detections and turns them into planogram-ready coordinates

Optimizing YOLOv8 for Grocery Shelf Detection

Prerequisites & Context Jump to heading#

Step 1 — Curate the Dataset and Mine Hard Negatives Jump to heading#

Step 2 — Tune Loss Weighting and the Augmentation Schedule Jump to heading#

Step 3 — Export and Compile for Production Latency Jump to heading#

Step 4 — Map Detections to the Planogram Grid Jump to heading#

Verification & Testing Jump to heading#

Troubleshooting Jump to heading#

Related Jump to heading#