ESL Synchronization

Architecture Overview

The ESL synchronization system uses an event-driven architecture with critical safety mechanisms to maintain consistency. When a user modifies data affecting Electronic Shelf Labels (ESLs), STREAM writes tasks to the database and publishes NATS messages for processing.

The key architectural constraint is per-label task ordering: all tasks affecting the same label must be processed sequentially according to their seq (global Postgres sequence number) to maintain consistency. This is enforced using Redis distributed locking, where each label is locked per global MAC address during processing. A single worker processes all pending tasks for a locked label in sequence order before releasing the lock.

To handle NATS message loss and worker crashes, two fallback mechanisms run continuously:

Polling Loop (every 10s): Scans the database for pending or failed tasks, spawning workers to process them by label
Recovery Loop (every 30s): Detects tasks stuck in processing for >2 minutes (indicating a worker crash) and marks them as failed, allowing the polling loop to retry them

uml diagram

Operational Details

Event-Driven Primary Path

When changes arrive through NATS, Solaris acquires a Redis lock for the affected label. This prevents concurrent workers from processing tasks for the same label out of order. The worker then processes all pending tasks for that label sequentially in seq order, pushing images to SoluM before releasing the lock.

Polling Loop (Fallback for Lost Messages)

Every 10 seconds, the polling mechanism scans the database for any pending or failed ESL tasks that haven't been picked up. When found, it spawns a worker for each label with pending tasks. This ensures that even if a NATS message is lost, tasks will eventually be processed without manual intervention.

Recovery Loop (Fallback for Crashed Workers)

Every 30 seconds, the recovery mechanism checks for tasks stuck in the "processing" state for longer than 2 minutes - a strong indicator that the worker crashed. These tasks are marked as failed, which makes them visible to the polling loop in the next cycle. The polling loop then retries them by spawning a fresh worker.

Consistency Through Sequencing

The seq ordering is critical: tasks are written to the database in a single transaction with their sequence numbers, ensuring total order. Redis locking by label prevents concurrent processing within a label, while the seq field ensures that when multiple jobs affecting overlapping labels are created in quick succession, tasks are processed in the correct order to maintain data consistency across all affected labels.