Published April 2026 · Last verified April 2026
Log.
Real incident write-ups from the studio's Project Learning Ledger. The bug ledger and the test suite are the same artefact.
Total entries
10
CL-FAC-001 – CL-FAC-185
[ A ] · The ledger
Every regression, race, or silently wrong default that ships through the gate becomes a permanent test and a tagged log entry. The entry is not a post-mortem document buried in a wiki; it is a machine-readable rule that is injected into the prompt of the next build. When the factory encounters the same failure mode again, the test catches it before the code leaves the local branch.
This is the Operating Tenet Capture & reuse: a buffer-overflow found in the WRIE autonomy core becomes a permanent test case for every subsequent C++ build, and a schema-mismatch caught in URIP becomes a validation rule for all future API designs. The ledger is the institutional memory of the operation, and it compounds.
[ B ] · Entries
Ten incidents. Ten permanent tests.
Each entry below traces back to a real audit finding, a failed gate, or a production regression. The code, the file path, and the fix are all verifiable.
CL-FAC-091
WRIE
2026-02-14
High
Pose orientation messages must be normalised to [-π, π] before bus publish.
During a 50-robot live load test at Addverb Noida, intermittent navigation drift appeared when robots crossed the ±π boundary. The VDA5050 MQTT gateway was publishing raw Euler angles from the C++ FMS without normalisation, causing the React dashboard to render flipped heading vectors. The drift only manifested at the boundary, so standard regression tests missed it. The fix was a single atan2(sin(θ), cos(θ)) wrap in the telemetry serializer. A property test now generates 10⁶ random angles and asserts the normalised output is always within [-π, π].
CL-FAC-177
Cross-cutting
2026-03-01
Critical
Capture every regression at gate-fail and inject a targeted fix-checklist into the next build prompt.
In early 2026 the factory shipped three consecutive builds with the same CORS misconfiguration because the fix was documented in a Slack thread, not in the prompt. The PMO Gate 10 check only verified the file existed, not that the content matched the contract. After gate-fail, the regression was rewritten as a structured checklist and injected into the system prompt for every subsequent build. The result: zero repeat regressions across the next 23 projects. The checklist is now version-controlled alongside the blueprint.
CL-FAC-178
URIP
2026-04-07
Critical
Trust Center validate_storage_uri() exists but is never called - SSRF and arbitrary file-read remain open.
All three external auditors independently flagged the same path: backend/services/trust_center_service.py:83 defines a URI validator, yet publish stores the raw file_storage_uri and download executes open() on it without validation. An attacker could pass file:///etc/passwd or an internal metadata endpoint. The fix wired the validator into the publish flow and added a deny-list for file:// and loopback schemes. A contract test now asserts that every storage URI passes validation before persistence.
CL-FAC-179
WRIE
2026-02-28
Critical
O(N²) collision detection exceeds the 67 ms cycle budget at 20+ robots.
The FleetManager pairwise distance check in cpp/src/fleet/FleetManager.cpp:288 iterates all robot pairs. At 50 robots this is 1,225 pairs, consuming 30–50 ms and starving the MAPF planner. The code comment even admitted it was acceptable for fleets of <= 20 robots. A QuadTree spatial index already existed in cpp/src/navigation/QuadTree.h but was not wired into collision detection. The fix replaced the nested loop with QuadTree neighbour queries. A soak test now runs 100 robots for one hour and asserts collision_ms < 5 ms in every cycle.
CL-FAC-180
URIP
2026-04-15
High
Connector pull fanout enqueues one task per tenant × connector with no concurrency cap or soft-time-limit.
Every fifteen minutes the Celery beat schedules a pull for all 31 connectors across every tenant. A single misbehaving vendor or an unbounded pagination loop can pin every worker permanently, because task_soft_time_limit was never configured and connector_pull_fanout lacks a chord cap. The incident was caught during a load test when one CrowdStrike tenant with 400K findings stalled the queue for 90 minutes. The fix added task_soft_time_limit=600 and a per-tenant concurrency semaphore. A simulation test now injects a slow connector and asserts queue depth stays under 10.
CL-FAC-181
WRIE
2026-03-31
High
SOURCE_TAG = "gazebo" is hard-coded even when every byte of data comes from Python kinematics.
A Kimi synthetic-data audit discovered that gazebo_bridge.py permanently welds the source tag to gazebo regardless of whether the data originates from real Gazebo physics or the VirtualRobot kinematic fallback. This caused downstream analytics to treat simulated trajectories as real sensor data, corrupting the digital-twin fidelity model. The fix made the tag dynamic: gazebo only after verifying the gz process and topic subscription. A bridge contract test now asserts the correct tag for each backend mode.
CL-FAC-182
AstroRattan
2026-04-21
Medium
Seventeen PDF report sections marked NOT IMPLEMENTED even though the backend engines existed and were importable.
During the Kundli engine audit, sections such as Aspects & Conjunctions, Yogas & Doshas, and Shadbala were rendered as missing boxes because the CLI orchestrator called non-existent wrapper names, while the real engines sat untouched in app/dasha_engine.py and app/transit_engine.py. The assembler had been copy-pasted from an older project and never updated for the new module signatures. The fix wired all 17 engines into the full-report payload and added an import-smoke test that fails if any engine is unreachable.
CL-FAC-183
Mushroom Ki Mandi
2026-02-20
Medium
Sensor watchdog lacked a hard timer - a Wi-Fi blip caused indefinite silence without chip reboot.
In the field, a DHT11 bag-sensor stopped reporting after a transient router restart. The firmware’s soft watchdog only reinitialised the Wi-Fi radio, but the I2C bus for the DS18B20 remained hung because the error path returned without resetting the bus master. The gap was not visible in the dashboard because the last reading was cached as current. The fix added a two-stage watchdog: soft timer recovers the radio at 30 s, hard timer resets the entire ESP32 at 120 s. A soak test in the humidity chamber now asserts recovery within 90 s of injected bus faults.
CL-FAC-184
URIP
2026-04-28
Medium
Audit-log hash chain breaks under concurrent appends because no advisory lock is held.
Two simultaneous state-changing requests both read the same prev_hash, then insert competing row_hash values, forking the integrity chain for that tenant. The verify_chain() utility existed but was never scheduled in production, so the fork went undetected for two weeks. The fix wraps every append in a Postgres advisory lock keyed by tenant_id. A weekly Celery task now runs verify_chain() and asserts the chain is unbroken for every tenant.
CL-FAC-185
io-gita
2026-04-10
Low
Synthetic LiDAR scans used for KDTree calibration leaked into production as real sensor data.
The calibrate_iogita() routine in the integration example generated scans with np.random.uniform and fed them into the KDTree v5 engine. The comment said replace with real scans in production, but the example code was copy-pasted into the fleet integration node without removal. During a cold-start benchmark, the engine reported 97.2% accuracy on synthetic data but only 81% on real Gazebo raycasts. The fix removed all synthetic calibration paths and added a require_real_scans assertion at import time. A calibration contract test now rejects any scan with zero variance.