Applied · Smart-grid stability · 100 operating points

Semantic Gravity, applied to a real-world prediction.

Same dataset. Same 12 features per row. Five systems answering the same question: will this 4-node power grid be stable or unstable? One system is the io-gita semantic-gravity engine — deterministic, sub-millisecond, free. Four are frontier LLMs. Below is what actually happened — measured numbers, isolated workspace, no fabrication.

Dataset

UCI Smart Grid Stability

Test rows

100 (deterministic seed 42)

Truth split

62 unstable · 38 stable

Date measured

2026-05-02

The setup

The UCI Smart Grid Stability dataset is a synthetic 4-node DAE simulation from Schäfer et al. (2016). Each row is one operating point of a 4-node power network: 12 features describe reaction times (τ), nominal powers (p), and price elasticities (g). The label is whether that point is dynamically stable or unstable.

Field	Meaning
tau1 .. tau4	Reaction times of producer + 3 consumers (sec, ~0.5–10)
p1 .. p4	Nominal power; p1>0 producer, p2/p3/p4<0 consumers
g1 .. g4	Price-elasticity coefficients (~0–1, dimensionless)
stab	Continuous stability score (used to derive 4 quartile basins; not given to LLMs)
stabf	Binary label: stable / unstable (the ground truth we measure against)

We took 100 rows as a deterministic holdout (seed=42), trained sg-engine on the remaining 9,900, and asked five systems for stable/unstable per row. LLM prompts ran in an isolated workspace (`/tmp/sg_isolated/`) containing only the prompt — no labeled CSV in scope.

The calculation, end-to-end

How sg-engine turns 9,900 training rows into a 4-basin topology — and how one new test row gets a prediction. Six steps. Every number below is from the actual run (`engine.json`).

Quartile-bin the training stab score

The continuous stab score on 9,900 training rows is split at its 25/50/75 quantiles into Q1..Q4. The bin edges come from training data only — there is no test-set leakage. Each quartile becomes a candidate attractor basin. Empirically: Q1 = 0% unstable, Q2 = 55% unstable, Q3 = 100% unstable, Q4 = 100% unstable.

Fit feature transforms on training data

RobustScaler + median imputation + PCA at 95% variance reduce the 12 raw features to 11 components. The fitted transformers are pinned in an ArtifactBundle so test rows are encoded with the exact same fit — no refitting at inference time.

Build prototypes (multimodal allowed)

build_prototypes() emits up to 3 prototypes per class via k-medoids when a class is multimodal. On Smart Grid: 4 quartile classes → 8 prototypes (multimodal split detected on Q2/Q3). Each prototype is the median feature vector of its sub-cluster.

Lift into D=10,000 with random bipolar atoms

11 PCA components are bound to 11 D=10,000 random bipolar atom vectors. Each prototype becomes Q = sum(component_value × atom_vector), giving 8 high-dimensional pattern attractors. Hebbian recall (W = Pᵀ·P / D) builds the energy landscape.

Map the directed transition atlas

Atlas.map() tests every potential basin→basin edge under forcing at α ∈ {0.1, 0.3, 0.5}. Result on Smart Grid: 6 edges per alpha (Q-quartile structure is sparse), 1-node SCC, top sinks at Q2 sub-prototypes. The atlas is the static topology over which all inference happens.

Per test row → predict()

transform(test_row, bundle) applies the fitted PCA. predict() returns the nearest prototype (= predicted basin) plus a confidence. Each basin has a known training-set unstable rate (Q1=0%, Q2=55%, Q3=100%, Q4=100%) → that rate is the predicted probability of unstable.

Engine network actually built

10,000

Basins

Build time

191.8 ms

Atlas time

10.40 s

Per-quartile unstable rate (training): Q1=0.0%Q2=55.3%Q3=100.0%Q4=100.0%

Five outputs, one row.

Test row #0. Twelve features. Truth: stable. Each system below saw the identical input. No system was told the answer.

Feature	tau1	tau2	tau3	tau4	p1	p2	p3	p4	g1	g2	g3	g4
Value	3.942	5.718	9.789	1.784	3.858	-0.632	-1.982	-1.244	0.144	0.299	0.366	0.900

io-gita engine

✗ wrong

unstable

Basin Q2 · prob_unstable 55.3%

The 12 features were PCA-projected, lifted into D=10,000 via Hebbian binding, identified as nearest prototype. The basin's empirical training rate produces the probability.

Kimi

✓ correct

stable

Confidence 97 · one prediction took ~7 minutes (likely ran code internally)

Kimi's CLI runs an agentic loop — for this benchmark it appears to have used code execution to compute the analytical Schäfer stability criterion. High accuracy, high latency.

Gemini

✗ wrong

unstable

Confidence 75

Pure text-mode reasoning across all 100 rows in one call. Confidence reads as a self-report, not a calibrated probability.

Codex

✗ wrong

unstable

Confidence 65

Codex (gpt-5.2) ran with --full-auto. Capable of code execution but with sharply varying behaviour across reruns; this run produced a coherent JSON array.

Claude (subagent)

✗ wrong

unstable

Confidence 62

A fresh Claude subagent with no tools and no context from the benchmark — pure linguistic reasoning over the 12 features.

This is one of 100 rows. The full per-row table (25 columns × 100 rows) ships with the data and is available as JSON below.

The honest comparison

All numbers below are computed from the JSON files shipped with this page — no placeholder values, no rounding for marketing. Run 1 of each system is the primary; reruns are used for the determinism column only.

System	Accuracy	AUC	Determinism	ms / query	Reruns OK
io-gita	80.0%	0.840	100.0%	0.078 ms	3/3
Kimi (CLI · code-exec)	100.0%	1.000	98.0%	3.90 s	2/3
Gemini (CLI · text)	77.0%	0.843	78.0%	370.0 ms	3/3
Codex (CLI · code-exec)	74.0%	0.813	21.0%	430.0 ms	3/3
Claude (subagent · text)	72.0%	0.677	89.0%	n/a	3/3

io-gita: deterministic topology + nearest-prototype assignment. 80% accuracy, AUC 0.84, perfectly reproducible, 0.078 ms/query — about 4727× faster than the cheapest LLM.
Kimi: 100% accuracy because its CLI evidently invoked code execution to compute the Schäfer stability formula directly. Real result, but with 7-minute per-call latency and one of three reruns failing silently.
Gemini: 77% accuracy, AUC 0.84, but only 78% determinism — one row in five flips across reruns.
Codex: 74% accuracy on run 1, but run 3 was degenerate (predicted every row as stable). Inter-run determinism dropped to 21%.
Claude (subagent): 72% accuracy, AUC 0.68. Pure linguistic reasoning under-performs the engine on AUC.

Determinism · same input, same answer?

For an audit-grade decision system, "did the same input produce the same answer?" is non-negotiable. We re-ran every system three times on the identical 100 rows.

io-gita

100%

Same input → same output. Always. The Hopfield ODE plus deterministic basin assignment has no stochastic components. 100% across 3 reruns of all 100 rows.

Kimi (CLI · code-exec)

98%

Two of three reruns produced full 100-row arrays; one returned an empty response. The two that ran agreed on 98% of rows.

Gemini (CLI · text)

78%

All three reruns succeeded. 22% of rows flipped prediction across reruns — the same operating point can be called stable in one minute and unstable the next.

Codex (CLI · code-exec)

21%

All three reruns succeeded but diverged sharply: run-3 was degenerate (predicted every single row as stable). Inter-run agreement: 25% (run 2 vs run 3).

Claude (subagent · text)

89%

Three subagents, fresh context, no tools. 11% of rows flipped across reruns. Better than Codex/Gemini, worse than the engine and (when it ran) Kimi.

Why this matters: if the same grid operating point is called stable in one query and unstable in the next, no operator can act on the answer. The engine is not "more accurate than every LLM" — Kimi-with-code-execution beat it on accuracy. But the engine is the only system that can be trusted to give the same answer twice.

The contamination story

The first time we ran this benchmark, every LLM got near-perfect accuracy. Suspicious. The cause: each LLM CLI was given workspace access to the io-gita repo — and that workspace contained test/smart_grid/data/test_100.csv with the ground-truth labels in plain sight. The models simply read the answers.

The fix: we re-ran the LLMs in /tmp/sg_isolated/ — a workspace containing only the prompt file, no labeled data. Every accuracy figure on this page comes from the isolated re-run.

Why it's worth surfacing: in a real-world deployment, an LLM agent given filesystem or network access can accidentally — or deliberately — access the very data it was asked to predict on. A deterministic engine cannot. Auditability is not a feature you bolt on; it's a property of the architecture.

Where each one actually wins

io-gita is the right tool when…

Inference at 0.078 ms per query — 4,727× faster than the cheapest LLM
100% determinism across reruns (vs 21%–98% for the LLMs)
Auditable basin + path output: every prediction maps to a named region of the topology
$0 marginal cost — runs locally on CPU, no API rate limits, no vendor lock-in
One-time atlas build (10.4s) is reusable across every future query

An LLM is the right tool when…

Natural-language explanation per row (`reasoning` field) when prompted
Zero-shot on novel data with no fitted pipeline
Adapts task framing on the fly without retraining
Kimi (with code-exec) achieves 100% by computing the underlying Schäfer formula directly

The engine and the LLMs solve overlapping but distinct problems. The deepest point of this benchmark isn't "engine beats LLM" — that's only true against text-only LLMs. It's that the engine is competitive on accuracy while being orders of magnitude faster, perfectly reproducible, free, and architecturally auditable. That combination is rare.

Caveats & reproducibility

n = 100 test points. AUC has roughly ±0.10 uncertainty at this sample size — treat the absolute numbers as illustrative, the *ranking* and *order-of-magnitude gaps* are what matter.

The UCI Smart Grid Stability dataset is a 4-node DAE simulation, not field data from real grids. No claim is made about operational power-system control.

LLM model identities and versions: Kimi (kimi CLI), Gemini (gemini-yolo CLI), Codex (codex exec --full-auto, model gpt-5.2), Claude (Claude Opus subagent, no tools). All on 2026-05-02.

Kimi run 2 returned empty twice; reported metrics use the 2 successful runs. We don't hide the failure — it's part of the reliability picture.

Engine accuracy comes from its predict() = nearest prototype path, NOT from running the full ODE per row. The full ODE is available via TieredInference for cases where higher accuracy is required at higher cost.

The full benchmark — script, prompts, raw LLM transcripts, and aggregated JSON — lives in the io-gita repository at test/smart_grid/llm_vs_engine/. The page above renders straight from those JSON files; nothing is hard-coded.

← Read the research paper See the io-gita engine