Applied · Smart-grid stability · 100 operating points
Semantic Gravity, applied to a real-world prediction.
Same dataset. Same 12 features per row. Five systems answering the same question: will this 4-node power grid be stable or unstable? One system is the io-gita semantic-gravity engine — deterministic, sub-millisecond, free. Four are frontier LLMs. Below is what actually happened — measured numbers, isolated workspace, no fabrication.
The setup
The UCI Smart Grid Stability dataset is a synthetic 4-node DAE simulation from Schäfer et al. (2016). Each row is one operating point of a 4-node power network: 12 features describe reaction times (τ), nominal powers (p), and price elasticities (g). The label is whether that point is dynamically stable or unstable.
| Field | Meaning |
|---|---|
| tau1 .. tau4 | Reaction times of producer + 3 consumers (sec, ~0.5–10) |
| p1 .. p4 | Nominal power; p1>0 producer, p2/p3/p4<0 consumers |
| g1 .. g4 | Price-elasticity coefficients (~0–1, dimensionless) |
| stab | Continuous stability score (used to derive 4 quartile basins; not given to LLMs) |
| stabf | Binary label: stable / unstable (the ground truth we measure against) |
We took 100 rows as a deterministic holdout (seed=42), trained sg-engine on the remaining 9,900, and asked five systems for stable/unstable per row. LLM prompts ran in an isolated workspace (`/tmp/sg_isolated/`) containing only the prompt — no labeled CSV in scope.
The calculation, end-to-end
How sg-engine turns 9,900 training rows into a 4-basin topology — and how one new test row gets a prediction. Six steps. Every number below is from the actual run (`engine.json`).
Quartile-bin the training stab score
The continuous stab score on 9,900 training rows is split at its 25/50/75 quantiles into Q1..Q4. The bin edges come from training data only — there is no test-set leakage. Each quartile becomes a candidate attractor basin. Empirically: Q1 = 0% unstable, Q2 = 55% unstable, Q3 = 100% unstable, Q4 = 100% unstable.
Fit feature transforms on training data
RobustScaler + median imputation + PCA at 95% variance reduce the 12 raw features to 11 components. The fitted transformers are pinned in an ArtifactBundle so test rows are encoded with the exact same fit — no refitting at inference time.
Build prototypes (multimodal allowed)
build_prototypes() emits up to 3 prototypes per class via k-medoids when a class is multimodal. On Smart Grid: 4 quartile classes → 8 prototypes (multimodal split detected on Q2/Q3). Each prototype is the median feature vector of its sub-cluster.
Lift into D=10,000 with random bipolar atoms
11 PCA components are bound to 11 D=10,000 random bipolar atom vectors. Each prototype becomes Q = sum(component_value × atom_vector), giving 8 high-dimensional pattern attractors. Hebbian recall (W = Pᵀ·P / D) builds the energy landscape.
Map the directed transition atlas
Atlas.map() tests every potential basin→basin edge under forcing at α ∈ {0.1, 0.3, 0.5}. Result on Smart Grid: 6 edges per alpha (Q-quartile structure is sparse), 1-node SCC, top sinks at Q2 sub-prototypes. The atlas is the static topology over which all inference happens.
Per test row → predict()
transform(test_row, bundle) applies the fitted PCA. predict() returns the nearest prototype (= predicted basin) plus a confidence. Each basin has a known training-set unstable rate (Q1=0%, Q2=55%, Q3=100%, Q4=100%) → that rate is the predicted probability of unstable.
Five outputs, one row.
Test row #0. Twelve features. Truth: stable. Each system below saw the identical input. No system was told the answer.
| Feature | tau1 | tau2 | tau3 | tau4 | p1 | p2 | p3 | p4 | g1 | g2 | g3 | g4 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Value | 3.942 | 5.718 | 9.789 | 1.784 | 3.858 | -0.632 | -1.982 | -1.244 | 0.144 | 0.299 | 0.366 | 0.900 |
The 12 features were PCA-projected, lifted into D=10,000 via Hebbian binding, identified as nearest prototype. The basin's empirical training rate produces the probability.
Kimi's CLI runs an agentic loop — for this benchmark it appears to have used code execution to compute the analytical Schäfer stability criterion. High accuracy, high latency.
Pure text-mode reasoning across all 100 rows in one call. Confidence reads as a self-report, not a calibrated probability.
Codex (gpt-5.2) ran with --full-auto. Capable of code execution but with sharply varying behaviour across reruns; this run produced a coherent JSON array.
A fresh Claude subagent with no tools and no context from the benchmark — pure linguistic reasoning over the 12 features.
This is one of 100 rows. The full per-row table (25 columns × 100 rows) ships with the data and is available as JSON below.
The honest comparison
All numbers below are computed from the JSON files shipped with this page — no placeholder values, no rounding for marketing. Run 1 of each system is the primary; reruns are used for the determinism column only.
| System | Accuracy | AUC | Determinism | ms / query | Reruns OK |
|---|---|---|---|---|---|
| io-gita | 80.0% | 0.840 | 100.0% | 0.078 ms | 3/3 |
| Kimi (CLI · code-exec) | 100.0% | 1.000 | 98.0% | 3.90 s | 2/3 |
| Gemini (CLI · text) | 77.0% | 0.843 | 78.0% | 370.0 ms | 3/3 |
| Codex (CLI · code-exec) | 74.0% | 0.813 | 21.0% | 430.0 ms | 3/3 |
| Claude (subagent · text) | 72.0% | 0.677 | 89.0% | n/a | 3/3 |
- io-gita: deterministic topology + nearest-prototype assignment. 80% accuracy, AUC 0.84, perfectly reproducible, 0.078 ms/query — about 4727× faster than the cheapest LLM.
- Kimi: 100% accuracy because its CLI evidently invoked code execution to compute the Schäfer stability formula directly. Real result, but with 7-minute per-call latency and one of three reruns failing silently.
- Gemini: 77% accuracy, AUC 0.84, but only 78% determinism — one row in five flips across reruns.
- Codex: 74% accuracy on run 1, but run 3 was degenerate (predicted every row as stable). Inter-run determinism dropped to 21%.
- Claude (subagent): 72% accuracy, AUC 0.68. Pure linguistic reasoning under-performs the engine on AUC.
Determinism · same input, same answer?
For an audit-grade decision system, "did the same input produce the same answer?" is non-negotiable. We re-ran every system three times on the identical 100 rows.
Why this matters: if the same grid operating point is called stable in one query and unstable in the next, no operator can act on the answer. The engine is not "more accurate than every LLM" — Kimi-with-code-execution beat it on accuracy. But the engine is the only system that can be trusted to give the same answer twice.
The contamination story
The first time we ran this benchmark, every LLM got near-perfect accuracy. Suspicious. The cause: each LLM CLI was given workspace access to the io-gita repo — and that workspace contained test/smart_grid/data/test_100.csv with the ground-truth labels in plain sight. The models simply read the answers.
The fix: we re-ran the LLMs in /tmp/sg_isolated/ — a workspace containing only the prompt file, no labeled data. Every accuracy figure on this page comes from the isolated re-run.
Why it's worth surfacing: in a real-world deployment, an LLM agent given filesystem or network access can accidentally — or deliberately — access the very data it was asked to predict on. A deterministic engine cannot. Auditability is not a feature you bolt on; it's a property of the architecture.
Where each one actually wins
- Inference at 0.078 ms per query — 4,727× faster than the cheapest LLM
- 100% determinism across reruns (vs 21%–98% for the LLMs)
- Auditable basin + path output: every prediction maps to a named region of the topology
- $0 marginal cost — runs locally on CPU, no API rate limits, no vendor lock-in
- One-time atlas build (10.4s) is reusable across every future query
- Natural-language explanation per row (`reasoning` field) when prompted
- Zero-shot on novel data with no fitted pipeline
- Adapts task framing on the fly without retraining
- Kimi (with code-exec) achieves 100% by computing the underlying Schäfer formula directly
The engine and the LLMs solve overlapping but distinct problems. The deepest point of this benchmark isn't "engine beats LLM" — that's only true against text-only LLMs. It's that the engine is competitive on accuracy while being orders of magnitude faster, perfectly reproducible, free, and architecturally auditable. That combination is rare.
Caveats & reproducibility
The full benchmark — script, prompts, raw LLM transcripts, and aggregated JSON — lives in the io-gita repository at test/smart_grid/llm_vs_engine/. The page above renders straight from those JSON files; nothing is hard-coded.