Skip to content

LMSR Prediction Market Robustness

Under what conditions do Logarithmic Market Scoring Rule (LMSR) prediction markets fail to aggregate information?

Contributors: Mark Song, Minghe Liu, Zhaohua Zheng
CSE 5106: Multi-Agent Systems \(|\) Spring 2026

Approach

We use a multi-agent simulation with controlled stress tests. Across reruns, we vary:

  • noise intensity
  • adversary strength
  • informed-trader quality
  • liquidity depth
  • solvency structure
  • defense mechanisms

We then measure how often, how badly, and how long the market deviates from the true probability.

Market Model

LMSR setup

  • Outcome space: binary
  • Ground-truth probability: p* = 0.65
  • Liquidity parameter: b
    Larger b means deeper liquidity and less price impact per trade.
  • Trading horizon: 500 steps per run
  • Pricing rule: standard LMSR share-cost geometry

Core equations

The binary LMSR cost function and positive-response price are:

\[ C(q_{\mathrm{yes}}, q_{\mathrm{no}})=b \log\left(e^{q_{\mathrm{yes}}/b}+e^{q_{\mathrm{no}}/b}\right) \]
\[ p_t=\frac{e^{q_{\mathrm{yes}}/b}}{e^{q_{\mathrm{yes}}/b}+e^{q_{\mathrm{no}}/b}} =\sigma\left(\frac{q_{\mathrm{yes}}-q_{\mathrm{no}}}{b}\right) \]

Mispricing is measured by Bernoulli KL divergence:

\[ \mathrm{KL}(p^* \Vert p_t)=p^* \log\frac{p^*}{p_t}+(1-p^*)\log\frac{1-p^*}{1-p_t} \]

Evaluation logic

  • final_kl — final KL divergence from truth
  • p95_kl / p99_kl — tail mispricing during the run
  • kl_spike_rate — share of time steps with severe mispricing
  • recovery_time_after_last_adversary_rolling — time to recover after adversarial pressure ends
  • trade_fill_rate — executed orders / submitted orders

Informed-trader safety logic:

\[ \mathrm{safe\_seed\_rate} =\frac{1}{S}\sum_{s=1}^S \mathbf{1}\!\left[\min W^{\mathrm{wc}}_{\mathrm{Informed},s}\ge 0\right] \]
\[ \mathrm{composite\_failure\_rate} =\frac{1}{S}\sum_{s=1}^S \mathbf{1}\!\left[\mathrm{raw\_failure}_s \lor \min W^{\mathrm{wc}}_{\mathrm{Informed},s}<0\right] \]

For the failure-boundary sweep, a run passes the realized-influence gate only if:

\[ \mathrm{gate}_s=\mathbf{1}\!\left[a_s\ge 0.15 \land n_s\ge 0.10 \land r_s\ge 0.05\right] \]

where a_s is adversary active duration, n_s its executed notional share, and r_s its trade acceptance ratio.

Threshold calibration

Failure thresholds were calibrated from healthy-control branches (scenario_no_adversary and scenario_tuned_recovery), not external benchmarks. The calibration pass selected:

  • final_kl ≤ 0.0072
  • avg_kl_after_burn_in ≤ 0.0056
  • p95_kl ≤ 0.0189
  • kl_spike_rate ≤ 0.0000
  • recovery_time_after_last_adversary_rolling ≤ 44.8
  • trade_fill_rate ≥ 0.2576

Important correction: these values were saved for documentation and later tuning, but the sweeps below still used the thresholds embedded in the live configs.

Agent Families

The simulation includes 3 agent families and 10 concrete agent types.

Informational agents

  • InformedAgent — trades on a noisy signal when belief and price differ enough
  • DelayedInformedAgent — acts on lagged signals
  • NoiseAgent — trades randomly
  • RegimeSwitchNoiseAgent — alternates between calm and bursty noise

Behavioral agents

  • MomentumAgent — follows recent price moves
  • MeanReversionAgent — trades against recent moves
  • HerdingTechnicalAgent — combines order flow and momentum into trend following

Adversarial agents

  • AdaptiveAdversary — pushes toward a target distribution and adjusts intensity
  • CollusiveAdversary — coordinates attack and rest phases across agents
  • TriggerAdversary — attacks only when the market becomes exploitable

Experimental Design

The project has 4 complementary studies:

  1. Multi-seed scenario suite
    10 scenarios × 5 seeds = 50 runs

  2. Failure-boundary sweep
    3 noise × 3 adversary × 3 informed-strength × 10 seeds = 270 runs

  3. Liquidity / solvency frontier
    2 scenarios × 5 liquidity levels × 4 risk profiles × 5 seeds = 200 runs

  4. Defense-mechanism ablation
    32 combinations × 10 seeds = 320 runs

Results

1. Scenario results

The 10-scenario suite split into three groups.

All-flagged group

  • delayed_info
  • high_noise
  • technical_herding
  • low_liquidity
  • regime_noise
  • no_adversary*

Partial-failure group

  • trigger_adversary1 / 5 seeds failed (failure_rate = 0.200)

Zero-failure group

  • strong_adversary
  • collusive_adversary
  • tuned_recovery

Main takeaway:
The worst outcomes came not from the strongest adversaries, but from stale information, dominant noise, and endogenous technical feedback.

Per-scenario failure rate over 5 seeds

2. Tail risk

Delayed information produced the worst tail behavior.

Worst average p95_kl

  • 2.4433delayed_info
  • 0.5487high_noise
  • 0.4837technical_herding

delayed_info also showed the broadest failure signature:

  • final_kl in 2 / 5 seeds
  • p95_kl in 5 / 5 seeds
  • kl_spike_rate in 5 / 5 seeds
  • recovery_time_after_last_adversary_rolling in 5 / 5 seeds
  • stress_duration_after_burn_in in 5 / 5 seeds

Average p95 KL across scenarios

3. Best-performing scenarios

Several adversarial scenarios performed unexpectedly well.

Best average final KL

  • 0.0003trigger_adversary
  • 0.0020tuned_recovery
  • 0.0029collusive_adversary
  • 0.0071strong_adversary

Best execution efficiency

  • 0.6889high_noise
  • 0.5960low_liquidity
  • 0.5766regime_noise

Control caveat: no_adversary

All five seeds were flagged as failures only because recovery_time_after_last_adversary_rolling is undefined when no adversary exists. Quantitatively, the control remained strong:

  • avg final KL: 0.0037
  • avg p95 KL: 0.0154

Mean KL after burn-in by scenario

4. Two failure modes

The project separates economic failure from true adversarial stress.

A. Economic failure, no dominance

High final KL with low realized adversary share. These failures are driven by:

  • noise
  • stale information
  • weak informed order flow

B. True adversarial stress

The adversary is both:

  • active
  • economically influential

This category is much rarer than raw failure counts suggest.

Robustness is not just whether bad outcomes occur, but who causes them and through what mechanism.

Realized adversary share vs average final KL

5. Failure boundary

Noise creates a sharp phase transition.

Low noise

Failure rises when informed traders are weak and adversarial pressure increases.

Failure heatmap under low noise

Medium noise

Strong informed traders remain safe, while weak-informed cells stay fragile.

Failure heatmap under medium noise

High noise

Failure becomes universal: P(failure) = 1.0 across the full grid. At that point, adversary strength no longer matters.

Failure heatmap under high noise

The worst raw cell in the rerun was noise = high, adversary = low, informed = weak, with:

  • failure rate: 1.000
  • avg p95 KL: 1.0297
  • gate-pass rate: 0.000

6. Gating diagnostic

A realized-influence gate reframes the apparent adversarial boundary. A run counts as true adversarial stress only if adversary activity, executed notional share, and acceptance ratio all clear the thresholds.

Many high-failure cells had a gate-pass rate of 0.0, meaning they failed economically without strong realized adversarial control. Calling these cells “adversarial breakdowns” would be misleading.

True adversarial stress corner

  • noise: low
  • adversary: high
  • informed strength: weak
  • gate-pass rate: 0.600
  • failure rate: 1.000
  • avg adversary notional share: 0.351

This is the clearest case where manipulation is both active and economically meaningful.

7. Liquidity × solvency frontier

High liquidity and solvency floors produced the cleanest safe region.

Safest overall cells

  • scenario: no_adversary
  • liquidity: b = 40
  • risk profile: strict_floor_0
  • safe-seed rate: 1.000
  • composite failure: 0.000
  • avg p95 KL: 0.0422

  • scenario: no_adversary

  • liquidity: b = 40
  • risk profile: buffered_floor_5
  • safe-seed rate: 1.000
  • composite failure: 0.000
  • avg p95 KL: 0.0516

Best safe cell under strong adversarial pressure

  • scenario: strong_adversary
  • liquidity: b = 40
  • risk profile: strict_floor_0
  • safe-seed rate: 1.000
  • composite failure: 0.200
  • avg p95 KL: 0.0188

Margin profiles

Margin-based profiles never produced a safe cell. Under strong_adversary, the most accurate margin cell still showed:

  • risk profile: margin_cap_100
  • avg p95 KL: 0.0053
  • informed worst-case terminal wealth: -48.071
  • composite failure: 1.000

Thus, low error alone did not imply safety.

Failure heatmap without adversary Failure heatmap with strong adversary

8. Accuracy × safety

Solvency floors outperform margin caps when both accuracy and survival matter.

The most effective protections are structural:

  1. Deeper liquidity
    Reduces price impact per trade.

  2. Solvency floors
    Create the only fully safe region in the rerun.

  3. Separate accuracy from safety
    Some margin cells look accurate on p95_kl but still fail because informed traders can end with deeply negative worst-case wealth.

The broader implication is that the best protection comes from controlling market physics—price impact and bankruptcy—not from subtle fee tweaks.

Average p95 KL vs worst-case terminal wealth

9. Defense ablation

All 32 defense combinations converged to the same aggregate result in the live rerun.

Aggregate outcome across every combination

  • failure rate: 0.200
  • avg p95 KL: 0.0238
  • avg fill rate: 0.2773
  • combinations meeting a 0.10 target failure rate: 0

Likely interpretation

Under the current stressed default state, the tested toggles do not separate outcomes at the aggregate level. Either the base regime dominates the defense levers, or the summary metrics are too coarse to reveal narrower benefits.

Failure rate vs enabled defense components

Discussion

The main result contradicts the default intuition that adversaries are the dominant threat.

Expected vs observed

Robustness work often treats manipulation as the central risk. In these simulations, that is not what happens. Several adversarial scenarios perform well, and the realized-influence gate shows that many apparent adversarial failures are actually economic failures with low adversary participation.

Why this happens

The main bottleneck is information quality. Markets fail when:

  • informed signals are stale
  • noise becomes dominant
  • endogenous technical behavior distorts price discovery

LMSR aggregates well only when informed flow dominates the order stream.

What the defenses show

Behavioral nudges such as fee toggles were weak here. By contrast, liquidity depth and solvency floors created clear safe regions. The most effective interventions therefore act on market structure, not just trader incentives.