Autonomy · Benchmarks · April 2026

How Long Can Claude Mythos Work Alone?

Opus 4.6 can complete human tasks that take 12 hours. We estimate Mythos could handle tasks that take 16.

Sam Donahue · April 7, 2026 · Written before METR evaluation of Mythos

METR measures the longest human task an AI model can autonomously complete — writing code, debugging systems, running experiments. It is a measure of task complexity, not runtime: a 12-hour time horizon means the model can solve problems that would take a human roughly 12 hours, not that the model runs for 12 hours. The current leader is Opus 4.6 at ~12 hours. Mythos has no METR score yet.

We regress aggregate capability scores (IRT) against METR time horizons for 19 models. On reasoning-era models (n=14, R²=0.977), the fit predicts Mythos could complete human tasks of roughly 16 hours — a 33% increase over Opus 4.6.

Reasoning-era models: Mythos predicted at ~16-hour task complexity

Post-o1 models only (n=14, R²=0.977). Linear and quadratic converge. Faded points = pre-o1 models (not in fit). Error bars = METR CIs.

Data: METR-Horizon-v1.1 × Self-reported IRT (Ho et al.). Band = 90% bootstrap CI (400 resamples). Y-axis = human-equivalent task duration.

~16h

Predicted task complexity
(median, post-o1 fit)

12h

Current METR leader
(Opus 4.6)

+33%

Predicted increase
over Opus 4.6

0.977

R² (post-o1 fit)
LOO error: 9–11%

Estimates at a glance

Method	Estimate (human-task hours)	Source
IRT regression, post-o1 (n=14)	~16h (linear & quadratic converge)	This analysis
IRT regression, all models (n=19)	15.9–27.1h	This analysis
Alternative fits (power law, sigmoid, piecewise)	10–17h (4 of 6 fits)	This analysis
Individual benchmark ensemble (6 benchmarks)	10.5h median (5.4–18.1h)	This analysis
Anthropic internal task evals	40h-equiv. on 2/3 of tasks	System Card p. 34
Anthropic qualitative assessment	"Not close" to replacing engineers	System Card p. 45

Our estimates cluster around 10–17 hours across methods. Anthropic's internal task evaluations (40h-equivalent) are not directly comparable — they measure speedup on narrow tasks, not sustained autonomous completion of complex work. Their qualitative finding ("not close" to engineer replacement) is consistent with a model that handles day-length tasks but cannot reliably sustain multi-day autonomous work.

Background: METR and IRT

METR (Model Evaluation & Threat Research) evaluates AI autonomy by giving models real-world tasks of increasing complexity — coding challenges, system debugging, research experiments. The time horizon is the human-equivalent duration of the most complex task the model can complete. It reports two thresholds:

p50 — the median: the task duration the model completes successfully 50% of the time
p80 — the reliable threshold: the task duration it completes 80% of the time

Model	p50 (task hours)	Release
Claude Opus 4.6	~12 hours	Feb 2026
GPT-5.2	~5.9 hours	Dec 2025
GPT-5.3 Codex	~5.8 hours	Feb 2026
Claude Opus 4.5	~4.9 hours	Nov 2025
Gemini 3 Pro	~3.7 hours	Nov 2025

Source: Epoch AI / METR-Horizon-v1.1, retrieved March 21, 2026. Mythos has no METR score yet.

Item Response Theory (IRT) aggregates many benchmark scores into a single ability parameter per model, simultaneously estimating each benchmark's difficulty (Ho et al., as implemented by the Epoch Capabilities Index). We use self-reported IRT (from labs' own benchmark results) because Mythos's score of 186.6 is on that scale. Mixing self-reported with third-party IRT produces unstable extrapolations due to a systematic ~6–11 point gap between the scales.

We fit log(METR minutes) = a + b·IRT + c·IRT² to the 19 models with both METR and IRT scores, testing across three regime cutoffs and six functional forms.

Prior work: Epoch AI's baseline

Ho et al. ("A Rosetta Stone for AI Benchmarks", Section 3.1.1) established the baseline for this approach. They fit a linear map from their estimated capability score (C_m) to log(time horizon), finding:

time_horizon = exp(3.69 × C_m − 4.58)

R² = 0.85 on a 40% held-out validation set, using a 60/40 train-test split. This outperformed individual benchmarks (median R² = 0.62 across 18 benchmarks; GPQA Diamond alone R² = 0.75). They note that some benchmarks, including SWE-Bench Verified, are "actively detrimental" for predicting time horizons (footnote 8, p. 5).

Our approach differs in three ways:

Scale: We use self-reported IRT (anchored so Claude 3.5 Sonnet ≈ 130), not Epoch's C_m (WinoGrande-anchored, where 0 = WinoGrande difficulty). The scales are monotonically related but not identical.
Data: We match 19 models (vs Epoch's ~15–20 with overlap), with both p50 and p80, using the March 2026 METR v1.1 dataset which includes Opus 4.6 and GPT-5.x models not available when the Rosetta Stone paper was written (November 2025).
Validation: We use leave-one-out CV (median error 9–11% on post-o1) rather than a random train-test split, and test multiple functional forms beyond linear.

Method	R²	Validation	n	Mythos
Epoch (Ho et al.) — linear, C_m scale	0.85	40% holdout (1 trial)	~15–20	—
Ours — linear, all models	0.946	LOO: 22%	19	15.9h
Ours — linear, all models	0.909*	40% holdout (1000 trials)	19	15.4h [12.4–23.4]
Ours — linear, post-o1	0.977	LOO: 9%	14	16.4h
Ours — linear, post-o1	0.937*	40% holdout (1000 trials)	14	17.8h [14.5–21.5]

*Median test-set R² over 1000 random 60/40 splits. Brackets = 90% CI on Mythos prediction across splits. Higher R² vs Epoch likely reflects updated data (more frontier models with METR scores) and regime restriction, not a methodological improvement. Epoch's approach is the foundational method; we apply it with newer data and test robustness. Also see EA Forum post applying weighted regression with 8 post-o3 models.

Results

Full dataset (n=19): linear and quadratic diverge on older models

Including pre-o1 models (GPT-4, Claude 3 Opus, etc.) widens the quadratic prediction. Compare to the post-o1 chart above.

Three regimes, two fits each.

Regime	Fit	n	R²	LOO Error	p50 Prediction	p80 Prediction
All models	Linear	19	0.946	22%	15.9 hours	2.8 hours
All models	Quadratic	19	0.959	32%	27.1 hours	4.3 hours
Post-o1	Linear	14	0.977	9%	16.4 hours	2.7 hours
Post-o1	Quadratic	14	0.977	11%	16.1 hours	1.4 hours
Frontier IRT≥130	Linear	13	0.971	10%	16.8 hours	2.8 hours
Frontier IRT≥130	Quadratic	13	0.972	13%	14.6 hours	0.9 hours

LOO = Leave-one-out cross-validation median absolute percentage error. Post-o1 = models released after o1 (Dec 2024). Frontier = IRT ≥ 130.

Post-o1 regime (n=14): Linear and quadratic converge at ~16h with 9–11% LOO error. On the full dataset (n=19), they diverge (15.9h vs 27.1h) because pre-reasoning-era models weight the quadratic's curvature upward. The post-o1 subset has fewer free parameters relative to the data range and lower LOO error.

Beyond linear and quadratic

We also tested power law, sigmoid, cubic, and piecewise linear fits on the full n=19 dataset. The predictions cluster into two groups:

Fit Type	Params	R²	LOO	Mythos p50
Linear (log-space)	2	0.946	22%	15.9h
Power law	2	0.906	17%	10.1h
Sigmoid (log-space)	3	0.970	26%	11.8h
Piecewise linear (break=130)	4	0.972	18%	16.8h
Quadratic (log-space)	3	0.959	32%	27.1h
Cubic (log-space)	4	0.978	23%	7.7h

Four of six fits predict 10–17 hours. The quadratic (27h) is pulled up by curvature from older models. The cubic (7.7h) overfits and curves back down. Sigmoid and piecewise fits, which allow the functional form to change, land at 12–17 hours.

Individual benchmarks → METR

As a robustness check, we regress each benchmark individually against METR p50 (univariate log-linear fits), then predict Mythos from its system card score. Of the 27 self-reported benchmarks with ≥4 METR models reporting, only 6 also have Mythos scores available — a data coverage limitation, not a methodological one.

We also attempted multivariate Ridge regression across all available benchmarks. With only 7 overlapping features and 19 data points, the model extrapolates unstably (Mythos scores exceed the training range on every benchmark). The univariate ensemble below is more robust.

Benchmark	n	R²	LOO	Mythos	Data max	→ p50	Ref
BrowseComp	4	0.968	28%	86.9%	84.0%	14.9h	p. 191
SWE-bench Verified	14	0.856	40%	93.9%	80.9%	11.3h	p. 187
GPQA Diamond	17	0.846	33%	94.5%	92.4%	5.4h	p. 189
MMMLU	9	0.835	56%	92.7%	91.8%	9.7h	p. 189
HLE (no tools)	6	0.712	65%	56.8%	53.1%	18.1h	p. 191
Terminal-Bench 2.0	4	0.173	73%	82.0%	77.3%	8.7h	p. 188

Univariate log-linear fits. Mythos scores clipped to 115% of data max to limit extrapolation. Sorted by R². Terminal-Bench grayed: R² = 0.17 (poor fit, only 4 models). All predictions involve extrapolation.

Median across 6 benchmarks: 10.5 hours. R² > 0.3 subset (5 benchmarks): median 11.3 hours. These are lower than the IRT-based ~16h and have higher LOO errors (28–65% vs 9–18%). Individual benchmarks are noisier predictors and the extrapolation is more severe (Mythos exceeds every benchmark's data max). IRT aggregation compresses this noise, which is why the IRT-based estimates have better LOO.

What the system card says

No METR score, but Anthropic reports internal autonomy evaluations and their own IRT trajectory.

Autonomy evaluations

Anthropic's internal suite tests AI R&D capabilities with hour-equivalent thresholds (System Card Table 2.3.3.A, p. 34):

Task	Opus 4.5	Opus 4.6	Mythos	Threshold
Kernel task (speedup)	252×	190×	399×	300× = 40h eq.
Time Series (MSE)	5.71	5.80	4.55	<5.3 = 40h eq.
LLM Training (speedup)	16.5×	34×	51.9×	>4× = 4–8h eq.
Quadruped RL (score)	19.48	20.96	30.87	>12 = 4h eq.
Novel Compiler (%)	69.4%	65.8%	77.2%	90% = 40h eq.

Source: Claude Mythos Preview System Card, Table 2.3.3.A, p. 34.

Note: These are task-specific hour-equivalents on narrow evaluations, not METR's measure of sustained autonomous work across diverse tasks. The two metrics are not directly comparable.

Anthropic's internal ECI and what it means for our prediction

For the first time, Anthropic published their own IRT-based capability tracking (System Card Section 2.3.6, pp. 40–42), using the same method as Epoch AI's public ECI but with ~300 models and "hundreds of benchmarks, mostly internal." Two findings are directly relevant:

Benchmark saturation at the frontier (Figure 2.3.6.A, p. 40). Most benchmarks in their IRT fit cluster below ECI ~175. Very few exist at Mythos's level (~190 on their internal scale). Anthropic states: "The ECI is only as good as the underlying dataset, and there are currently few benchmarks at Claude Mythos Preview's current capability level to tightly calibrate its ECI score." This applies equally to our prediction: if Mythos's IRT score of 186.6 has wide uncertainty due to benchmark scarcity at the frontier, then our METR extrapolation inherits that uncertainty.

Accelerating capability trajectory (Figure 2.3.6.B, p. 42). The Anthropic frontier from Claude 3 Opus (~118 internal ECI, Jan 2024) to Mythos (~190, Apr 2026) shows a two-phase linear fit with slope ratio 1.86×–4.3× depending on breakpoint. Mythos "appears to be above the pre-Mythos Preview trend, although its error bars are quite large." They caution: "we do not know if this trend will continue with future models."

Reconciling their ECI with our IRT: Anthropic's internal ECI of ~190 and our self-reported IRT of 186.6 are not directly comparable — different benchmark sets, different anchoring. But both place Mythos roughly 10–15 points above Opus 4.6 on their respective scales, which is the gap that drives our METR prediction. The relative gap matters more than the absolute number.

Three other details from this section that are easy to miss:

Gains from human research, not AI. Anthropic interviewed the people involved and concludes the capability advances were made "without significant aid from the AI models available at the time" (p. 42). Mythos's jump is conventional R&D scaling, not recursive self-improvement.
4× productivity ≠ 4× progress. Internal survey found ~4× geometric mean productivity uplift from Mythos. But reaching 2× R&D acceleration "would require uplift roughly an order of magnitude larger" (p. 42). Productivity and progress have sharply diminishing returns between them.
Early AI-win claims deflated. "When we followed up on each claim, it appeared that the contribution was real, but smaller or differently shaped than initially understood" (p. 42). What initially looked like autonomous discovery was "reliable execution of a human-specified approach." Worth keeping in mind when interpreting METR scores: performing well on structured evaluations may overstate real-world autonomous capability.

Qualitative assessment from the system card

Task-level performance and sustained autonomy diverge.

Anthropic's internal survey (n=18, p. 35): 1/18 thought Mythos was a drop-in for an entry-level Research Scientist. Their conclusion (p. 45): "Claude Mythos Preview does not seem close to being able to substitute for Research Scientists and Research Engineers, especially relatively senior ones."

Documented failure modes (pp. 35–39):

Confabulation cascades — confident but contradictory explanations, corrected only after empirical pushback (pp. 37–38)
Grinding — ~160 extra experiments re-running identical code for lucky measurements (p. 39)
Factual errors — 4 independent errors over 38 turns, corrected only after user prompting (pp. 36–37)
Reward hacking — moved computation outside timing calls; found test sets used by graders (p. 35)

These failure modes — confabulation, grinding, factual errors — are the kinds of behaviors that degrade sustained autonomous performance. METR's time horizon measures exactly this: coherent multi-step work over extended periods. The gap between narrow task scores (40h equivalent) and qualitative assessment ("not close" to engineer replacement) is consistent with a p50 time horizon in the 10–20h range.

Summary

Method	Estimate	Source
IRT regression, post-o1 (n=14)	~16h (linear & quadratic converge)	This analysis
IRT regression, all models (n=19)	15.9–27.1h (linear–quadratic)	This analysis
Alternative fits (power, sigmoid, piecewise)	10–17h (4 of 6 fits)	This analysis
Univariate benchmark ensemble (6 benchmarks)	10.5h median (5.4–18.1h range)	This analysis
Anthropic's internal task evals	40h equiv. on 2/3 of tasks	System Card p. 34
Anthropic's qualitative assessment	"Not close" to engineer replacement	System Card p. 45

Our regression-based estimates cluster around 10–17 hours, with the tightest fit (post-o1, R²=0.977) converging at ~16h. Anthropic's internal task evaluations are not directly comparable to METR, and their qualitative assessment is consistent with a model that can sustain autonomous work for hours but not reliably for days. METR's actual evaluation, when published, will determine accuracy.

Aside: Meta Muse Spark

Also released this week (April 8, 2026), Meta's Muse Spark sits at IRT ~168 — nearly identical to Opus 4.5 (167.6) and 19 points below Mythos (186.6). Our model predicts ~5 task-hours for Muse Spark, vs ~16 for Mythos.

On aggregate, Muse Spark is competitive: it beats Opus 4.6 on 10 of 19 benchmarks, with large wins on multimodal (CharXiv +21.1, ERQA +13.1), health (HealthBench Hard +28.0), and competitive coding (LiveCodeBench Pro +9.3). But it trails on the benchmarks most predictive of autonomous work:

Agentic benchmark	Muse Spark	Opus 4.6	Gemini 3.1	GPT-5.4	Mythos
SWE-bench Verified	77.4	80.8	80.6	—	93.9
SWE-bench Pro	52.4	53.4	54.2	57.7	77.8
Terminal-Bench 2.0	59.0	65.4	68.5	75.1	82.0
DeepSearchQA	74.8	73.7	69.7	73.6	—
ARC AGI 2	42.5	63.3	76.5	76.1	—

Model	IRT	Predicted METR p50	Actual METR (if known)
Mythos Preview	186.6	~16h	TBD
Opus 4.6	177.0	~9h*	12h (actual)
Muse Spark (Thinking)	167.7	~5h	TBD
Opus 4.5	167.6	~5h	4.9h (actual)
Gemini 3 Pro	166.8	~5h	3.7h (actual)

*Opus 4.6 actual METR is 12h; model underpredicts by ~25%. Muse Spark benchmarks from Meta's release (April 8, 2026). Frontier model benchmarks from respective system cards / announcements.

The IRT of 168 is an aggregate that averages Spark's multimodal strengths with its agentic weaknesses. For METR prediction, the agentic benchmarks matter most, and on those Spark trails Opus 4.6 by 3–6 points. At IRT 168, our model places it near Opus 4.5 (actual METR: 4.9h) and Gemini 3 Pro (actual: 3.7h) — both of which would validate a ~5h prediction. The 19-point IRT gap to Mythos (168 → 187) translates to a predicted 3× difference in autonomous task complexity.

Robustness checks

80% reliability threshold (p80)

p80 measures the human-task duration a model completes successfully 80% of the time — a stricter bar than the median (p50).

At the 80% reliability bar, Mythos predicted at ~2–3 task-hours

All 19 models. The gap between p50 (~16h) and p80 (~2h) reflects high variance in autonomous performance on complex tasks.

Individual benchmark predictions

As a robustness check, we regress each benchmark individually against METR (univariate log-linear fits). Only 6 of 27 benchmarks have both METR model coverage (≥4) and a Mythos score.

Six benchmarks predict 5–18 task-hours (median 10.5h)

Each benchmark regressed independently against METR p50. Dot size = R². Mythos scores from system card.

Individual benchmarks are noisier (LOO 28–65% vs 9–11% for IRT) and Mythos exceeds the data maximum on all six, making every prediction an extrapolation. IRT aggregation compresses this noise, which is why IRT-based estimates have lower cross-validation error.

Scaling behavior

IRT 130 (Claude 3.5 Sonnet) → IRT 177 (Opus 4.6): +47 points, task complexity grew from ~20 min to ~12 hours (36×).
IRT 177 (Opus 4.6) → IRT 187 (Mythos): +10 points, predicted ~12h → ~16h (1.3×).

On the full dataset, linear and quadratic diverge (15.9h vs 27.1h). On post-o1 only, they converge (both ~16h). The data does not clearly distinguish functional forms in this regime — more frontier model evaluations will resolve whether the relationship remains log-linear or accelerates.

Caveats

Extrapolation risk

Mythos at IRT 186.6 is ~10 points beyond Opus 4.6 at 177.0. Our regression has never been tested in this range. The confidence intervals widen accordingly.

Self-reported score inflation

Our prior SPAR research found labs over-report by ~1.13 pp on average (bootstrap 95% CI: [0.58, 1.74], p=0.0005, n=180 benchmark pairs). Mythos's IRT of 186.6 is derived from self-reported scores. If inflated, the true capability and METR prediction would be lower.

Benchmark saturation

Multiple benchmarks are near ceiling for Mythos (GPQA 94.5%, USAMO 97.6%, SWE-bench Verified 93.9%). Anthropic acknowledges: "The supply of benchmarks at the frontier is still a bottleneck" (System Card p. 40). Saturated benchmarks compress IRT differences and may understate the true capability gap.

Opus 4.6 autonomy outlier

At 718 minutes, Opus 4.6 dramatically outperforms GPT-5.2 (352 min) despite similar IRT. Anthropic may have specifically optimized for autonomous task completion. Our regression, anchored partly by Opus 4.6, may overweight Anthropic-specific gains.

Reward hacking

The system card reports novel reward hacking: Mythos moved computation outside timing calls and found test sets used by graders (p. 35). If similar behaviors emerge in METR evaluations, the measured time horizon could be artificially inflated.

80th percentile data sparsity

While all 19 models now have matched p80 data, the 80th percentile threshold is much harder to reach and the absolute values are small (many under 5 minutes), amplifying noise. The ~2.7 hour post-o1 prediction should be treated as directional.

IRT fallback for Claude 3.5 Sonnet June '24

Claude 3.5 Sonnet June '24 uses third-party IRT (127.0) as fallback since no self-reported IRT was available — the only model requiring this fallback. This introduces a small inconsistency in the otherwise self-reported IRT dataset.

Benchmark scores vary across system cards

The same model on the same benchmark can report different scores in different system cards. Example: Opus 4.6 on CharXiv Reasoning scores 68.5% in its own system card but 61.5% in the Mythos system card. The Mythos card explains why (Section 6.11, p. 192): they changed the grader model (Sonnet 4 → Sonnet 4.6), expanded the tool condition (image-cropping → full Python), and preserved thinking traces during grading. Anthropic re-evaluated all prior models with the new harness for fair within-card comparison. Our IRT scores use whichever scores appeared in the SPAR master dataset at time of computation, which may mix harness versions across models. This is an unquantified source of noise in the IRT fits.

Full methodology

Regression model: log(METR minutes) = a + b·IRT + c·IRT² via numpy.polyfit, degree 1 and 2, fit in log-space.

Data: 19 models from METR-Horizon-v1.1 (Epoch AI), matched to self-reported IRT scores from SPAR master dataset. All 19 have both p50 and p80 METR scores. Three regimes tested: all models (n=19), post-o1 (n=14), and frontier IRT≥130 (n=13).

IRT computation: 2-parameter logistic IRT via scipy.optimize.least_squares, MMLU-Pro anchor, L2 regularization (0.1), minimum 3 benchmarks per model. Self-reported IRT preferred; third-party IRT used as fallback for Claude 3.5 Sonnet June '24 only (127.0).

Validation: Leave-one-out cross-validation. Median absolute percentage error ranges from 9% (post-o1 linear) to 32% (all models quadratic). 500 bootstrap resamples for 90% confidence bands.

Mythos IRT score: 186.6 (self-reported, from SPAR master dataset as of April 7, 2026).

System card: Claude Mythos Preview System Card, April 7, 2026. 243 pages. Page numbers cited throughout.

Appendix: all 19 matched models (raw data)

Every model used in the regression, sorted by IRT score. IRT source: SR = self-reported, 3P = third-party fallback. METR CIs from METR-Horizon-v1.1. All times in minutes unless labeled hours.

Model	IRT	Src	p50 (min)	p50	p50 CI	p80 (min)	p80	Release
Claude Opus 4.6	177.0	SR	718.8	12.0h	[319, 3950]	69.9	1.2h	2026-02-05
GPT-5.3 Codex	173.8	SR	349.5	5.8h	[192, 858]	54.7	0.9h	2026-02-05
GPT-5.2	169.7	SR	352.2	5.9h	[191, 862]	66.0	1.1h	2025-12-11
Claude Opus 4.5 (R)	167.6	SR	293.0	4.9h	[161, 639]	49.4	0.8h	2025-11-24
Gemini 3 Pro	166.8	SR	224.3	3.7h	[137, 387]	54.1	0.9h	2025-11-18
GPT-5.1	163.4	SR	223.7	3.7h	[135, 395]	50.6	0.8h	2025-11-19
GPT-5	162.2	SR	203.0	3.4h	[114, 407]	38.3	0.6h	2025-08-07
Claude 4.1 Opus	150.8	SR	100.5	1.7h	[60, 158]	23.5	0.4h	2025-08-05
o3	150.0	SR	119.7	2.0h	[73, 192]	30.0	0.5h	2025-04-16
Claude 4 Opus (R)	149.9	SR	100.4	1.7h	[60, 163]	20.4	0.3h	2025-05-22
Claude 3.7 Sonnet	140.1	SR	60.4	1.0h	[33, 107]	12.1	0.2h	2025-02-24
o1	134.6	SR	38.8	39m	[22, 67]	7.1	7m	2024-12-05
Claude 3.5 Sonnet (Oct)	130.0	SR	20.5	21m	[10, 40]	2.6	3m	2024-10-22
Claude 3.5 Sonnet (Jun)	127.0	3P	11.4	11m	[5, 23]	1.7	2m	2024-06-20
o1-preview	123.8	SR	20.3	20m	[12, 33]	4.4	4m	2024-09-12
GPT-4o	117.1	SR	7.0	7m	[4, 12]	1.3	1m	2024-05-13
Claude 3 Opus	112.3	SR	4.0	4m	[2, 9]	0.6	<1m	2024-03-04
GPT-4 Turbo	109.2	SR	4.0	4m	[2, 8]	0.8	<1m	2023-11-06
GPT-4	84.8	SR	4.0	4m	[2, 8]	0.9	<1m	2023-03-14

Data: METR-Horizon-v1.1 YAML matched to SPAR master dataset IRT scores. CSV available at final_matched_metr_irt.csv.

Appendix: full benchmark matrix underlying IRT scores (19 models × 101 benchmarks)

Each model's IRT score is computed from its self-reported benchmark results. The full matrix is 19 models × 101 benchmarks, but highly sparse — no single benchmark covers all 19 models, and models report between 0 and 43 benchmarks. IRT is designed to handle this sparsity, estimating a single ability parameter from whatever subset of benchmarks is available.

Model	IRT	Benchmarks reported	Top benchmarks (sample)
Opus 4.6	177.0	26	GPQA 91.3, SWE-V 80.8, AIME'25 99.8, MMMLU 91.1, HLE 53.1, ARC-AGI2 68.8, T-Bench 65.4
GPT-5.3 Codex	173.8	5	T-Bench 77.3, SWE-Lancer-IC 47.3, ...
GPT-5.2	169.7	23	GPQA 92.4, SWE-V 80.0, AIME'25 100, MMMLU 89.6, HLE 34.5, ARC-AGI2 52.9
Opus 4.5 (R)	167.6	10	GPQA 87.0, SWE-V 80.9, MMMLU 90.8, ARC-AGI2 37.6, T-Bench 59.3
Gemini 3 Pro	166.8	18	GPQA 91.9, SWE-V 76.2, AIME'25 100, MMMLU 91.8, HLE 45.8, T-Bench 54.2
GPT-5.1	163.4	9	GPQA 88.1, SWE-V 76.3, AIME'25 94.0
GPT-5	162.2	35	GPQA 85.7, SWE-V 74.9, AIME'25 94.6, MATH 84.7, MMLU 92.5, HLE 24.8
Opus 4.1	150.8	8	GPQA 80.9, SWE-V 74.5, AIME'25 78.0, MMMLU 89.5
o3	150.0	22	GPQA 83.3, SWE-V 69.1, AIME'25 86.4, HLE 14.7, ARC-AGI2 6.5
Opus 4 (R)	149.9	10	GPQA 79.6, SWE-V 72.5, AIME'25 75.5, MMMLU 88.8
Son 3.7	140.1	11	GPQA 84.8, SWE-V 70.3, AIME'25 54.8, MMMLU 86.1
o1	134.6	19	GPQA 78.0, SWE-V 41.0, MMMLU 87.7, MATH 96.4, MMLU 91.8
Son 3.5 (Oct)	130.0	19	GPQA 67.2, SWE-V 49.0, MATH 78.3, MMLU 90.4
Son 3.5 (Jun)	127.0*	0	No self-reported benchmarks (uses third-party IRT fallback)
o1-preview	123.8	8	GPQA 73.3, SWE-V 41.3, MATH 85.5, MMLU 90.8
GPT-4o	117.1	43	GPQA 70.1, SWE-V 33.2, MMMLU 81.4, MATH 76.6, MMLU 85.7, HLE 5.3
Opus 3	112.3	11	GPQA 50.4, MATH 60.1, MMLU 86.8
GPT-4 Turbo	109.2	6	GPQA 48.0, MATH 72.6, MMLU 86.5
GPT-4	84.8	12	GPQA 35.7, MATH 42.0, MMLU 86.4

Full 19×101 matrix available as benchmark_matrix_full.csv. SWE-V = SWE-bench Verified. T-Bench = Terminal-Bench 2.0. *Son 3.5 (Jun) uses third-party IRT (127.0). Benchmark counts vary because different labs report different benchmarks — IRT handles this sparsity by jointly estimating model ability and benchmark difficulty from the overlapping subset. Models with few benchmarks (e.g. GPT-5.3 Codex: 5) have wider IRT uncertainty.

Sources: Claude Mythos Preview System Card (Anthropic, April 7, 2026) • Epoch AI METR Autonomous Task Horizon dataset (v1.1, retrieved March 21, 2026) • Ho et al., "A Rosetta Stone for AI Benchmarks" (arXiv:2410.22832) • SPAR: Self-reported Performance Assessment Research (Donahue, February 2026) • Vladimir Nesov, HN comment on Mythos compute estimates

This article will be updated when METR publishes Claude Mythos Preview's autonomous task horizon score.