Opus 4.6 can complete human tasks that take 12 hours. We estimate Mythos could handle tasks that take 16.
Sam Donahue · April 7, 2026 · Written before METR evaluation of Mythos
METR measures the longest human task an AI model can autonomously complete — writing code, debugging systems, running experiments. It is a measure of task complexity, not runtime: a 12-hour time horizon means the model can solve problems that would take a human roughly 12 hours, not that the model runs for 12 hours. The current leader is Opus 4.6 at ~12 hours. Mythos has no METR score yet.
We regress aggregate capability scores (IRT) against METR time horizons for 19 models. On reasoning-era models (n=14, R²=0.977), the fit predicts Mythos could complete human tasks of roughly 16 hours — a 33% increase over Opus 4.6.
| Method | Estimate (human-task hours) | Source |
|---|---|---|
| IRT regression, post-o1 (n=14) | ~16h (linear & quadratic converge) | This analysis |
| IRT regression, all models (n=19) | 15.9–27.1h | This analysis |
| Alternative fits (power law, sigmoid, piecewise) | 10–17h (4 of 6 fits) | This analysis |
| Individual benchmark ensemble (6 benchmarks) | 10.5h median (5.4–18.1h) | This analysis |
| Anthropic internal task evals | 40h-equiv. on 2/3 of tasks | System Card p. 34 |
| Anthropic qualitative assessment | "Not close" to replacing engineers | System Card p. 45 |
Our estimates cluster around 10–17 hours across methods. Anthropic's internal task evaluations (40h-equivalent) are not directly comparable — they measure speedup on narrow tasks, not sustained autonomous completion of complex work. Their qualitative finding ("not close" to engineer replacement) is consistent with a model that handles day-length tasks but cannot reliably sustain multi-day autonomous work.
METR (Model Evaluation & Threat Research) evaluates AI autonomy by giving models real-world tasks of increasing complexity — coding challenges, system debugging, research experiments. The time horizon is the human-equivalent duration of the most complex task the model can complete. It reports two thresholds:
| Model | p50 (task hours) | Release |
|---|---|---|
| Claude Opus 4.6 | ~12 hours | Feb 2026 |
| GPT-5.2 | ~5.9 hours | Dec 2025 |
| GPT-5.3 Codex | ~5.8 hours | Feb 2026 |
| Claude Opus 4.5 | ~4.9 hours | Nov 2025 |
| Gemini 3 Pro | ~3.7 hours | Nov 2025 |
Source: Epoch AI / METR-Horizon-v1.1, retrieved March 21, 2026. Mythos has no METR score yet.
Item Response Theory (IRT) aggregates many benchmark scores into a single ability parameter per model, simultaneously estimating each benchmark's difficulty (Ho et al., as implemented by the Epoch Capabilities Index). We use self-reported IRT (from labs' own benchmark results) because Mythos's score of 186.6 is on that scale. Mixing self-reported with third-party IRT produces unstable extrapolations due to a systematic ~6–11 point gap between the scales.
We fit log(METR minutes) = a + b·IRT + c·IRT² to the 19 models with both METR and IRT scores, testing across three regime cutoffs and six functional forms.
Ho et al. ("A Rosetta Stone for AI Benchmarks", Section 3.1.1) established the baseline for this approach. They fit a linear map from their estimated capability score (Cm) to log(time horizon), finding:
R² = 0.85 on a 40% held-out validation set, using a 60/40 train-test split. This outperformed individual benchmarks (median R² = 0.62 across 18 benchmarks; GPQA Diamond alone R² = 0.75). They note that some benchmarks, including SWE-Bench Verified, are "actively detrimental" for predicting time horizons (footnote 8, p. 5).
Our approach differs in three ways:
| Method | R² | Validation | n | Mythos |
|---|---|---|---|---|
| Epoch (Ho et al.) — linear, Cm scale | 0.85 | 40% holdout (1 trial) | ~15–20 | — |
| Ours — linear, all models | 0.946 | LOO: 22% | 19 | 15.9h |
| Ours — linear, all models | 0.909* | 40% holdout (1000 trials) | 19 | 15.4h [12.4–23.4] |
| Ours — linear, post-o1 | 0.977 | LOO: 9% | 14 | 16.4h |
| Ours — linear, post-o1 | 0.937* | 40% holdout (1000 trials) | 14 | 17.8h [14.5–21.5] |
*Median test-set R² over 1000 random 60/40 splits. Brackets = 90% CI on Mythos prediction across splits. Higher R² vs Epoch likely reflects updated data (more frontier models with METR scores) and regime restriction, not a methodological improvement. Epoch's approach is the foundational method; we apply it with newer data and test robustness. Also see EA Forum post applying weighted regression with 8 post-o3 models.
Three regimes, two fits each.
| Regime | Fit | n | R² | LOO Error | p50 Prediction | p80 Prediction |
|---|---|---|---|---|---|---|
| All models | Linear | 19 | 0.946 | 22% | 15.9 hours | 2.8 hours |
| All models | Quadratic | 19 | 0.959 | 32% | 27.1 hours | 4.3 hours |
| Post-o1 | Linear | 14 | 0.977 | 9% | 16.4 hours | 2.7 hours |
| Post-o1 | Quadratic | 14 | 0.977 | 11% | 16.1 hours | 1.4 hours |
| Frontier IRT≥130 | Linear | 13 | 0.971 | 10% | 16.8 hours | 2.8 hours |
| Frontier IRT≥130 | Quadratic | 13 | 0.972 | 13% | 14.6 hours | 0.9 hours |
LOO = Leave-one-out cross-validation median absolute percentage error. Post-o1 = models released after o1 (Dec 2024). Frontier = IRT ≥ 130.
Post-o1 regime (n=14): Linear and quadratic converge at ~16h with 9–11% LOO error. On the full dataset (n=19), they diverge (15.9h vs 27.1h) because pre-reasoning-era models weight the quadratic's curvature upward. The post-o1 subset has fewer free parameters relative to the data range and lower LOO error.
We also tested power law, sigmoid, cubic, and piecewise linear fits on the full n=19 dataset. The predictions cluster into two groups:
| Fit Type | Params | R² | LOO | Mythos p50 |
|---|---|---|---|---|
| Linear (log-space) | 2 | 0.946 | 22% | 15.9h |
| Power law | 2 | 0.906 | 17% | 10.1h |
| Sigmoid (log-space) | 3 | 0.970 | 26% | 11.8h |
| Piecewise linear (break=130) | 4 | 0.972 | 18% | 16.8h |
| Quadratic (log-space) | 3 | 0.959 | 32% | 27.1h |
| Cubic (log-space) | 4 | 0.978 | 23% | 7.7h |
Four of six fits predict 10–17 hours. The quadratic (27h) is pulled up by curvature from older models. The cubic (7.7h) overfits and curves back down. Sigmoid and piecewise fits, which allow the functional form to change, land at 12–17 hours.
As a robustness check, we regress each benchmark individually against METR p50 (univariate log-linear fits), then predict Mythos from its system card score. Of the 27 self-reported benchmarks with ≥4 METR models reporting, only 6 also have Mythos scores available — a data coverage limitation, not a methodological one.
We also attempted multivariate Ridge regression across all available benchmarks. With only 7 overlapping features and 19 data points, the model extrapolates unstably (Mythos scores exceed the training range on every benchmark). The univariate ensemble below is more robust.
| Benchmark | n | R² | LOO | Mythos | Data max | → p50 | Ref |
|---|---|---|---|---|---|---|---|
| BrowseComp | 4 | 0.968 | 28% | 86.9% | 84.0% | 14.9h | p. 191 |
| SWE-bench Verified | 14 | 0.856 | 40% | 93.9% | 80.9% | 11.3h | p. 187 |
| GPQA Diamond | 17 | 0.846 | 33% | 94.5% | 92.4% | 5.4h | p. 189 |
| MMMLU | 9 | 0.835 | 56% | 92.7% | 91.8% | 9.7h | p. 189 |
| HLE (no tools) | 6 | 0.712 | 65% | 56.8% | 53.1% | 18.1h | p. 191 |
| Terminal-Bench 2.0 | 4 | 0.173 | 73% | 82.0% | 77.3% | 8.7h | p. 188 |
Univariate log-linear fits. Mythos scores clipped to 115% of data max to limit extrapolation. Sorted by R². Terminal-Bench grayed: R² = 0.17 (poor fit, only 4 models). All predictions involve extrapolation.
Median across 6 benchmarks: 10.5 hours. R² > 0.3 subset (5 benchmarks): median 11.3 hours. These are lower than the IRT-based ~16h and have higher LOO errors (28–65% vs 9–18%). Individual benchmarks are noisier predictors and the extrapolation is more severe (Mythos exceeds every benchmark's data max). IRT aggregation compresses this noise, which is why the IRT-based estimates have better LOO.
No METR score, but Anthropic reports internal autonomy evaluations and their own IRT trajectory.
Anthropic's internal suite tests AI R&D capabilities with hour-equivalent thresholds (System Card Table 2.3.3.A, p. 34):
| Task | Opus 4.5 | Opus 4.6 | Mythos | Threshold |
|---|---|---|---|---|
| Kernel task (speedup) | 252× | 190× | 399× | 300× = 40h eq. |
| Time Series (MSE) | 5.71 | 5.80 | 4.55 | <5.3 = 40h eq. |
| LLM Training (speedup) | 16.5× | 34× | 51.9× | >4× = 4–8h eq. |
| Quadruped RL (score) | 19.48 | 20.96 | 30.87 | >12 = 4h eq. |
| Novel Compiler (%) | 69.4% | 65.8% | 77.2% | 90% = 40h eq. |
Source: Claude Mythos Preview System Card, Table 2.3.3.A, p. 34.
Note: These are task-specific hour-equivalents on narrow evaluations, not METR's measure of sustained autonomous work across diverse tasks. The two metrics are not directly comparable.
For the first time, Anthropic published their own IRT-based capability tracking (System Card Section 2.3.6, pp. 40–42), using the same method as Epoch AI's public ECI but with ~300 models and "hundreds of benchmarks, mostly internal." Two findings are directly relevant:
Benchmark saturation at the frontier (Figure 2.3.6.A, p. 40). Most benchmarks in their IRT fit cluster below ECI ~175. Very few exist at Mythos's level (~190 on their internal scale). Anthropic states: "The ECI is only as good as the underlying dataset, and there are currently few benchmarks at Claude Mythos Preview's current capability level to tightly calibrate its ECI score." This applies equally to our prediction: if Mythos's IRT score of 186.6 has wide uncertainty due to benchmark scarcity at the frontier, then our METR extrapolation inherits that uncertainty.
Accelerating capability trajectory (Figure 2.3.6.B, p. 42). The Anthropic frontier from Claude 3 Opus (~118 internal ECI, Jan 2024) to Mythos (~190, Apr 2026) shows a two-phase linear fit with slope ratio 1.86×–4.3× depending on breakpoint. Mythos "appears to be above the pre-Mythos Preview trend, although its error bars are quite large." They caution: "we do not know if this trend will continue with future models."
Reconciling their ECI with our IRT: Anthropic's internal ECI of ~190 and our self-reported IRT of 186.6 are not directly comparable — different benchmark sets, different anchoring. But both place Mythos roughly 10–15 points above Opus 4.6 on their respective scales, which is the gap that drives our METR prediction. The relative gap matters more than the absolute number.
Three other details from this section that are easy to miss:
Task-level performance and sustained autonomy diverge.
Anthropic's internal survey (n=18, p. 35): 1/18 thought Mythos was a drop-in for an entry-level Research Scientist. Their conclusion (p. 45): "Claude Mythos Preview does not seem close to being able to substitute for Research Scientists and Research Engineers, especially relatively senior ones."
Documented failure modes (pp. 35–39):
These failure modes — confabulation, grinding, factual errors — are the kinds of behaviors that degrade sustained autonomous performance. METR's time horizon measures exactly this: coherent multi-step work over extended periods. The gap between narrow task scores (40h equivalent) and qualitative assessment ("not close" to engineer replacement) is consistent with a p50 time horizon in the 10–20h range.
| Method | Estimate | Source |
|---|---|---|
| IRT regression, post-o1 (n=14) | ~16h (linear & quadratic converge) | This analysis |
| IRT regression, all models (n=19) | 15.9–27.1h (linear–quadratic) | This analysis |
| Alternative fits (power, sigmoid, piecewise) | 10–17h (4 of 6 fits) | This analysis |
| Univariate benchmark ensemble (6 benchmarks) | 10.5h median (5.4–18.1h range) | This analysis |
| Anthropic's internal task evals | 40h equiv. on 2/3 of tasks | System Card p. 34 |
| Anthropic's qualitative assessment | "Not close" to engineer replacement | System Card p. 45 |
Our regression-based estimates cluster around 10–17 hours, with the tightest fit (post-o1, R²=0.977) converging at ~16h. Anthropic's internal task evaluations are not directly comparable to METR, and their qualitative assessment is consistent with a model that can sustain autonomous work for hours but not reliably for days. METR's actual evaluation, when published, will determine accuracy.
Also released this week (April 8, 2026), Meta's Muse Spark sits at IRT ~168 — nearly identical to Opus 4.5 (167.6) and 19 points below Mythos (186.6). Our model predicts ~5 task-hours for Muse Spark, vs ~16 for Mythos.
On aggregate, Muse Spark is competitive: it beats Opus 4.6 on 10 of 19 benchmarks, with large wins on multimodal (CharXiv +21.1, ERQA +13.1), health (HealthBench Hard +28.0), and competitive coding (LiveCodeBench Pro +9.3). But it trails on the benchmarks most predictive of autonomous work:
| Agentic benchmark | Muse Spark | Opus 4.6 | Gemini 3.1 | GPT-5.4 | Mythos |
|---|---|---|---|---|---|
| SWE-bench Verified | 77.4 | 80.8 | 80.6 | — | 93.9 |
| SWE-bench Pro | 52.4 | 53.4 | 54.2 | 57.7 | 77.8 |
| Terminal-Bench 2.0 | 59.0 | 65.4 | 68.5 | 75.1 | 82.0 |
| DeepSearchQA | 74.8 | 73.7 | 69.7 | 73.6 | — |
| ARC AGI 2 | 42.5 | 63.3 | 76.5 | 76.1 | — |
| Model | IRT | Predicted METR p50 | Actual METR (if known) |
|---|---|---|---|
| Mythos Preview | 186.6 | ~16h | TBD |
| Opus 4.6 | 177.0 | ~9h* | 12h (actual) |
| Muse Spark (Thinking) | 167.7 | ~5h | TBD |
| Opus 4.5 | 167.6 | ~5h | 4.9h (actual) |
| Gemini 3 Pro | 166.8 | ~5h | 3.7h (actual) |
*Opus 4.6 actual METR is 12h; model underpredicts by ~25%. Muse Spark benchmarks from Meta's release (April 8, 2026). Frontier model benchmarks from respective system cards / announcements.
The IRT of 168 is an aggregate that averages Spark's multimodal strengths with its agentic weaknesses. For METR prediction, the agentic benchmarks matter most, and on those Spark trails Opus 4.6 by 3–6 points. At IRT 168, our model places it near Opus 4.5 (actual METR: 4.9h) and Gemini 3 Pro (actual: 3.7h) — both of which would validate a ~5h prediction. The 19-point IRT gap to Mythos (168 → 187) translates to a predicted 3× difference in autonomous task complexity.
p80 measures the human-task duration a model completes successfully 80% of the time — a stricter bar than the median (p50).
As a robustness check, we regress each benchmark individually against METR (univariate log-linear fits). Only 6 of 27 benchmarks have both METR model coverage (≥4) and a Mythos score.
Individual benchmarks are noisier (LOO 28–65% vs 9–11% for IRT) and Mythos exceeds the data maximum on all six, making every prediction an extrapolation. IRT aggregation compresses this noise, which is why IRT-based estimates have lower cross-validation error.
IRT 130 (Claude 3.5 Sonnet) → IRT 177 (Opus 4.6): +47 points, task complexity grew from ~20 min to ~12 hours (36×).
IRT 177 (Opus 4.6) → IRT 187 (Mythos): +10 points, predicted ~12h → ~16h (1.3×).
On the full dataset, linear and quadratic diverge (15.9h vs 27.1h). On post-o1 only, they converge (both ~16h). The data does not clearly distinguish functional forms in this regime — more frontier model evaluations will resolve whether the relationship remains log-linear or accelerates.
log(METR minutes) = a + b·IRT + c·IRT² via numpy.polyfit, degree 1 and 2, fit in log-space.scipy.optimize.least_squares, MMLU-Pro anchor, L2 regularization (0.1), minimum 3 benchmarks per model. Self-reported IRT preferred; third-party IRT used as fallback for Claude 3.5 Sonnet June '24 only (127.0).Every model used in the regression, sorted by IRT score. IRT source: SR = self-reported, 3P = third-party fallback. METR CIs from METR-Horizon-v1.1. All times in minutes unless labeled hours.
| Model | IRT | Src | p50 (min) | p50 | p50 CI | p80 (min) | p80 | Release |
|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.6 | 177.0 | SR | 718.8 | 12.0h | [319, 3950] | 69.9 | 1.2h | 2026-02-05 |
| GPT-5.3 Codex | 173.8 | SR | 349.5 | 5.8h | [192, 858] | 54.7 | 0.9h | 2026-02-05 |
| GPT-5.2 | 169.7 | SR | 352.2 | 5.9h | [191, 862] | 66.0 | 1.1h | 2025-12-11 |
| Claude Opus 4.5 (R) | 167.6 | SR | 293.0 | 4.9h | [161, 639] | 49.4 | 0.8h | 2025-11-24 |
| Gemini 3 Pro | 166.8 | SR | 224.3 | 3.7h | [137, 387] | 54.1 | 0.9h | 2025-11-18 |
| GPT-5.1 | 163.4 | SR | 223.7 | 3.7h | [135, 395] | 50.6 | 0.8h | 2025-11-19 |
| GPT-5 | 162.2 | SR | 203.0 | 3.4h | [114, 407] | 38.3 | 0.6h | 2025-08-07 |
| Claude 4.1 Opus | 150.8 | SR | 100.5 | 1.7h | [60, 158] | 23.5 | 0.4h | 2025-08-05 |
| o3 | 150.0 | SR | 119.7 | 2.0h | [73, 192] | 30.0 | 0.5h | 2025-04-16 |
| Claude 4 Opus (R) | 149.9 | SR | 100.4 | 1.7h | [60, 163] | 20.4 | 0.3h | 2025-05-22 |
| Claude 3.7 Sonnet | 140.1 | SR | 60.4 | 1.0h | [33, 107] | 12.1 | 0.2h | 2025-02-24 |
| o1 | 134.6 | SR | 38.8 | 39m | [22, 67] | 7.1 | 7m | 2024-12-05 |
| Claude 3.5 Sonnet (Oct) | 130.0 | SR | 20.5 | 21m | [10, 40] | 2.6 | 3m | 2024-10-22 |
| Claude 3.5 Sonnet (Jun) | 127.0 | 3P | 11.4 | 11m | [5, 23] | 1.7 | 2m | 2024-06-20 |
| o1-preview | 123.8 | SR | 20.3 | 20m | [12, 33] | 4.4 | 4m | 2024-09-12 |
| GPT-4o | 117.1 | SR | 7.0 | 7m | [4, 12] | 1.3 | 1m | 2024-05-13 |
| Claude 3 Opus | 112.3 | SR | 4.0 | 4m | [2, 9] | 0.6 | <1m | 2024-03-04 |
| GPT-4 Turbo | 109.2 | SR | 4.0 | 4m | [2, 8] | 0.8 | <1m | 2023-11-06 |
| GPT-4 | 84.8 | SR | 4.0 | 4m | [2, 8] | 0.9 | <1m | 2023-03-14 |
Data: METR-Horizon-v1.1 YAML matched to SPAR master dataset IRT scores. CSV available at final_matched_metr_irt.csv.
The IRT score for each model is computed from the self-reported benchmark scores below (plus others not shown — models have 5–43 total benchmarks each). IRT tolerates this sparsity by design, but the matrix is sparse: no single benchmark covers all 19 models. GPQA (17/19) has the best coverage. Mythos row shown for reference.
| Model | IRT | GPQA | SWE-V | AIME25 | MMMLU | MATH | MMLU | HLE | ARC-AGI2 | T-Bench | #SR |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mythos Preview | 186.6 | 94.6 | 93.9 | — | 92.7 | — | — | 56.8 | — | 82.0 | 10 |
| Opus 4.6 | 177.0 | 91.3 | 80.8 | 99.8 | 91.1 | — | — | 53.1 | 68.8 | 65.4 | 26 |
| GPT-5.3 Codex | 173.8 | — | — | — | — | — | — | — | — | 77.3 | 5 |
| GPT-5.2 | 169.7 | 92.4 | 80.0 | 100.0 | 89.6 | — | — | 34.5 | 52.9 | — | 23 |
| Opus 4.5 (R) | 167.6 | 87.0 | 80.9 | — | 90.8 | — | — | — | 37.6 | 59.3 | 10 |
| Gemini 3 Pro | 166.8 | 91.9 | 76.2 | 100.0 | 91.8 | — | — | 45.8 | 31.1 | 54.2 | 18 |
| GPT-5.1 | 163.4 | 88.1 | 76.3 | 94.0 | — | — | — | — | — | — | 9 |
| GPT-5 | 162.2 | 85.7 | 74.9 | 94.6 | — | 84.7 | 92.5 | 24.8 | — | — | 35 |
| Opus 4.1 | 150.8 | 80.9 | 74.5 | 78.0 | 89.5 | — | — | — | — | — | 8 |
| o3 | 150.0 | 83.3 | 69.1 | 86.4 | — | — | — | 14.7 | 6.5 | — | 22 |
| Opus 4 (R) | 149.9 | 79.6 | 72.5 | 75.5 | 88.8 | — | — | — | 8.6 | — | 10 |
| Son 3.7 | 140.1 | 84.8 | 70.3 | 54.8 | 86.1 | — | — | — | — | — | 11 |
| o1 | 134.6 | 78.0 | 41.0 | — | 87.7 | 96.4 | 91.8 | — | — | — | 19 |
| Son 3.5 (Oct) | 130.0 | 67.2 | 49.0 | — | — | 78.3 | 90.4 | — | — | — | 19 |
| Son 3.5 (Jun) | 127.0* | — | — | — | — | — | — | — | — | — | 0 |
| o1-preview | 123.8 | 73.3 | 41.3 | — | — | 85.5 | 90.8 | — | — | — | 8 |
| GPT-4o | 117.1 | 70.1 | 33.2 | — | 81.4 | 76.6 | 85.7 | 5.3 | — | — | 43 |
| Opus 3 | 112.3 | 50.4 | — | — | — | 60.1 | 86.8 | — | — | — | 11 |
| GPT-4 Turbo | 109.2 | 48.0 | — | — | — | 72.6 | 86.5 | — | — | — | 6 |
| GPT-4 | 84.8 | 35.7 | — | — | — | 42.0 | 86.4 | — | — | — | 12 |
*Claude 3.5 Sonnet (Jun) uses third-party IRT (127.0) as fallback — no self-reported benchmarks available. #SR = total self-reported benchmarks feeding the IRT computation (including those not shown). SWE-V = SWE-bench Verified. T-Bench = Terminal-Bench 2.0. Dashes = not reported by the model's developer. IRT is computed from ALL available benchmarks per model, not just these 9 columns.