Archive - METR

How Does Time Horizon Vary Across Domains?

Note: This post includes inline LaTeX that looks better on the version at our website.

Jul 14

June 2025

Recent Frontier Models Are Reward Hacking

In the last few months, we’ve seen increasingly clear examples of reward hacking[1] on our tasks: AI systems try to “cheat” and get impossibly high…

Jun 5

April 2025

OpenAI o3 and o4-mini Evaluation Results

Details about METR’s preliminary evaluation of o3 and o4-mini

Apr 16

Claude 3.7 Evaluation Results

Details about METR’s preliminary evaluation of Claude 3.7 Sonnet

Apr 4

March 2025

Common Elements of Frontier AI Safety Policies

A number of developers of large foundation models have committed to corporate protocols that lay out how they will evaluate their models for severe…

Mar 26

Measuring AI Ability to Complete Long Tasks

Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete.

Mar 19

HCAST: Human-Calibrated Autonomy Software Tasks

To understand and predict the societal impacts of highly autonomous AI systems, we need benchmarks with grounding, i.e., metrics that directly connect…

Mar 17

Response to OSTP on AI Action Plan

We believe it will be important to plan around the following three major factors:

Mar 15

Why it’s good for AI reasoning to be legible and faithful

AI systems increasingly ‘reason’ in text before producing their final outputs.[1] [2] [3] [4] This reasoning is a powerful tool for safely developing…

Mar 11

DeepSeek-R1 Evaluation Results

DeepSeek-R1 Evaluation Report

Mar 5

February 2025

METR’s GPT-4.5 pre-deployment evaluations

As described in OpenAI’s GPT-4.5 System Card, METR received access to an earlier checkpoint of GPT-4.5 from OpenAI a week prior to model release.

Feb 27

Measuring Automated Kernel Engineering

Understanding AI systems’ ability to automate AI research and development is important: it could enable recursive self-improvement where AI development…

Feb 14

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts