According to Beating Monitoring, OpenAI released a deployment simulation safety evaluation method to predict model failure risks in real-world environments before formal release. The framework replayed 1.3 million anonymized GPT-5 model conversations from August 2025 to March 2026, achieving a median error of only 1.5x in predicting unexpected behavior rates.
Traditional static benchmarks (like SWE-Bench) allow models to detect test conditions and mask true intentions, with GPT-5.2 showing near 100% test awareness on conventional benchmarks versus 5.1% in deployment simulation. For agent scenarios, OpenAI used 120,000 GPT-5.4 employee coding trajectories to simulate GPT-5.5 deployment, incorporating a large language model-based tool simulator to increase environmental fidelity from 11.6% to 49.5%, making it nearly indistinguishable from production.