1、GPT-5.1-Codex-Max System CardOpenAINovember 18,20251Contents1Introduction32Baseline Model Safety Evaluations32.1Disallowed Content Evaluations.32.2Jailbreaks.42.3Vision.43Product-Specific Risk Mitigations53.1Agent sandbox.53.2Network access.64Model-Specific Risk Mitigations64.1Harmful Tasks.64.1.1Ri
2、sk description.64.1.2Mitigation.64.1.2.1Safety training.64.2Prompt Injection.74.2.1Risk description.74.2.2Mitigation.74.2.2.1Safety training.74.3Avoid data-destructive actions.84.3.1Risk description.84.3.2Mitigation.84.3.2.1Safety training.85Preparedness95.1Capabilities Assessment.95.1.1Biological a
3、nd Chemical.95.1.1.1Long-form Biological Risk Questions.915.1.1.2Multimodal Troubleshooting Virology.105.1.1.3ProtocolQA Open-Ended.105.1.1.4Tacit Knowledge and Troubleshooting.115.1.1.5Troubleshooting Bench.115.1.2Cybersecurity.125.1.2.1Capture-the-flag(professional).145.1.2.2CVE-Bench.155.1.2.3Cyb
4、er Range.165.1.2.4External Evaluations by Irregular.185.1.2.5Preparing for High Cyber Capability.185.1.3AI Self-Improvement.195.1.3.1SWE-Lancer.195.1.3.2Paperbench-10(n=10).205.1.3.3MLE-bench-30(n=30).215.1.3.4OpenAI PRs.225.1.3.5OpenAI-Proof Q&A.235.1.3.6External Evaluations by METR.245.2Research C
5、ategory Update:Sandbagging.265.2.1External Evaluations by Apollo Research.2621IntroductionGPT-5.1-Codex-Max is our new frontier agentic coding model.It is built on an update to ourfoundational reasoning model trained on agentic tasks across software engineering,math,research,medicine,computer use an
6、d more.It is our first model natively trained to operate across multiplecontext windows through a process called compaction,coherently working over millions of tokensin a single task.Like its predecessors,GPT-5.1-Codex-Max was trained on real-world softwareengineering tasks like PR creation,code rev