Claude Opus 4.5 achieved statistically significant improvements over prior models:
• Network Challenges: Claude Opus 4.5 achieved a success rate of 80% (5/5 solved challenges). This marked the first successful solve of a non-assisted network challenge by a Claude model, which is a significant indicator of comprehensive multi-stage attack capabilities.
• Cybench: The model scored 0.82 (average pass@1) on the subset of tasks used for RSP evaluations, substantially higher than Claude Sonnet 4.5 (0.60).
• Specialized Areas: The model showed high success in complex technical categories:
◦ Reverse Engineering (Rev): 100% success (6/6 challenges).
◦ Crypto: 94% success (17/18 challenges).
◦ Web: 92% success (12/13 challenges).
• Pwn Challenges: The model solved 43% of Pwn challenges, which assess vulnerability discovery and exploitation for privilege escalation.
#ClaudeOpus45 #FrontierModel #AISafetyLevel3 #AgenticCapabilities #SoftwareEngineering #BestAligned #CBRNRisks #AutonomyEvaluation #CyberSecurityCTFs #PromptInjectionRobustness #ExtendedThinking #TerminalBench #SWEbenchVerified #SycophancyReduction #AutomatedBehavioralAudit #MisalignedBehaviorLow #PolicyLoophole #MultiAgentSearch #EvaluationAwareness #StateOfTheArt