Refact.ai Agent achieved a new top score of 74.4% on SWE-bench Verified — the most widely used benchmark for evaluating AI agents in software engineering. By autonomously solving 372 out of 500 tasks, it now holds the best result on the leaderboard — and the highest score among open-source programming agents.
At the core of this run is Anthropic’s Claude 4 Sonnet model, which provided a significant boost in reasoning and coding capability. The new score surpasses our previous best — 70.4% with Claude 3.7 Sonnet — which was #1 open-source result listed on the official SWE-bench leaderboard until now.
This milestone builds on our earlier SWE-bench efforts. In our previous blog post, we detailed the AI Agent setup that helped achieve a 70.4% score — and prior to that, a run that made Refact.ai the #1 AI Agent on SWE-bench Lite. The lessons learned there, plus the upgrades listed below, made this new SOTA score possible.
The full SWE-bench pipeline we used is open-source, fully reproducible end to end. Our mission is to empower developers with an autonomous AI that amplifies their capabilities. We believe the future of programming is open-source — transparent, customizable, and community-driven — with AI tools built by developers, for developers.
Our latest result isn’t just a number — it shows that Refact.ai Agent is a reliable autonomous programming partner. It handles real dev tasks end-to-end with precision, while you can review and guide every step. Start using Refact.ai today in VS Code or JetBrains to speed up your development workflow by offloading your everyday tasks to AI Agent that’s really worth your trust.
Our AI Agent uses a deliberate approach focused on step-by-step problem solving and reliability. Key elements of our SWE-bench Verified setup in this run included:
debug_script()
sub-agent that fixes bugs and can modify/create new scripts.How does Refact.ai Agent solve the SWE-bench Verified tasks? It follows a four-step strategy defined in its system prompt. The Agent starts by exploring the problem: using tools like cat()
to open files, search_symbol_definition()
, search_pattern()
, etc. to locate relevant code. The Agent also uses compress_session()
, ensuring it gathers the right context before attempting any changes.
At step two, the Agent reproduces the issue. It runs all existing tests to ensure a clean baseline, writes a script that triggers the bug (covering all possible edge cases), sets up the environment, and runs the script via shell("python ...")
to confirm the failure. Then debug_script()
takes over — a custom sub-agent that uses pdb to debug, modify, and generate scripts. Powered by Claude 4 with o4-mini for summarizing the debug info, it’s called at least once — and up to three times — per task. In practice, it was really helpful for digging into the problem source.
Once complete, the Agent plans and applies the fix based on the debugging report. It updates project files directly, without creating patches and diffs. In the earlier run, this step used a separate strategic_planning()
tool. With Claude 4 Sonnet, that’s no longer needed — the model’s reasoning is strong enough to handle this job on its own. Finally, the Agent checks its work: re-runs the reproduction script and the project’s existing tests to validate the fix. If all tests pass, it uses compress_session()
to offload any debug or temporary files and optimize context usage before ending the run.
Throughout the run, automatic guardrails help keep the Agent on track. These are mid-run messages, inserted into the chat as if from a simulated “user” when the model gets stuck or makes mistakes. A script statically monitors Claude 4’s outputs, and when needed, injects messages to guide the model back on course. For example, it may remind the model to open all visited files after debug_script()
, or to follow correct implementation rules after planning. These small actions make a big difference in stability.
The entire run is fully autonomous: no manual inputs, no retries. Each task runs in a single session, with the Agent self-correcting and managing context to stay efficient and produce a single correct solution.
Several upgrades helped push the agent from 70.4% to 74.4%:
strategic_planning()
: Previously, this tool (powered by o3) reasoned over debug_script()
output and modified files This is now fully handled by Claude 4 Sonnet.cat
, leading to context overflow. We’ve added a limit: if a folder contains more than 5 files, the Agent returns an error and asks for one-by-one access: (“Too many files were requested. Please open files one by one”).search_pattern()
.debug_script()
prompt.All these improvements work together to make Refact.ai Agent more robust and efficient. Moving to Claude 4 Sonnet significantly boosted reasoning ability and allowed us to simplify the agent’s loop while still solving more tasks. Meanwhile, the debug sub-agent and guardrails have been enhanced to ensure greater reliability throughout each run.
The new #1 SWE-bench result shows the rapid progress of Refact.ai Agent. Ultimately, our focus isn’t only on benchmark scores - it’s on building an AI agent that truly works for real developers. The lessons learned and improvements made for SWE-bench are already finding their way into the product. That means when you use Refact.ai, you’re benefitting from the engineering approach that achieved this benchmark record.
Refact.ai Agent is an AI Agent for software engineering you can trust — and guide when needed. Autonomous when you want it, collaborative when you step in.
If you’re ready to work with an AI that understands your environment, works across your tools, and earns your trust one task at a time — Refact.ai is ready for you. Join our community, see what real developers are building end-to-end, and start programming with Refact.ai Agent today.