Open-Source Refact.ai Agent is SOTA on SWE-bench Lite With a 60.0% Score

May 5, 2025

by Refact.ai Team

5 min read

Refact.ai is now the #1 open-source AI Agent on SWE Verified — read the details.

Refact.ai Agent has achieved the #1 score on SWE-bench Lite — solving 180 out of 300 tasks, for a 60,0% success rate. This result puts Refact.ai at the top of the leaderboard and makes it the best-performing open-source AI agent on SWE-bench Lite to date.

SWE-bench Lite is a benchmark that evaluates LLM-based systems on real GitHub issues from popular open-source Python projects. Each task requires applying a bug fix or feature implementation, then validating the result through test execution. This makes the benchmark particularly valuable for understanding how AI tools will perform in actual production environments.

Agent setup

Refact.ai Agent takes a fully autonomous, iterative approach. It plans, executes, tests, and self-corrects — repeating steps as needed to reach a single correct solution with no user input. So, the benchmark setup was designed to reflect our autonomy-first philosophy:

Prompt strategy: Defines the Agent’s behavior and high-level task-solving logic. Open-source and available on GitHub.
Model: Claude 3.7 Sonnet — responsible for orchestration and decision-making.
Execution layer: refact-lsp, a backend that connects the model to tools and the environment.
deep_analysis() tool: Enhanced reasoning, powered by o4-mini.
Tool suite for repository exploration, code modification, and testing. Used dynamically based on task needs.
Step cap: 60 agent steps (each one a discrete action) per task.

What sets Refact.ai apart is that our AI Agent independently drives the entire process. While some solutions take a semi-autonomous approach — requiring to manually invoke tools and guide the agent — Refact.ai Agent operates independently from start to finish.

Prompt strategy

Refact.ai’s SWE-bench Lite prompt follows a clear workflow:

Describe the problem.
Investigate the repo.
Create and run problem reproduction script.
Make up a plan (using deep_analysis() powered by o4-mini) and apply changes.
Run tests and evaluate changes (including optional reasoning with deep_analysis())
Repeat 4 and 5 steps until the problem is solved.

This workflow serves as high-level guidance, not hard rules. Refact.ai Agent uses it to form its own strategy — repeating, skipping, or reordering steps based on task context.

For each SWE-bench problem, Refact.ai Agent made one multi-step run, aimed to produce a single, correct final solution.

Claude 3.7: Model choice

Refact.ai uses Claude 3.7 Sonnet with sampling temperature 0.0 as its core model for SWE-bench Lite. It demonstrated exceptional capabilities for autonomous workflows: following multi-step instructions, understanding complex codebases, maintaining context across long interactions.

We’ve previously paired Refact.ai with Claude 3.7 solving the Polyglot benchmark, where it reached 93.3% with Thinking Mode and 92.9% without — the highest known scores to date on that task set.

Deep-analysis() tool

One of the key features in Refact.ai’s approach is the deep_analysis() tool. It adds a structured, three-step reasoning process that improves solution quality at critical moments in the task flow.

deep_analysis() is powered by o4-mini — a small, fast reasoning model that handles the cognitive load of problem-solving so Claude 3.7 can focus on orchestration.

The prompt for deep_analysis() tool follows the pattern [also on GitHub]:

Solution generation
”Get the initial solution”
Critique
”Please critique the solution above. Identify any weaknesses, limitations, or bugs. Be specific and thorough in your analysis. Remember, that the final solution must be minimal, robust, and effective.”
Refinement
”Please improve the original solution based on the critique. Provide a comprehensive, refined solution that addresses the weaknesses identified in the critique while maintaining the strengths of the original solution”

This structured loop is normally triggered during Step 4 of the Benchmark prompt — when planning and applying code changes. But in fact, Refact.ai Agent decides when to use this tool on its own.

Completing the benchmark, we observed that Agent sometimes called deep_analysis() multiple times — first during planning, then again when testing and evaluating results. In other cases, it skipped the tool entirely. This proves that Refact.ai Agent doesn’t follow a rigid script, but instead prioritizes its own strategy to get the task done right.

Tools, tools, tools

Refact.ai Agent has access to a variety of tools that allow it to interact with the entire development environment for tasks-solving.

Code exploration: search(), regex_search(), definition(), references(), tree(), cat()
Editing: create_textdoc(), update_textdoc()
Shell execution: shell() — used to run Python tests and verify solutions.

These tools enable AI Agent to navigate codebases, understand dependencies, make precise changes, and verify that its solutions work correctly. It uses them autonomously, what and when needed.

Although Refact.ai Agent can also interface with real-world tools (GitHub, Docker, PostgreSQL, etc.) and 1000+ tools via MCP servers, these integrations weren’t used in the benchmark run — but are part of standard workflows in user environments.

60 steps cap

Claude 3.7 Sonnet has 60 steps to complete a task. A step = AI action, such as modifying a file, listing directories, or running tests. AI Agent strategically decides how to use these steps, leading to clear, controlled solutions.

Final SOTA score

Out of 300 tasks in SWE-bench Lite:

🥇 Solved: 180 (60,0% resolve rate)
Not solved: 120 (40,0%).

Refact.ai Agent even managed to solve two SWE-bench tasks that no other listed agents have (django-12589, sympy-21627) — supposingly, thanks to the o4-mini model’s reasoning capabilities.

Evaluation results

Total Instances	Solved	Not solved	Solved (%)	Not solved (%)
300	180	120	60,0%	40,0%

Looking forward

Refact.ai’s performance on SWE-bench Lite demonstrates that AI agents are becoming increasingly capable of handling real-world software engineering tasks autonomously — not just generating code, but planning, debugging, testing, and refining it with minimal human input.

Our next step is evaluating Refact.ai Agent on SWE-bench verified — a benchmark with more rigorous testing.

All of this is part of our open-source commitment. Developers can explore the system, understand how autonomy is implemented, and even contribute. We believe that as the baseline work of software development shifts to AI, human engineers will be free to focus on the more interesting and creative parts of the job — and invite developers to build the future of programming together.

Why does SWE-bench matter for developers?

This isn’t just about ranking highly on a benchmark — it’s about real-world coding impact. Refact.ai Agent helps developers and software companies:

Automate repetitive tasks across the SDLC
Focus on core work while the AI handles the rest
Deliver faster with the AI Agent working alongside you in your IDE
Delegate with confidence, knowing the AI writes reliable, tested code.

Get Refact.ai Agent for your IDE

Vibe coding is the future of software development — get it today.

Refact.ai’s autonomous AI Agent works like a senior developer in your IDE:

Works inside your workflow & with your dev tools
Boosts productivity x10 with real automation
Handles coding while you focus on core work
Available to everyone in IDE.

Try open-source Refact.ai in VS Code or JetBrains for your programming tasks and let us know what you think!