Refact.ai Agent has achieved the #1 score on SWE-bench Lite — solving 179 out of 300 tasks, for a 59,7% success rate. This result puts Refact.ai at the top of the leaderboard and makes it the best-performing open-source AI agent on SWE-bench Lite to date.
SWE-bench Lite is a benchmark that evaluates LLM-based systems on real GitHub issues from popular open-source Python projects. Each task requires applying a bug fix or feature implementation, then validating the result through test execution. This makes the benchmark particularly valuable for understanding how AI tools will perform in actual production environments.
Refact.ai Agent takes a fully autonomous, iterative approach. It plans, executes, tests, and self-corrects — repeating steps as needed to reach a single correct solution with no user input. So, the benchmark setup was designed to reflect our autonomy-first philosophy:
refact-lsp
, a backend that connects the model to tools and the environment.deep_analysis()
tool: Enhanced reasoning, powered by o4-mini.What sets Refact.ai apart is that our AI Agent independently drives the entire process. While some solutions take a semi-autonomous approach — requiring to manually invoke tools and guide the agent — Refact.ai Agent operates independently from start to finish.
Refact.ai’s SWE-bench Lite prompt follows a clear workflow:
deep_analysis()
powered by o4-mini) and apply changes.deep_analysis()
)This workflow serves as high-level guidance, not hard rules. Refact.ai Agent uses it to form its own strategy — repeating, skipping, or reordering steps based on task context.
For each SWE-bench problem, Refact.ai Agent made one multi-step run, aimed to produce a single, correct final solution.
Refact.ai uses Claude 3.7 Sonnet with sampling temperature 0.0 as its core model for SWE-bench Lite. It demonstrated exceptional capabilities for autonomous workflows: following multi-step instructions, understanding complex codebases, maintaining context across long interactions.
We’ve previously paired Refact.ai with Claude 3.7 solving the Polyglot benchmark, where it reached 93.3% with Thinking Mode and 92.9% without — the highest known scores to date on that task set.
One of the key features in Refact.ai’s approach is the deep_analysis()
tool. It adds a structured, three-step reasoning process that improves solution quality at critical moments in the task flow.
deep_analysis()
is powered by o4-mini — a small, fast reasoning model that handles the cognitive load of problem-solving so Claude 3.7 can focus on orchestration.
The prompt for deep_analysis()
tool follows the pattern [also on GitHub]:
This structured loop is normally triggered during Step 4 of the Benchmark prompt — when planning and applying code changes. But in fact, Refact.ai Agent decides when to use this tool on its own.
Completing the benchmark, we observed that Agent sometimes called deep_analysis()
multiple times — first during planning, then again when testing and evaluating results. In other cases, it skipped the tool entirely. This proves that Refact.ai Agent doesn’t follow a rigid script, but instead prioritizes its own strategy to get the task done right.
Refact.ai Agent has access to a variety of tools that allow it to interact with the entire development environment for tasks-solving.
search()
, regex_search()
, definition()
, references()
, tree()
, cat()
create_textdoc()
, update_textdoc()
shell()
— used to run Python tests and verify solutions.These tools enable AI Agent to navigate codebases, understand dependencies, make precise changes, and verify that its solutions work correctly. It uses them autonomously, what and when needed.
Although Refact.ai Agent can also interface with real-world tools (GitHub, Docker, PostgreSQL, etc.) and 1000+ tools via MCP servers, these integrations weren’t used in the benchmark run — but are part of standard workflows in user environments.
Claude 3.7 Sonnet has 60 steps to complete a task. A step = AI action, such as modifying a file, listing directories, or running tests. AI Agent strategically decides how to use these steps, leading to clear, controlled solutions.
Out of 300 tasks in SWE-bench Lite:
Refact.ai Agent even managed to solve two SWE-bench tasks that no other listed agents have (django-12589, sympy-21627) — supposingly, thanks to the o4-mini model’s reasoning capabilities.
Total Instances | Solved | Not solved | Solved (%) | Not solved (%) |
---|---|---|---|---|
300 | 179 | 121 | 59,7% | 40,3% |
astropy/astropy: 3/6 (50.0%) django/django: 78/114 (68.4%) matplotlib/matplotlib: 11/23 (47.8%) mwaskom/seaborn: 2/4 (50.0%) pallets/flask: 0/3 (0.0%) psf/requests: 5/6 (83.3%) pydata/xarray: 2/5 (40.0%) pylint-dev/pylint: 3/6 (50.0%) pytest-dev/pytest: 10/17 (58.8%) scikit-learn/scikit-learn: 17/23 (73.9%) sphinx-doc/sphinx: 6/16 (37.5%) sympy/sympy: 43/77 (55.8%)
Refact.ai’s performance on SWE-bench Lite demonstrates that AI agents are becoming increasingly capable of handling real-world software engineering tasks autonomously — not just generating code, but planning, debugging, testing, and refining it with minimal human input.
Our next step is evaluating Refact.ai Agent on SWE-bench verified — a benchmark with more rigorous testing.
All of this is part of our open-source commitment. Developers can explore the system, understand how autonomy is implemented, and even contribute. We believe that as the baseline work of software development shifts to AI, human engineers will be free to focus on the more interesting and creative parts of the job — and invite developers to build the future of programming together.
This isn’t just about ranking highly on a benchmark — it’s about real-world coding impact. Refact.ai Agent helps developers and software companies:
Vibe coding is the future of software development — get it today.
Refact.ai’s autonomous AI Agent works like a senior developer in your IDE:
Try open-source Refact.ai in VS Code or JetBrains for your programming tasks and let us know what you think!