Fine-Tuning Code Completion Models: The Easy Way

March 27, 2024

by Oleg Klimov

4 min read

TL;DR:

Code models are trained on open source projects. Most of them, on permissively licensed projects
Fine-tune wouldn’t help much if you are just using popular open source libraries
But for a large codebase and internal APIs, a fine-tuned model works much better than non-fine-tuned one
After fine-tuning a model we see improvement from 25% to 45% of code written by a AI
New information is injected into the model: the APIs you use, the way you use them, coding style, your typical tricks
Fine-tuning does come at the expense of performance on generic tasks, but the drop is not that significant
Refact.ai Enterprise server (or open-source Refact.ai) makes fine-tuning easy, you don’t have to be a ML Engineer

How to Start Fine-tune on Refact.ai Server

Follow the Self-hosted or Enterprise guide to start a server. To run the container, you need one of the following options: a computer with an NVidia GPU and nvidia-compatible docker; Runpod Cloud GPU; AWS Account.
Open server UI in a browser. In the “Projects” tab, create a project and add your source code files. You can use a link to a git repo (including a private repo) or upload a .zip file.
Go to the “Finetune” tab, hit “Launch”. A finetune should be ready in 5 hours - the number is for one RTX 3090 and a codebase of 1000 files.

That’s all you really need to know. The rest of this document describes how to make it even better.

What Source Files To Choose?

Train on files that you think are good - this way to model will learn to emulate good style and practices. You can also see this as a way of knowledge transfer, from expert engineers in your company to all the others: the model will write code the way your internal expert would do it.

What is LoRA?

It’s not necessary to finetune all weights of a model. There are Parameter-Efficient Fine-Tuning (PEFT) methods, the method we use all the time is called LoRA. It has advantages: it trains faster, it needs much less memory to train, it retains the speed of the original model, and it’s possible to switch LoRAs very fast during inference.

Compared to unfreezing all the weights of a model it is also less prone to “catastrophic forgetting” - the phenomenon of forgetting previously learned skills when learning something new. It’s easy to see why from the math of how LoRA works: it’s a small addition to the weights, by making it even smaller it’s possible to go back to the original weights and skills of the base model. So nothing is ever lost during LoRA training: the process only needs to balance new information (higher LoRA weights) and retaining the base model capabilities (smaller LoRA weights).

How Much Information is Possible to Inject?

For a 3B model, the largest LoRA that is still possible to run with good hardware acceleration has 150M parameters. That’s a lot, 150M models were popular just a few years ago! But in practice, including feedback from our clients, we still see limitations on how much information a LoRA can retain.

Here are some examples that can help you to draw a line between a fine-tuning that will likely work and one that will not.

6k files in C++ => works well, our typical use case
60k files in C => not that good, little visible improvement
600 files in python => works very well
600 files in python + 100 files in Rust => works well
600 files in python + 100 files in Rust + 500 files in C++ => still works, but starts to struggle recalling function names at times

From that experience it appears that LoRA likes to concentrate on a single language or topic. Feeding it several dissimilar software projects written in different languages is probably a bad idea.

But fear not, in Refact you can train several finetunes and assign them to different teams!

Changing Hyperparameters and Comparing Runs

If you want to play around with learning rates, LoRA sizes, training steps, weight decay - it’s convenient to do that within Refact UI.

The first thing you need is a test set. You just need 1 to 3 source files that you think are representative of your codebase. They will be automatically subtracted from the train set. You can upload individual files BTW, not necessarily a .zip archive.

Stars.py is the only file used at test set in this setup.

Calculated on the same thing, the test loss will be comparable between runs and models. Strictly speaking, for a test loss to be comparable you also need the same tokenizer, and that is different for different models, but in reality tokenizers are pretty close together for English.

You need to look for the lowest test loss among all runs you try. The test loss is a good measurement of how little the model is surprised by what it sees in your test files.

Make sure training goes through several epochs before the lowest test loss is reached.

Try Fine-Tuning in Refact.ai Enterprise:
3 months free!

Refact.ai is a fine-tuned AI coding that boosts developers’ productivity by 80%. It features context-aware AI completions, in-IDE chat, and in-line code commands for faster, high-quality code delivery. As a secure alternative to copilot, Refact.ai can be deployed on-premises, ensuring 100% safety of your data. Maximize your software development efficiency with our AI solution for companies!

Get a 2-week free trial of Refact.ai Enterprise
or set up a meeting with our team.