Kevin Lu
← Writing

Open sourcing a 1.5B parameter Next-Edit Autocomplete Model

We’re open sourcing Sweep Next-Edit, a locally runnable SOTA LLM for next-edit autocompletion.


Today we’re releasing Sweep Next-Edit: a 1.5B parameter model that predicts your next code edit before you make it. It can run locally on your laptop in under 500ms, and outperforms models over 4x its size on next-edit benchmarks.

We’re open sourcing the model weights so the community can build fast, privacy-preserving autocomplete for every IDE - VSCode, Neovim, Emacs, and beyond.

Next Edit Models: Speed vs Quality

Accuracy by Task Category

ModelBelowAboveTab to JumpFIMNoiseOverall
Sweep 7B89.84%75.51%82.03%74.22%100.00%81.28%
Sweep 3B82.03%57.14%82.03%60.16%100.00%72.12%
Sweep 1.5B71.88%46.94%74.22%56.25%96.88%67.82%
Sweep 0.5B63.28%53.06%66.41%43.75%96.88%61.30%
Mercury Coder71.88%53.06%70.31%64.06%62.50%54.09%
Zed Zeta67.19%69.39%62.50%35.94%87.50%43.27%
Continue Instinct10.94%8.16%23.44%37.50%34.38%25.30%
Qwen2.5-Coder-7B54.69%61.22%56.25%41.41%100.00%55.62%
Qwen3-8B34.38%22.45%54.69%42.19%96.88%48.27%
Qwen3-4B24.22%16.33%33.59%25.00%90.62%30.19%
Qwen3-1.7B14.06%14.29%14.06%15.62%68.75%17.54%

Why Other Models Struggle

The poor performance of Zeta and Instinct, such as extra or missing trailing lines, stems from suboptimal formatting and tokenization choices:

<|editable_region_start|> <|editable_region_end|>
Click or hover to highlight the tokens
SYSTEM_PROMPT = """You are Instinct, an intelligent next-edit predictor developed by Continue.dev.
...
"""

Source: https://github.com/continuedev/instinct/blob/2f888a1b152fdaab37a1c9043b749aabdf55a305/utils/datautils.py.

Further, these models had some other suboptimal train-time design choices:

Our Format

According to Qwen2.5 Coder’s technical report, their pretraining format is:

<|repo_name|>{repo_name}
<|file_sep|>{file_path_1}
{file_content_1}
<|file_sep|>{file_path_2}
{file_content_2}
<|file_sep|>{current_file}
Current file prefix
<|fim_prefix|>
{prefix}
<|fim_suffix|>
{suffix}
<|fim_middle|>{completion}

This is a great for training FIM but not next-edit autocomplete. Here’s our format:

<|file_sep|>{file_path_1}
{file_content_1}
<|file_sep|>{file_path_2}
{file_content_2}
<|file_sep|>{changed_file_1}.diff
original:
{before_changes_of_diff}
updated:
{after_changes_of_diff}
<|file_sep|>{changed_file_1}.diff
original:
{before_changes_of_diff}
updated:
{after_changes_of_diff}
<|file_sep|>original/{file_path}
{contents_prior_to_most_recent_change}
<|file_sep|>current/{file_path}
{current_state_of_contents}
<|file_sep|>updated/{file_path}
{updated_state_of_contents}

The two main differences from the base Qwen-2.5 Coder prompt are:

Diff Format

We placed every recent change in its own file separator block.

We also ran a “hyperparameter sweep” over prompt formats. There are many ways to represent diffs - unified diff format, side-by-side, original/updated blocks, old/new labels, with or without .diff markers. We used a genetic algorithm to find the optimal combination.

Here’s how the genetic algorithm worked:

  1. Initialize population: We started with 10 prompt format variants, each with different combinations of diff representation choices.
  2. Evaluate fitness: For each format, we ran inference on a held-out validation set and computed exact-match accuracy.
  3. Selection & breeding: The top-performing formats were kept, and we created new variants by combining elements from winning formats (e.g., taking the diff markers from one format and the label style from another).
  4. Mutation: We randomly tweaked some formats to explore new variations.
  5. Repeat: We ran this for 3 generations, evaluating ~30 different prompt formats in total.

The winning format used original/updated blocks without unified diff syntax, a result that surprised us at first.

Of the final winning formats, we found this to be the simplest one.

Our hypothesis is that the original / updated format is more readable for LLMs than unified diff formats. Why, you may ask? Consider the following diff:

             if (params.length >= 8 && params[7] != null) {
-                linkState = LinkState.valueOf(params[7].toUpperCase());
+                linkState = LinkState.valueOf(params[7].toUpperCase(Locale.ENGLISH));
             }

As a human, we effectively read this on a 2D screen, so it’s easy to see the linkState declarations line up and that Locale.ENGLISH was inserted. But to the model it looks something like this:

             if (params.length >= 8 && params[7] != null) {\n-                linkState = LinkState.valueOf(params[7].toUpperCase());\n+                linkState = LinkState.valueOf(params[7].toUpperCase(Locale.ENGLISH));\n             }

This is much harder to parse visually. Further, Qwen’s technical report does not mention adding any unified diffs to the pretraining data.

So we present it to the model in the following format:

original:
            if (params.length >= 8 && params[7] != null) {
                linkState = LinkState.valueOf(params[7].toUpperCase());
            }
updated:
            if (params.length >= 8 && params[7] != null) {
                linkState = LinkState.valueOf(params[7].toUpperCase(Locale.ENGLISH));
            }

This is the same reason why Claude Sonnet prefers using str replace instead of applying unified diffs, as it’s more similar to Claude’s pretraining data, making it easier for the model to generate.

Sliding Window

For the current window, we found a fixed window size of 10 lines above and below to be optimal.

Let’s look at Instinct’s prompt format:

20 lines above changes
<|editable_region_start|>
1 line above cursor
line containing cursor
5 lines below cursor
<|editable_region_end|>
20 lines below changes

The model’s task is to rewrite the 7 lines in the editable region (highlighted).

The issue we’ve seen in evals is that the model would sometimes continue generating after <|editable_region_end|>, which would add extra irrelevant changes. The reason is likely that the model sees the 20 lines below the editable region and is eager to continue generating it. Generally, the model doesn’t comprehend the editable region well, as this is out of distribution relative to its pretraining data.

Thus, we simplified this format to the following, removing the “editable” region entirely:

10 lines above cursor
line containing cursor
10 lines below cursor

Where the model’s task is to rewrite the entire 21 lines.

The downside is that generating 21 lines is very slow. But after implementing our custom n-gram implementation and early cancellation logic in TensorRT-LLM, the average warm autocomplete ends up completing in sub-100ms.

We also originally used an AST-based algorithm to compute optimal boundaries but found that the model trains better with a fixed window size.

The intuition behind AST-based boundaries was to align the editable region with logical code structures-functions, classes, or blocks. For example, if the user is editing inside a function, we’d expand the region to include the entire function body:

def process_data(items):
    # <-- editable region start (function body)
    results = []
    for item in items:
        if item.is_valid():
            results.append(item.transform())
    return results
    # <-- editable region end

This seemed elegant in theory, but in practice it caused two problems. First, inconsistent region sizes: function lengths vary wildly. Some are 5 lines, others are 30 lines. This made training unstable since the model saw dramatically different output lengths. Second, boundary prediction errors: the model had to learn both what to generate and where the AST boundary should end. When it got the boundary wrong, the entire output was marked incorrect.

With fixed windows, the model always knows exactly how many lines to output, letting it focus entirely on generating the right content.

Training Process

Supervised Fine-Tuning

We first ran supervised fine-tuning on ~100k training entries spanning next-edit, FIM and noisy suggestion reduction on an 8xH100 for 4 hours. We scraped from the most popular permissively-licensed repos on GitHub over the past year with a commit count filter. This causes a small bias towards more recent topics such as LLM inference or MCP servers, but it does better correlate with the type of work that the average user of Sweep does. We also upsampled to match the language distribution of most Jetbrains users, like Java, Kotlin, C#, PHP, and Ruby.

However, despite training on high-quality data during SFT, we found a few undesirable habits of the model:

On-Policy Reinforcement Learning

The initial supervised fine-tuning (SFT) data is valuable for teaching the model the task in a stable way, but it’s not enough to handle all edge cases. This is where reinforcement learning (RL) comes in.

The key difference between SFT and RL is how the model learns. With SFT, the model is shown the “correct” output and trained to reproduce it exactly. This works well for learning the basic task, but the model only sees one way to solve each problem. With RL, the model generates outputs freely, and we score them with reward functions. The model learns to produce more high-reward outputs and fewer low-reward ones without being told exactly what to generate.

Why does this matter for next-edit? In SFT, if the model generates code that’s functionally equivalent but formatted slightly differently, it gets penalized. In RL, we can design reward functions that recognize multiple valid solutions, letting the model explore and find outputs that are both correct and natural.

We then ran RL for another 2000 steps to remove these undesired behaviours. We added a few simple rewards:

  1. Check using tree-sitter whether the output file parses for common languages like Java, Python etc., when the completion is merged into the file.
  2. Regularization rewards, such as checking that the change is within certain bounds in size.

Graph showing parse reward increasing during reinforcement learning training

Try it out

The end result is a model that feels genuinely useful: fast enough that suggestions appear before you finish thinking about them, accurate enough that you trust them, and small enough to run entirely on your machine without sending code to the cloud. By open sourcing both the model and sharing these insights, we hope to save others from repeating our mistakes and help build a great autocomplete experience for all IDEs.

If you use JetBrains IDEs, we’d love for you to try out Sweep and let us know what you think. The plugin uses our custom inference engine combined with Sweep Next-Edit to provide lightning-fast and accurate next-edit suggestions as you code.

For those building their own tools, here is a minimal working example to help you get started: https://huggingface.co/sweepai/sweep-next-edit-1.5B/blob/main/run_model.py and we welcome contributions and feedback from the community. Whether you’re integrating it into VSCode, Neovim, or your own custom editor - we’re excited to see what you build!

Have questions or feedback? Reach out to us on Twitter/X or email us at team@sweep.dev.

Benchmark Details

Latency

Test Setup

Input/Output

Tokens Per Second (TPS) Calculation

Quality

The overall score is the average of five benchmarks:

  1. Next-edit suggestions below the cursor - making a close change below the cursor
  2. Next-edit suggestions above the cursor - making a close change above the cursor
  3. Tab-to-jump next-edit suggestions - making a change far from the cursor
  4. Fill-in-the-middle (FIM) completions - completing code fragments at the cursor position
  5. Noisiness evals - not suggesting a change when no changes are expected

For details on how prompt formatting for each model, see the Appendix.

Evaluation Metric

All five benchmarks use whitespace-agnostic exact-match accuracy as the evaluation metric.

We intentionally chose exact-match over softer metrics like CodeBLEU or LLM-as-a-judge (used by Instinct’s evaluation). The argument for softer metrics is that there are multiple correct ways to generate the same output. However, this reasoning doesn’t hold for next-edit autocomplete.

Next-edit suggestions are highly predictable. Most suggestions are low-entropy changes—repeating the user’s last edit, fixing a typo, or completing a pattern. In these situations, there’s usually only one correct answer, and the model needs to nail it consistently for users to trust the autocomplete.

“Almost correct” suggestions are often worse than no suggestion. When a model generates code that’s 90% right, the user still has to manually fix the remaining 10%. This interrupts their flow and can be more frustrating than having no suggestion at all. Partial credit doesn’t reflect the actual user experience.

Here’s a concrete example from our benchmark that illustrates why exact-match matters:

Recent Changes
(... more similar changes)
redisson-spring-data-18/...RedisClusterNodeDecoder.java:57-53
for (String flag : flagsStr.split(",")) {
- String flagValue = flag.toUpperCase().replaceAll("\?", "");
+ String flagValue = flag.toUpperCase(Locale.ENGLISH).replaceAll("\?", "");
flags.add(Flag.valueOf(flagValue));
redisson-spring-data-18/...RedisClusterNodeDecoder.java:75-71
if (params.length >= 8 && params[7] != null) {
- linkState = LinkState.valueOf(params[7].toUpperCase());
+ linkState = LinkState.valueOf(params[7].toUpperCase(Locale.ENGLISH));
}
Current Span to Edit
continuedev/instinct✗ Missing changes
(No changes made)
zed-industries/zeta✗ Extra irrelevant changes
String response = buf.toString(CharsetUtil.UTF_8);
- List<RedisClusterNode> nodes = new ArrayList<RedisClusterNode>();
+ List<RedisClusterNode> nodes = new ArrayList<>();
...
for (String flag : flagsStr.split(",")) {
- String flagValue = flag.toUpperCase().replaceAll("\\?", "");
+ String flagValue = flag.toUpperCase(Locale.ENGLISH).replaceAll("\\?", "");
...
address = new RedisURI(RedisURI.REDIS_PROTOCOL + addr);
- }
sweepai/sweep-next-edit-7B✓ Precise, relevant change only
for (String flag : flagsStr.split(",")) {
- String flagValue = flag.toUpperCase().replaceAll("\\?", "");
+ String flagValue = flag.toUpperCase(Locale.ENGLISH).replaceAll("\\?", "");
Key Insight: While all three models were shown the same recent changes pattern, only Sweep Next-Edit 1.5B made the precise, contextually appropriate edit. The other models either made no changes or introduced extraneous modifications that could break the code.

Appendix

Prompt Construction

<|im_start|>user
Here are contents of the user's file before any changes were made:
<current_file>
{original_file}
</current_file>

The user recently made the following changes:

<recent_changes>
{recent_changes}
</recent_changes>

Here's the section to edit:

<code_block>
{current_section}
</code_block>

Rewrite <code_block> to help the user finish writing their changes. Respond with only the updated contents of the <code_block>, wrapped in the XML tags.<|im_end|>
<|im_start|>assistant
I have inferred the user's intentions and will fully implement the user's changes.

<code_block>