How to Make LLMs Shut Up
December 21, 2024

How to Make LLMs Shut Up

This article is adapted from a talk I gave at the Sourcegraph Developer Tools Meetup at the Cloudflare offices in San Francisco on December 16, 2024.

I’m Daksh, co-founder of Greptile – artificial intelligence capable of understanding large code libraries. Our most popular product is our AI code review bot. It provides the first ever review of PRs, including full context of the broader code base, exposed bugs, anti-patterns, duplicate code, and more.

When we first launched this product, the biggest complaint by far was that bots were leaving too many reviews. In a PR with 20 changes, it will leave up to 10 comments, at which point the PR author will start ignoring all comments.

we need to:

  • Find out how to reduce the number of comments generated by Greptile,
  • This means figuring out which comments should be deleted,
  • This means finding a way to evaluate the quality of each review.

Two ideas here:

  1. GitHub allows developers to reply to comments via 👍/👎. We can use it as a quality indicator.
  2. We can check which comments the author actually processed in the code by scanning the diffs of subsequent commits.

We chose the latter, which also gave us a performance metric – the percentage of generated comments that were actually processed by authors.

We analyzed existing Greptile reviews and found that about 19% were good, 2% were flat-out wrong, and 79% were nits – technically correct reviews, but not something developers care about.

Here is a nit example:

Essentially, we need to teach LL.M.s (paid by tokens) to produce only a small number of high-quality reviews.

Attempt 1: Tips

Our first reaction was to “cue the engineers.”

Sadly, even with all the prompting techniques, we couldn’t get the LL.M. to make less critical comments.

Since LL.M.s are “few-shot learners”, we also tried to provide a few examples of good and bad reviews for Greptile in the prompt – hoping it would generalize the patterns.

This doesn’t work either. If anything, this makes the bot worse because it doesn’t actually find useful patterns in the available samples (some might argue that LLM is architecturally incapable of doing this), but instead infers surface features.

Test 2: LL.M. Judge

Because we can’t make the LL.M. Production nit reviews, we thought we would just add a filtering step where LLM could rate the severity of review+discrepancy pairs on a scale of 1-10 and simply eliminate any review with a rating lower than 7.

Sadly, this also failed. LL.M.s’ judgments of their own output are almost random. This also makes the bot very slow because there is now a whole new inference call in the workflow.

no idea

At this point we were running out of ideas. Basically we learned three things:

  1. Tips don’t work for this
  2. LL.M.s are poor evaluators of severity
  3. Nits are subjective – definitions and criteria vary from team to team

The third study points out the general direction for us study. The bot needs to somehow infer where the team’s standards for “details” lie, and then filter comments accordingly.

We considered fine-tuning, but cost, speed, and lack of portability (Greptile would no longer be model agnostic) ruled out this possibility.

Last attempt: clustering

In a final attempt, we started generating embedding vectors of past comments at each team level that had been processed/upvoted or downvoted by developers and stored them in a vector repository. The idea is to filter out comments that are very similar to a minimum number of downvoted comments.

When Greptile generates a comment, we generate its vector embedding and run it through a simple filter:

  • If the cosine similarity of the comment exceeds a certain threshold and there are at least 3 unique downvoted comments, the comment will be blocked.
  • Same situation but with three voted in favor Commented and passed.
  • If it’s neither or both, that will pass.

result

Remarkably – this works! It turns out that most nits can be placed into a minority cluster. Users downvote nit comments, and when enough comments of the same type are downvoted, the bot can filter out any new comments of that type.

Within two weeks of launching this feature, existing users saw the address rate (the percentage of Greptile comments addressed by developers prior to merging) increase from 19% to 55+%. While this is far from perfect, it is certainly the most effective technique for reducing the noise produced by the LLM.

This is an ongoing issue for us and I may write a part two when we are lucky enough to see another address rate change!

2024-12-18 16:31:46

Leave a Reply

Your email address will not be published. Required fields are marked *