Using AI tooling for code review

After @instagibbs reported success in using Claude for rebasing ln-symmetry, I had a go at messing with it too. While it seems very useful in general, I think one place that’s worth talking about is using it as a review aid: PR review has long been recognised (eg) as a bottleneck in Bitcoin Core at least, and AI’s certainly lowering the barrier to creating new PRs which won’t make things easier.

I think there’s a few different goals when reviewing a PR that are worth being a bit explicit about:

  • does this actually do what it’s supposed to?
  • is doing this a good thing? if it is a good thing, is it worth the cost? are there more important things that should take priority?
  • could this be done better?
  • are the changes properly tested?
  • do I understand the change well enough to fix any bugs in it that get found later?

With this in mind, I found two pretty strong biases that I didn’t think were very helpful:

  • Claude would largely take it for granted that the idea was good and worthwhile, and only review correctness.
  • Claude would largely try to do everything itself – form an opinion and state a conclusion and even offer to post that as a PR review, sometimes even before I’d had a chance to look at the commits myself.

I think the “is it worthwhile” one is probably just hard; it’s not that easy for humans either really. My test case for that was PR#34444, and as per the comment, even if its review contribution wasn’t helpful, it was still helpful for me overall.

But the “I’ll just do everything for you” aspect doesn’t seem so helpful – if you want to improve review, you want to add “brainpower” to the review process, not subtract it. So the approaches I’ve looked at for this is a few things:

  • telling it to run a sub-agent and do a review, but keep the results of the review private until I’ve done my own review, so that we can then compare them (it immediately summarised the results when the agent finished, of course)
  • telling it that I’m stepping through each commit, and to quiz me on anything that seems complicated in the commit (this worked pretty well, and found a couple of things I didn’t understand fully) Helps to disable “show tips” via /config, or else the quiz answers will appear as suggested responses, though…
  • asking it for help doing things when I’m stepping through – in particular, I’ve always found it hard to get code coverage reports and iwyu/clang-format results; telling the AI to do that for me has been great

I’m not sure how much it helped, but telling it various resources about the PR and getting it to do a sub-agent that summarises those as a resource to use before beginning the rule might be a good idea.

Using it as a project manager (“oh, I should do X, remind me about that when I’m done with Y”) was also very helpful. I’ve gone so far as to feed it my various blog posts on Bitcoin so it can help prioritise tasks at a higher level too, and so far it seems like that works okay too.

Anyway, I used that approach very heavily in my review of PR#34257 and PR#34023 for cluster mempool related improvements, and found it pretty effective. The AI’s initial self-review almost immediately spotted an inverted condition in a test, and then having the AI assist with code-coverage reporting on the fuzz corpus helped validate that it really was behaving as it seemed. For the SFL optimisation stuff, I tried getting it to have a sub-agent review the code prior to the PR and suggest its own optimisations and then see if those matched the PR, or if anything was missing. It did claim to pick out the most valuable change, while missing most of the minor ones, but that didn’t really add anything to the review.

Probably the key bits of my overall prompting stuff are these bits (mostly copied from a tweet I saw or self-generated by claude):

## Core Principles

### Simplicity First
- Make every change as simple as possible. Impact minimal code.
- Find root causes. No temporary fixes. Senior developer standards.
- Changes should only touch what's necessary. Avoid introducing bugs.

### Plan Before Executing
- Enter plan mode for ANY non-trivial task (3+ files or architectural decisions)
- If something seems off, STOP and re-plan immediately - don't keep pushing
- Write detailed specs upfront to reduce ambiguity

### Verification Before Done
- Never mark a task complete without proving it works
- Run tests, check logs, demonstrate correctness
- Ask yourself: "Would a staff engineer approve this?"

### Self-Improvement Loop
- After ANY correction or finding: update `lessons.md` with the pattern
- Write rules that prevent the same mistake
- Review lessons at session start

### Autonomous Investigation
- When given a potential bug: just investigate it, don't ask for hand-holding
- Point at logs, errors, failing tests - then resolve them
- Zero context switching required from the user

### Interactive Assistance
- Claude assists; the user drives. Don't produce finished artifacts and
  offer to post them. Present work incrementally and wait for direction.
- When in doubt about the next step, ask rather than assume.

### Demand Elegance (Balanced)
- For non-trivial changes: pause and ask "is there a more elegant way?"
- Skip this for simple, obvious fixes - don't over-engineer

Now that I actually look at that, I guess the “Autonomous Investigation” and “Interactive Assistance” sections are in some conflict there. Oh well!

It’s not currently clear to me what resources from a session would be interesting to publish – so far I haven’t personally looked much at either the docs or the project files that it generates; I’m more just using them as the AI’s long-term memory so we can both pick back up where we left off. For conversations with ChatGPT and the like, just posting a log of the conversation can make sense, I think (eg), but it’s not clear to me what would be similar for an agent-based project.

(For those interested I’m running it in a dedicated VM, with a read-only github token for looking up PRs, and no push access/etc. I haven’t tried running with local models, or Codex or OpenClaw etc)

Anyway, thought I’d share. Anyone else trying something similar?

10 Likes

FWIW I’ve been heavily using Claude Code for review as well. I love your quiz idea which I have not tried. re:SFL review, I found the biggest gains was in its ability to quickly make (counter) examples for diagrams, since I am pretty bad at making examples in general. So when I hit a conditional I don’t quite understand, I can just ask and it seems to nail it 100% of the time (maybe after long backtracking thought strings where it realizes the examples don’t hold, but it gets there!).

In general, asking LLMs verifiable things is the biggest gain, so things like counter-examples fit in really nicely in that paradigm.

1 Like

I found this interesting: https://github.com/spaceraccoon/vulnerability-spoiler-alert

Reasons:

  1. It’s not easy to fix the vulnerabilities in open source projects like bitcoin core.

    https://xcancel.com/fanquake/status/1977396233237864530 https://xcancel.com/christine_dkim/status/1978121490336796945

  2. The results in this experiment look good enough based on some of the repositories it scans everyday.

Questions:

  1. How would this affect bitcoin core and other open source bitcoin projects?
  2. Has there ever been an incident in bitcoin core where a vulnerability was exploited before a release because it was easy to spot the fix in the commits?

Note: Such reviews can also be used to find bugs in the commits and not just the bug fix.

Archive

I have also been experimenting with almost exactly this using a review-core command in Claude code. It’s certainly a long way from perfect, and if I’m honest it does spot a few too many false-positives for my liking, but it is reasonably thorough, and often provides a decent starting spot go go from, or more regularly a sanity check against my own thoughts after I’ve done my own review. I haven’t decided yet if I prefer to run this before or after my own review, as both seem to have their merits (and downsides)…

The command I am currently using is:


# Bitcoin Core Code Review

You are assisting the user's review of Bitcoin Core code — security-critical software managing billions in value. Surface findings and concerns for the user to evaluate; don't produce a finished review or draw final conclusions on their behalf.

## Review Process

1. First, fetch upstream and identify what is being reviewed:
   ```bash
   git fetch upstream
   ```
   - If no argument provided, review the current branch's changes against `upstream/master`
   - If a PR number is provided, fetch and review that PR

2. Fetch PR context from GitHub (if reviewing a PR):
   ```bash
   gh pr view <number> --repo bitcoin/bitcoin
   ```
   The PR description often contains valuable context: motivation, design decisions, and related issues/PRs. Read it carefully before diving into code.

3. Find the merge base, then gather context by examining the diff:
   ```bash
   git merge-base upstream/master HEAD
   ```
   ```bash
   git diff <merge-base>..HEAD
   ```
   If the merge-base equals HEAD (PR already merged), diff against the parent of the first PR commit instead.

4. Apply the checklist below systematically

## Priority Checklist

### Concept & Justification
- Is this change worth making? Don't take the premise for granted — question it
- Does the implementation actually achieve its stated goal?
- Is the benefit proportional to the complexity and review burden?

### Security & Consensus
- Could this change affect consensus behavior? Flag for extensive review
- Check for integer overflows, memory safety issues, undefined behavior
- Verify all user/network input is validated before use
- Look for DoS vectors (unbounded memory/CPU, slow paths attackers can trigger)
- Ensure cryptographic code uses constant-time operations where needed

### Threading & Locking
- Verify lock annotations (`EXCLUSIVE_LOCKS_REQUIRED`, `LOCKS_EXCLUDED`) are correct and complete
- Check lock ordering—would this introduce potential deadlocks?
- Confirm `AssertLockHeld()` calls match annotations
- Flag any use of `RecursiveMutex` where `Mutex` would suffice

### Code Quality
- Does the change do one thing well, or is it trying to do too much?
- Does this fix the root cause, or just the symptom?
- Are there unnecessary changes mixed in (refactors bundled with behavior changes)?
- Is the commit history logical and bisectable?
- Do commit messages explain *why*, not just *what*?
- Could a reviewer who didn't write this code maintain and debug it later?

### Testing
- Are there unit tests for new/changed logic?
- For RPC/P2P changes, are there functional tests?
- Do existing tests still make sense, or do they need updating?
- Consider edge cases: empty inputs, maximum values, malformed data

### Style & Conventions
- Named arguments for booleans: `Foo(/*flag=*/true)` not `Foo(true)`
- Prefer `std::optional` over sentinel values
- No `std::map::operator[]` for reads (use `.find()` or `.at()`)
- Initialize all members at declaration

## Design Analysis

After applying the checklist, step back and analyze the PR's approach at a higher level. Generate and answer these questions in your review under a "### Design Analysis" heading, after the line-by-line findings and before the summary.

### Motivation & Threat Model
- What problem or attack does this change address? Explain the threat model or gap being closed.
- Why does Bitcoin Core specifically benefit from this? Connect to concrete risks (DoS, privacy, censorship resistance, etc).
- If this is a defense/hardening measure, what's the residual risk with and without it?

### Approach Robustness
- Could the chosen approach produce incorrect behavior under any scenario? Walk through failure modes explicitly.
- What assumptions does the approach rely on? Are those assumptions guaranteed or merely conventional?
- If there's a fallback/retry mechanism: can it loop infinitely, mask real errors, or leave inconsistent state?
- What happens when upstream dependencies (Tor, OS, libs) change behavior? Is the approach brittle to version changes?

### Alternative Approaches
- Is there a simpler or more robust way to achieve the same goal? Consider what the PR chose *not* to do and why that tradeoff is or isn't justified.
- Would a different approach avoid failure modes present in the current one?
- If the current approach is the best one, say so and briefly explain why the alternatives are worse.

### Strategic Assessment
- What maintenance burden does this create? Will this need updating as dependencies (Tor, libraries, protocols) evolve?
- Does this establish a pattern or precedent that future PRs will follow? Is that a pattern we want?
- Is this the right layer to solve this problem? Should it be handled upstream, downstream, or by configuration instead of code?
- For Bitcoin Core specifically: does this increase attack surface, trust assumptions, or dependency coupling in a way that's disproportionate to the gain?

Be specific — name the scenarios and tradeoffs, don't just say "this could go wrong" or "there might be a better way."

## Output Format

For each concern found, report:

```
**[SEVERITY]** `file:line` - Brief description

Issue: What's wrong
Suggestion: How to fix it
```

Severities:
- **BLOCKING** - Must fix before merge (security, correctness, consensus)
- **SHOULD-FIX** - Strong recommendation (bugs, significant issues)
- **NIT** - Minor style or preference (optional to address)

## Summary Section

End with a summary:
- What the PR accomplishes (1-2 sentences)
- Overall assessment (ACK with suggestions / needs work / concept concerns)
- Key concerns if any
- What was done well

Be direct but constructive.


It’s kind of a mixture between things I specifically want it to consider, developer-notes.md and various other additions/removals.

I have in a global CLAUDE.md things like:

### Self-Improvement Loop
- After ANY correction or finding: update `lessons.md` with the pattern
- Write rules that prevent the same mistake
- Review lessons at session start

I uploaded a few of my settings to GitHub - willcl-ark/.claude in case any of interest.

One thing I definitely need to step up though it running it more containerised than I do currently. I’ve seen firsthand it ignore disallowed commands, or run them in alternative ways to bypass things, and I don’t believe for a moment that it will e.g. follow the intent of never reading a .env file; I’ve seen it iterate env before in python for example.

I like it. I think it probably ought to be possible (and relatively easy once you figure out how) to setup a set of background info and prompts that you could stick in an .md file on github, and have claude/chatgpt act as an automatic pr-review-club with 24-7 availability, possibly even just claude/chatgpt on the web without any local setup, maybe even usable via a free tier? I keep finding other things to work on, so haven’t tried this yet.

I have also been using Claude for code review for the last ~6 months. In my opinion, it is extremely good on technical, localized issues but still limited when it comes to more global issues.

My usual approach is not to give it an elaborate generic CLAUDE.md, but to have a highly interactive session, usually going commit by commit: When I just ask it to review a PR carefully, there are many types of issues it won’t find - but when I combine it with my own knowledge and intuition to nudge it towards certain areas (which is very PR dependent), even if I can’t formulate it very well, it gets much deeper into things and will usually work out the details on its own and get to a solid assessment whether there is an actual issue with the code.

So my experience with code review is similar to what Terence Tao just said about using AI for math problems:

Additionally, a lot of AI companies have this obsession with push-of-a-button, completely autonomous workflows where you give your task to the AI, and then you just go have a coffee, and you come back and the problem is solved. That’s actually not ideal. With difficult problems, you really want a conversation between humans and AI.

3 Likes

I don’t think claude file really helps for review at all. Seems like more value for coding.

I have also been experimenting with making GitHub review slightly less painful.

For the past few weeks I’ve been building ACKtopus (GitHub - l0rinc/ACKtopus · GitHub), a fairly opinionated Tampermonkey browser userscript specialized for Bitcoin Core PR review:

The goal is mostly to patch over the most annoying GitHub review friction, plus add Bitcoin-Core-specific workflow helpers and some LLM-assisted tools:

  • copying the latest hash for ACKing
  • copy command to run only modified tests/benchmarks or range diff
  • copying PR commit/comment context for external LLMs
  • show inline commit/PR summaries and explanations
  • selection-based explain/fact-check/proofread actions
  • pending-review helpers
  • range-diff/re-review helpers
  • auto-collapsing unrelated files after force-push compares
  • commit-hash prefixes in draft review comments
  • llm questions about comments with navigation.

It’s far from stable and I would not recommend relying on it for review judgment, but for me, it is mainly a review accelerator and a way to reduce GitHub friction. I’m sharing it because some of the individual features may still be useful to others, even if only as ideas to copy into their own setup rather than using the whole thing.

Given how scarce review is (and how capable LLMs have become), I think it makes sense to use every tool available to make review easier and safer.

4 Likes

I’ve been having success the last two weeks using openclaw to assist in managing PR reviews. I’ve since setup an agent for each of my open source projects.

This agent specifically is responsible for giving me alerts when PRs I am interested in need my attention. It also proactively provides me a thorough review of the changes since I last commented on the PR so that I am prepared when pulling up github. It saves what it’s reviewed and the latest commit hashes in memory so that everytime it does its scan it can give me intelligent updates.

Additionally, I have the agent setup to automatically rebase my commits whenever there is a new conflict. I am having a lot of success with that. It can accurately rebase and update the PR and add a comment that a rebase was completed. I did learn to make it clear that it can never delete tests that are in conflict and to always validate through building and running all relevant tests.

This is very much a WIP and I would like to hear if others are having similar successes with openclaw/codex automations. I will be trying to slim down my AGENT.md, SOUL.md, and the cron job prompt as I discover what the most effective versions of each are. Designing these files seems more art than a science right now but I’m going to try to integrate the things you all have mentioned here as well.