A Bitcoin-native LLM: dataset, architecture and open questions

Following up on my own question to @0xB10C: I inspected the dump directly, and the answer is better than I hoped.

The backup preserves full anchoring. Each review comment in pulls/{n}.json carries diff_hunk (the exact code fragment under review), path, line, commit_id/original_commit_id (so the anchoring survives rebases and force-pushes), in_reply_to_id for reply threading, and pull_request_review_id for grouping by review session. PR-level events (commits, force-pushes, review submissions) are in the same file.

I sampled PRs across the full history — 2014 (#5159), 2015 (#6312), 2016 (#8149, SegWit), 2018 (#15006), 2020 (#19988, ~540 review comments), 2023 (#28122, BIP352) — and every single review comment carries its diff hunk. The format is uniform across twelve years.

Concretely, this means (code, critique, resolution) triples can be extracted directly from the JSON: diff_hunk gives the code, body the critique, and the resolution reconstructs from the reply thread plus subsequent commits in events. No re-alignment against git history needed. As a bonus, #35506 — the AssumeUTXO PR from @l0rinc’s attribution example — is in there with the same structure, so the contested-topics angle is covered by the same corpus.

This moves the bitcoin/bitcoin dump from “interesting source” to primary corpus candidate in my view. Thanks again for the pointer.