Dust Expiry: Clean the UTXO set from spam

gmaxwell · May 21, 2025, 3:48am

[quote=“RobinLinus, post:22, topic:1707”] Wouldn’t that require everyone to resync the chain, though? [/quote] Just a reindex, which has been required for upgrades before. Only a very small percentage of nodes are pruned and otherwise the process could just be done invisibly in the background so long as you upgraded before the requirement was active. (Even if pruned it could be done invisibly in the background too, though obviously with a bandwidth cost).

If the node was pruned and didn’t care for the bandwidth cost and didn’t mind a reduction in autonomous security it would be easy to make the values importable.

A lot of these dust outputs are created by quite large transactions, and so constructing it with a txout proof makes the proposal easy to argue against. Some (though sure, not many) are created by txn so big that the txout proof wouldn’t fit, so that form is confiscating. The point that the value of the outputs are small only get you so far because some imagine bitcoin worth millions per coin in the not distant future, so I think it’s important that the non-confisciationess is sincere and not just ‘technically true’-- besides to the extent that the coins do get spent we wouldn’t want them wasting the block weight cap reinserting over and over again junk that is already in the chain.

ajtowns · May 21, 2025, 10:44am

I think a policy something like this could be interesting:

give utxos a score based on their value in sats and how long they’ve been in the utxo set, eg score = utxo_value - floor((tip_height - utxo_height)/2016) * 4000
when the score for a utxo drops below zero, move it from the utxo set into an accumulator, and require an accumulator-inclusion proof when spending
treat the accumulator-inclusion proof as free as far as block weight limits are concerned

That’s non-confiscatory, not only in the sense that any spendable utxo remains spendable, but also remains spendable with no increase in required fees.

(Having it be costed per byte at some ratio to witness bytes is also plausible, but I think would be slightly annoying – the proofs required could change at each block, depending on how the accumulator changes, and could grow in size, resulting in your tx’s feerate decreasing unexpectedly. Having it be costed as a flat per-accumulator-input could also work, but that would still make discourage cleanups of utxos in the accumulator, which might be something we’d prefer to encourage?)

Maintaining the accumulator-inclusion proof can be expensive (it may need to be updated every time the accumulator is updated, which requires maintaining a full node), however this can be outsourced, and can also be calculated from scratch at any point by reindexing the blockchain. I think a node running on consumer-level hardware could reasonably maintain a full set of utreexo accumulator-inclusion proofs for every utxo created in the past 20 years indefinitely, with costs only increasing roughly linearly as you bump the number of years up.

As at block 897,666 that rule would have the following impact:

currently, there are 171,366,598 utxos, using ~12GB of disk, with 19,867,488.3479_1527 bitcoins worth of value
after discarding based on the above rule, there would be 32,694,853 utxos remaining (ie reducing the utxo set size by ~81%, to perhaps 2GB or 3GB. the value of discarded utxos would total 34,761.7225_9465; about 25000 sats each on average
under this rule, a utxo worth less than 4000 sats (~$4 currently) would be dropped after two weeks, and a utxo worth less than 104,000 sats (~$100 currently) would be dropped after a year
about 27% of utxos were created in the past year (since block 845520), of those, 68% would be discarded under the above rule
in the 1000 blocks prior to 897,666, there were only 31 spends of utxos that would have been in the accumulator under this rule out of 7,304,674 total spends (0.00042%). Those spends were across 18 blocks (1.8%). No doubt there are periods where that number is much higher, though.

In the context of utreexo, I think this is the interesting part: the downside of utreexo is that every user has to construct/maintain proofs in order to broadcast their transactions, or rely on bridge nodes to do the translation for them, except that full bridge nodes are very expensive to run, and will only get more so as time goes on, which doesn’t seem ideal from a privacy/centralisation/efficiency point-of-view, at least to me.

In contrast, just doing it for old/low-value outputs that are rarely spent anyway means that most users are entirely unaffected, and that bridge nodes are only needed for people spending anything in the masses of low value outputs that are causing the problem in the first place. Likewise it reduces the bandwidth impact of adding proofs to the chain by quite a lot – you’re no longer adding proofs for utxos that are spent quickly, or that were high value, which is most spends. (With the parameters set at 4000sats and 2016 blocks, a 1.0 BTC utxo would not get moved to the accumulator until it had sat unspent in the utxo set for ~960 years, eg)

The downside, compared to utreexo, is that recent/high-value utxos are still stored individually, which still requires a database of a few GBs, rather than just the accumulator of a few kB. I believe that a value - floor(height/N)*V >= 0 calculation along with the supply cap and blocksize limit also implies a hard upper limit on the size of the utxo set, but I’m not sure off hand how you’d calculate it.

As far as implementation goes, the score calculation above could be rearranged to instead calculate acc_height = floor((value + 1)/4000)*2016 + height – the height at which the score becomes negative and the utxo should be moved into the accumulator. This could either be maintained as a separate index (if our utxo db supported that), or you could scan through the utxo set for any potentially affected utxos during the 2015 blocks where nothing changes.

gmaxwell · May 21, 2025, 9:58pm

Accumulator proofs that require updating though are problematic-- I think probably a lot worse than them costing weight even, particularly because they’re quite expensive for anyone who just wants to track all of them (an “index node”?). The idea that coins can be excluded from them is a good one, though if the criteria is too complicated I think people will justifiably worry that their coins will fall into them and then they are even worse off because they didn’t track the proofs. That’s why I was favoring a simple absolute limit: if your coins are above this value they’ll won’t get subjected to the accumulator. Also the non accumulated utxo set size has a fixed upper limit when you do that, – number of coins divided by that limit.

(Depending on how it’s set one could even bite the bullet and preallocate the usage!)

I wonder if it would be at all interesting to hybridize.. E.g. use a txout style proof that never changes for the coins identity and use a utxotree style proof for just the spentness bit. The total proof size would be increased, but nodes that maintain spentness wouldn’t need the spentness proof. (and probably the utxotree proof is smaller than you imagine since many spentness commitments could be stuffed under one leaf).

RobinLinus · May 22, 2025, 2:01am

I think that’s a good idea. Dust Expiry should reflect a balanced principle: while every UTXO imposes a cost on the system, non-dust UTXOs also provide value—since the holder’s conviction contributes to Bitcoin’s overall worth. The mechanism should penalize only those UTXOs so small that their cost clearly outweighs their contribution.

Do you mean an accumulator scheme that allows state updates using an inclusion proof? Otherwise, if updating requires knowledge of the entire set, the accumulator introduces unnecessary overhead.

Also, the set of unspent expired UTXOs isn’t actually that large: location pointers of the form

pointer = (block_height, tx_index, output_index)

can be naively encoded in under 8 bytes. That amounts to roughly 800 MB for the ~100 million spam UTXOs created in recent years.

Moreover, the set can be compressed significantly. First, since the order doesn’t matter, you can sort the entries to reduce entropy. Second, both block_height and output_index have quite low entropy and can be compressed efficiently. So you can likely get away with 4 bytes per entry. Perhaps even less than 2 bytes.

And since this cleanup mechanism disincentivizes UTXO spam attacks, the set likely stops growing so rapidly.

gmaxwell · May 22, 2025, 7:59am

I agree that it can be compressed. The interest in exploring the applicability of commitments is that this scheme still leaves UTXO cost O(N) just with a better constant factor.

It’s interesting to consider how that could evolve over time e.g. once the reduced cost justify doing something about it. Like the utxotree stuff results in a kilobyte of state, which 800MB (plus the non-dust utxo set) is very large against indeed. Particularly where someone may want a circuit that validates (most of) a block that difference matters. But the more efficient solutions have the problem of updatability, while the txo suggestion needs only a static proof which is a big win.

At least I’ve always found it useful to try thinking a couple steps ahead of any proposal, sometimes it results in changing how I think about it.

That seems somewhat dubious to me, I could argue for that side but it feels like a stretch.

ajtowns · May 23, 2025, 8:34am

Yeah, simple might be fine, though that becomes a different sort of dust limit, and maybe wallets will just decide “oh, I’ll ignore the complexity by just pretending values below that amount are unspendable / don’t exist at all”, making it more confiscatory in practice than in theory.

I don’t think it actually gives a very nice fixed upper limit; sending utxos at 20,000 sats ($20) or below straight to the accumulator still lets you potentially have 105 billion utxos at 20,001 sats each, using up about 7 terabytes of disk, assuming the entire 21M BTC is used for the attack. Of course, applying a budget constraint to the attack might fix that; if you’re discarding anything under as little as 4000 sats, then every additional 10GB added to the utxo set implies that ~5714 BTC ($570M) has been reserved by the attacker, and at 1sat/vb, I think recovering those funds would cost ~81 BTC ($8M, ~1.4% of the funds used).

Adding the time factor in doesn’t really help here though, because if the entire network is conspiring to fill your disk, they can just create too many new utxos in each block and hit pretty high numbers before any time factor comes into play. So perhaps that also argues in favour of simplicity…

One thing maybe worth worrying about is if low-value utxos are used briefly as part of L2 systems (eg John Law’s systems, perhaps some of the Ark designs); being able to create a dust utxo then spend it quickly without having an extra layer of complexity thrust upon you is probably nice, and also fairly benign as far as everyone else is concerned.

Perhaps that just means that you should move things to the accumulator if it’s below 4000 sats and older than 1008 blocks, or similar though.

In the worst case, over ~1000 blocks, I think you could create somewhere around either 10M utxos (normal-ish transactions), 30M utxos (as many outputs as possible), or 100M utxos (dummy utxos that are either unspendable or anyone-can-spend, and probably use up less space individually in the utxo set), using up between 700MB to 7GB on disk.

I guess you could probably also do something like “for the first 1000 blocks (1 week), you just need a txid/vout; after that, for the next 100,000 blocks (2 years), you need the txid/vout/block-hash/position-within-block to spend; and after that, you need to maintain/reconstruct a utreexo-style proof as well” could work.

Honestly I think it would be the reverse – if the UTXO set could grow without placing a burden on nodes, there’s no longer the same argument for preventing dust outputs, so probably not really a reason to reject them from mining/relay, which in turn would invite them to be used more frequently in L2 designs (“this output’s available if we need it, for unusual paths when participants disappear or try to cheat or something else weird happens, but normally we’ll just ignore it and let it expire into the accumulator forever”), creating significantly more of them.

gmaxwell · May 23, 2025, 8:54am

Which less of an issue for log(n) scaling proposals as log(n) is eventually a constant, more of an issue for O(n) with a good initial constant!

On the other hand the proof costing weight is plenty of incentive for the decisions you mention favoring avoiding dust when they can.

For the subject of “your wallet might not implement stuff and abandon you” being functionally confiscatory. I think that’s a useful thing to consider, but I think it fundimnetally doesn’t get you far, as we’ve seen exactly that happen with stuff like “scanning blocks is unusably slow for our SPV wallet, guess we’ll just shut down”. The standard for confiscation has to be, I think, an assumption that the user can rip out the private key and use it with something that does work and that they’ll find it worth their time to do so.

Even bitcoin core has made it harder to use old wallets now, friction from software changing is just life.

As far as 7TB in storage costs that doesn’t seem that immodest to me, thinking in the long term. It’s a pretty good improvement on literally infinite.

ajtowns · May 23, 2025, 8:56am

The proof costing weight is a disincentive to spending dust, but not to creating it, though, so mostly makes this worse, rather than better, as far as I can see?

gmaxwell · May 23, 2025, 8:58am

No effect to dust as a semaphore, I think. Disincentivizes creating many tiny outputs when you could create fewer larger ones, under the assumption they’re actually going to spend them eventually and not just use them for sequencing.

moonsettler · May 23, 2025, 4:55pm

There was a time I would have been amiable to expiring near dust UTXOs or to introduce a “UTXO rot”, but since then we have alternative node designs proposed that either entirely remove the burden of maintaining a UTXO set or just simply ditch the UTXO set.

Currently I lean towards the UTXO set being the problem we should do away with not something we have to protect from bloating. Therefore this seems like something of no long term importance for the time being. The considerations we would also have to make are related to the future block size and parallelization of- and various technical bottlenecks in validation.

If I understand the tradeoffs correctly, Utreexo removes the need to involve IO in the validation / sync process as it can be performed in memory and if blocks stay at this size, then lack of parallelization is not a major concern. If sync is mainly IO bound and not CPU bound and also network speeds will be no obstacle, then the accumulators solve our problem in syncing up nodes with full validation without the barbarous disk murder ritual.

gmaxwell · May 24, 2025, 2:47am

It comes at a different cost: you have to have a constantly updated proof to spend the coins. You can’t go offline and spend coins you know you have SPV style without having someone who has validated the chain provide you with an updated proof. And running an ‘indexer’ node that monitors proofs for all coins is very costly compared to running a node today. That model also increases the bandwidth needed for relaying transactions and blocks considerably. I think but am not sure that you can’t avoid the bandwidth increase by just having a today-style utxo set, because you’d need to provide proofs to non-local-utxo peers.

There is a reasonable argument that the asymptotic behavior means it will eventually be needed, but I think it’s harder to argue that it’s a good tradeoff now. So I think that makes alternatives with other tradeoffs interesting to discuss.

moonsettler · May 24, 2025, 4:48pm

I was thinking about IBD / initial sync done via Utreexo style accumulators in memory. This ofc requires aid from nodes that can serve the blocks in that enriched form. However this just means the blockchain is validated in memory without involving IO not that you absolutely have to prune it. It becomes a smooth download and write to disk once kind of experience.

Once the node synced, it can switch to a standalone full archiving node or even Utreexo server depending on configuration. It can also just hang on to the Utreexo forest roots and header chain (few hundred megabytes). The point is the size of the UTXO set no longer would be a detriment to low powered node hardware to do full validating initial sync.

Ofc we can also decide that the UTXO set was a premature optimization that backfired and go with an alternative node design like libbitcoin. That has it’s own tradeoffs again.