Stats on compact block reconstructions

I’ve started recording the contents of inbound and outbound getblocktxn messages a week ago. This should allow for some insights into “are peers often missing the same transactions?” and “can we pre-fill the transactions we had to request our self?”. I haven’t taken a closer look at the data yet.

Also, I’ve changed one of my nodes to run with blockreconstructionextratxn=10000 and updated two nodes to a master that includes p2p: track and use all potential peers for orphan resolution #31397. Probably need to wait until the mempool fills up again to see the effects of this.

3 Likes

One other thing is that FIBRE is designed for UDP transmission to avoid delays due to retransmits; so redoing it over TCP via the existing p2p network would be a pretty big loss…

4 Likes
  • I’ve changed node alice to run with blockreconstructionextratxn=10000 early February. This had a noticeable effect the following days with slightly higher scores starting 2025-02-06. During the increased mempool activity between 2025-02-21 and 2025-03-06 it performed significantly better than my other nodes.
  • Node charlie and node erin were switched to a branch that includes p2p: track and use all potential peers for orphan resolution #31397 at the same time in early February. I don’t see any immediate improvement for these two nodes.
  • Node ian was running Bitcoin Core v26.1 until I switched all nodes to run v29.0rc1 release candidate. ian clearly performed worse than the other nodes before the update, which is expected as e.g. mempoolfullrbf wasn’t default in v26.1.
  • Node mike doesn’t allow inbound connections (while the other nodes do and usually have full inbound slots). This is noticeable in the reconstruction performance. Only having eight peers that inform mike about transactions is probably likely worse than having close to 100 peers that inform you about new transactions.

The stats from alice, charlie, and erin could indicate that orphans aren’t the problem, but conflicts, replacements, and policy invalid transactions (i.e. extra pool txns) cause low performance during high mempool activity. Although, I’m not sure if these three moths of data are enough to be certain yet.

I’ve started to look at the data I’ve been recording. It seems that many of my peers I announced a compact block to end up requesting very different sets of transactions (and usually larger sets) than I request from my peers. I assume many of them might be non-listening nodes like mike or run with a non-default configuration. This needs more work, but I hope to post more stats on requested transactions here at some point.

I’ve also noticed that my listening nodes running with the default configuration often independently request similar sets of transactions. This seems promising in regards to predictably prefilling transactions in our compact block announcements. My assumption would be that if we prefill:

  • transactions we had to request
  • transactions we took from our extra pool
  • and prefilled transactions we didn’t have in our mempool (i.e. prefilled txns that were announced to use and ended up being useful)

we can improve the propagation time among nodes that accept inbound connections and use a “Bitcoin Core” default policy. This in turn should improve block propagation time of the complete network as now more nodes know about the block earlier. Additionally, useful prefilled transactions don’t end up wasting bandwidth, only transactions that a peer already knew about would waste bandwidth. These improvements would probably be most noticeable during high mempool activity: the (main) goal wouldn’t be to bring the days with 93% (of reconstructions not needing to request a transaction) to 98% but rather the days with 45% to something like 90% for well-connected nodes.

Since Bitcoin Core only high-bandwidth/fast announces compact blocks to peers that specifically requested it from us (because we quickly gave them new blocks in the past), non-listening nodes that are badly connected won’t start sending wasteful announcements with many prefilled, well-known transactions to their peers.

I’ve started implementing this in 2025-03-prefill-compactblocks but its still work-in-progress:

  • limit the prefill amount to something like 10kB worth of transactions as per BIP152 implementation note #5. I think this is useful to avoid wasting too much bandwidth if a node does a high-bandwidth announcement but, for some reason, prefills a lot of well-known transactions in the announcement
  • cmpctblock debug logging on wasted bandwidth: log the number of bytes of transactions we already knew about when receiving a prefilled compact block. This can be tracked/monitored to determine if were wasting too much bandwidth by prefilling
  • since the positive effect on the network is only measurable with a wide(r) deployment of the prefilling patch, it’s probably worthwhile to do some Warnet simulations on this and test the improvement under different scenarios.
3 Likes

I’m not sure I understand why we would want to also prefill txs from our extra pool. The logic for extra pool inclusion would be the same for all nodes. So if we consider that our peers would have the same txs in their mempool then logically we would consider that our peers would have the same txs in their extra pool, no?

Yeah, good question. I don’t have data on this yet, but I think it makes sense to look at the extra_pools of nodes and see if they are similar or different. My assumption is that they aren’t too similar.

So if we consider that our peers would have the same txs in their mempool then logically we would consider that our peers would have the same txs in their extra pool, no?

A few arguments against extra pool similarity are:

  • the extra pool is quite small with only 100 transactions in it by default
  • mempool transactions are relayed with the hope that mempools converge, extra pool transactions are stopped at their first hop and aren’t relayed
  • you might have peer that is sending a lot of transactions you’ll reject and put into your extra pool, but I might not have a connection to this peer - our extra pools will be quite different

But our peers will also have the same default.

RBF replaced txs are put into the extra pool, and the replacing tx is still relayed. So they should converge. If we are going to search the orphanage anyways, we can stop putting orphans into the extra pool.

Would it be likely a miner will mine these rejected txs though? Not sure.

One point brought up by sipa here in a semi-related thread ([WIP] p2p: Add random txn's from mempool to GETBLOCKTXN by davidgumberg · Pull Request #27086 · bitcoin/bitcoin · GitHub) is that the number of TCP packets sent over could increase if we’re making the CMPCTBLOCK message larger with prefilledtxns. I think that is maybe one downside to prefilling transactions. Perhaps it’s possible to prefill transactions up to a certain total message size limit specifically for compact blocks?

EDIT: His point was actually about the GETBLOCKTXN causing more round trips, but the same thing applies.

1 Like

0xB10C/2025-03-prefill-compactblocks is very interesting,

since the positive effect on the network is only measurable with a wide(r) deployment of the prefilling patch, it’s probably worthwhile to do some Warnet simulations on this and test the improvement under different scenarios.

I think one low effort way to perform a limited test of this patch on mainnet is to run a second node which only listens to CMPCTBLOCK announcements from manually-connected peers, and is manually connected to a 0xB10C/2025-03-prefill-compactblocks node. I’ve created a branch to try this: davidgumberg/5-20-25-cmpct-manual-only, I’ll try to run an experiment soon with two nodes.


My assumption would be that if we prefill:

  • transactions we had to request
  • transactions we took from our extra pool
  • prefilled transactions we didn’t have in our mempool (i.e. prefilled txns that were announced to use and ended up being useful)

I think the privacy concerns raised in bitcoin/bitcoin#27086, are relevant here, how can a node avoid:

  1. Providing a unique fingerprint by revealing its exact mempool policy in CMPCTBLOCK announcements.
  2. Revealing all of the non-standard transactions that belong to it by failing to include them in it’s prefill.

2. is more severe, and may be part of a class of problems (mempool’s special treatment of it’s own transactions) that is susceptible to a general fix outside of the scope of compact block prefill. Even if it’s impossible or infeasible to close all leaks of what’s in your mempool, it would be good to solve this.

One way of fixing this might be to add another instantation of the mempool data structure (CTxMempool), maybe called m_user_pool. Most of the code could go unchanged except for where it is desirable to give special treatment to user transactions, and these cases could be handled explicitly.

To solve 1., I wonder if there is a reasonably performant way to shift the prefills in the direction of prefilled transactions the node wouldn’t have included according to default mempool policy. This is not just for privacy, as I imagine this is the ideal set of transactions to include, strict mempools prefilling too much, and loose mempools prefilling too little.[1] If this would be too expensive to compute on CMPCTBLOCK receipt, maybe a variation of m_user_pool is possible, where a node maintains another CTxMempool instance for all the transactions which default mempool policy would have excluded, but user supplied arguments have permitted. Or maybe the extra state is too expensive/complicated, and instead just performing an extra standardness check with the default policy on tx receipt and setting a flag on the tx (or keeping a map of flagged tx’es) is enough.

Maybe all of this is too complicated to implement proportional to its value here, but these could also be steps toward solving mempool fingerprinting more generally.[2]


the number of TCP packets sent over could increase if we’re making the CMPCTBLOCK message larger with prefilledtxns.

I am not very knowledgeable about TCP, but as I understand RFC 5681, the issue is not a message growing to a size where it has to be split across multiple packets/segments, but a message that grows too big to fit in the receiver-advertised message window (rwnd) and the RFC 5681 (or other congestion control algorithm) specified congestion window. (cwnd). The smallest of these two (cwnd and rwnd) is the largest amount of data that can be transmitted in a single TCP round trip, it should be possible to get the relevant metrics for this from the tcp_info structure on *nix systems[3] doing something like:

struct tcp_info info;
socklen_t info_len = sizeof(info);
getsockopt(sockfd, IPPROTO_TCP, TCP_INFO, &info, &info_len)

// congestion send window (# of segments) * mss (max segment size)
uint32_t cwnd_bytes = info.tcpi_snd_cwnd * info.tcpi_snd_mss;
// our peer's advertised receive window in bytes
uint32_t peer_rwnd_bytes = info.tcpi_snd_wnd;
// get the smaller one
uint32_t max_bytes_per_round_trip = cwnd_bytes < peer_rwnd_bytes ? cwnd_bytes : peer_rwnd_bytes;

And the announcer could pack the prefill until it hits this limit. I am not sure how likely it is that that constraining messages to this size would deter a second round trip from taking place, but it seems like a reasonable starting point.


  1. For better or for worse, such an approach would disadvantage nodes with stricter-than-default mempools in compact block reconstruction. ↩︎

  2. But maybe no general solution to mempool fingerprinting is possible, and nodes with non-default mempools shouldn’t have any expectation that they can’t be fingerprinted. ↩︎

  3. Linux, Mac, FreeBSD It seems something similar on Windows is possible with SIO_TCP_INFO ↩︎

Prefilling is just a flawed part of the design, it was kinda tossed in because it was very easy to add and harmless if not used. After compact blocks were deployed I did a bunch of testing and was unable to make it do anything but harm.

The issues it has are several fold: it’s part of the compact block message so it blocks reception of the compact block in cases where it wasn’t needed. Peers also get compact blocks from multiple sources and so if they all use prefill then you waste N fold the bandwidth (or N-1 if one was indeed helpful). And then of course the extra data stuffs you further back into needing RTTs, thanks to window issues.

Then of course you have the issue that many missed transactions are missed because they were too large, which makes all the above issues much worse.

Fiber being AGPL is a non-issue, parts could be re-licensed if needed. It has in it solutions to every one of the issues raised above-- including the ability for extra data to be sent that helps even if the prediction of what was missed wasn’t accurate, allowing data from multiple peers to all contribute, and so on.

The use of UDP however, needed get around the TCP window issues, would probably be challenging for widespread deployment due to the need for hole punching.

A lot of thing have happened since then, core has minisketch merged (though unused), and using that kind of tool I was able to get blocks in consistently 800-ish bytes before. A big reduction in compact block size would leave a lot of room for data to fill in missing transactions.

But if miners are regularly including hundreds of kilobytes that were never relayed I’m a bit dubious that any scheme is going to result in particularly good performance except between peers with extremely high dedicated bandwidth that can do manual congestion management (e.g. a fiber like deployment of geographically dispersed data center nodes). Though the fact that it can help even if just some nodes run something faster is helpful-- it makes development of stuff more interesting even if there isn’t a serious deployment story.

The issues it has are several fold: it’s part of the compact block message so it blocks reception of the compact block in cases where it wasn’t needed. Peers also get compact blocks from multiple sources and so if they all use prefill then you waste N fold the bandwidth (or N-1 if one was indeed helpful). And then of course the extra data stuffs you further back into needing RTTs, thanks to window issues.

Then of course you have the issue that many missed transactions are missed because they were too large, which makes all the above issues much worse.

I agree that in the extreme case, prefilling will not be helpful. But I’m optimistic that prefilling up to the TCP congestion window (no extra RTT) is not harmful. It seems reasonable to presume that, in general, a node’s operating system’s congestion control algorithm will reliably predict the maximum message that can be sent to a peer without incurring an extra round trip, and nodes with slow connections will tend to also have small windows, mitigating the redundant prefill cost. If it works as I understand, it seems like using the cwnd will scale nicely up and down with connection speeds, and offloads the engineering burden of this problem to kernel developers and the IETF.

It seems worth measuring what the typical sizes of compact block BLOCKTXN fulfillments are. I’ve made a branch that might help with this: (log: Additional compact block logging by davidgumberg · Pull Request #32582 · bitcoin/bitcoin · GitHub). It would also be useful to have some data on bitcoin node congestion windows sizes, and if these are close to each other in size, compact block reconstruction failures don’t go away, but conservatively prefilling might make them less frequent while incurring little additional cost.

A lot of thing have happened since then, core has minisketch merged (though unused), and using that kind of tool I was able to get blocks in consistently 800-ish bytes before. A big reduction in compact block size would leave a lot of room for data to fill in missing transactions.

Great idea, I see that on my node compact block messages hover around ~20kB, 800 bytes would leave a lot more overhead for prefills!

I am not sure whether the comment in PR 27086 I linked is referring to congestion issues or IPv4 fragmentation issues. I don’t have hard data, but I believe both contribute to latency issues here and sending data >> MTU (~1500 bytes) is going to lead to lots of fragmentation. Two links if you have the time:

I’m not really sure that pre-filling above MTU is worth it after reading the two above RFCs, but curious to hear thoughts.

EDIT: Sorry to cross-post, but I’ve TLDR’d the above two RFC’s in a related Lightning conversation here: Latency and Privacy in Lightning - #13 by Crypt-iQ

I think I’ve actually conflated IP reassembly with TCP reassembly. I think maybe hard data would be nice to have here?

There is no IP fragmentation involved in TCP transmissions (well, assuming PMTUD did its thing)… indeed, you’re conflating IP reassembly with TCP reassembly.

An interactive/modifiable version of all the data and plots are in a jupyter notebook here: https://davidgumberg.github.io/logkicker/lab/index.html?path=2025-07-11-first-report%2FPrefilling.ipynb

Summary

I connected two Bitcoin Core nodes running on mainnet, one prefilling transactions to a node that only received CMPCTBLOCK announcements from its prefilling peer. Even though the intended effects of prefilling transactions are network-wide, and it would be nice to have some more complicated topologies and scenarios tested in e.g. Warnet, this basic setup can be used to validate some of the basic assumptions of the effects of prefilling:

  1. Does prefilling work to prevent failed block reconstructions that otherwise require GETBLOCKTXN->BLOCKTXN roundtrips, irrespective of the cost of prefilling?
  2. Does prefilling result in a net reduction on block propagation times?

The results indicate that the answer to 1. is definitively yes. The metric used by 0xB10C/2025-03-prefill-compactblocks of prefilling the transactions we were missing from our mempool when performing block reconstruction resulted in an observed reconstruction rate of 98.25% for a node receiving prefilled CMPCTBLOCK announcements when both the prefilling node and the prefill-receiving node are running similar builds of Bitcoin Core, compared to the observed reconstruction rate on a node not receiving prefilled blocks of of 61.81%. Some of those prefills, as pointed out by @Crypt-iQ and @gmaxwell above, exceeded the TCP window, and likely resulted in an additional round-trip, negating the benefit of prefilling. But, in my measurements, 85.78% of the prefills would have fit in the partially occupied TCP window a prefilling node sent the CMPCTBLOCK’s in. Projecting out, these measurements indicate that if all Bitcoin Core nodes had been prefilling during the period which I measured data, the reconstruction rate would have been 93.07% and we can likely do better taking advantage of the fact, pointed out by @andrewtoth above that similar peers will likely have similar vExtraTxn.

I think the following improvements should be made to 0xB10C/2025-03-prefill-compactblocks:

Definitely:
  • Only prefill up to the next TCP window boundary.
  • Always insert candidates from vExtraTxn last.
Maybe:
  • Within DoS limits (maybe a limit of 4 MiB per valid header), temporarily store a per-block cache of prefilled transactions you hear about, increasing the chances that you successfully reconstruct without having to wait for an RTT.
  • If the send window can’t fit all of the prefill candidates, prefill a random selection of candidates, always prefilling transactions not in vExtraTxn first.

Future investigations should:

  1. Use prefill-receiving nodes to measure the amount of duplicate / redundant data in the prefill.
  2. Use two peers with stable and high (maybe artificial?) latencies to easily estimate the number of round-trips that messages take to pass between them, there is also probably external tooling that can measure this.
  3. Measure / reason about effects of prefilling at a distance of more than one hop.
  4. Measure data about the GETBLOCKTXN messages that a prefilling node receives from random peers.

Latency and Bandwidth

Feel free to skip the math in this section or to skip reading this section entirely.

Taking a simplified view, the latency for a receiver to hear an unsolicited message (the scenario we care about in block relay) consists of transmission delay plus propagation delay:

\text{Latency} \approx \frac{\text{Data}}{\text{Bandwidth}} + \sim{\frac{1}{2}} * \text{Round-trip time}

Any time compact block reconstruction fails because the receiver was missing transactions, an additional round-trip-time (RTT) of requesting missing transactions and receiving them (GETBLOCKTXN->BLOCKTXN) must be paid in order to complete reconstruction, but at this point the amount of data that needs to be transmitted for reconstruction to succeed does not change. Where f is the probability for a block to fail reconstruction:

\text{Latency} \approx \frac{\text{Data}}{\text{Bandwidth}} + \sim{\frac{1}{2}}\text{RTT} + f * \text{RTT}

If we had perfect information about the transactions our peer will be missing, we should always send these along with the block announcements since we will pay basically the same transmission delay, minus the unnecessary round-trip. If we don’t have perfect information, then the worst we can do is send transactions which our peer already knew about, while not sending them transactions they didn’t know about, incurring the RTT anyways, plus the transmission time of the redundant data. Let’s say we send p extra prefill bytes, with each byte having a probability n of being redundant and prefilling p bytes gets us a reconstruction failure probability of f_{p}, then:

\text{Latency}_\text{Prefilling} \approx \frac{\text{Data}}{\text{Bandwidth}} + \frac{p * n}{\text{Bandwidth}} + \sim{\frac{1}{2}}\text{RTT} + f_p * \text{RTT}

Criterion for deciding if prefilling is advantageous

In order for prefilling latency to be better than or equal to no-prefilling latency, the following inequality must be satisfied:

\frac{p * n}{\text{Bandwidth}*\text{RTT}} \leq f_0 - f_p

Derivation

If latency while prefilling is less than or equal to latency without prefilling, where b is bandwidth, r is the RTT, d is the size of the CMPCTBLOCK without prefill, p is the size of the prefill, and f_p is the reconstruction failure rate at a given prefill size p:

\frac{d}{b} + \frac{pn}{b} + \frac{1}{2}r + {f_p}{r} \leq \frac{d}{b} + \frac{1}{2}r + {f_0}{r}

Subtracting the common terms \frac{d}{b} and \frac{1}{2}r from both sides:

\frac{pn}{b} + {f_p}{r} \leq {f_0}{r}

Subtracting {f_p}{r} from both sides:

\frac{pn}{b} \leq {f_0}{r} - {f_p}{r}

Dividing both sides by r:

\frac{pn}{{b} {r}} \leq {f_0} - {f_p}

If we plug in some example values, prefilling 10KiB with a bandwidth of 5 MiB/s and an RTT of 50ms (.050s) and use a worst case n of 1

\frac{10\text{KiB}*1}{5 \text{MiB/s} * 0.050\text{s}} = 0.039

In this case, if prefilling improves reconstruction rates by at least 3.9% it is definitely better than not prefilling.

Latency Cost of Prefilling

And we can quantify the latency cost of prefilling over not prefilling as:

\text{Latency}_\text{Prefilling} - \text{Latency}_\text{Not prefilling} = \frac{p*n}{\text{Bandwidth}} - r(f_0 - f_p)

TCP windows and the costs of prefilling.

But, the use of TCP in the Bitcoin P2P protocol complicates this, because a sender will not send data exceeding the TCP window size in a single round-trip. Instead, they will send up to the window size in data, wait for an ACK from the receiver, and then send up to window bytes after the data which was ACKed. That means that if we exceed a single TCP window, we will have to pay an additional RTT in propagation latency (and a little bit of transmission latency for the overhead). And for each additional window we overflow, we will pay another RTT:

\text{TCP Latency} \approx \frac{\text{Data}}{\text{Bandwidth}} + \sim{\frac{1}{2}}\text{RTT} + f * \text{RTT} + \lfloor{\frac{\text{Data}}{\text{Window Size}}}\rfloor\text{RTT}

Note \lfloor a \rfloor meaning std::floor(a)

Doing a similar dance as above, where p is the prefill size and f_p is the probability of reconstruction failure at prefill size p, and n is the probability of a prefill byte being redundant:

\frac{p*n}{\text{Bandwidth}*\text{RTT}} \leq f_0 - f_p + \lfloor{\frac{\text{Data}}{\text{Window Size}}}\rfloor - \lfloor{\frac{\text{Data}+p}{\text{Window Size}}}\rfloor

The “TCP window” is the smaller of two values: the receiver advertised window (rwnd) and the sender-calculated congestion window (cwnd).

Overflowing current TCP window is always worse than doing nothing

The above formula establishes as a rule something which might have been intuited, that if the prefill causes us to exceed the current TCP window, then we will always do worse than if we hadn’t prefilled, since:

  1. f_0 - f_p \leq 1 since the smallest number f_0 can be is 0, and the largest number f_p can be is 1.
  2. \lfloor{\frac{\text{Data}}{\text{Window Size}}}\rfloor - \lfloor{\frac{\text{Data}+p}{\text{Window Size}}}\rfloor \leq -1 if the prefill overflows the current partially filled TCP window.
  3. If a \leq 1 and b \leq -1, then a + b \leq 0, so the right hand side of the formula is \leq 0.
  4. The left hand side of the equation will always be \geq 0, since none of the variables on the left side can ever be negative.
  5. If lhs \geq 0 and 0 \geq rhs, then lhs \geq rhs, so the left hand side will never be less than the right hand side, therefore prefilling will never be beneficial.

But, if we bound our prefill p so that we never increase the number of TCP windows used, i.e.: \lfloor{\frac{\text{Data}}{\text{Window Size}}}\rfloor - \lfloor{\frac{\text{Data}+p}{\text{Window Size}}}\rfloor = 0 which, I believe is easy to do, we can use the exact same formula as above to decide whether or not prefilling is effective:

\frac{p * n}{\text{Bandwidth}*\text{RTT}} \leq f_0 - f_p

Complication: TCP Retransmission

So far, I have assumed perfectly reliable networks and this isn’t always the case, packets get lost, and in TCP that means waiting for a timeout, and then retransmitting. But, I believe the problem above I’ve described in relation to prefilling is very similar to the problem that the designers of TCP had in selecting a static window size, and later, dynamic window sizes through congestion control algorithms like those described in RFC 5681 and RFC 9438. Instead of the probability that a block reconstruction will fail, they deal with the probability that a packet will not arrive, in both cases, the consequence is an additional round-trip, and a core question is whether the marginal value of potentially saving a round-trip by packing in more data is worth the risk that retransmission will be necessary anyways. The analogy is imperfect, as there are many more concerns that TCP congestion control algorithms deal with, but I argue that the node can outsource the question: “How large of a message can we send and reasonably expect everything to arrive?” to its operating system’s congestion control implementation.

Complication: Cost of Bandwidth

In all of the above, I have assumed the cost of using bandwidth is 0 outside of the latency cost. I’ve done this because I believe the cost of the redundant transactions sent in compact block prefills is negligible, the data I measured below suggests that prefills will be on the order of ~20KiB, so worst case monthly bandwidth usage of prefilling, assuming every byte is redundant and did not need to be sent, and that you always receive a prefilled CMPCTBLOCK from three HB peers, is ~300 MiB. (3 HB Peers * 20 KiB * 6 * 24 * 31)

Takeaways

I don’t think proving that the above inequality being satisfied is necessary for a prefilling solution, what I think it’s useful for is building an intuition of the problem, and setting theoretical boundaries on how effective prefilling needs to be to be worth it.

  • Nodes are likelier to suffer rather than benefit from prefilling that have smaller Bandwidth * RTT (See Bandwidth-delay product (BDP)) connections: e.g. nodes with low bandwidth and low ping. And nodes that have connections with large BDP’s are likelier to benefit, e.g. high-bandwidth, high-latency connections(“Long Fat Networks” as described in RFC 7323)
  • If the redundant broadcast probability n is zero, prefilling is always worth it.

Data

The data was all taken from debug.log’s generated by the nodes and parsed with this python script: logkicker/compactblocks/logsparser.py at 285034d6833e34dfcb058ce37b30affede0333be · davidgumberg/logkicker · GitHub

CSV’s from the data collected can be found here: logkicker/compactblocks/2025-07-11-first-report at main · davidgumberg/logkicker · GitHub

Summary

The node receiving not prefilled blocks had a reconstruction rate of 61.81%, the node receiving prefilled blocks had a reconstruction rate of 98.25%, but only 85.78% of those would have fit in the current TCP window of the CMPCTBLOCK being announced, so projecting from those two figures, 93.07% of blocks could have been reconstructed without an additional TCP round trip. The vast majority of TCP windows observed were around ~15KiB. For a lot of the data around prefills, the averages are massive because of a few extreme outliers, but the vast majority of the time, a very small amount of prefill data is needed to prevent a GETBLOCKTXN->BLOCKTXN round trip, 65% of blocks observed needed 1KiB or less of prefill.

Prefill-Receiving Node: stats on CMPCTBLOCK’s received

This data was gathered from a node configured so that it would only receive CMPCTBLOCK announcements from our prefilling node, the main thing to see here is that reconstruction generally succeeds, the average reconstruction time metric is misleading, since we don’t count extra RTT’s that happen in the TCP layer in reconstruction time, just the time we receive the CMPCTBLOCK until the time we have reconstructed it.

49 out of 2793 blocks received failed reconstruction. (1.75%)
Reconstruction rate was 98.25%
Avg size of received block: 55851.93 bytes
Avg bytes missing from received blocks: 603.40 bytes
Avg bytes missing from blocks that failed reconstruction: 34393.55 bytes
Avg reconstruction time: 7.821697ms

Prefilling Node: stats on CMPCTBLOCK’s received

This data was gathered from the node that sends prefilled compact blocks to its peers. Because this node is otherwise unmodified, we can use its measurements on the receiving side as a baseline for block reconstruction on nodes today.

1101 out of 2883 blocks received failed reconstruction. (38.19%)
Reconstruction rate was 61.81%
Avg size of received block: 15957.20 bytes
Avg bytes missing from received blocks: 47849.36 bytes
Avg bytes missing from blocks that failed reconstruction: 125294.91 bytes
Avg reconstruction time: 25.741588ms

TCP Window data

There is a flaw in the window available bytes metric I have used, pointed out to me by @hodlinator, which is that I only did window_size - cmpctblock_size, and did not factor existing bytes queued to send to peers in vSendMsg. I anticipate this will have a small effect, but a branch which prefills up to the TCP window limit should take this into account.

Edit:

@andrewtoth has pointed out more complications in the available bytes metric: it will also have to take into account the current in-flight segments to the peer. On Linux, for example, this is tcpi_unacked (Multiplied by tcpi_snd_mss to get size in bytes). It will also have to account for bytes that are in the operating system’s send queue (tcpi_notsent_bytes). But I believe that because Bitcoin Core uses TCP_NODELAY the OS send buffer should generally be empty. On Linux, the sum of these two values can be obtained with SIOCOUTQ[1]:

int bytes_inflight_and_unsent, err;
err = ioctl(socket, SIOCOUTQ, &bytes_inflight_and_unsent)

I found this article helpful: Journey of Life: TCP socket send buffer deep dive (archive link)

TCP Window Size: Avg: 16128.76 bytes, Median: 14480.0, Mode: 14480
The mode represented 13076/26392 windows. (49.55%)
Avg. TCP window bytes used: 7449.62 bytes
Avg. TCP window bytes available: 8679.14 bytes

Prefilling node: stats on CMPCTBLOCK’s sent

The average CMPCTBLOCK we sent was 65732.80 bytes.
The average prefilled CMPCTBLOCK we sent was 91614.48 bytes.
The average not-prefilled CMPCTBLOCK we sent was 14483.78 bytes.
17536/26392 blocks were sent with prefills. (66.44%)
Avg available prefill bytes for all CMPCTBLOCK's we sent: 8679.14 bytes
Avg available prefill bytes for prefilled CMPCTBLOCK's we sent: 8362.04 bytes
Avg total prefill size for CMPCTBLOCK's we prefilled: 74593.04 bytes
15042/17536 prefilled blocks sent fit in the available bytes. (85.78%)

The average prefill size is notably large, but this is a consequence of some outlier blocks.

vExtraTxnForCompact

But, we can probably do even better, since the above statistics are for prefilling with all of the transactions in vExtraTxnForCompactBlock, and as I understand, it is very likely for peers running the same branch of Bitcoin Core to have a similar vExtraTxnForCompactBlock’s to one another. So it is likely that reconstruction will often succeed even without these transactions, so they should be the first candidates for not being included in the prefill. Unfortunately, their size is not something that Bitcoin Core logs, although I tried to compute it with a heuristic: prefill_size - missing_txns_we_requested_size, but it turned out this was very incorrect.

Overflowing window before pre-fill.

Interestingly, some compact blocks were already so large before prefilling that they required more than one TCP round-trip to be sent, while this circumstance is not ideal, prefilling performs better by taking advantage of this.

1432/26392 CMPCTBLOCK's sent were already over the window for a single RTT before prefilling. (5.43%)
Avg. available bytes for prefill in blocks that were already over a single RTT: 8555.57 bytes
1432/1432 excessively large blocks had prefills that fit. (100.00%)

  1. man tcp incorrectly documents SIOCOUTQ as only the unsent bytes, but it’s both unsent and unacked ↩︎

8 Likes

Awesome work @davidgumberg. I haven’t gotten to dig into it deeper yet.

Just some brief updated stats on reconstructions over the past months since I also posted them in policy: lower the default blockmintxfee, incrementalrelayfee, minrelaytxfee by glozow · Pull Request #33106 · bitcoin/bitcoin · GitHub.

Since log: Additional compact block logging by davidgumberg · Pull Request #32582 · bitcoin/bitcoin · GitHub, we log the the size of the requested transactions. Here, I plot the average size of the requested transactions per block per day.

In early June we were requesting less than 10kB per block were we needed to request something (about 40-50% of blocks) on average. Currently, we are requesting close to 800kB of transactions on average for 70% (30% of the blocks need no requests) of the blocks.

3 Likes

Nice work on this @davidgumberg, this is very helpful. I have a few questions.

Within DoS limits (maybe a limit of 4 MiB per valid header), temporarily store a per-block cache of prefilled transactions you hear about, increasing the chances that you successfully reconstruct without having to wait for an RTT.

Is this because we have 3 HB peers and we want to combine prefills across them to try and reconstruct?

So far, I have assumed perfectly reliable networks and this isn’t always the case, packets get lost, and in TCP that means waiting for a timeout, and then retransmitting

I think it would be helpful to run this experiment where one of the nodes is on the other side of the world (maybe China?) to see what packet loss looks like and how that affects reconstruction times with/without prefilling.

For the “Scatter plot of Missing Bytes/Reconstruction Time for a node whose peers don’t prefill”, some of the points have > 1 MiB missing and have reconstruction times at ~100ms. Do you know if these points are with peers with large window sizes? I would think that if that much data has to be sent across and the window sizes are ~14480, reconstruction times would be a bit higher… I’m wondering if you have any ideas on why the reconstruction times are low in those cases?

Edit: For the latency formulas, should they include the cost of transmitted data needed to reconstruct in the failure case? I know they are just to form an intuition, but I was also curious about that.