For the past few months (since plebfi miami), I’ve been working on a project to monitor the Lightning gossip network by collecting gossip messages from numerous nodes. The repo with the code used, the raw data collected before lightning++, and the slides from my talk is here:
Convergence delay (time for a message to propagate around the network) has dropped a significant amount since similar measurements in 2022, from ~500 to ~200 seconds for 75% propagation. I suspect this is due to LN implementations defaulting to making more P2P connections.
A significant subset of all messages received were sent to my node by less than 1/4th of its peers. This could be due to a poorly connected graph of P2P connections, or some filtering policy in the LN implementations.
For channel_update messages, which were 60% of total messages, 20% of channels had more than 144 messages. This is relevant when compared to the proposed gossip rate limit for Taproot gossip / gossip 1.5(?):
For node_announcement messages, which were 30% of total messages, 2.5% of nodes were announced more than 144 times. This seems like a unintended behavior of some implementation or node automation tool.
The total size of all unique messages collected over that day was 103.2 MB, which is greater than I expected. This does not account for the bandwidth overhead of receiving the same message from multiple peers, so node bandwidth usage is likely much higher.
Future Work
So far, I only collected data over a 24 hour period and analyzed that. In the near term I’ll be setting up permanent infrastructure to receive gossip from multiple geographic regions and neighborhoods in the P2P graph. I’ll also be adding functionality to broadcast gossip messages from these nodes, and observe their propagation. Work will continue in the gossip-observer repo linked above.
Getting Involved
If you have feedback on interesting metrics to compute, a request to track gossip propagation from your node, or anything else, feel free to message me here!
I’m also investigating how this raw data could be published so others can perform their own analysis. I’m particularly interested in options for unsupervised anomaly detection, but I’m also definitely not a data scientist.
A related subject is how we could switch the LN P2P network from message flooding to something closer to the design outlined in the Erlay paper and BIP. Some observations:
Latency / propagation delay is much less important for LN gossip; implementations will allow slightly outdated fees to apply to their channel, for example.
Gossip messages are signed by the sender, and are intended to be public. So the propagation method does not need to conceal information about the sender’s position nor connections in the LN P2P graph.
There has been some existing work on this subject:
A key detail is how to map gossip messages to Minisketch set elements. This mapping should be collision-resistant, and ideally the elements are short, since reconciliation time is quadratic wrt. element length.
In the Erlay design, collisions are avoided by generating a salt for each peer connection, and using that and the TXID as input to a fast non-cryptographic collision-resistant hash like SipHash. This means that the total amount of hashing operations needed scales with the number of peer connections, but having a shorter (32-bit) set element greatly reduces the compute needed for set construction and reconciliation.
Alternatively, set elements could be computed from a channel’s short channel ID (64 bits), plus the block height included in the gossip message. However, there are some issues with this approach.
One is that the components of a short channel ID (channel open block height, transaction index, and output index) don’t have a lot of entropy. For example, in a full block, one could have ~36000 1-in 1-out P2TR TXs of 111 vB, or 1 TX with 1 input and 93020 outputs with weight ~3999900 vB. In either case, the entropy of the TX + output index is only ~17 bits. Mixing in block heights does not add much to the total entropy, since a message can only use a block height from the last two weeks (~2016 possible values, 11 bits of entropy).
[Edit 19.11.25: I forgot about the max standard TX weight of 400k WU; so a full block could have 10 TXs with ~3600 outputs each. That doesn’t really change the total entropy of (TX index + output index) I think.]
In addition, we would need a different mapping for node_announcement messages.
Given that LN nodes already store some per-peer state on connection (re)establishment, I’m in favor of borrowing the Erlay approach and using a per-peer salt + a hash like SipHash to generate short Minisketch set elements for each gossip message.
Gossip v1.5 message format
Another detail is that the gossip 1.5 proposal allows for optional fields in gossip messages, which is not possible with current gossip messages:
However, I think peers can decide which fields to send to their peer based on their announced feature bits.
Other considerations
Compared to the Erlay design, I don’t think LN implementations would need to perform any message flooding at all. Combining limited flooding and set reconciliation is useful for reducing total propagation delay, but the LN gossip propagation delay is so large that it could be maintained through set reconciliation alone. This should greatly simplify implementation complexity as well.
With respect to implementation, one problem would be different nodes filtering messages with different policies. This would lead to persistent set differences, which would waste bandwidth and, in the worst case, cause reconciliations to fail. I suspect that these filtering policies are either legacy behavior, or adaptions to quirks in existing gossip behavior, and could largely be removed if we move to set reconciliation.
Future Work
In parallel to the gossip monitoring work, I’d like to get feedback from LN implementers on which designs could work for them. Is the per-peer state a significant drawback or source of complexity? Would there be an issue with more frequent (every 30 seconds?) set reconciliation and gossip message handling? Are there other issues that haven’t been considered?
I’ll update this thread if/when I have a more concrete proposal or draft BIP.
Minisketch has superlinear decoding costs, so decoding huge sketches will burn up a lot of CPU. One could reconcile more often to try to fight this, but every reconcile will imply some communication overheads since you’ll overshoot the unknown needed amount for a reconstruction.
The flooding in the erlay has the advantage that takes the bulk of the load off the sketch and lets the sketch fill in the small omissions which it’s good at doing.
If you were to use reconciliation only you might be better off using iblt instead of minisketch (maybe plus a very small minisketch to unjam stuck IBLT decodes). The overheads of iblt are much worse but that may be less significant if you’re running all the traffic through it.
I missed that benefit of the flooding in Erlay on my previous reads of the paper; that makes sense.
This opened a rabbit hole that actually led me to a very recent paper proposing an IBLT-based set reconciliation protocol, that may be well-suited for this use case.
Firstly, the CPISync implementation used for benchmarks in the Minisketch repo has moved and expanded to include some new protocols:
There is a related paper that explores some of the tradeoffs of sync with Cuckoo filters vs. CPI vs. IBLT:
Cuckoo filters looked interesting, but IIUC don’t fit our use case well since a participant learns which elements their counterparty is missing, not the opposite (which elements they should request). The bandwidth used also seems to have a high floor. Outside of that, the benchmarks focus on much larger sets than what’s relevant in the LN Gossip (or Erlay) setting.
I never looked at the MET-IBLT paper; it seems like a better IBLT sync scheme, but with no benefits compared to RIBLT.
Looking into cuckoo filters eventually led me to this paper, Rateless IBLT (RIBLT):
Which seems very promising. They even used the Minisketch library in their benchmarks! And there is a public implementation in Golang (+ impls in C++ and Rust linked in the README):
The main math & explanation is in Section 4, with comparison to alternatives in Section 7.
As a tl;dr:
To make an IBLT rateless / an infinite stream of coded symbols, we can extend a mostly-normal IBLT s.t. later coded symbols map to fewer and fewer inputs. If we have an efficient function to compute the indices of coded symbols an input must contribute to, we can encode our table efficiently even as the set size grows.
We can also extend the table incrementally, and send extensions until decoding succeeds, so we don’t need to estimate the set difference nor regenerate the IBLT if decode fails (similar to extending a Minisketch) (the paper doesn’t comment much on partial recovery, but Figure 6 has some simulation results related to this).
The bandwidth overhead seems to have a ceiling of ~1.75, even for sets with very few differences; it converges to ~1.35 as differences increase past 100, which seems significantly better than standard IBLT? (Figure 5).
For both large and small set differences (1-1000), encoding cost grows linearly (Figure 8) This also holds for total set size (FIgure 10). Worth noting that the ratio (set_differences)/(set_size) is quite small in most of their benchmarks.
Most of their benchmarks use an element size of 64 bits. IIUC the bandwidth overhead could be significantly reduced if an implementation was optimized for small sets and smaller elements; there are some relevant notes in Sections 7.1 and 7.2.
If these performance properties hold up, I agree that mixing frequent IBLT-based syncs + infrequent Minisketch usage is appealing. It may also be worth skipping Minisketch entirely and trade bandwidth overhead for the CPU savings; I’m not sure how LN implementations feel about that tradeoff tbh.
Another consideration is that the elements we want to exchange are very small (average message size <275 bytes), so there isn’t much room to add bandwidth overhead. These messages will still be small with gossip v2.
Being able to have larger (64+ bit) elements with only a small increased CPU cost (FIgure 11) may be a very significant benefit; I’ll take a look at these implementations soon and report back.
Just be careful with figures from papers they tend to be rather asymptotic. IBLT like schemes usually start with ~32-bits per difference in overhead for a checksum which is pretty bad when otherwise 30-bit members are fine >100% overhead before you even get to the overheads needed to achieve correct reconstruction. Some papers simply leave this overhead out of their figures, though I don’t recall if the rateless paper did that.
(FWIW, you can use minisketch in a kind of quasi rateless way by just dynamically sending more until the other side could recover– almost all the computation from a partial one can be conserved, as you don’t get to the expensive and non-reusable root finding step until you’re almost certain to have a correct decode, so long as you’re willing to take one or two extra elements overhead)
Fair; their main benchmarks are using 32-byte members and 8-byte checksums. I think the claimed overhead depends on the ratio between these two? Which would be worse for this application.
They do mention that you can shorten the checksum for small sets, but it definitely isn’t benchmarked. I think that is acceptable if the set members are already sufficiently collision resistant.
Ah, I didn’t fully realize that this was the tradeoff for bisection / sketch reuse from the original paper nor notes on the repo But I see it listed as a TODO. That seems like the best option then; having more than one communication round for bisection is also more acceptable for LN than Bitcoin TX broadcast.
I don’t have a real estimate nor intuition for how many differences to expect; I may be able to estimate that by looking at real-world traffic, but it seems like this also depends on the flooding timer skew across all of your peers? Perhaps that is another option for reducing differences. Eventually I’ll replay real-world traffic in simulation for a proper estimate.
Related, one simple adjustment to reduce differences would be to flood messages your node generates to your direct peers, instead of waiting for the reconciliation timer.
Sorry to join the conversation late, and I haven’t worked on this for years so my memory could be stale, but here’s a brain dump of our conclusions.
Short channel ids have a lot of bits to squeeze. Ultimate would be to refer to outputs by their index in the blockchain (i.e. the 123456th output), but you can easily use fewer than 64 bits for blocknum/txnum/outnum.
You maintain three minisketches. We tried a single, but it costs encoding bits, complexity and hits the O(N^2) harder.
Channel minisketch is just the scids. Importantly, you send your blockheight with the sketch, because that informs the failure case.
Channel update minisketch is compacted scid + direction bit + height. Note that height takes 10 or 11 bits, IIRC, since you don’t send expired ones. You can only reliably decode this once you have reconciled the channels sketch.
Node announce sketch is uses the same encoding, using the oldest channel attached to the node as its key. Again this assumes you reconciled the channels sketch first.
If you cannot encode an scid compactly, just send it as a series of “raw" entries. IIRC creating such an scid requires an exceptional number of txs in a block or exceptional output count.
With this scheme, you simply send the sketches every 60 seconds (like now), and your peer sends you what you’re missing.
Note that you can truncate the set you send if you want to save bandwidth, but really the cost is in the set maintenance, so maybe this is silly. My memory is that minisketch is fast in practice though, even if you keep a 64k (8k element) set which is our max message size anyway.
If reconstruction fails, there are several things you can do:
If block height differs, ignore. Time will sort it. Maybe include block hash here?
Enlarge your own set (or, send more of your set). If this allows your peer to reconstruct, it will learn that you cannot reconstruct, and it knows to send its largest set if it wasn’t already.
Wait for other peers. You might close the gap.
Existing gossip queries for recent changes (assuming a pile of old changes haven’t suddenly appeared). You know if you need announcements, updates or node anns.
Query for everything.
Oh, we added a “total entries" counter to each message, which gives a clue as well: if your peer has far fewer entries, it’s a cry for help
50,000 public channels means 100,000 updates per block maximum, so roughly 10,000 a minute.
In an ideal situation with two peers and you space out the updates so that they are 30 seconds apart, you can just handle that with a 64kb set (8k entries).
Of course not every channel updates at the maximum rate, and behaviour will definitely be bursty, but this was the back of envelope calculation that convinced me that this approach can work.
Thanks for reposting it here! Useful to have more context on this, and any snippets of earlier discussions.
Agreed; I think we could cut at least 6 bits off of txindex? And then use creative XORing to mix in missing bits that don’t have their own spot.
Not sure what you mean with the “raw entries” part? Do you mean, send the underlying gossip message outside of the set reconciliation messaging flow?
IIUC the cost is mostly in the final parts of decode; I think if you wanted to save bandwidth, you could create a sketch with an overestimate of the true set difference, and then send some portion of that. If your peer signals that decode would fail, you can send the rest / another chunk of the oversized set.
I didn’t consider the max message size at all; that sounds like an argument for using shorter (32-bit) set elements and using a per-peer salt
Not sure I understand this bit; I figure you mean the block height you suggested sending alongside the channel announcement sketch. Is this the blockheight of the newest channel announcement you’ve received, or something else?
Mm, I wonder if there are other values that would be useful to add in here.
Re: your follow-up post - I agree on the napkin math working out, and it’s even better if we’re doing one set per message type. And having more peers should also help.
Will think over the message ordering dependencies you mentioned in having three sketches vs. one. Fallback behavior #4 could be promising, where you rebuild a sketch from the messages you received over the last n minutes vs. just 1 minute, reattempt a reconciliation, and hopefully catch up to that peer. I think I’d like to avoid leaning too much on gossip queries if possible
No, I meant, say you use 24 bits for blknum, 14 for txindex, 12 for output number, 1 for direction and 12 for blocknumber, that uniquely identifies each update in 64 bits. But it someone uses the 4096th output of a tx for a channel, you can’t encode it in the set. For that, you simply encode it raw, outside the minisketch.
You don’t need to signal. When you get their sketch, it fails for you. So you go larger, if you can.
Per-peer salt penalizes scaling. Now each peer costs you more than bandwidth. If we can avoid this, we should, and I think we can.
The current blockheight you are aware of. This is important, because channel updates must be refreshed every 2048 blocks (IIRC in the new spec proposal), and you will accept new channel announcements on each block. So you expect non-zero differences in this case.
The block hash would allow you to see forks, if that happened.
Note that the long pole in the scaling tent is channel updates. Node announcements only change when you change IP address or something, and there are far fewer nodes than channels x 2. Channel announcements only happen once per channel.
I think in practice, implementations will fall back to asking for everything if reconciliation fails. It has the benefit of being simple, both to implement and test.
Yes, there’s a whole load of fun avoiding griefing here. Some rando connects and sends you 8k elements of junk, you don’t want to spend 1 second of CPU (based roughly on the graphs on the minisketch page). Fortunately you can limit it to some ceiling simply by trimming first. The ideal BOLT spec would give exact numbers here, but thorough benchmarks required (like, how long does it take the build the set for all the gossip in the first place? How long to maintain it on each block? etc).
Ah ok, makes sense now. I think another option is to use more bits per field, and then ‘overlap’ fields by XORing between the low-entropy bits of one field with the high-entropy bits of another.
As an example, a 24 bit blknum field supports another ~300 years of blockheights (we could probably shrink it?). blknum could be encoded little-endian (LE) / with the LSBs starting from index 0. txindex is expanded to 18 bits, also encoded as LE, but starts from index 20. So the two fields overlap on 4 bits:
Key[0:19] = LE(blknum)[0:19]
Key[20:23] = LE(blknum)[20:23] ⊕ LE(txindex)[0:3]
Key[24:37] = LE(txindex)[4:17]
IMO we can adjust the size of these fields based on current constraints like min. tx size, etc. such that the amount of overlap is minimal, but we wouldn’t need this fallback path.
This is the message flow I was thinking of in my response:
sequenceDiagram
participant Alice
participant Bob
Alice->>Bob: RequestSketch()
Bob-->>Alice: SketchDifferenceOfTwoFourths
Alice->>Alice: DecodeSketch()
alt DecodeSketch() failed
Alice->>Bob: RequestSketchExtension()
Bob-->>Alice: SketchDifferenceOfThirdFourth
Alice->>Alice: decode
end
I’m unclear on what you meant by “go larger”, but I think you meant that Alice should build a sketch from a larger set locally and retry reconciliation?. IIUC, if the number of differences exceeds the capacity of the sketch Bob sent Alice, Alice retrying reconciliation in this way would only succeed if the number of differences decreased between her first and second sets. Which may happen if she has reconciled with other peers in the meantime.
I think we’d probably want to do both? So if initial reconciliation fails, wait for timer_interval / 3 seconds, expand our local set, build a larger sketch, and try reconciliation again. If the 2nd reconciliation fails, ask Bob for a sketch extension and retry reconciliation a 3rd time. If it still fails, either:
Query Bob for all set elements and reconcile locally (lots of bandwidth, minimal CPU)
Give up on reconciliation with Bob for this interval, retry the normal reconciliation protocol in (timer_interval) * 2/3 seconds. Hope that our reconciliation with other peers reduces our difference with Bob.
Another option is to ask Bob for a fresh sketch of the same capacity, instead of an extension of the 1st sketch we received.
IIUC we mostly agree on the options for handling initial reconciliation failure; I think having evidence to decide will require simulation and tweaking parameters there.
My response earlier may have been missing a /s tag, sorry; I think we may just have a different mental model for which parts of this process consume more or less resources. From the benchmarks in the minisketch repo, and comments upthread, we know that:
Encoding a sketch is very fast; most of the compute for reconciliation is part of sketch merging & decoding.
Decode time is quadratic with # of differences.
Given current optimizations, decode time is 2x as long for 64-bit keys vs. 32-bit keys.
My argument re: salting and 32-bit set keys, is that the additional complexity of per-connection salts, and the extra work of computing more set keys (1 per connection vs. 1 global set key), is worth the expected ~2x savings in compute for sketch decoding.
As a supporting point, non-cryptographic hash functions like xxhash are extremely fast for small inputs. As a detracting point, I suspect that a per-peer salt would require using a bit more memory to keep track of per-peer sketch state.
I’m not married to a requirement that we use 32-bit keys, but I think it’s worth exploring / keeping in mind. Perhaps we can tweak other knobs, keep the # of differences down, and then not lose much by having longer 64-bit keys.
Makes sense, I agree
Fair; though I suspect maintaining a larger set / allowing messages to remain in the set for longer will help a lot with reconciliation reliability. Another question for simulation!
IIUC, the sketch decode has multiple steps, and most of the compute is in the final step. Before that final step, a decoder would know if decode is expected to yield a successful reconciliation with very high probability, before executing that final step. So an early exit should make the cost for that much smaller; I think this isn’t yet implemented in the minisketch library though.
I imagine you’d only want to accept sketches after explicitly requesting them, and you’d disconnect peers that repeatedly send malformed sketches.
I agree a lot of this will depend on some more benchmarking, especially since I think our set and difference sizes may be larger than what is covered on the minisketch repo right now. More varied benchmarking should be a straightforward first step before network-level simulation, I’ll look into it.
That API would be great, but meanwhile I think you just trim untrusted sources to use a small sketch, which bounds the time. If a failure to decode simply (eventually) results in you streaming all gossip, it’s no worse than the current case where a peer explicitly asks you for all gossip?