Gossip Observer: New project to monitor the Lightning P2P network

No problem, happy to have more feedback! There are some other problems I’ve heard of across multiple anecdotes talking to implementers, that aren’t really covered in prior literature (that I’m aware of):

  • For a well-connected node, like an lightning service provider (LSP) that sells channels, or a routing node trying to earn a profit, adding more P2P connections and accepting gossip messages from all peers increases CPU usage significantly without improving their view of the network. As a result, bigger nodes take on extra complexity in filtering/rejecting gossip from peers, and maintaining their network view with some secondary system.
  • Even with how well-connected I expect the P2P network to be now, given the default number of connections implementations make (5+), there are reports of nodes missing messages related to entire subgraphs / neighborhoods of the payment network. So propagation of some messages may not be working reliably. This could also be caused by implementation-specific message filtering, that could be removed when moving to a sketch-based protocol.
  • A lot of implementation complexity, across all implementations, concerning when to use gossip query messages to stay in sync with peers. As well as policies around when to forward gossip messages.

There should also be some savings of ‘CPU usage per P2P connection’, depending on the specifics of the sketch-based protocol.

I don’t have a concrete number for a maximum amount of bandwidth implementations would tolerate to be honest. Though the minimum amount of bandwidth needed (total volume of unique messages) will grow as the network continues to grow. And we know that flooding is already ‘bad’ enough with the current number of P2P connections, and that we’ve accrued many workarounds so far.

True - that value of 900 initial connections was just an arbitrary starting point tbh. I’m planning to have something more thought-out for the upcoming version of the observer :slight_smile:

Re: reachability - one theory I heard recently is that many of the nodes that have both Tor and clearnet address in their node_announcment have misconfigured routers / firewalls, such that they broadcast an IPv4 address in their node_announcement but can’t accept inbound IPv4 connections. I know Bitcoin Core has spent a lot of effort on this, with the (deprecated) UPnP support, and now NAT-PMP / PCP support. I suspect that implementations may broadcast IPv4 IPs without verifying that they can accept such connections.

This may not affect normal operations like channel opens or broadcasting channel updates, since they can make an outbound connection to their counterparty and use keep-alives to work around NAT / router constraints. Or they just connect to their counterparty over Tor.

The upcoming version of the observer should have Tor support, which would also help reach a better percentage of the network.

Some of that may be that my connection count was changing (decreasing) over time, to a final count of ~700 peers IIRC. So that peak at 700 likely still represents reliable propagation.

For the peak of 100, I think that may be related to different propagation behavior for certain message types; I’ll try to follow up on that.

Based on offline feedback from implementers (and my own opinion), reliability in converging to a full network view / being in sync, is much more important than the convergence delay. Followed by resource usage and implementation complexity.

I should be able to better observe the difference in network views over time once I start collecting data from multiple ‘observers’ at different positions in the P2P network.

Achieving higher reliability via even more flooding connections is not a trade people want to make, and I think the gossip query behavior is the current substitute for that, where a node may periodically query a peer for all messages from a certain timespan (e.x. send all messages from the last hour) to make sure it didn’t miss messages from only flooding.