How to linearize your cluster

ajtowns · February 11, 2025, 10:02am

That just suggests the example problems being solved are effectively getting simpler as n,m increase, no?

I haven’t read the papers, but I’m not following how you construct a network flow where solving a max flow / min cut gives you a subset of txs that maximises f_C - \lambda s_C for a given \lambda. The DeepSeek approach seems like it could solve for a C that gives the largest feerate, by bisecting on different values for \lambda, but that seems more like finding the breakpoints in order?

If you have txs A at f/s = 100/1, B at 3980/50, C at 920/49 (with \lambda=5000/100=50), and where C spends B and B spends A, what’s the flow diagram that tells you the first/best breakpoint is AB vs C, rather than A vs BC?

stefanwouldgo · February 11, 2025, 10:44am

But that’s just what the DeepSeek approach does. It finds a subset C with maximum weight under the side condition that C is a closure. And we choose the weight to be f-\lambda s for all nodes. It’s a closure because it’s a min-cut, and the original edges get infinite capacity, so they are never cut. It maximizes weight because the size of the min-cut is exactly the sum of excluded nodes with positive weight plus the sum of included nodes with negative weight.

The trick is that turning our feerate into a weight by introducing \lambda lets us simply sum up the weights into total weight instead of directly optimizing a ratio.

No, bisection does not find them in order. It always finds the optimal improvement for the diagram in the interval you are looking at. And then you choose the next interval to look for.

stefanwouldgo · February 11, 2025, 10:59am

In this case the weights are 100-50=50 for A, 3980-2500=1480 for B, and 920-2450=-1530 for C. So the min-cut will pick AB as first subset/chunk.

Next you can choose to see if optimizing AB further is even better: the new \lambda is the new breakpoint is the combined feerate of AB, which is 4080/51=80. Then the new weights become 20 for A, -20 for B, -3000 for C. So the min-cut will pick A as the next best chunk. We could let the algorithm run again with the new breakpoint \lambda=100 and see that now we get the empty set or maybe A again, but we already know that a single node cannot be improved upon, so the optimal chunking is (A,B,C).

ajtowns · February 11, 2025, 3:48pm

Oh, I think I follow now; I was missing something super basic just as I thought. So you have the network:

flowchart TB
         s --50--> A
         s --1480--> B
         C --1530--> t
         subgraph txs
         C --∞--> B --∞--> A
         end

and your max flow is 0, which means the residual network is the same as the original network, and the min cut sets are {s,A,B} (the nodes reachable from the source) and {C,t} (the nodes that can reach the sink). With a less trivial example you might have some flow from s to B to A to t (where B pays for its parent A whose feerate is below \lambda) in which case you’d do some actual work.

So… is the essence of the idea really just:

take your collection of txs, and its overall feerate \lambda
calculate a min cut of S (higher feerate) and T (lower feerate) on a network with n+m edges using f-\lambda s and \infty as the capacities
repeat the algorithm on both S and T to further improve things, prioritising S
run the algorithm at most 2k-1 times where k is the final/optimal number of chunks – each run will either split a chunk into two, or tell us we no longer need to do any more processing of that chunk

And I guess that means the different variants are simply about reusing info from the previous step (eg of ABC with \lambda_0=50) to make the next step (eg of AB with \lambda_1=80) faster than doing it from scratch, since the nature of the network is really restricted – all you’re doing is changing the value of \lambda and getting either a subset of S as you increase it, or a subset of T as you decrease it?

ajtowns · February 11, 2025, 7:59pm

I believe a different way of looking at this is that the min cut algorithm collects all the chunks from the optimal feerate diagram whose individial chunk feerate is greater than the target feerate \lambda_1. That makes Q_1 maximal, because a greater slope means the lines were moving apart as the cumulative size increased (and a lower slope from then on means they’re getting closer together).

(Finding the residual network from the max flow eliminates chunks with lower feerates as a chunk with a lower feerate is either only connected to the sink, or because any higher feerate children have their connection to the source eliminated because more flow is able to go to the sink than can come from the source)

sipa · February 12, 2025, 4:05am

Oh, right, of course. That’s an easier way to think about it.

Lagrang3 · February 13, 2025, 1:08pm

@sipa on which branch can one follow the progress of this work? Is it wip_memepool_fuzz?

sipa · February 13, 2025, 1:20pm

Depending on what you mean by “this work”:

The old exponential algorithm (which the first post in this thread is about) is merged in Bitcoin Core’s master branch, along with various related things (postlinearizing, merging, LIMO, ancestor sort, …). The current work is building a higher abstraction layer around it to manage clusters, and integrating it in the mempool and validation code. See the tracking issue. This is currently taking most of my time.
The simplex-derived spanning forest algorithm that I recently posted about is implemented in my spanning_tree_linearization branch. I am currently no longer working on it, as the min-cut approaches being discussed here are more promising.
As for min-cut based linearization approaches, my thinking is to start experimenting with a from-scratch implementation of GGT (with FIFO or max-label strategies) as that seems to be the most promising balance between implementation complexity and worst-case asymptotic behavior, but I don’t have any code yet.

I expect that the min-cut work will eventually lead to a better cluster linearization algorithm that can be a drop-in replacement of the currently-merged code, but that’s probably a longer-term thing than getting the existing code actually operational.

sipa · March 6, 2025, 4:39am

So, short update.

My plan is still to write a prototype of GGT (with max-distance selection strategy, not dynamic trees, so \mathcal{O}(n^2 \sqrt{m}) complexity) to experiment with, but no code yet.

@ajtowns and I spent some time thinking about how to represent capacities/flows internally. The paper only describes them as real numbers, but actual implementations need to represent them somehow:

The most obvious choice is using floating-point data types, but these involve potential rounding issues. More conceptually, it’s also unclear under what circumstances they actually result in exact results.
A line of thinking we followed for a while was representing them as exact fractions. It turns out that within one min-cut (ignoring GGT slicing), all capacities will be a multiple of 1/S, where S =\sum_{i \in G} \operatorname{size}(i), so one can easily represent all capacities/flows as integer multiples thereof, which can grow up to \mathcal{O}(FS) (with F = \sum_{i \in G} \operatorname{fee}(i)). Unfortunately, when slicing (descending into one side of the cut) in GGT, the flows are inherited by the child problem, which needs a different denominator. Bringing everything on the same denominator could lead to ever-growing numerators.
A next line of thinking was to convert the problem from one denominator to another, by first mapping to the closest integer, and then getting rid of the remaining fractional parts using Flow Rounding, which will find a flow with integer numerators for the subproblem using a flow for the parent with integer numerators these. It’s not very costly computationally, but seems nontrivial, and just feels unnecessary.
This paper suggests a simpler alternative: multiply the flows and capacities by a fixed constant (the same for the entire GGT problem, including subproblems) and represent them as rounded integer multiples. It argues that as long as this multiplier M is such that all distinct breakpoints (in our context: chunk feerates) multiplied by M are at least 2 apart, the found min-cuts will be exactly correct. No two (distinct) chunk feerates can differ by less than 1/(S^2 - S), so picking M=2S^2 would suffice for exact results. This involves \mathcal{O}(S^3 F) multiplication results internally, but 128-bit integers should more than suffice in practice.

EDIT: I realized that integers capable of representing 4 S^2 F suffices. The two places where multiplications happen are:

In the computation of the scaled group feerate for a group g, \lambda_g = -(M f_g) / s_g.
In the computation of from-source / to-sink capacity of a transaction x, c_x = M f_x + \lambda_g s_x.

stefanwouldgo · March 6, 2025, 8:50am

This approach seems appropriate. I’d even argue that we don’t much care for exact breakpoints if they are very close, so 64-bit arithmetic might be good enough.

Maybe these considerations are also another hint that calculating one breakpoint at a time using a simple min-cut algorithm is preferable for our small instances.

sipa · March 7, 2025, 4:02am

That is probably true. Even if cluster sizes are limited to 4 MWU, and feerates are limited to ~10000 sat/vB, M = 461 can be used, which suffices to separate chunks whose feerate are just 0.004 sat/vB apart.

That said, I think the cost of 128-bit vs 64-bit arithmetic is probably close to negligible. There is one division per chunk, and a few multiplications per transaction per chunk. Everything else (the bulk of min-cut itself) is just comparisons, additions, and subtractions. I expect the bulk of the runtime cost to come from iterating over nodes and edges, and maintaining the data structures for them.

Unrelatedly, the paper that introduced the \mathcal{O}(n^2 \sqrt{m}) bound for max-height push-relabel also looks interesting. It appears to contain recipes for building worst cases for which that complexity is actually reached. This may be useful for us for benchmarking worst cases.

Lagrang3 · March 15, 2025, 5:19pm

Hi @sipa. As far as I understand, the linearization of a cluster means any algorithm that takes as input a dependency graph of transactions and outputs those transactions in topological order (if B is a child of A, “B->A”, then A must come before B). I understand that filling blocks maximizing fees is a hard problem (literally) but some approximate heuristics can be used to obtain fairly good solutions, for example linearizing the clusters of transactions and merging the resulting sequences. But still it is not clear to me what’s the definition of the optimal linearization. How do you measure the quality of one linearization vs another.

I am asking because I can think of other ways, besides the one described here, to linearize the cluster and I am not able to judge it good or bad because I lack the definition of optimality or the measure of goodness in this problem.

sipa · March 15, 2025, 5:37pm

I’ve made a number of posts on the topic, but it’s explained at least in:

Introduction to cluster linearization (more accessible)
Cluster mempool definitions & theory (more theory)

Summarized very briefly:

A linearization is any topological ordering for the transactions in a cluster (i.e., parents always go before children).
The fee-size diagram for a linearization of a cluster is the convex hull through the points whose (x,y) are the (size,fee) of all prefixes of the linearization. The line sections of the diagram correspond to chunks of the linearization: groups of transactions which “pay for each other” (which can be seen as a generalization of CPFP).
A linearization L_1 is at least as good as a linearization L_2 for the same cluster if the diagram of L_1 is nowhere below the diagram of L_2. The intuition is that we don’t know ahead of time how much of the cluster we’ll be able to include in a block (when considering clusters in isolation), so we want something that is good “everywhere” - the diagram tells us (roughly, ignoring the fact that the convex-hulling makes it approximate) how good a linearization is for every possible prefix that gets included.
There is a proof that by this definition, at least one optimal linearization for every cluster exists, which means a linearizations that is at least as good as every other linearization for the cluster. In other words, making the convex-hull approximation is sufficient to simplify the problem enough so that a globally optimal linearization always exists, abstracting away the fact that we don’t know ahead of time how much of a cluster will be included in a block.

One way of constructing an optimal linearization is by finding the highest-feerate topologically-closed subset of the cluster, moving it to the front of the linearization, and then continuing with what remains of the cluster. GGT uses another approach, by subdividing the cluster into smaller and smaller portions, which end up being the chunks.

gmaxwell · March 16, 2025, 4:28am

If the “convex hull” part of Sipa’s description of optimality seems capricious to you, perhaps it would be helpful for me to share why I found it intuitive:

Consider the mining problem without the issue of dependencies. An obvious approximation algorithm for the problem is to sort transactions by feerate, and include transactions in that order until you can’t fit the next one.

One thing to observe about the algorithm is that if the last transaction you include exactly fills the block then this greedy algorithm is optimal, not approximate. With that in mind, you can see that there is an obvious bound on the worst case approximation error: You will at most miss out on fees for the unfilled bytes times the next best feerate. The relative error is also small so long as transactions are small compared to the block (since the loss is no more than the left out transaction). Of course, in practice its less than that if you fill in the remaining space as best you can (which is what Bitcoin Core has done for a long time).

A similar kind of analogy exists for the “chunk” approximation-- the constraint to chunks causes an approximation error in the otherwise elegant definition of optimality… but only when the available space doesn’t line up with chunks. And when it doesn’t the amount of error is bounded by the maximum size of a chunk.

And since any kind of optimized linearization is going to have some less attractive parent transactions ahead of child transactions that make them worthwhile, there will always be some approximation loss for any algorithim that isn’t figuring out the order just in time. I think that an argument can also be made that it doesn’t make sense to optimize for any particular fullness location because the distribution of the remaining space is going to look like a sum of arbitrarily many randomly sized chunks mod the typical chunk size, and will be pretty uniformly distributed (though admittedly that’s a total hand-wave, but it’s my intuition).

I think it’s also important to keep in mind that the linearization is usually unimportant:

The vast majority of all clusters have only one possible linearization because there is only a single txn or the txn form a straight chain. And the linearization is irrelevant during both mining and eviction if either none of the transactions would be selected or if all of them would be selected. – which is again usually the case.

stefanwouldgo · March 25, 2025, 3:52pm

As I was pondering all the potential ways an adversary might abuse the fact that finding a good chunking/linearization can take time (even if it’s polynomial), the following idea struck me: why don’t we allow/expect someone submitting a cluster of transactions that conflicts with some mempool txns (RBF) or that improves the mempool diagram (CPFP) to show this improvement themselves by also submitting a chunking/linearization of all affected txs?

While we have been discussing in this thread that it is absolutely possible to calculate an optimal chunking in reasonable time, it might still be a burden to do this for every adversarially submitted tx on low spec hardware. But most txns don’t need this at all, and for those that do, Core could offer to do it locally and then forward this linearization with the txns. Checking if it really improves the status quo should be possible in linear time and in particular shouldn’t be harder than what is checked now.

There are surely encodings for such chunkings that add negligible space overhead to the transactions themselves.

sipa · March 25, 2025, 4:09pm

Yeah, we’ve discussed the possibility of relaying linearizations or other cluster/chunking information along with transactions.

It’s hard to make it a requirement however, because the sender’s cluster might not be the same as your cluster, and requiring that they’re identical would permit even worse attacks, by an attacker attaching different incompatible transactions to the same cluster for different peers.

Still, it is possible to do this as an independent improvement, because the linearization merging algorithm can always be applied between the linearization received over the network, and the linearization the receiver already had for a cluster, even when the clusters aren’t exactly identical.

Because of the above however, I don’t think it can remove the need for having a fast-but-good-enough from-scratch linearization algorithm that can run at relay/tx processing time.

stefanwouldgo · March 25, 2025, 4:29pm

I don’t think the clusters have to be identical. We just want the received linearization to beat any conflicting transactions or show the improvement by CPFP. Can you elaborate on a scenario where an attacker can do this in multiple ways? What would they gain? Would this enable free relay or prevent me from further improving the graph? I’m having a hard time envisioning a concrete attack.

stefanwouldgo · March 25, 2025, 5:29pm

In particular, we can always assume ancestor sets as the baseline. Beyond that, BYO linearization or optimize your whole mempool with the algorithms from this thread if you have spare resources.

sipa · March 25, 2025, 6:01pm

Ok, taking a step back.

I’m interpreting your suggestion as follows:

Instead of computing linearizations locally every time a cluster changes, require peers to give you a better linearization than what you had before, along with the new transaction, in order for it to be acceptable.

This is nice, because it means the cost of computing linearizations can be deduplicated, and the responsibility to find it lands on the relayer, who probably already has it anyway or they wouldn’t have accepted/relayed it.

But what if someone wants to send you a transaction to attach to your cluster (incl. a linearization for the result), but you already have in the cluster a transaction that conflicts with the newly relayed one (and this conflicting one, for DoS-protection reasons, can’t be evicted)?

The rule cannot be “if the linearization received isn’t for the exact cluster you’re considering, reject it”, because that would make it in some cases trivial for an attack to make a transaction to relay at all (spam the network simultaneously with many different transactions attaching to the same cluster, so everyone has a different cluster).

The rule could be “merge the linearization received with the linearization you already have, for the part that overlaps, and if the result is better than the cluster you had, and it doesn’t conflict with anti-DoS rules, accept it”. However, I think this can still be used by attacker to thwart relay, by attaching different other transactions to the cluster that offer significant linearization opportunities when combined with the new honest transaction being relayed. E.g., the honest transaction and the attacker’s attachment both fee-bump a parent, but the attacker’s transaction can bump more than the honest transaction alone. In this case, when the honest transaction is relayed along with a linearization, which does not include the already-accepted attacker transaction, it may appear worse (or not strictly better) than what the receiver already had.

Note that all of this is about using network-relayed linearization as replacement for requiring nodes to compute their own good from-scratch linearizations/improvements, which I fear could be exploited. If we’re just talking about additionally relaying linearizations and incorporating their linearizations along with self-computed ones, there is no problem.

stefanwouldgo · March 26, 2025, 11:36am

Yes, that’s what I had in mind.

This is where I fail to see the problem ATM. Of course, it’s hard to prove that there isn’t any problem, but intuitively, an attacker would have to attach transactions that make the mempool objectively better (and provide linearizations for them if they are not trivial) and so has to spend as many resources as anybody else.

Sounds like fair competition to me. The attacker got there first and I haven’t heard of them updating the mempool. OK, my tx gets rejected. Ideally with a reason why. But this way or through normal propagation I will learn of the newly attached attacker transaction eventually. When I do, I do the calculation again and resubmit with a better linearization. Of course the attacker could again attach something better in the mean time, but again it will cost him fees.

It may well be that my intuition here is way off, but it feels to me like this kind of dance should already be possible now, just with one less round of communication between me and the relay, because my linearization might be outdated.

Notice also that in this case, one min-cut calculation with the feerate that we are competing against would suffice for the relay node to check if the current cluster can be improved upon. And if this times out, the sender can still retry.

I was also wondering if the problem isn’t even easier in our setting because we receive the transactions online. That is, whether we receive one tx or a whole bundle, these new txs can only be dependent on clusters already in the mempool. But the mempool clusters cannot depend on the new txs. With this limitation, can there even arise linearizations that are better than merging a linearization for the new txs with a linearization of the existing clusters?