Spanning-forest cluster linearization

sipa · April 18, 2025, 12:31pm

I have posted a comparison of this algorithm (SFL) with the GGT parametric min-cut algorithm, and the existing candidate set searching (CSS) algorithm from the other thread.

Here are some of the open problems still with SFL:

Termination: there is a proof that if SFL terminates, the result is an optimal linearization. However, there is no proof that it always will, even though fuzzing seems to indicate that the “merge chunks by maximum feerate difference” heuristic is sufficient to never cause the same state to be repeated (which would suffice for termination, as there is only a finite number of possible states). EDIT: it is not guaranteed to terminate.
- Complexity bound: it would be good to have some bound on how many iterations the algorithm can go through. If the above observation (no repeated states) holds, then there is some bound on the number of iterations, as there are at most 2^m states, but this is not a very interesting bound. Somewhat stronger bounds are possible by taking into account that no more than n-1 dependencies can be active, and that they form a spanning forest, but it remains exponential.
Equal-feerate chunk splitting: The algorithm works if splits are applied whenever the top subset has strictly higher feerate than the bottom, and merges whenever the bottom has higher or equal feerate as the top. If instead splits are done when the top has higher-or-equal feerate, the algorithm may loop forever. If instead merged are only done when the bottom has strictly higher feerate than the top, the result may not be topological. The result of this is that the chunks that come out may be contain multiple equal-feerate components that could be split in a topologically-valid manner, i.e., become separate equal-feerate chunks. It would be nice to find an algorithm that can efficiently find these, whether that’s done as part of SFL, or as a separate post-processing step. Note that this is not simply finding connected components within the chunks: the SFL chunks are always connected already. EDIT: solved, see post below.
- Sub-chunk linearization: Somewhat related, SFL provides no way to order the transactions within a chunk. This only matters for block building with sub-chunk granularity, but intuitively, it feels like there is some information within the SFL state that may be useful. When a chunk is repeatedly split and re-merges with itself, it feels to me like it is really improving some implied sub-chunk linearization, until that sub-chunk linearization becomes good enough that it permanently splits the chunk in two. An “optimal sub-chunk linearization” would imply splitting equal-feerate chunks too, though beyond that, it’s not clear how to define optimality here.
LIMO-like improvements: It would be useful if one could take an SFL state, and a topologically-valid subset, and make directed improvement steps such that the corresponding linearization is at least as good as moving that subset to the front of the linearization. This would permit things like mixing in ancestor sets into the state.
- Merging states: a more general question is, can one, given two SFL states (sets of active dependencies) efficiently find a third SFL state whose linearization is at least as good as both linearizations corresponding to the input states? This is trivially possibly by extracting linearizations from both, applying the linearization merging algorithm, and converting those back to SFL, which is all possible in \mathcal{O}(n^2), but this loses valuable information. A final SFL state lets one decide that the corresponding linearization is optimal (by not having any more splits or merges to apply) in \mathcal{O}(n^2) time, while doing that for just a linearization is as far I know equivalent to computing a (single) min-cut, which is \mathcal{O}(n^3) even with the most state-of-the-art algorithms.