How to linearize your cluster

sipa · February 1, 2025, 11:32pm

Sorry, I don’t think this is possible.

When you remove a chunk, it may be possible to find a way to keep the capacities of the chunk itself monotonic, but everything else pretty fundamentally needs to go up, because the next \lambda will be lower (you’ve removed the highest-feerate subset, the next subset you remove can have a lower feerate).

Lagrang3 · February 2, 2025, 9:01am

I could help with the implementation of a min-cut algorithm from scratch. Given the fact that clusters are expected to be small, I would stick to simpler implementations so the Bisection search would be my pick (E.L. Lawler mentioned above). Also a tailored implementation could exploit the problem’s specific characteristics, eg. that all arcs besides those connecting s and t, have unlimited capacity and that the cluster doesn’t have cycles.

It would be nice to have some test cases, like the typical worst case clusters one might expect and such, for benchmarks.

stefanwouldgo · February 4, 2025, 9:19am

I’ve looked into the problem some more and it turns out that its structure is really fascinating and helpful. The key property here is that the min-cuts are nested for a large class of parametric min-cut problems including the monotone source-sink ones. This means that as we look at the series of potential feerates \lambda from the highest to the lowest, the min-cuts are only ever increasing in the sense that the highest weight closure for a lower \lambda includes all the highest weight closures for higher \lambda. To be more precise, there always is a min-cut with this property and all the algorithms that we are looking at here are preserving it, for example by always looking for a sink-minimal min-cut. This also makes sense intuitively: Including a better chunk into your worse chunk only improves the feerate.

This property is very useful: Firstly, it means that there can only be at most O(n) different min-cuts/subsets/\lambda that we need to find. And secondly, it means we can solve the entire problem (not just finding the highest feerate subset, but partition the entire cluster into optimal chunks) in asymptotically the same time that it takes to find a single min-cut!

Because of the nested min-cuts, one can also simply continue on to the next lower \lambda to find the next best closure (which includes the previous ones, but of course it’s easy to strip these away to get at the chunks). All this trickery that we were thinking about isn’t really necessary. However, it turns out the canonical way of speeding up working after finding the highest fee closure is to contract the entire source side (the closure) into the source, and this was already proposed in the GGT paper.

Another interesting thought occured to me as I was thinking about this problem: One application where we might use these algorithms is in deciding whether a potential RBF tx improves the cluster diagram. But this is especially simple because we already know the feerates that we are comparing to. So calling any min-cut algorithm at these \lambda will at least let us know if the new cluster improves on the old one. But maybe this is premature optimization.

Let’s talk about implementation: This paper from 2007 compares an implementation of the full GGT algorithm (which includes calculating the flow forward and backwards from the sink to the source in order to get a better worst-case bound) to a simplified GGT algorithm (which doesn’t do the bidirectional flow calculation and “starts each maximum flow from scratch” but also seems to do the source contraction). The simple algorithm always wins in practice (they also compare it to an algorithm that is even faster in natural instances but only works on bipartite graphs, which doesn’t apply here).

This paper from 2012 is also really helpful, in particular because it supplies multiple implementations (including their own “simpler” algorithm) in the github repo I’ve already linked above. They even have benchmark examples there, though I’m not sure how comparable they will be to our instances. However, their algorithm seems to perform best on all their instances, they are all MIT licensed C++ implementations and it seems to me they solve exactly our full problem: We just need to use -\lambda instead of \lambda everywhere (because the standard description of source-sink monotone has capacities that are increasing at the source and decreasing at the sink with increasing \lambda, we use it the other way round), and instantiate the capacity functions a little more involved than they do (they use affine functions, but the only important property for their algorithm, or really for any algorithm that finds all min-cuts, is that one can easily find roots of these functions, which should be just as easy for ours). It’s great that you want to help with the implementation, @Lagrang3, and I suggest starting by investigating this code. I’m not a C++ wizard and you guys certainly will have an easier time of deciding how useable this is for us. For example, are the dependencies this has a problem, are there ways of speeding it up even more in our case etc.

Now it might be that because our graphs are so small, an even simpler approach might be even faster: Really any standard min-cut/max-flow algorithm, combined with some logic that finds the correct series of \lambda (this process is described quite well in both papers). So it would probably indeed be helpful to have some test cases/benchmarks.

sipa · February 4, 2025, 3:41pm

I have posted a topic about the algorithm we had been working on before these min-cut based approaches were discovered: Spanning-forest cluster linearization. Perhaps some of the insights there carry over still.

That’s an amazing insight, and with it, it indeed sounds plausible that with the same overall complexity all chunks can be found. My belief was that since there can exist O(2^n) different-feerate chunks, an algorithm like this needs extra work to remove previous chunks to avoid the blowup, but what you’re saying is that maximizing weight for a given minimum \lambda already accomplishes that?

Determining if an RBF is an improvement isn’t just a question of whether the first chunk is better, but whether the diagram is better everywhere (RBFs can conflict with transactions in other clusters even, and we care about the combined diagram in this case).

This is a very preliminary comment, as I haven’t looked at the actual implementation, but I’m skeptical that any existing implementation will do. We’re working with extremely tight timeframes (sub-millisecond, preferably less), which in my experience means that even just the cost of converting the problem to a data structure an existing implementation accepts may be non-trivial already. If the approach is restricted to a background thread that re-linearizes any hard things that weren’t linearized optimally at relay time, such concerns are less relevant, but still, ideally we have just a single codebase that can handle all cases.

Unfortunately, because we work in an adverserial setting, the real question isn’t (just) about real-life clusters we see today, but also worst-case clusters that attackers can construct. Obviously it doesn’t matter that attacker clusters are optimally linearized, but attackers may in some scenarios be able to attach their transactions to honest users’ transactions, and ideally, even in these settings simple use cases keep working (e.g. CPFP).

As an example, imagine an honest user performing a CPFP, which causes a transaction to be fee-bumped slightly. Simultaneously, an attacker manages to attach to this simple CPFP cluster a bunch of transactions of their own, which involve significant potential gains from good linearization. A linearization algorithm that spends all its time optimizing the attacker side, but then runs out of time and is interrupted before it ever considers finding the honest users’ fee-bump, would break things.

In the currently-merged code, this is addressed by always including an ancestor-set finding in the result: each successive chunk has a feerate that is at least as high as the best ancestor set (single child together with all its ancestors) among the transactions that remain. It may be possible to keep using that strategy here; please have a look at the LIMO algorithm which may still apply. There is no guarantee that that is sufficient for any particular use case, but it’s probably not far from best we can do within O(n^2) time algorithms, and anything worse than O(n^2) is probably infeasible in the worst case for the cluster sizes we want to support within our time limits.

But all of this means is that what we’re actually aiming for isn’t all that well defined. Still:

We don’t necessarily care about the time it takes to find an optimal linearization, but more about how much improvement to the linearization is made per time unit.
It’s not necessarily actual clusters we see today that matter, but also worst cases attackers can produce.
We probably need an algorithm that can run with a time limit, and produce a valid (possibly non-optimal) linearization still.
It would be nice if the algorithm can incorporate “externally” found good topologically-valid sets, like LIMO can with ancestor sets, guaranteeing them as a minimum quality level.
When the algorithm runs with a time limit that results in the optimal not being found, ideally the work it did is spread out over all transactions. This may mean some degree of randomization is needed to prevent deterministic behavior that lets an attacker direct where work is performed. It may even be worth doing so if the randomization worsens the worst-case complexity.
Probably obvious, but still worth stating: for small problems like ours, constant factors matter a lot, and may matter more than asymptotic complexity. In particular, things like bitvectors to represent sets of transactions are possible, which in theory have O(n) cost set operation, but with very significantly lower constant factors than say an O(\log n) tree structure which a purely theoretical complexity analysis would likely suggest.

That’s great! I do think we need time to experiment with these algorithms, because as stated above, actual performance will matter a lot. Once I understand the min-cut algorithms and this paper better I will probably try writing a from-scratch implementation too.

Worst cases really depend on the algorithm. This isn’t just theoretical: the currently-merged (exponential) linearization algorithm seems to have clusters with O(1) dependency per transaction as worst case, while the spanning-forest algorithm I had been working on (see above) seems to have clusters with O(n) dependencies per transaction as worst case.

It is entirely possible that a well-optimized min-cut based implementation works in pretty much negligible time for any real clusters we see today, which makes it hard to source benchmarks there.

That said, there are benchmarks for the currently-merged algorithm: bitcoin/src/bench/cluster_linearize.cpp at 94ca99ac51dddbee79d6409ebcc43b1119b0aca9 · bitcoin/bitcoin · GitHub

stefanwouldgo · February 6, 2025, 3:37pm

Thank you for the in-depth writeup. It’s certainly full of interesting observations. It appears you have independently rediscovered a lot of the theory that applies here.

I’m not quite sure I understand your question. What I am saying is that for any \lambda, a min-cut calculation gives us a chunk/closure of at least feerate \lambda (or possibly the empty set if there is no such chunk). For a very low \lambda (say the lowest feerate of any transaction), it might just give us the entire cluster. If we increase \lambda from there, we will get better chunks until we overshoot to a \lambda that is higher than the optimum feerate (after that we might just get the empty set). But the nested cut property means that in every step we get a subset of the earlier chunks. So if we search in the other direction, starting with a high \lambda, we get bigger and bigger chunks and can just remove the better chunks we have found before.

In particular, the min-cuts/optimal chunks we find this way form a sub-lattice of the lattice of closures/topogical subsets, that is, they are closed under union and intersection. Even more astoundingly, if X_1 is a min-cut for \lambda_1 and X_2 a min-cut for \lambda_2 with \lambda_1 \geq \lambda_2, then X_1 \cap X_2 is a min-cut for \lambda_1 and X_1 \cup X_2 is a min-cut for \lambda_2. This is sometimes called the ascending property. And I believe it is exactly this structure that you have been rediscovering with the cluster diagrams and guides.

I know. But the structure that I have just tried to explain appears to guarantee that we can check if we can find a better diagram for the cluster including the new RBF tx by calculating a min-cut at the breakpoints of the combined conflicting diagram. In this way we might not find the optimal chunking, but we can decide if it is better than what we already have.

Yes, that is why I think it’s very important to find good worst-case bounds.

The CPFP example you give is very interesting. I’ll have to think more about the implications.

I think the minimum unit of time in which this kind of algorithm can be useful is the time it takes to calculate a min-cut. But it is always easy to get a linearization of any cluster by simply sorting it topologically. From there, with every min-cut we calculate, we can find an optimal chunk to put at the beginning of a linearization. It will be really interesting to find out how fast we can calculate these min-cuts in practice. As I said, in theory this can be done linearly in the number of edges, but that’s not how it will be done in practice. Also in theory, it doesn’t help for our graphs to be DAGs, because any flow network can be transformed into an equivalent DAG in linear time, but in practice it might be very fast. Unless an attacker can construct a bad graph … so designing how to deploy these algorithms in a secure way might be another challenge.

In a way, this helps, because like in the RBF case mentioned above, it gives us feerates that can be a good starting point. Unfortunately, the nice ascending property from above doesn’t apply to just any chunking we get from anywhere. So if we take the union or intersection of, say, an ancestor set and an optimal min-cut, we will get a closure, but it could have arbitrarily bad feerate. Your LIMO algorithm might help here, but I haven’t fully understood that yet.

sipa · February 6, 2025, 6:16pm

Just in case I confused you here: I was trying to explain why this ascending property you mention later wasn’t obvious to me earlier. It doesn’t matter, it makes sense to me now.

My question was whether I had understood things correctly; given your responses, I believe that was indeed the case.

I see now. That may work, but it’s probably premature optimization. Just evaluating the new diagram at all the old diagram breakpoints may be just as hard as just computing the new diagram?

That may be somewhat unfortunate. I was hoping it would be possible to spend some time finding some cut (not necessarily the minimal one), and then later revisit and find a better cut. Because even just finding a single min-cut is O(n^3) as I understand it (if m = O(n^2)), which is probably too much.

Right, but doing this in a “time-restricted” setting means you might end up with a linearization where the beginning is optimal, but the end is terrible, which might be undesirable.

You can think of LIMO as repeatedly doing:

Given a cluster with an existing linearization L for it
- Loop:
  - Find a topological subset S with good feerate to move to the front.
  - Compute L’, which is L with with S moved to the front (leaving the order of transactions within S, and outside of S, unchanged from L).
  - Compute L’’ as a merge of L and L’, which has a diagram that’s at least as good as the best of both everywhere.
  - Output the first chunk of L’‘, and continue with L as what remains of L’'.

I suspect this can be done in combination with the GGT approach, but the more interesting combination is if it can feed back too, i.e. the S set finding algorithm can be bootstrapped by using the first chunk of L at that point. This may be harder.

I’m currently polishing up an implementation of the spanning-forest algorithm, so that I don’t forget the ideas I got while creating the writeup, and also to have a benchmark to target for future things (I think that comparing with the current exponential algorithm will be hard, as the style of graphs which are hard may differ wildly between exponential and GGT, but the difference between spanning-forest and GGT is probably smaller). After that, I plan to dig deeper into minimal-cut and GGT.

sipa · February 6, 2025, 7:17pm

Woah, .

The ~~GGT algorithm~~ min-cut based algorithms are effectively finding a subset x which maximizes

\operatorname{fee}_x - \lambda \operatorname{size}_x

for a given existing solution S with feerate \lambda. In order words, it maximizes

\operatorname{fee}_x - \frac{\operatorname{fee}_S}{\operatorname{size}_S} \operatorname{size}_x

which, given that \operatorname{size}_S is a constant in this context, is the same as maximizing

\operatorname{fee}_x\operatorname{size}_S - \operatorname{fee}_S\operatorname{size}_x

Which is what I’ve called q(x, S) in the spanning-forest writeup, the quantity being maximized when performing chunk splits…

Lagrang3 · February 6, 2025, 8:47pm

That, I think is not accurate to say. For a fixed \lambda finding a subset that maximizes \mathrm{fee}_x - \lambda \mathrm{size}_x simply transforms to the “maximum weight closure problem” that can be solved with any min-cut finding algorithm. The “max. weight closure problem” is a subproblem of your original “maximum feerate closure problem”. The novelty of GGT is that you don’t need to do bisection to solve the “maximum feerate closure” and the computation complexity of the feerate problem stays the same as the computational complexity of a single “max. weight closure”.

sipa · February 6, 2025, 8:51pm

Forget that I mentioned GGT here, I understand. I was using it as a shorthand for “min-cut based approaches”, which is indeed not the novelty of GGT at all, it came earlier.

Lagrang3 · February 9, 2025, 10:40am

There several options for implementing preflow-push (Goldberg-Tarjan maxflow/min-cut algorithm). The fastest theoretical bound O(nm \log(n^2/m)) is obtained using a dynamic tree data structure or alternatively one can simply use a queue to process active nodes in FIFO order leading to a provable O(n^3) complexity. The FIFO-preflow-push is actually pretty simple to implement. See for example:

github.com

Lagrang3/max-density-closure/blob/27c874164a0b457cc4092e64f0fc46457e60ab7e/examples/kattis-maxflow.cpp

// https://open.kattis.com/problems/maxflow

#include <cassert>
#include <iostream>
#include <limits>
#include <queue>
#include <vector>

typedef long long flow_t;

/* FIFO variant of Goldbert-Tarjan's Preflow-Push algorithm.
 * O(n^3) */
int preflow_push(const std::vector<std::vector<int>>& node_arcs,
		 std::vector<flow_t>& residual_capacity,
		 const std::vector<int>& next_node, const int source,
		 const int sink) {
	const int N_nodes = node_arcs.size();
	std::vector<int> distance(N_nodes, 0);
	distance[sink] = 0;
	distance[source] = N_nodes;

This file has been truncated. show original

Lagrang3 · February 9, 2025, 10:45am

The other take away points from GGT is that the same preflow-push algorithm can be extended to parametric problems (like the maximum-rate-closure problem that we are interested in) and still keep the same runtime complexity of O(n^3) for our FIFO-preflow-push case.

There is another interesting point to remember, which is the fact that we are interested in the min-cut set and not on the maxflow itself. So we may stop the algorithm execution earlier and possibly save half of the running time by doing so.

sipa · February 9, 2025, 12:38pm

Or, from what I understand so far, one can use maximum-label as a selection strategy (O(n^2 \sqrt{m})), which also means a O(n^3) worst-case provably complexity, but only O(n^{2.5}) when the number of dependencies scales linearly with the number of transactions (which is probably somewhat true in practice, as dependencies cost input vsize).

That’s good to know. My guess would be that the dynamic trees version (which would be also O(n^3) in the worst case, but O(n^2 \log n) for a linear number of dependencies) might be not worth it for us for the small problem sizes.

That’s surprising to me, because if you computed the min-cut, in our setting, you know the closure, whose total fee and size you can determine in O(n) time (just sum them up), and the max flow of the problem you just solved will be \operatorname{fee} - \lambda \operatorname{size}, with \lambda the parameter value you just solved for. So it seems to me that not computing the actual max flow can at best save you O(n) work.

I also had the following realization (see diagram posted below). Solving for the closure with maximal \operatorname{fee} - \lambda \operatorname{size} can be visualized on the feerate diagram (diagram with cumulative size on X axis, cumulative fee on Y axis, dots for all optimal chunk boundaries, straight lines between them).

In the initial step, where you set \lambda to the feerate of the entire cluster, draw a line L from the origin to the end point of the diagram (whose slope will be \lambda). The min cut found will the point on the diagram whose highest point lies most above L, in vertical distance. And from this it is clear that point must be on a chunk boundary (if it wasn’t, and was on or below a line segment of the diagram, then depending on the slope of L, one of the two segment end points must lie higher above L).

In a second iteration (which in GGT does not require starting over from scratch), one does the same thing, but with \lambda now set to the feerate of the previously-found solution, and L a line from the origin to the point found there. The next min-cut will now found us the point that lies most above that L, etc.

From this it is clear that there can at most be N steps, because there can be at most N chunks, and each min-cut step is sort of a bisection search, cutting off one or more bad chunks of the solution.

It also means that the breakpoints GGT finds each correspond with chunk boundaries of the diagram already, but not all of them. To find all of them, one needs to rerun the search in the “other half” of the bisections cut off as well?

Lagrang3 · February 9, 2025, 1:17pm

One thing is the scalar value of the maximum flow (the sum of all incoming flows to the sink) and the flow function that achieves that value (real valued function over the set of edges). In the context of Goldberg-Tarjan algorithm when you arrive to a certain state for which every node with positive excess is unable to reach the sink, then you can already construct the min-cut and maxflow value with O(m), but this might be an incomplete state for the sake of the flow function cause the balance conditions are not guaranteed to be satisfied yet, you’re still at a preflow state and every excess flow has to go back to the source before a proper flow function is obtained.

sipa · February 9, 2025, 2:28pm

To demonstrate what I mean above:

D is the feerate diagram of the optimal linearization, so each black dot on it corresponds to some topologically-valid subset, and D is the convex hull through them. The vertices of the convex hull (and the transaction sets they correspond to) are what we are trying to find.
L_1 is the line whose slope is the \lambda_1 for the first min-cut iteration, corresponding with the feerate of the entire cluster.
Then we run the min-cut algorithm, to find the closure C_1 whose weight Q_1 = \operatorname{fee}_{C_1} - \lambda_1 \operatorname{size}_{C_1} is maximal.
Then we set \lambda_2 to the previous solution, i.e., \lambda_2 = \operatorname{fee}_{C_1} /\operatorname{size}_{C_1}.
L_2 is the line whose slope is \lambda_2.
Then we run the min-cut algorithm again, to find the closure C_2 whose weight Q_2 = \operatorname{fee}_{C_2} - \lambda_2 \operatorname{size}_{C_2} is maximal.

In every iteration, one or more chunks are removed, but the solutions each correspond to a prefix of chunks of the optimal linearization. So by the time we find the first chunk C_2, we have found the next chunk already too (C_1 \setminus C_2), but not the one after that because the first iteration skipped that one already. I’m guessing that’s where the contraction comes in: replace the source s with C_1 \cup \{s\}, and start over, and somehow that is still possible without fully resetting the algorithm state?

stefanwouldgo · February 10, 2025, 8:54am

Wow, this is a brilliant visualization of what’s happening here.

This sounds as if LIMO is a good solution for improving an existing linearization in combination with a min-cut approach. The existing best chunk gives a great starting \lambda, and from there, a min-cut will find a better chunk if there is one, which yields a better linearization. Every min-cut calculation will then either improve a chunk or show that it cannot be improved.

Yes, and that is why the GGT algorithm for finding all breakpoints (3.3) is so involved. In order to get the same runtime bound, it needs to contract the s- or t- components. Because the sizes of the s- and t-sides might be badly distributed, it needs to run the flow calculation on the reverse graph from t to s in parallel, and only use the result of the calculation that finishes first. This is obviously wasteful and complicated, and in practice, a simpler algorithm is always better. The PBST-algorithm avoids this waste and seems even faster in reality (and just as good in the worst case), and it finds all the breakpoints in ascending order (descending \lambda for our case), which might be a desirable property. However, it is also pretty involved. But, we already have code for it.

sipa · February 10, 2025, 12:42pm

That does sound at least conceptually simpler.

You mean PBFS from https://arxiv.org/pdf/2410.15920, right? Its complexity bound seems somewhat worse than GGT.

Summarizing the different algorithms I see (FP = solve each min-cut from scratch), where:

n: number of transactions
m: number of dependencies (which is itself \mathcal{O}(n) for sparse graphs and \mathcal{O}(n^2) for dense graphs).
k: number of chunks (up to \mathcal{O}(n)).

Algorithm	Complexity	Sparse	Dense
FP (generic)	\mathcal{O}(k n^2 m)	\mathcal{O}(n^4)	\mathcal{O}(n^5)
FP (FIFO)	\mathcal{O}(k n^3)	\mathcal{O}(n^4)	\mathcal{O}(n^4)
FP (max label)	\mathcal{O}(k n^2 \sqrt{m})	\mathcal{O}(n^{3.5})	\mathcal{O}(n^4)
FP (dynamic trees)	\mathcal{O}(knm \log(n^2/m))	\mathcal{O}(n^3 \log n)	\mathcal{O}(n^4)
PBFS	\mathcal{O}(n^2 m)	\mathcal{O}(n^3)	\mathcal{O}(n^4)
GGT (generic)	\mathcal{O}(n^2 m)	\mathcal{O}(n^3)	\mathcal{O}(n^4)
GGT (FIFO)	\mathcal{O}(n^3)	\mathcal{O}(n^3)	\mathcal{O}(n^3)
GGT (max label)	\mathcal{O}(n^2 \sqrt{m})	\mathcal{O}(n^{2.5})	\mathcal{O}(n^3)
GGT (dynamic trees)	\mathcal{O}(nm \log(n^2/m))	\mathcal{O}(n^2 \log n)	\mathcal{O}(n^3)

We’ll need to experiment with specialized implementations though, because seeing papers talk about problems with millions of nodes means that what they consider “practical problems” may be very different than what we have in mind (we’ll probably prefer simpler algorithms over better complexity ones) but also that many data-structure optimizations may not apply in their settings. However, we also care more about worst-case performance than they do, presumably.

stefanwouldgo · February 10, 2025, 3:36pm

Yes, that’s the one I mean. In fact, it seems conceptually more complex to me. And you are right, the asymptotic worst-case complexity is potentially worse. But in experiments, FP (what they call DS in the PBFS paper) seems to always beat GGT (the other experimental paper), and PBFS seems to always beat FP/DS. And this is already on instances that are probably way bigger than what we need: Their smallest example has about 5k nodes and is done in 11 ms. Assuming cubic runtime, we might expect 500 nodes to run in less than 50 \mu s.

So it’s all about the constants, but they are also very dependent on implementation details. Your running time table is a great starting point. But it might turn out that for the instances we are interested in, the story looks exactly the other way round than asymptotics suggest. In fact, our instances are small enough that simple DS/FP with a nicely tuned min-cut algorithm might be the best solution.

sipa · February 10, 2025, 6:39pm

I think that would be an incorrect assumption; the benchmarks in the paper seem to exhibit behavior very far from the proven complexity bound. I ran regression on its data set, and obtain runtimes that are approximated by t = 0.00098 \times n^{0.091} \times m^{0.77}.

stefanwouldgo · February 10, 2025, 7:20pm

That’s an interesting idea, but the result doesn’t appear very convincing. The algorithm should at least look at every node and edge once.

But of course my estimation may be far off, there are probably large additive constants in effect.

sipa · February 11, 2025, 12:34am

Another advantage that GGT (or DS) may have over PBFS is that the chunk splittings that each min-cut corresponds to is in a sense optimal: it is the best improvement to the diagram possible, when considering only steps that subdivide a single chunk in two. PBFS doesn’t have this, as it finds the breakpoints in order.

If the way this gets implemented is by aborting the computation as soon as a time/work threshold is exceeded, for GGT this probably means a better/fairer partial solution than PBFS (which may have spent all its time on the first chunk, and nother further).