If the “convex hull” part of Sipa’s description of optimality seems capricious to you, perhaps it would be helpful for me to share why I found it intuitive:
Consider the mining problem without the issue of dependencies. An obvious approximation algorithm for the problem is to sort transactions by feerate, and include transactions in that order until you can’t fit the next one.
One thing to observe about the algorithm is that if the last transaction you include exactly fills the block then this greedy algorithm is optimal, not approximate. With that in mind, you can see that there is an obvious bound on the worst case approximation error: You will at most miss out on fees for the unfilled bytes times the next best feerate. The relative error is also small so long as transactions are small compared to the block (since the loss is no more than the left out transaction). Of course, in practice its less than that if you fill in the remaining space as best you can (which is what Bitcoin Core has done for a long time).
A similar kind of analogy exists for the “chunk” approximation-- the constraint to chunks causes an approximation error in the otherwise elegant definition of optimality… but only when the available space doesn’t line up with chunks. And when it doesn’t the amount of error is bounded by the maximum size of a chunk.
And since any kind of optimized linearization is going to have some less attractive parent transactions ahead of child transactions that make them worthwhile, there will always be some approximation loss for any algorithim that isn’t figuring out the order just in time. I think that an argument can also be made that it doesn’t make sense to optimize for any particular fullness location because the distribution of the remaining space is going to look like a sum of arbitrarily many randomly sized chunks mod the typical chunk size, and will be pretty uniformly distributed (though admittedly that’s a total hand-wave, but it’s my intuition).
I think it’s also important to keep in mind that the linearization is usually unimportant:
The vast majority of all clusters have only one possible linearization because there is only a single txn or the txn form a straight chain. And the linearization is irrelevant during both mining and eviction if either none of the transactions would be selected or if all of them would be selected. – which is again usually the case.