Disclosure: LND Excessive Failback Exploit

The following disclosure is copied verbatim from a blog post on morehouse.github.io, reproduced here to facilitate discussion.

LND 0.17.5 and below contain a bug in the on-chain resolution logic that can be exploited to steal funds. For the attack to be practical the attacker must be able to force a restart of the victim node, perhaps via an unpatched DoS vector. Update to at least LND 0.18.0 to protect your node.

Background

Whenever a new payment is routed through a lightning channel, or whenever an existing payment is settled on the channel, the parties in that channel need to update their commitment transactions to match the new set of active HTLCs. During the course of these regular commitment updates, there is always a brief moment where one of the parties holds two valid commitment transactions. Normally that party immediately revokes the older commitment transaction after it receives a signature for the new one, bringing their number of valid commitment transactions back down to one. But for that brief moment, the other party in the channel must be able to handle the case where either of the valid commitments confirms on chain.

As part of this handling, nodes need to detect when any currently outstanding HTLCs are missing from the confirmed commitment transaction so that those HTLCs can be failed backward on the upstream channel.

The Excessive Failback Bug

Prior to v0.18.0, LND’s logic to detect and fail back missing HTLCs works like this:

func failBackMissingHtlcs(confirmedCommit Commitment) {
  currentCommit, pendingCommit := getValidCounterpartyCommitments()

  var danglingHtlcs HtlcSet
  if confirmedCommit == pendingCommit {
    danglingHtlcs = currentCommit.Htlcs()
  } else {
    danglingHtlcs = pendingCommit.Htlcs()
  }

  confirmedHtlcs := confirmedCommit.Htlcs()
  missingHtlcs := danglingHtlcs.SetDifference(confirmedHtlcs)
  for _, htlc := range missingHtlcs {
    failBackHtlc(htlc)
  }
}

LND compares the HTLCs present on the confirmed commitment transaction against the HTLCs present on the counterparty’s other valid commitment (if there is one) and fails back any HTLCs that are missing from the confirmed commitment. This logic is mostly correct, but it does the wrong thing in one particular scenario:

  1. LND forwards an HTLC H to the counterparty, signing commitment C0 with H added as an output. The previous commitment is revoked.
  2. The counterparty claims H by revealing the preimage to LND.
  3. LND forwards the preimage upstream to start the process of claiming the incoming HTLC.
  4. LND signs a new counterparty commitment C1 with H removed and its value added to the counterparty’s balance.
  5. The counterparty refuses to revoke C0.
  6. The counterparty broadcasts and confirms C1.

In this case, LND compares the confirmed commitment C1 against the other valid commitment C0 and determines that H is missing from the confirmed commitment. As a result, LND incorrectly determines that H needs to be failed back upstream, and executes the following logic:

func failBackHtlc(htlc Htlc) {
  markFailedInDatabase(htlc)
  
  incomingHtlc, ok := incomingHtlcMap[htlc]
  if !ok {
    log("Incoming HTLC has already been resolved")
    return
  }
  failHtlc(incomingHtlc)
  delete(incomingHtlcMap, htlc)
}

In this case, the preimage for the incoming HTLC was already sent upstream (step 3), so the corresponding entry in incomingHtlcMap has already been removed. Thus LND catches the “double resolution” and returns from failBackHtlc without sending the incorrect failure message upstream. Unfortunately, LND only catches the double resolution after H is marked as failed in the database. As a result, when LND next restarts it will reconstruct its state from the database and determine that H still needs to be failed back. If the incoming HTLC hasn’t been fully resolved with the upstream node, the reconstructed incomingHtlcMap will have an entry for H this time, and LND will incorrectly send a failure message upstream.

At that point, the downstream node will have claimed H via preimage while the upstream node will have had the HTLC refunded to them, causing LND to lose the full value of H.

Stealing HTLCs

Consider the following topology, where B is the victim and M0 and M1 are controlled by the attacker.

M0 -- B -- M1

The attacker can steal funds as follows:

  1. M0 routes a large HTLC along the path M0 -> B -> M1.
  2. M0 goes offline.
  3. M1 claims the HTLC from B by revealing the preimage, receives a new commitment signature from B, and then refuses to revoke the previous commitment.
  4. B attempts to claim the upstream HTLC from M0 but can’t because M0 is offline.
  5. M1 force closes the B-M1 channel using their new commitment, thus triggering the excessive failback bug.
  6. The attacker crashes B using an unpatched DoS vector.
  7. M0 comes back online.
  8. B restarts, loads HTLC resolution data from the database, and incorrectly fails the HTLC with M0.

At this point, the attacker has succeeded in stealing the HTLC from B. M0 got the HTLC refunded, while M1 got the value of the HTLC added to their balance on the confirmed commitment.

The Fix

The excessive failback bug was fixed by a small change to prevent failback of HTLCs for which the preimage is already known. The updated logic now explicitly checks for preimage availability before failing back each HTLC:

func failBackMissingHtlcs(confirmedCommit Commitment) {
  currentCommit, pendingCommit := getValidCounterpartyCommitments()

  var danglingHtlcs HtlcSet
  if confirmedCommit == pendingCommit {
    danglingHtlcs = currentCommit.Htlcs()
  } else {
    danglingHtlcs = pendingCommit.Htlcs()
  }

  confirmedHtlcs := confirmedCommit.Htlcs()
  missingHtlcs := danglingHtlcs.SetDifference(confirmedHtlcs)
  for _, htlc := range missingHtlcs {
    if preimageIsKnown(htlc.PaymentHash()) {
      continue  // Don't fail back HTLCs we can claim.
    }
    failBackHtlc(htlc)
  }
}

The preimageIsKnown check prevents failBackHtlc from being called when the preimage is known, so such HTLCs are never failed backward or marked as failed in the database. On restart, the incorrect failback behavior no longer occurs.

The patch was hidden in a massive rewrite of LND’s sweeper system and was released in LND 0.18.0.

Discovery

This vulnerability was discovered during an audit of LND’s contractcourt package, which handles on-chain resolution of force closures.

Timeline

  • 2024-03-20: Vulnerability reported to the LND security mailing list.
  • 2024-04-19: Fix merged.
  • 2024-05-30: LND 0.18.0 released containing the fix.
  • 2025-02-17: Gijs gives the OK to disclose publicly in March.
  • 2025-03-04: Public disclosure.

Prevention

It appears all other lightning implementations have independently discovered and handled the corner case that LND mishandled:

  • CLN added a preimage check to the failback logic in 2018.
  • eclair introduced failback logic in 2023 that filtered upstream HTLCs by preimage availability.
  • LDK added a preimage check to the failback logic in 2023.

Yet the BOLT specification has not been updated to describe this corner case. In fact, by a strict interpretation the specification actually requires the incorrect behavior that LND implemented:

## HTLC Output Handling: Remote Commitment, Local Offers

### Requirements

A local node:
  - for any committed HTLC that does NOT have an output in this commitment transaction:
    - once the commitment transaction has reached reasonable depth:
      - MUST fail the corresponding incoming HTLC (if any).

It is quite unfortunate that all implementations had to independently discover and correct this bug. If any single implementation had contributed a small patch to the specification after discovering the issue, it would have at least sparked some discussion about whether the other implementations had considered this corner case. And if CLN had recognized that the specification needed updating back in 2018, there’s a good chance all other implementations would have handled this case correctly from the start.

Takeaways

  • Keeping specifications up-to-date can improve security for all implementations.
  • Update to at least LND 0.18.0 to protect your funds.
4 Likes

While the on-chain logic of a LN state machine is notoriously hard to implement correctly and BOLT5 is infamously known to not be exhaustive, there is one remark to be made on the proposed specification patch.

If I’m understanding correctly, the new requirement would apply to the situation where a local node has offered an HTLC output, said output included in a remote commitment transaction. If this offered HTLC is not included in the latest commitment transaction, it is proposed to fail the corresponding received HTLC on the incoming pair of commitment transactions.

If my memory is correct, this new mechanism has been already discussed in the past with few LN maintainers and devs in considering some other class of attacks. However, implementing this mechanism comes with the effect that now a LN node force-closes of the incoming channel, even if there is an economic disproportion between the HTLC at risk and the on-chain fee cost to force-close the channel. There is no guarantee on the interactivity of the backward peer. E.g the HTLC might be worth 1000 sats and the force-close absolute fee cost might be 10_000 sats. As far as I remember, there was no convergence among the LN maintainers of each implementation at the time, what should be the best in this situation and if the fall-back shouldn’t be an implementation policy or a node settings detail.

That it can be still worthy to have a description in BOLT5 of what can be a correct behavior, that I don’t disagree.

Oof. This really needs to be fixed.

My proposed spec change is to require that nodes should not fail back HTLCs for which a preimage is known.

That is an orthogonal problem. The decision of whether to claim an HTLC on chain or not (because it would be uneconomical) is independent of the decision to fail back off-chain.

So currently, if I’m reading correctly this is a MAY and not a SHOULD NOT:

    - MAY fail the corresponding incoming HTLC sooner.

This could be more precise, as the incoming HTLC might not be included in the commitment transaction. If it’s not included, failing it backward is not a problem. If it is included, this is a different problem as a routing LN node as it’s all depends the mempool feerates at the time of failing backward. With anchor output, the feerate cost to confirm the commitment_tx on the incoming channel could be higher than the received HTLC’s amount_msat.

That is an orthogonal problem. The decision of whether to claim an HTLC on chain or not (because it would be uneconomical) is independent of the decision to fail back off-chain .

Sure, however if the decision to fail back is not materialized on-chain (see comments above), this only works if the LN channel counterparty is cooperative, and I think the implicit assumption with BOLT5 (“Recommendations for On-chain Transaction Handling”) as a LN node cannot rely on interactivity with the counterparty to decide to go on-chain or not.

I’m not sure I understand or agree that this is actually a problem in BOLT5. But if you think it is, please open a PR to fix it so we can discuss there.

The problem is with the "MUST fulfill the corresponding incoming HTLC (if any)”.

Per-BOLT3 Appendix A, the overall weight is given by the following equation: 900 + 172 * num-htlc-outputs + 224 WU. If you assume a max_accepted_htlcs of 50 (i.e default value for LDK’s our_max_accepted_htlcs), the max weight unit of a commitment tx is 900 + 172 * 533 + 224 = 92800. If for simplifying the demonstration, you assume 1 sat / vbyte = 1 sat / 4 weight units.

Let’s assume the commitment transaction signed at 1 sat / byte under the update_fee mechanism, and the burden fee is on counterparty as the opener.

All commitment transaction feerate beyond that 1 sat / byte will be paid by the routing LN node via an anchor output.

Let’s assume the “missing” HTLC’s amount_msat is of 50_000 satoshis.

At 5 sat / vbyte, the feerate cost of the commitment transaction is (92800 / 4) * 4 = 92800, the absolute fee cost of the commitment tx is of 92800 satoshis. If the routing LN node goes on-chain for a single HTLC, there is a loss of 42_800 satoshis.

At 10 sat / vbyte, the feerate cost of the commitment transaction is (92800 / 4) * 9 = 208800, the absolute fee cost of the commitment tx is of 208800 satoshis. If the routing LN node goes on-chain for a single HTLC, there is a loss of 158_800 satoshis.

At 15 sat / vbyte, the feerate cost of the commitment transaction is (92800 / 4) * 14 = 324_800, the absolute fee cost of the commitment tx is of 324800 satoshis. If the routing LN nodes goes on-chain for a single HTLC, there is a loss of 324_800 satoshis.

Before this change to BOLT5 specification, a LN node should have gone on-chain on the downstream link when the commitment transaction on the upstream link has reached enough sufficient depth (e.g for LDK the value is ANTI_REORG_DELAY=6). For an upstream counterparty confirming on-chain a commitment transaction has an absolute fee cost (either effective if no-hashrate capabilities or potential if hashrate capabilities).

After this change to BOLT5 specification, a LN node can goes on-chain whatever the status of the upstream link commitment transaction of the counterparty and whatever the absolute fee cost paid by the LN node.

If implemented, this opens the door to that kind of exploitation, where you have Mallory ↔ Alice ↔ Mallet, where Mallory is routing max HTLC=482 to Mallet through Alice to inflate the commitment transaction.

Mallory routes a single HTLC through Alice to Mallet of worth 10_000 sats. Mallet releases the preimage for the HTLC and Alice - Mallet do a revoke and commit dance, however Mallet withhold the latest revoke_and_ack so from the PoV of Alice, Mallet has two valid commitment transaction, one with the preimage. Alice has a single valid commitment transaction, with no HTLC output for the 10k sats HTLC.

Applying the “if the payment preimage is known: - MUST fulfill the corresponding incoming HTLC”, if Mallet do not cooperate to update the downstream link, Alice should go on-chain to claim the 10k sats preimage. If Alice’s commitment transaction is of weight 84200, at 10 sat / vbyte, Alice loss is of 240600 satoshis.

This on-chain fee can be snipped by a miner collaborating with Mallet or Mallory. This behavior can be also done by Mallet or Mallory just to do pure fee griefing of the LSPs they don’t like.

Now, enters the devil, what if Alice does not go on-chain with her commitment transaction on the downstream link ? As Mallet still have a valid commitment transaction on the upstream link, Mallory can goes on-chain with 482 HTLC-timeout on the downstream link and Mallet can goes on-chain with 482 HTLC-success on the upstream link, double-spending Alice for 482 HTLCs.

The timelock orders for the single 10k sats and the remaining 482 HTLCs can be adjusted by Mallet and Mallory to the double, e.g HTLC 10k sats = 144 and 482 low-value HTLC = 1008. At T+144, either Alice goes on-chain both downstream or upstream, or Mallet and Mallory starts to gain adverserial optionality.

To be checked and correct me if I’m wrong. I think the main subtlety is on the revoke_and_ack dance Alice-Mallet:

← update_fulfill_htlc

← commitment_signed

→ revoke_and_ack

→ commitment_signed

*Mallet withheld revoke_and_ack

This is hard to follow because it looks like you’re confusing downstream and upstream in several of your paragraphs…I think I got the gist of it though, but I don’t see how it applies to the change you’re referring to.

The BOLTs change only says that the preimage must be relayed upstream as soon as it’s obtained downstream. At that point we only specify that nodes should correctly extract preimages from downstream and send update_fulfill_htlc upstream. This doesn’t change the requirements of whether the node should force-close upstream (if its update_fulfill_htlc is not acked) or not.

1 Like