Disclosure: LND Excessive Failback Exploit

The following disclosure is copied verbatim from a blog post on morehouse.github.io, reproduced here to facilitate discussion.

LND 0.17.5 and below contain a bug in the on-chain resolution logic that can be exploited to steal funds. For the attack to be practical the attacker must be able to force a restart of the victim node, perhaps via an unpatched DoS vector. Update to at least LND 0.18.0 to protect your node.

Background

Whenever a new payment is routed through a lightning channel, or whenever an existing payment is settled on the channel, the parties in that channel need to update their commitment transactions to match the new set of active HTLCs. During the course of these regular commitment updates, there is always a brief moment where one of the parties holds two valid commitment transactions. Normally that party immediately revokes the older commitment transaction after it receives a signature for the new one, bringing their number of valid commitment transactions back down to one. But for that brief moment, the other party in the channel must be able to handle the case where either of the valid commitments confirms on chain.

As part of this handling, nodes need to detect when any currently outstanding HTLCs are missing from the confirmed commitment transaction so that those HTLCs can be failed backward on the upstream channel.

The Excessive Failback Bug

Prior to v0.18.0, LND’s logic to detect and fail back missing HTLCs works like this:

func failBackMissingHtlcs(confirmedCommit Commitment) {
  currentCommit, pendingCommit := getValidCounterpartyCommitments()

  var danglingHtlcs HtlcSet
  if confirmedCommit == pendingCommit {
    danglingHtlcs = currentCommit.Htlcs()
  } else {
    danglingHtlcs = pendingCommit.Htlcs()
  }

  confirmedHtlcs := confirmedCommit.Htlcs()
  missingHtlcs := danglingHtlcs.SetDifference(confirmedHtlcs)
  for _, htlc := range missingHtlcs {
    failBackHtlc(htlc)
  }
}

LND compares the HTLCs present on the confirmed commitment transaction against the HTLCs present on the counterparty’s other valid commitment (if there is one) and fails back any HTLCs that are missing from the confirmed commitment. This logic is mostly correct, but it does the wrong thing in one particular scenario:

  1. LND forwards an HTLC H to the counterparty, signing commitment C0 with H added as an output. The previous commitment is revoked.
  2. The counterparty claims H by revealing the preimage to LND.
  3. LND forwards the preimage upstream to start the process of claiming the incoming HTLC.
  4. LND signs a new counterparty commitment C1 with H removed and its value added to the counterparty’s balance.
  5. The counterparty refuses to revoke C0.
  6. The counterparty broadcasts and confirms C1.

In this case, LND compares the confirmed commitment C1 against the other valid commitment C0 and determines that H is missing from the confirmed commitment. As a result, LND incorrectly determines that H needs to be failed back upstream, and executes the following logic:

func failBackHtlc(htlc Htlc) {
  markFailedInDatabase(htlc)
  
  incomingHtlc, ok := incomingHtlcMap[htlc]
  if !ok {
    log("Incoming HTLC has already been resolved")
    return
  }
  failHtlc(incomingHtlc)
  delete(incomingHtlcMap, htlc)
}

In this case, the preimage for the incoming HTLC was already sent upstream (step 3), so the corresponding entry in incomingHtlcMap has already been removed. Thus LND catches the “double resolution” and returns from failBackHtlc without sending the incorrect failure message upstream. Unfortunately, LND only catches the double resolution after H is marked as failed in the database. As a result, when LND next restarts it will reconstruct its state from the database and determine that H still needs to be failed back. If the incoming HTLC hasn’t been fully resolved with the upstream node, the reconstructed incomingHtlcMap will have an entry for H this time, and LND will incorrectly send a failure message upstream.

At that point, the downstream node will have claimed H via preimage while the upstream node will have had the HTLC refunded to them, causing LND to lose the full value of H.

Stealing HTLCs

Consider the following topology, where B is the victim and M0 and M1 are controlled by the attacker.

M0 -- B -- M1

The attacker can steal funds as follows:

  1. M0 routes a large HTLC along the path M0 -> B -> M1.
  2. M0 goes offline.
  3. M1 claims the HTLC from B by revealing the preimage, receives a new commitment signature from B, and then refuses to revoke the previous commitment.
  4. B attempts to claim the upstream HTLC from M0 but can’t because M0 is offline.
  5. M1 force closes the B-M1 channel using their new commitment, thus triggering the excessive failback bug.
  6. The attacker crashes B using an unpatched DoS vector.
  7. M0 comes back online.
  8. B restarts, loads HTLC resolution data from the database, and incorrectly fails the HTLC with M0.

At this point, the attacker has succeeded in stealing the HTLC from B. M0 got the HTLC refunded, while M1 got the value of the HTLC added to their balance on the confirmed commitment.

The Fix

The excessive failback bug was fixed by a small change to prevent failback of HTLCs for which the preimage is already known. The updated logic now explicitly checks for preimage availability before failing back each HTLC:

func failBackMissingHtlcs(confirmedCommit Commitment) {
  currentCommit, pendingCommit := getValidCounterpartyCommitments()

  var danglingHtlcs HtlcSet
  if confirmedCommit == pendingCommit {
    danglingHtlcs = currentCommit.Htlcs()
  } else {
    danglingHtlcs = pendingCommit.Htlcs()
  }

  confirmedHtlcs := confirmedCommit.Htlcs()
  missingHtlcs := danglingHtlcs.SetDifference(confirmedHtlcs)
  for _, htlc := range missingHtlcs {
    if preimageIsKnown(htlc.PaymentHash()) {
      continue  // Don't fail back HTLCs we can claim.
    }
    failBackHtlc(htlc)
  }
}

The preimageIsKnown check prevents failBackHtlc from being called when the preimage is known, so such HTLCs are never failed backward or marked as failed in the database. On restart, the incorrect failback behavior no longer occurs.

The patch was hidden in a massive rewrite of LND’s sweeper system and was released in LND 0.18.0.

Discovery

This vulnerability was discovered during an audit of LND’s contractcourt package, which handles on-chain resolution of force closures.

Timeline

  • 2024-03-20: Vulnerability reported to the LND security mailing list.
  • 2024-04-19: Fix merged.
  • 2024-05-30: LND 0.18.0 released containing the fix.
  • 2025-02-17: Gijs gives the OK to disclose publicly in March.
  • 2025-03-04: Public disclosure.

Prevention

It appears all other lightning implementations have independently discovered and handled the corner case that LND mishandled:

  • CLN added a preimage check to the failback logic in 2018.
  • eclair introduced failback logic in 2023 that filtered upstream HTLCs by preimage availability.
  • LDK added a preimage check to the failback logic in 2023.

Yet the BOLT specification has not been updated to describe this corner case. In fact, by a strict interpretation the specification actually requires the incorrect behavior that LND implemented:

## HTLC Output Handling: Remote Commitment, Local Offers

### Requirements

A local node:
  - for any committed HTLC that does NOT have an output in this commitment transaction:
    - once the commitment transaction has reached reasonable depth:
      - MUST fail the corresponding incoming HTLC (if any).

It is quite unfortunate that all implementations had to independently discover and correct this bug. If any single implementation had contributed a small patch to the specification after discovering the issue, it would have at least sparked some discussion about whether the other implementations had considered this corner case. And if CLN had recognized that the specification needed updating back in 2018, there’s a good chance all other implementations would have handled this case correctly from the start.

Takeaways

  • Keeping specifications up-to-date can improve security for all implementations.
  • Update to at least LND 0.18.0 to protect your funds.
4 Likes

While the on-chain logic of a LN state machine is notoriously hard to implement correctly and BOLT5 is infamously known to not be exhaustive, there is one remark to be made on the proposed specification patch.

If I’m understanding correctly, the new requirement would apply to the situation where a local node has offered an HTLC output, said output included in a remote commitment transaction. If this offered HTLC is not included in the latest commitment transaction, it is proposed to fail the corresponding received HTLC on the incoming pair of commitment transactions.

If my memory is correct, this new mechanism has been already discussed in the past with few LN maintainers and devs in considering some other class of attacks. However, implementing this mechanism comes with the effect that now a LN node force-closes of the incoming channel, even if there is an economic disproportion between the HTLC at risk and the on-chain fee cost to force-close the channel. There is no guarantee on the interactivity of the backward peer. E.g the HTLC might be worth 1000 sats and the force-close absolute fee cost might be 10_000 sats. As far as I remember, there was no convergence among the LN maintainers of each implementation at the time, what should be the best in this situation and if the fall-back shouldn’t be an implementation policy or a node settings detail.

That it can be still worthy to have a description in BOLT5 of what can be a correct behavior, that I don’t disagree.

Oof. This really needs to be fixed.

My proposed spec change is to require that nodes should not fail back HTLCs for which a preimage is known.

That is an orthogonal problem. The decision of whether to claim an HTLC on chain or not (because it would be uneconomical) is independent of the decision to fail back off-chain.