Introducing UltrafastSecp256k1: A Multi-Architecture Exploration of Secp256k1 Optimizations

Introduction Hello everyone. I’ve been developing a high-throughput implementation called UltrafastSecp256k1. The project, which was open-sourced on February 11th, 2026, started as an exploration of how modern hardware features (SHA-NI, AVX2, ARM64 Assembly) can be leveraged to push the limits of ECC performance across diverse platforms—from high-end x86 servers to resource-constrained IoT devices like ESP32-S3 and RISC-V boards.

The goal is to create a highly portable, constant-time, and branchless library that is accessible through multiple language bindings (12+ languages including Rust, Go, Swift, and Dart). I am reaching out to this community for a technical audit, feedback on the cryptographic primitives, and suggestions on our constant-time implementation.

Architecture & Core Optimizations

The library is built on a “Zero-Allocation” hot-path contract, ensuring no heap overhead during critical operations. Key technical pillars include:

  • Field Representation: We transitioned to a field representation for Point internals, enabling __int128 lazy reduction across constant-time (CT) operations.
  • Constant-Time Field Inversion: Implemented using the SafeGCD (divsteps) algorithm, specifically optimized for different architectures (e.g., divsteps for robustness).
  • Scalar Multiplication: Leverages GLV Endomorphism via -decomposition combined with interleaved double-and-add, significantly reducing the cycle count for .
  • Hardware Acceleration: We use SHA-NI (Intel/AMD SHA Extensions) for high-speed hashing dispatching and AVX2 CT table lookups for secure, constant-time scans.
  • I-Cache Efficiency: We utilize noinline on large functions like jac52_add_mixed_inplace to prevent instruction cache pollution, resulting in a ~59% reduction in I-cache misses.

Platform-Specific Implementation & Benchmarks

We have focused on making the library performant where it’s needed most:

  • x86_64: Utilizing Comb precomputation tables (teeth=6, blocks=43) to optimize operations, achieving significant speedups over standard implementations.
  • ARM64 (Android/Linux): Hand-tuned multiply/square bypasses directly calling assembly, optimized for Cortex-A76 and newer cores.
  • Embedded & Emerging: Current support for ESP32-S3 and upcoming optimizations for RISC-V (Milk-V Mars).

Current State (v3.10.x): The library currently passes over 12,000 consistency tests across x86 and ARM64 platforms. The ecosystem includes full bindings for NPM (Node.js/React Native) and NuGet (.NET), making it ready for high-level integration.

Request for Review & Technical Discussion

I am specifically looking for feedback on:

  1. Constant-Time Integrity: Review of our assembly bypasses for potential side-channel leaks.
  2. Algorithm Selection: Evaluation of our H-Product Serial Inversion and SafeGCD implementation details.
  3. Branchless Logic: Suggestions for further removing branches in the point-normalization and signing flows to improve security.

The project is fully open-source, and I believe that peer review from the Delving Bitcoin community is vital to ensure this tool remains both fast and secure for the wider ecosystem.

GitHub Repository: https://github.com/shrec/UltrafastSecp256k1

Technical Changelog:https://github.com/shrec/UltrafastSecp256k1/blob/c649f6dfd80b1611b17f606206b156e3c2e6a058/CHANGELOG.md

3 Likes

Just finished the RISC-V optimization sprint for Milk-V Mars (SiFive U74). Using U74-specific in-order scheduling gave us a 34% boost in verification speed. This is part of the v3.11 roadmap to make UltrafastSecp256k1 the go-to library for resource-constrained IoT devices. Cycles don’t lie! :rocket:

1 Like

I wonder what your expectation is. If it is that someone here will make the effort of reading and reasoning about more than 150 000 lines of cryptographic code, then I deem that the probability that this happens is negligible.

My main piece of feedback is that the license is not a good fit for the Bitcoin system ecosystem. Almost everything in the ecosystem uses the MIT license. Picking the AGPL means that essentially no projects will be able to use your code, even if they wanted to.

3 Likes

Thank you for the candid feedback — I appreciate it.

You’re absolutely right regarding the license friction. After reflecting on your comment and the broader ecosystem norms, I’ve decided to switch the project to the MIT license to better align with Bitcoin Core and related projects.

My intention was never to create adoption barriers. The goal is to build a portable, zero-dependency secp256k1 engine that can be evaluated and integrated freely.

I understand that a full manual review of a large cryptographic codebase is unrealistic without structured audit scope. I’m currently working on:

• A clear threat model document • A minimized audit surface breakdown • Reproducible apples-to-apples benchmark harness • Cross-implementation comparison vs libsecp256k1

Any targeted feedback on specific subsystems (e.g., scalar arithmetic, field layer, constant-time strategy) would already be extremely valuable.

Thanks again for taking the time to respond.

2 Likes

Cumulative release: v3.14.0 → v3.21.0 120+ commits ABI compatible No breaking changes — drop-in upgrade from v3.14.x

Highlights: • Bernstein‑Yang SafeGCD constant‑time scalar inverse • 6.4× faster ct::scalar_inverse • ~43% faster constant‑time ECDSA signing • RISC‑V constant‑time timing leak fixes • strict BIP‑340 parsing • expanded audit infrastructure • reproducible Docker CI • cross‑platform benchmarks on x86‑64, ARM64, RISC‑V and ESP32

1 Like

1 Like

1 Like

more benchmaks can be found here:

1 Like

Ultrafastsecp256k1 and BIP352 with i5cpu and Nvidia 5060ti

1 Like

GitHub - shrec/bench_bip352: BIP-352 Standalone Benchmark · GitHub benchmark repo

1 Like

Hi all,

I’ve been experimenting with BIP324 v2 encrypted transport and wanted to share some measurements around its performance characteristics, focusing on throughput, latency, and batching effects.

The goal was not to propose changes, but to better understand where the actual costs are and how they scale under different execution models.


Setup

  • Full BIP324 v2 stack implemented (ChaCha20-Poly1305 AEAD, HKDF-SHA256, ElligatorSwift, session management)

  • CPU: x86-64, clang-19, -O3

  • GPU: RTX 5060 Ti (CUDA), batch-oriented execution model

  • Measurements use median of multiple runs (RTDSCP timing)


CPU baseline (single-thread)

Mixed traffic profile:

  • ~715K packets/sec

  • ~221 MB/s goodput

  • ~5.5% protocol overhead

Selected primitives:

  • ChaCha20: ~780–840 MB/s

  • Poly1305: ~1.5–2.2 GB/s

  • AEAD encrypt: ~265–580 MB/s

  • AEAD decrypt: ~232–587 MB/s

One-time operations:

  • HKDF (extract+expand): ~286 ns

  • ElligatorSwift create: ~53 µs

  • ElligatorSwift XDH: ~30 µs

  • Full handshake (both sides): ~172 µs


GPU offload (batch processing)

With batching (128K packets):

  • ~12.78M packets/sec

  • ~3.9 GB/s goodput

  • ~17–18x throughput increase vs CPU

After optimizations (state reuse, instruction-level tuning, memory layout):

  • ~21.37M packets/sec

  • ~6.6 GB/s goodput

  • ~30x throughput vs CPU

Overhead remains roughly the same (~5.5–5.6%).


Latency vs batching

A key observation is the strong dependence on batch size:

  • 1 packet: ~17.6 µs (launch + transfer dominated)

  • 64 packets: ~0.5 µs/packet

  • 1024+ packets: ~63 ns/packet

This suggests:

GPU behaves as a throughput engine, not a latency engine.

Small workloads are dominated by launch and transfer overhead, while large batches amortize these costs effectively.


PCIe / data movement effects

End-to-end profiling shows:

  • Kernel execution: ~55–58% of total time

  • PCIe transfer: ~42–45%

Effective end-to-end throughput stabilizes around:

  • ~3.2–3.6 GB/s

This indicates that once crypto is sufficiently optimized, data movement becomes the dominant bottleneck, not the cryptographic primitives themselves.


Additional observations

  • Decoy traffic overhead is relatively small on GPU: ~20% decoy rate results in only ~1.4% throughput drop

  • Multi-stream execution (overlapping copy + compute): ~1.37x improvement vs single stream

  • Optimal batch size appears to be in the 4K–16K packet range for this setup


Takeaways

  1. BIP324 cryptographic overhead on CPU is measurable but not extreme (~5–6%)

  2. Throughput can scale significantly with parallel execution (30x in this setup)

  3. Latency and throughput behave very differently depending on batching

  4. Once crypto is fast enough, transport becomes memory/IO bound

  5. Batch size and execution model are critical factors in performance


Open questions

  • Are there realistic node-level scenarios where large batch sizes naturally occur?

  • Would transport-level batching be compatible with current peer/message handling models?

  • How relevant are throughput optimizations vs latency in real-world node deployments?


I’m happy to share more details or run additional measurements if useful.