Rust

#[divan::bench(sample_size = 3, sample_count = 3)]
fn savgol_f32() -> f32 {
    let n = 100_000_000;
    let v = vec![10.0f32; n];
    let mut buf = vec![0.0f32; n];
    sav_gol_f32::<2, 2>(bb(&mut buf), bb(&v));
    buf[0]
}

Whelp...

mrg:~/staged-sg-filter$ RUSTFLAGS="-C target-feature=+fma,+avx2 -C target-cpu=native" cargo bench
  #                                                    ~2x faster than without
     Running benches/divan.rs (target/release/deps/divan-a75fca219433bc49)
Timer precision: 20 ns
divan          fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ savgol_f32  521.4 ms      │ 530.7 ms      │ 526.2 ms      │ 526.1 ms      │ 3       │ 9
  #              ~2.5x faster than Scipy
╰─ savgol_f64  1.046 s       │ 1.059 s       │ 1.048 s       │ 1.051 s       │ 3       │ 9
  #              ~30% faster than Scipy

Language	f32 time[s]	Elements / ns	ns / Element
Python	1.38	.072	13.8
Julia	0.06	1.617	0.618
Rust	.52	.19	5.21

Building a compile-time SIMD optimized smoothing filter

The problem

The solution

Why it's juicy

Dot products galore

Benchmarks

Julia

Rust

Counting cycles

The tight loop

Ingredients for an enviable tight loop

Rusty implementation

Const all the things

Experiments:

Pass the `const` ball to your iterators

Small bites at a time

Low hanging fruit

Gives...

But if the trip count is 149...

Compilers are hard

Building a compile-time SIMD optimized smoothing filter

The problem

The solution

Why it's juicy

Dot products galore

Benchmarks

Julia

Rust

Counting cycles

The tight loop

Ingredients for an enviable tight loop

Rusty implementation

Const all the things

Experiments:

Pass the const ball to your iterators

Small bites at a time

Low hanging fruit

Gives...

But if the trip count is 149...

Compilers are hard

Pass the `const` ball to your iterators