I have a function to compute String edit distance in C++:
#include <string>
#include <vector>
#include <algorithm>
size_t sed_diff(const std::string & a, const std::string & b) {
std::vector<size_t> cp(b.length() + 1);
std::vector<size_t> cc(b.length() + 1);
// Edit distance on postorder traversal.
for (int j = 0; j <= b.length(); ++j) {
cp[j] = j;
}
for (int i = 1; i <= a.length(); ++i) {
cc[0] = i;
for (int j = 1; j <= b.length(); ++j) {
unsigned int c1 = a[i - 1];
unsigned int c2 = b[j - 1];
cc[j] = std::min({
cp[j - 1] + ((c1 == c2) ? 0 : 1),
cc[j - 1] + 1,
cp[j] + 1
});
}
std::swap(cp, cc);
}
return cp[b.length()];
}
int main() {
// call
sed_diff(std::string("banana"), std::string("mango"));
return 0;
}
I attempted to rewrite the same in Rust code:
fn sed(s1: &[u8], s2: &[u8]) -> u32 {
let mut cp = vec![0u32; s2.len() + 1];
let mut cc = vec![0u32; s2.len() + 1];
for i in 0..=s2.len() {
cp[i] = i as u32;
}
for i in 1..=s1.len() {
cc[0] = i as u32;
for j in 1..=s2.len() {
let c1 = s1[i - 1];
let c2 = s2[j - 1];
cc[j] = std::cmp::min(
std::cmp::min(cp[j - 1] + if c1 == c2 { 0 } else { 1 }, cc[j - 1] + 1),
cp[j] + 1,
)
}
std::mem::swap(&mut cp, &mut cc);
}
cp[s2.len()]
}
fn main() {
sed(
String::from("banana").as_bytes(),
String::from("mango").as_bytes(),
);
}
When benchmarking the code on the same dataset, the C++ code averages around 15 seconds, while the Rust compiled code averages around 30 seconds total runtime.
Both tests are run in a nested for loops, using 20 000 input strings and 50 test strings.
But I can't understand, why the Rust version is so much slower?
EDIT:
CPU is AMD's Ryzen 5 4500U.
I compiled C++ code using g++ (v13.2.1) + CMake in Release mode, so AFAIK only -O3 flag was used.
Rustc version is v1.75.0.
As for tested inputs as I mentioned in the comments, the strings are generated DNA sequences with lenghts of 95 up to 105 chars long. E.g.:
ATGATGGTCAGGTAGTGGAGAGGCTTACATATGGCCACATATCGGTCAAAAGCCATGGCTATGAGCAGCACCATC
TCAGTTCCCCCAACTGCATGGATAA
I used 20 000 input strings with 50 query strings with nested for loops (query string in outer loop).
Rustc fails to keep
cc[j] = min(...)in a register to become next iteration'scc[j - 1], instead it actually stores + reloads, creating a longer latency bottleneck. This is on top of the branchless 2xcmp/cmovandadd(+1) that's probably about 5 cycles to getminof 3 inputs. Store-forwarding latency is usually about 5 cycles on typical modern x86, so that explains the 2x speed difference.https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ / https://agner.org/optimize/. On some CPUs like Zen 2 and Zen 4, and Ice Lake, zero-latency store forwarding is possible, but that usually requires the same addressing mode, not just the same final address. (https://www.agner.org/forum/viewtopic.php?t=41)
Or maybe there's a throughput bottleneck as well for the GCC version. GCC does
min(cp[j] + 1, cp[j - 1] + (c1 != c2))first because that's independent of last iteration'sminresult that's now incc[j - 1], so the actual critical path latency is onlyadd/cmp/cmov, 3 cycles on Broadwell and later, or AMD.Without compiler versions/options and CPU model, and details on your microbenchmark to see if the function inlined and compiled differently, the precise explanation for the exact speed difference will have to wait, but the bottom line is a missed optimization by the Rust compiler.
You could probably help it out by using a local variable instead of re-reading an array element with the bounds-checking that entails in Rust.
The inner loop has a dependency on the previous iteration via
cc[j-1] + 1, so compilers can only vectorize thecp[j] = j;init loop.It's this loop-carried dependency that appears to be a throughput bottleneck in the Rust
-C opt-level=3code-gen, because Rustc / LLVM made asm that reloads it from memory instead of keeping the result of last iteration'sminin a register.Perhaps it has a harder time optimizing than in C++ because of the bounds checks (there are some extra
cmp/jccconditional branches in the inner loop, but I don't think that would account for a factor of 2 in throughput.) But LittleBoxOfSunshine's answer tested withget_uncheckedand found negligible speedups, so probably just LLVM misses this.GCC code-gen for the inner loop looks like this (after changing the type to
unsignedto match your use ofu32in Rust), Godbolt. I know you actually benchmarked a version that used 64-bitsize_telements for the vectors, assuming you compiled for x86-64, but that's not significant. For your small strings, these easily fit in L1d cache.Note that there are only two DWORD loads plus the BYTE load (of the char from the string). And none of the loads are using
r14in the addressing mode like the store is. So we can see there are no loads fromcc[], just a store, socc[j - 1] + 1must be computed from last iteration's store.Another possible optimization would have been to keep
cp[j]around for use ascp[j-1]next iteration, but looking at the addressing modes is enough to figure out that it didn't do that.Clang is weird, with what looks like two versions of the loop, one with setcc/cmovcc and another with branches (starting at
.LBB0_38:I think).But Rustc (Godbolt) has one inner loop:
So the main work is still done branchlessly like GCC.
But unlike GCC, it's actually storing/reloading
cc[j - 1], so store-forwarding latency becomes part of the loop-carried dependency chain critical path which is the bottleneck.Optimized Rust version
Seeing these problems in the asm, we can change the source to make it easier for the compiler to avoid them. Use a local
mut lastccjvariable to holdcc[j]across iterations, and uselastccjonly in the outermin, not as an input to the inner one, to keep the critical path latency down toinc/cmp/cmov, with the firstcmp/cmov` being independent.Godbolt
This is obviously somewhat less readable and harder to follow, so it would be nice if compilers did a better job. There might also be something to gain from
cp[j-1]vs.cp[j], using a temporary variable for that, but that's just a throughput issue so we'd have to look if that saves instructions. Modern x86 can do 2 or 3 loads per clock cycle so the back-end execution units aren't the limit, just getting all the instructions through the front-end and issued into the back-end. Especially with the bounds-check overhead, this loop is far from tiny.Perhaps some SIMD to prepare a buffer of
std::cmp::min(cp[j - 1] + (c1 != c2) as u32, cp[j] + 1)results would help. That could maybe even auto-vectorize. Could be very good with AVX-512 merge-masking, but could still go 8 elements at a time with AVX2 with a few instructions.But the final loop is going to have a 3 cycle loop-carried dependency and the bounds-check overhead, so there's a limit to how much speedup we can get.
https://uica.uops.info/ predicts that the inner loop from this optimized version bottlenecks at one iteration per 5.36 cycles on Skylake, one per 4.7 cycles on Ice Lake / Rocket Lake. (Or 4.2 cycles with perfect scheduling). Those are both worse than the 3-cycle latency bottleneck. So the bounds-check branching becomes part of the throughput bottleneck after fixing the latency bottleneck. CPUs with 6-wide pipelines like Alder Lake P-cores or Zen 3 or 4 could run it even faster, in theory one per 3.5 cycles.
Would it work to manually vectorize with SIMD prefix-scan techniques for the chain of
minoperations? Maybe not, because I'm not sure it's separable into a prefix-min ofcc[j]+1; the min with elements fromcp[j]can affect an oldercc[j]in ways that we can't correct for later. So the mix of+1andminmakes it not associative, I think.