More

camel-cdr · 2026-04-23T08:59:24 1776934764

bitfield insert/extract was also looked at by the scalar efficiency SIG: https://lists.riscv.org/g/sig-scalar-efficiency/topic/115060...

IIRC it didn't go anywere, because it wasn't worth the encoding space.

But a rlwimi sounds like a good candidate for >32b encoding.

brucehoult · 2026-04-24T00:27:03 1776990423

Both the PowerPC and Arm64 instructions do grab a lot of encoding space.

rlwimi uses 26 bits of opcode space (i.e. 2^26 = 64M code points). In a RISC-V context you can drop the Rc (set status flags) bit, but for RV64 you need to expand the shift/start/end fields from 5 to 6 bits, so you end up needing 28 bits of encoding space, 18 for the field spec and 5 each for Rd1 and Rd/Rs2.

A RISC-V major opcode, such as OP-IMM (which this effectively is, but with a R/W Rd/Rs2) only has 2^25 bits of encoding space for all instructions in total!

PPC64's rldimi expands shift and size to 6 bits each but drops the ability to take the source field from an arbitrary position but only from the LSBs, and so uses 23 encoding bits. i.e. exactly my proposed RISC-V instruction (except for the set flags bit, so 22 bits).

Arm64's BFM/SBFM effectively uses 24 bits to provide both 32 bit and 64 bit operations — there are 25 bits but `sf` and `N` must be the same, potentially allowing the other half of the code points (plus the ones for 32 bit with the MSBs of `immr` and `imms` set) to be used for something else in future. Note that BFM leaves all other bits in the dst unchanged, while SBFM both sign-extends into the higher bits of dst AND zeros the lower bits of DST.

So BFM/SBFM *could* be fit into RISC-V, taking up half of a major opcode, of which there aren't many left. That is a pretty huge amount — the enormous V extension takes 1 1/2 major opcodes, for far more functionality. It would free up various immediate shifts and sign/zero extension instructions, but those don't take much encoding space, no more than 16 bits each.

As nice as they are, it's hard to avoid a conclusion that both (32 bit) PowerPC and Arm64 spend too much opcode space on these.

I think PPC64's `rldimi` and M88K's `mak` (extended to 64 bits) and my last RISC-V suggestion — which are all effectively the same thing — hit the right tradeoff, not using excessive encoding space but allowing a 2-instruction sequence for that bit field move):

    srli   a3, a1, 21
    maki   a2, a3, (1<<6) | 10   # decoder expands to `maki a2, a2, a3, (1<<6) | 10`

That's 22 bits of opcode space, the same as any one of `addi`, `andi`, `ori`, `xori`, `slti`, `sltiu` (OP-IMM) or `addiw` (OP-IMM-32).

The original RV64GC has 5/8 funct3 encodings in OP-IMM-32 unused, which `maki` (or call it `bfi` or whatever) could have used one of. It has a combined `Rd`/`Rs2` field which is unusual in full size 4-byte RISC-V instructions, but not unprecedented: the V extension does that for multiply-add instructions.

I don't immediately see any ratified or currently-proposed extension using this space.

gblargg · 2026-04-24T01:21:38 1776993698

What would justify using this significant space for them these days? Video encoding/decoding in software seems like the most likely candidate, since there's a lot of bitfield packing and high data volume.

(Thanks for your elaboration on various architectures. It's an interesting glimpse into what goes in in allocating opcode space on fixed-length instruction machines.)

brucehoult · 2026-04-24T02:49:05 1776998945

My example is applicable to compiler / assembler / JIT / emulator.

The performance of conventional compilers and assemblers is not important to anyone but developers, but everyone uses JavaScript / WebAsm all the time. And QEMU can be important too (e.g. in docker for non-native ISAs, using binfmt_misc).

I guess I should point out in the proposed RISC-V example, it's 6 bytes of code as the initial shift can be a 2-byte "C" extension instruction. So that's slightly smaller code than everything except 32 bit PowerPC, which is another important aspect. Arm64 and M68k use 8 bytes of code.

Oh! I just realised standard RISC-V can be improved in this case (but not by so much in the general case).

    srli   x12, x10, 20          # shift field down to correct position
    andi   x12, x12, 0x7FE       # mask to 10 bits
    andi   x11, x11, ~0x7FE      # clear space in the destination
    or     x11, x11, x12         # insert the field

That's just 12 bytes of code.

In the more general case you need a `lui` or `lui;andi` pair to load the mask into a register, and then register to register ops, for 14 bytes total.

Note that x86_64 needs four instructions and 14 bytes of code, so no better than RISC-V.

camel-cdr · 2026-04-23T08:50:29 1776934229

pext/pdep are incredible, I'm hoping to see them in more SIMD ISAs in the future.

But my favorite is the 8x8 bit matrix transpose SIMD instruction (gf2p8affine, which does a bit more, buy I care about the tranapose). Combined with SIMD byte permutes it allows you to do things like: arbitrarily permute bits in SIMD elements, find the invers of a permutation, very fast histograming/binning

camel-cdr · 2026-04-21T19:32:20 1776799940

Thanks I've been doing dumb sudo sh -c ... stuff before.

camel-cdr · 2026-04-20T10:57:58 1776682678

The lines

    __m512i vx  = _mm512_set1_epi64(static_cast<long long>(x));
    __m512i vk  = _mm512_load_si512(reinterpret_cast<const __m512i*>(base));
    __mmask8 m  = _mm512_cmp_epu64_mask(vx, vk, _MM_CMPINT_GE);
    return static_cast<std::uint32_t>(__builtin_popcount(m));

would be replaced with:

    return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, FANOUT), x, FANOUT), FANOUT);

and you set FANOUT to __riscv_vsetvlmax_e32m1() at runtime.

Alternatively, if you don't want a dynamic FANOUT you keep the FANOUT=8 (or another constant) and do a stripmining loop

    size_t cnt = 0;
    for (size_t vl, n = 8; n > 0; n -= vl, base += vl) {
     vl = __riscv_vsetvl_e64m1(n);
     cnt += __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, vl), x, vl), vl);
    }
    return cnt;

This will take FANOUT/VLEN iterations and the branches will be essentially almost predicted.

If you know FANOUT is always 8 and you'll never want to changes it, you could alternatively use select the optimal LMUL:

    size_t vl = __riscv_vsetvlmax_e32m1();
    if (vl == 2) return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m4(base, 8), x, 8), 8);
    if (vl == 4) return __riscv_vcpop(__riscv_vmsge(u__riscv_vle64_v_u64m2(base, 8), x, 8), 8);
    return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, 8), x, 8), 8);

camel-cdr · 2026-04-11T13:32:55 1775914375

> good luck parsing through 100 different "performance optimization manuals" from 100 different companies

This would be a problem for any ISA with multiple/many vendors.

camel-cdr · 2026-03-29T21:00:55 1774818055

Idk, it seems to me like the Rivos people are still doing their RISC-V CPU work.

camel-cdr · 2026-03-29T18:18:24 1774808304

Sadly still on quite old hardware, with no RVV. Hopefully scaleway will have some newer servers in the future and this can be simply updated to the new devices.

LeFantome · 2026-03-29T19:05:30 1774811130

You can get RVV instances from Saleway.

camel-cdr · 2026-03-29T19:06:39 1774811199

Oh, cool, I didn't see them on the website. (https://labs.scaleway.com/en/em-rv1/)

camel-cdr · 2026-03-15T18:33:11 1773599591

K&R syntax is -1 char, if you are in C:

    double solve(double a,double b,double c,double d){return a+b+c+d;}
    double solve(double a...){return a+1[&a]+2[&a]+3[&a];}
    double solve(a,b,c,d)double a,c,b,d;{return a+b+c+d;}

camel-cdr · 2026-03-13T22:08:08 1773439688

> For example, should we use vrgather (with what LMUL), or interesting workarounds such as widening+slide1, to implement a basic operation such as interleaving two vectors?

Use Zvzip, in the mean time:

zip: vwmaccu.vx(vwaddu.vv(a, b), -1, b), or segmented load/store when you are touching memory anyways

unzip: vsnrl

trn1/trn2: masked vslide1up/vslide1down with even/odd mask

The only thing base RVV does bad in those is register to register zip, which takes twice as many instructions as other ISAs. Zvzip gives you dedicated instructions of the above.

janwas · 2026-03-14T08:34:47 1773477287

Looks like the ratification plan for Zvzip is November. So maybe 3y until HW is actually usable? That's a neat trick with wmacc, congrats. But still, half the speed for quite a fundamental operation that has been heavily used in other ISAs for 20+ years :(

Great that you did a gap analysis [1]. I'm curious if one of the inputs for that was the list of Highway ops [2]?

[1]: https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3... [2]: https://github.com/google/highway/blob/master/g3doc/quick_re...

camel-cdr · 2026-03-11T21:37:42 1773265062

OK, look.

Since my previous attempt to measure the impact of trap on signed overflow didn't seem to have moved your position one bit, I thought I'd give it a go in the most representable way I could think of:

I build the same version of clang on a x86, aarch64 and RISC-V system using clang. Then I build another version with the `-ftrapv` flag enabled and compared the compiletimes of compiling programs using these clang builds running on real hardware:

    runtime:         x86         | aarch64                    | RISC-V (RVA23)
                     Zen1        |  A78          A55*         |  X100         A100  !!! all cores clocked to about 2.2GHz, Zen1 can reach almost 4GHz
    clang A:         3.609±0.078 |  4.209±0.050   9.390±0.029 |  5.465±0.070  11.559±0.020
    clang-ftrapv A:  3.613±0.118 |  4.290±0.050   9.418±0.056 |  5.448±0.060  11.579±0.030
    clang B:         8.948±0.100 | 10.983±0.188  22.827±0.016 | 13.556±0.016  28.682±0.023
    clang-ftrapv B:  8.960±0.125 | 11.099±0.294  22.802±0.039 | 13.511±0.018  28.741±0.050

As you can see, once again the overhead of -ftrapv is quite low.

Suprizinglt the -ftrapv overhead seems the highest on the Cortex-A78. My guess is that this because clang generates a seperate brk with unique immediate for every overflow check, while on RISC-V it always branches to one unimp per function.

Please tell me if you have a better suggestion for measuring the real world impact.

Or heck, give me some artificial worst case code. That would also be an interesting data point.

Notes:

* The format is mean±variance

* Spacemit X100 is a Cortex-A76 like OoO RISC-V core and A100 an in-order RISC-V core.

* I tried to clock all of the cores to the same frequency of about 2.2GHz. *Except for the A55, which ran at 1.8GHz, but I linearly scaled the results.

* Program A was the chibicc (8K loc) compiler and program B microjs (30K loc).

    binary size:
                  x86        aarch64    RISC-V
    clang:        212807768  216633784  195231816
    clang-ftrapv: 212859280  216737608  195419512
    increase:     0.24%      0.047%     0.09%

purplesyringa · 2026-03-12T05:40:26 1773294026

I suspect that LLVM is optimized for compiling with `-ftrapv`, perhaps for cheap sanitizing or maybe just due to design decisions like using unsigned integers everywhere (please correct me if I'm wrong). I'm personally interested in how RISC-V behaves on computational tasks where computing carry is a known bottleneck, like long addition. Maybe looking at libgmp could be interesting, though I suspect absolute numbers will not be meaningful, and there's no baseline to compare them to.

camel-cdr · 2026-03-14T09:59:56 1773482396

LLVM mostly uses size_t like most C/C++ programs, which either use size_t or int for everything, both of which are handled well by RISC-V.

> Maybe looking at libgmp could be interesting, though I suspect absolute numbers will not be meaningful, and there's no baseline to compare them to.

Realistically, nobody cares about BigInt addition performance, considering there is no GMP implementarion using SIMD, or even any using dependency breaking to get beyond 64-bit per cycle.

I whipped up a quick AVX-512 implementation that was 2x faster than libgmp on Zen4 (which has 256-bit SIMD ALUs). On RISC-V you'd just use RVV to do BigInt stuff.

purplesyringa · 2026-03-16T21:38:04 1773697084

"nobody cares about BigInt addition performance" is an odd claim to make when half of the world's cryptography is based on ECC.