Re: Popcount optimization for the slow-path lookups - Mailing list pgsql-hackers

From Andrew Pogrebnoi
Subject Re: Popcount optimization for the slow-path lookups
Date
Msg-id CA+aWR11v_s_3wjJRvFPHNbZTv-+dJVWaKc_i4gxH-3jYw_TmFw@mail.gmail.com
Whole thread Raw
In response to Re: Popcount optimization for the slow-path lookups  (David Geier <geidav.pg@gmail.com>)
List pgsql-hackers
Hi David,

Thanks for looking at it!

> I would like to test if I can reproduce your results. Could you share
> your test program?

Here you go: https://gist.github.com/dAdAbird/1480ff15764f5a6301174806d8512a3a

> You also don't specify an optimization level. That means the default
> level -O0 is used. Does it make it a difference if you run e.g. with -O3?

Yes, good point. I got carried away with experiments and totally forgot that..

I've re-run tests with -O3 and it seems like the Summing of Adjacent Bits might make sense for uint64:

X86
```

% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark

2025-12-05T16:22:43+02:00

Running ./mybenchmark

Run on (12 X 3200 MHz CPU s)

CPU Caches:

  L1 Data 32 KiB

  L1 Instruction 32 KiB

  L2 Unified 256 KiB (x6)

  L3 Unified 12288 KiB

Load Average: 2.67, 4.72, 2.90

--------------------------------------------------------------------

Benchmark                          Time             CPU   Iterations

--------------------------------------------------------------------

Popcnt64_BranchlessLookup       5.92 ns         5.90 ns    113871130

Popcnt64_AdjacentBits           5.30 ns         5.27 ns    115084258

Popcnt64_PG                     8.34 ns         8.32 ns     80622869

Popcnt32_BranchlessLookup       3.62 ns         3.61 ns    197816110

Popcnt32_AdjacentBits           4.56 ns         4.55 ns    154932383

Popcnt32_PG                     5.32 ns         5.31 ns    121845083

```

M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include   -L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark
2025-12-05T14:29:18+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB (x10)
  L1 Instruction 192 KiB (x10)
  L2 Unified 12288 KiB (x10)
Load Average: 0.01, 0.01, 0.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       2.13 ns         2.13 ns    342101937
Popcnt64_AdjacentBits           1.89 ns         1.89 ns    370316571
Popcnt64_PG                     3.83 ns         3.83 ns    190990088
Popcnt32_BranchlessLookup       1.19 ns         1.19 ns    591797267
Popcnt32_AdjacentBits           1.73 ns         1.73 ns    409381266
Popcnt32_PG                     2.01 ns         2.01 ns    344969625
```

---
Cheers,
Andy

pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: More const-marking cleanup
Next
From: "Euler Taveira"
Date:
Subject: Re: making tid and HOTness of UPDATE available to logical decoding plugins