Popcount optimization for the slow-path lookups - Mailing list pgsql-hackers
| From | Andrew Pogrebnoi |
|---|---|
| Subject | Popcount optimization for the slow-path lookups |
| Date | |
| Msg-id | CA+aWR110MiXRWEhRawNrJvVO7CcDrNb5XkRdaa=m1dAVeXM02A@mail.gmail.com Whole thread Raw |
| Responses |
Re: Popcount optimization for the slow-path lookups
Re: Popcount optimization for the slow-path lookups |
| List | pgsql-hackers |
Hello hackers,
I want to propose an optimization for pg_popcount32_slow() and pg_popcount64_slow() where lookups into pg_number_of_ones[] are made branchless. It shows speedup around 58% for uint64 and 35% for uint32 words compared to the current, looped version. This is on x86. It is much more significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32 words.
I probably have to guard the lookup in pg_popcount64_slow() with `#if SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same function. What do you think?
For benchmarks, I used functions with the copies of the lookup loop from pg_popcount32/64_slow(), measuring them against branchless counterparts. Then in C++ I pre generated two static vectors with random uint64's and uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've run it on M1 and x86 Macs.
Results:
X86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
Popcnt64_PG 27.2 ns 27.0 ns 25415636
Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
Popcnt32_PG 19.5 ns 19.4 ns 36125676
```
Apple M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
Popcnt64_PG 17.1 ns 17.1 ns 43015334
Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
Popcnt32_PG 9.79 ns 9.79 ns 75893374
```
I also tried the Parallel summing of adjacent bits (see Hacker's Delight [1], Chapter 5)
This is the code for uint64:
```
#define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
#define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
#define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...
static inline int
pg_popcount_word64(uint64 word)
{
word = word - ((word >> 1) & POPCOUNT_M0);
word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
word = ((word >> 4) + word) & POPCOUNT_M2;
word += word >> 8;
word += word >> 16;
word += word >> 32; // this step is omitted for the uint32 version
return word & 0x3F;
}
```
However, it doesn't show any improvements compared to the branchless look up on M1 and a slight gain for uint64 on x86.
The same tests with the summing of adjacent bits:
x86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
Popcnt64_AdjacentBits 16.8 ns 16.7 ns 41481236
Popcnt64_PG 27.2 ns 27.0 ns 25415636
Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
Popcnt32_AdjacentBits 15.5 ns 15.4 ns 44454323
Popcnt32_PG 19.5 ns 19.4 ns 36125676
```
M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
Popcnt64_AdjacentBits 8.05 ns 8.05 ns 86988490
Popcnt64_PG 17.1 ns 17.1 ns 43015334
Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
Popcnt32_AdjacentBits 7.27 ns 7.27 ns 96050714
Popcnt32_PG 9.79 ns 9.79 ns 75893374
```
[0] https://github.com/google/benchmark
[1] https://dl.acm.org/doi/book/10.5555/2462741
-----
Cheers,
Andy
I want to propose an optimization for pg_popcount32_slow() and pg_popcount64_slow() where lookups into pg_number_of_ones[] are made branchless. It shows speedup around 58% for uint64 and 35% for uint32 words compared to the current, looped version. This is on x86. It is much more significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32 words.
I probably have to guard the lookup in pg_popcount64_slow() with `#if SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same function. What do you think?
For benchmarks, I used functions with the copies of the lookup loop from pg_popcount32/64_slow(), measuring them against branchless counterparts. Then in C++ I pre generated two static vectors with random uint64's and uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've run it on M1 and x86 Macs.
Results:
X86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
Popcnt64_PG 27.2 ns 27.0 ns 25415636
Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
Popcnt32_PG 19.5 ns 19.4 ns 36125676
```
Apple M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
Popcnt64_PG 17.1 ns 17.1 ns 43015334
Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
Popcnt32_PG 9.79 ns 9.79 ns 75893374
```
I also tried the Parallel summing of adjacent bits (see Hacker's Delight [1], Chapter 5)
This is the code for uint64:
```
#define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
#define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
#define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...
static inline int
pg_popcount_word64(uint64 word)
{
word = word - ((word >> 1) & POPCOUNT_M0);
word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
word = ((word >> 4) + word) & POPCOUNT_M2;
word += word >> 8;
word += word >> 16;
word += word >> 32; // this step is omitted for the uint32 version
return word & 0x3F;
}
```
However, it doesn't show any improvements compared to the branchless look up on M1 and a slight gain for uint64 on x86.
The same tests with the summing of adjacent bits:
x86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 17.2 ns 17.2 ns 39247568
Popcnt64_AdjacentBits 16.8 ns 16.7 ns 41481236
Popcnt64_PG 27.2 ns 27.0 ns 25415636
Popcnt32_BranchlessLookup 14.5 ns 14.4 ns 48441566
Popcnt32_AdjacentBits 15.5 ns 15.4 ns 44454323
Popcnt32_PG 19.5 ns 19.4 ns 36125676
```
M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 6.52 ns 6.52 ns 113974186
Popcnt64_AdjacentBits 8.05 ns 8.05 ns 86988490
Popcnt64_PG 17.1 ns 17.1 ns 43015334
Popcnt32_BranchlessLookup 4.34 ns 4.34 ns 145674483
Popcnt32_AdjacentBits 7.27 ns 7.27 ns 96050714
Popcnt32_PG 9.79 ns 9.79 ns 75893374
```
[0] https://github.com/google/benchmark
[1] https://dl.acm.org/doi/book/10.5555/2462741
-----
Cheers,
Andy
Attachment
pgsql-hackers by date: