Popcount optimization for the slow-path lookups - Mailing list pgsql-hackers

From Andrew Pogrebnoi
Subject Popcount optimization for the slow-path lookups
Date
Msg-id CA+aWR110MiXRWEhRawNrJvVO7CcDrNb5XkRdaa=m1dAVeXM02A@mail.gmail.com
Whole thread Raw
Responses Re: Popcount optimization for the slow-path lookups
Re: Popcount optimization for the slow-path lookups
List pgsql-hackers
Hello hackers,

I want to propose an optimization for pg_popcount32_slow() and pg_popcount64_slow() where lookups into pg_number_of_ones[] are made branchless. It shows speedup around 58% for uint64 and 35% for uint32 words compared to the current, looped version. This is on x86. It is much more significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32 words.

I probably have to guard the lookup in pg_popcount64_slow() with `#if SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same function. What do you think?

For benchmarks, I used functions with the copies of the lookup loop from pg_popcount32/64_slow(), measuring them against branchless counterparts. Then in C++ I pre generated two static vectors with random uint64's and uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've run it on M1 and x86 Macs.

Results:

X86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       17.2 ns         17.2 ns     39247568
Popcnt64_PG                     27.2 ns         27.0 ns     25415636
Popcnt32_BranchlessLookup       14.5 ns         14.4 ns     48441566
Popcnt32_PG                     19.5 ns         19.4 ns     36125676
```

Apple M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include   -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB (x10)
  L1 Instruction 192 KiB (x10)
  L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       6.52 ns         6.52 ns    113974186
Popcnt64_PG                     17.1 ns         17.1 ns     43015334
Popcnt32_BranchlessLookup       4.34 ns         4.34 ns    145674483
Popcnt32_PG                     9.79 ns         9.79 ns     75893374
```


I also tried the Parallel summing of adjacent bits (see Hacker's Delight [1], Chapter 5)
This is the code for uint64:
```
#define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
#define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
#define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...

static inline int
pg_popcount_word64(uint64 word)
{
    word = word - ((word >> 1) & POPCOUNT_M0);
    word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
    word = ((word >> 4) + word) & POPCOUNT_M2;
    word += word >> 8;
    word += word >> 16;
    word += word >> 32; // this step is omitted for the uint32 version

    return word & 0x3F;
}
```

However, it doesn't show any improvements compared to the branchless look up on M1 and a slight gain for uint64 on x86.

The same tests with the summing of adjacent bits:
x86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       17.2 ns         17.2 ns     39247568
Popcnt64_AdjacentBits           16.8 ns         16.7 ns     41481236
Popcnt64_PG                     27.2 ns         27.0 ns     25415636
Popcnt32_BranchlessLookup       14.5 ns         14.4 ns     48441566
Popcnt32_AdjacentBits           15.5 ns         15.4 ns     44454323
Popcnt32_PG                     19.5 ns         19.4 ns     36125676
```

M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include   -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB (x10)
  L1 Instruction 192 KiB (x10)
  L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       6.52 ns         6.52 ns    113974186
Popcnt64_AdjacentBits           8.05 ns         8.05 ns     86988490
Popcnt64_PG                     17.1 ns         17.1 ns     43015334
Popcnt32_BranchlessLookup       4.34 ns         4.34 ns    145674483
Popcnt32_AdjacentBits           7.27 ns         7.27 ns     96050714
Popcnt32_PG                     9.79 ns         9.79 ns     75893374
```



[0] https://github.com/google/benchmark
[1] https://dl.acm.org/doi/book/10.5555/2462741

-----
Cheers,
Andy
Attachment

pgsql-hackers by date:

Previous
From: Yannick Goetschel
Date:
Subject: Re: Proposal: QUALIFY clause
Next
From: David Geier
Date:
Subject: Re: Popcount optimization for the slow-path lookups