Using POPCNT and other advanced bit manipulation instructions - Mailing list pgsql-hackers

From David Rowley
Subject Using POPCNT and other advanced bit manipulation instructions
Date
Msg-id CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com
Whole thread Raw
Responses Re: Using POPCNT and other advanced bit manipulation instructions
Re: Using POPCNT and other advanced bit manipulation instructions
Re: Using POPCNT and other advanced bit manipulation instructions
List pgsql-hackers
Back in 2016 [1] there was some discussion about using the POPCNT
instruction to improve the performance of counting the number of bits
set in a word.  Improving this helps various cases, such as
bms_num_members and also things like counting the allvisible and
frozen pages in the visibility map.

Thomas Munro did some work to make this happen but didn't go as far as
adding the required run-time test to allow builds which were built on
a machine with these instructions to work on a machine without them.
We've now got other places in the code which have similar run-time
tests (for example CRC calculation), so I think we should be able to
do the same for the ABM instructions.

Thomas mentions in [1], to get the GCC to use the POPCNT instruction,
we must pass -mpopcnt in the build flags. After doing a bit of
research, I found [2] which mentions that some compilers have some
pattern matching code to allow the popcnt instruction to be used even
without a macro such as __builtin_popcount().  I believe I've
correctly written the run-time test to skip using the new popcnt
function, but if there's any code around that might cause the compiler
to use the popcnt instruction from pattern matching, then that might
cause problems.  Remember, that's not limited to core code since
pg_config will cause extensions to be compiled with -mpopcnt too.

I've put together a very rough patch to implement using POPCNT and the
leading and trailing 0-bit instructions to improve the performance of
bms_next_member() and bms_prev_member().  The correct function should
be determined on the first call to each function by way of setting a
function pointer variable to the most suitable supported
implementation.   I've not yet gone through and removed all the
number_of_ones[] arrays to replace with a pg_popcount*() call. That
seems to have mostly been done in Thomas' patch [3], part of which
I've used for the visibilitymap.c code changes.  If this patch proves
to be possible, then I'll look at including the other changes Thomas
made in his patch too.

What I'm really looking for by posting now are reasons why we can't do
this. I'm also interested in getting some testing done on older
machines, particularly machines with processors that are from before
2007, both AMD and Intel. 2007-2008 seems to be around the time both
AMD and Intel added support for POPCNT and LZCNT, going by [4].

I'm also a little uncertain of my cpuid bit tests. POPCNT appears to
have use bit 5 in EAX=80000001h, but also bit 23 in EAX=1 [5]. This
appears to be a variation between Intel and AMD. AMD always implement
either both POPCNT and LZCNT or neither. Where Intel use AMDs cpuid
bit flag just for LZCNT and have reserved their own flag for POPCNT
(they didn't implement both at once, as AMD did). I'm a bit uncertain
if AMD will set the Intel POPCNT flag or not, and if they do now, then
I'm not sure if they always did. Intel were 2nd in that race, so I
assume at least the earliest AMD processors would just set only the
AMD flag.  Testing might help reveal if I have this right.

I am able to measure performance gains from the patch.  In a 3.4GB
table containing a single column with just 10 statistics targets, I
got the following times after running ANALYZE on the table.

Patched:

postgres=# analyze t1;
Time: 680.833 ms
Time: 699.976 ms
Time: 695.608 ms
Time: 676.007 ms
Time: 693.487 ms
Time: 726.982 ms
Time: 677.835 ms
Time: 688.426 ms

Master:

postgres=# analyze t1;
Time: 721.837 ms
Time: 756.035 ms
Time: 734.545 ms
Time: 751.969 ms
Time: 730.140 ms
Time: 724.266 ms
Time: 713.625 ms

(+3.66% avg)

This should be down to the improved performance of
visibilitymap_count(), but it may not be entirely just from faster bit
counter as I also couldn't resist tightening up the inner-most loop.

[1]
https://www.postgresql.org/message-id/flat/CAEepm%3D3k%2B%2BYtf2LNQCvpP6m1%3DgY9zZHP_cfnn47%3DWTsoCrLCvA%40mail.gmail.com
[2] https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
[3] https://www.postgresql.org/message-id/CAEepm%3D3g1_fjJGp38QGv%2B38BC2HHVkzUq6s69nk3mWLgPHqC3A%40mail.gmail.com
[4] https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT
[5] https://en.wikipedia.org/wiki/CPUID#CPUID_usage_from_high-level_languages

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [PATCH] Improve tab completion for CREATE TABLE
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Block level parallel vacuum