Using POPCNT and other advanced bit manipulation instructions - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Using POPCNT and other advanced bit manipulation instructions |
Date | |
Msg-id | CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com Whole thread Raw |
Responses |
Re: Using POPCNT and other advanced bit manipulation instructions
Re: Using POPCNT and other advanced bit manipulation instructions Re: Using POPCNT and other advanced bit manipulation instructions |
List | pgsql-hackers |
Back in 2016 [1] there was some discussion about using the POPCNT instruction to improve the performance of counting the number of bits set in a word. Improving this helps various cases, such as bms_num_members and also things like counting the allvisible and frozen pages in the visibility map. Thomas Munro did some work to make this happen but didn't go as far as adding the required run-time test to allow builds which were built on a machine with these instructions to work on a machine without them. We've now got other places in the code which have similar run-time tests (for example CRC calculation), so I think we should be able to do the same for the ABM instructions. Thomas mentions in [1], to get the GCC to use the POPCNT instruction, we must pass -mpopcnt in the build flags. After doing a bit of research, I found [2] which mentions that some compilers have some pattern matching code to allow the popcnt instruction to be used even without a macro such as __builtin_popcount(). I believe I've correctly written the run-time test to skip using the new popcnt function, but if there's any code around that might cause the compiler to use the popcnt instruction from pattern matching, then that might cause problems. Remember, that's not limited to core code since pg_config will cause extensions to be compiled with -mpopcnt too. I've put together a very rough patch to implement using POPCNT and the leading and trailing 0-bit instructions to improve the performance of bms_next_member() and bms_prev_member(). The correct function should be determined on the first call to each function by way of setting a function pointer variable to the most suitable supported implementation. I've not yet gone through and removed all the number_of_ones[] arrays to replace with a pg_popcount*() call. That seems to have mostly been done in Thomas' patch [3], part of which I've used for the visibilitymap.c code changes. If this patch proves to be possible, then I'll look at including the other changes Thomas made in his patch too. What I'm really looking for by posting now are reasons why we can't do this. I'm also interested in getting some testing done on older machines, particularly machines with processors that are from before 2007, both AMD and Intel. 2007-2008 seems to be around the time both AMD and Intel added support for POPCNT and LZCNT, going by [4]. I'm also a little uncertain of my cpuid bit tests. POPCNT appears to have use bit 5 in EAX=80000001h, but also bit 23 in EAX=1 [5]. This appears to be a variation between Intel and AMD. AMD always implement either both POPCNT and LZCNT or neither. Where Intel use AMDs cpuid bit flag just for LZCNT and have reserved their own flag for POPCNT (they didn't implement both at once, as AMD did). I'm a bit uncertain if AMD will set the Intel POPCNT flag or not, and if they do now, then I'm not sure if they always did. Intel were 2nd in that race, so I assume at least the earliest AMD processors would just set only the AMD flag. Testing might help reveal if I have this right. I am able to measure performance gains from the patch. In a 3.4GB table containing a single column with just 10 statistics targets, I got the following times after running ANALYZE on the table. Patched: postgres=# analyze t1; Time: 680.833 ms Time: 699.976 ms Time: 695.608 ms Time: 676.007 ms Time: 693.487 ms Time: 726.982 ms Time: 677.835 ms Time: 688.426 ms Master: postgres=# analyze t1; Time: 721.837 ms Time: 756.035 ms Time: 734.545 ms Time: 751.969 ms Time: 730.140 ms Time: 724.266 ms Time: 713.625 ms (+3.66% avg) This should be down to the improved performance of visibilitymap_count(), but it may not be entirely just from faster bit counter as I also couldn't resist tightening up the inner-most loop. [1] https://www.postgresql.org/message-id/flat/CAEepm%3D3k%2B%2BYtf2LNQCvpP6m1%3DgY9zZHP_cfnn47%3DWTsoCrLCvA%40mail.gmail.com [2] https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/ [3] https://www.postgresql.org/message-id/CAEepm%3D3g1_fjJGp38QGv%2B38BC2HHVkzUq6s69nk3mWLgPHqC3A%40mail.gmail.com [4] https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT [5] https://en.wikipedia.org/wiki/CPUID#CPUID_usage_from_high-level_languages -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
pgsql-hackers by date: