Thread: Using POPCNT and other advanced bit manipulation instructions
Back in 2016 [1] there was some discussion about using the POPCNT instruction to improve the performance of counting the number of bits set in a word. Improving this helps various cases, such as bms_num_members and also things like counting the allvisible and frozen pages in the visibility map. Thomas Munro did some work to make this happen but didn't go as far as adding the required run-time test to allow builds which were built on a machine with these instructions to work on a machine without them. We've now got other places in the code which have similar run-time tests (for example CRC calculation), so I think we should be able to do the same for the ABM instructions. Thomas mentions in [1], to get the GCC to use the POPCNT instruction, we must pass -mpopcnt in the build flags. After doing a bit of research, I found [2] which mentions that some compilers have some pattern matching code to allow the popcnt instruction to be used even without a macro such as __builtin_popcount(). I believe I've correctly written the run-time test to skip using the new popcnt function, but if there's any code around that might cause the compiler to use the popcnt instruction from pattern matching, then that might cause problems. Remember, that's not limited to core code since pg_config will cause extensions to be compiled with -mpopcnt too. I've put together a very rough patch to implement using POPCNT and the leading and trailing 0-bit instructions to improve the performance of bms_next_member() and bms_prev_member(). The correct function should be determined on the first call to each function by way of setting a function pointer variable to the most suitable supported implementation. I've not yet gone through and removed all the number_of_ones[] arrays to replace with a pg_popcount*() call. That seems to have mostly been done in Thomas' patch [3], part of which I've used for the visibilitymap.c code changes. If this patch proves to be possible, then I'll look at including the other changes Thomas made in his patch too. What I'm really looking for by posting now are reasons why we can't do this. I'm also interested in getting some testing done on older machines, particularly machines with processors that are from before 2007, both AMD and Intel. 2007-2008 seems to be around the time both AMD and Intel added support for POPCNT and LZCNT, going by [4]. I'm also a little uncertain of my cpuid bit tests. POPCNT appears to have use bit 5 in EAX=80000001h, but also bit 23 in EAX=1 [5]. This appears to be a variation between Intel and AMD. AMD always implement either both POPCNT and LZCNT or neither. Where Intel use AMDs cpuid bit flag just for LZCNT and have reserved their own flag for POPCNT (they didn't implement both at once, as AMD did). I'm a bit uncertain if AMD will set the Intel POPCNT flag or not, and if they do now, then I'm not sure if they always did. Intel were 2nd in that race, so I assume at least the earliest AMD processors would just set only the AMD flag. Testing might help reveal if I have this right. I am able to measure performance gains from the patch. In a 3.4GB table containing a single column with just 10 statistics targets, I got the following times after running ANALYZE on the table. Patched: postgres=# analyze t1; Time: 680.833 ms Time: 699.976 ms Time: 695.608 ms Time: 676.007 ms Time: 693.487 ms Time: 726.982 ms Time: 677.835 ms Time: 688.426 ms Master: postgres=# analyze t1; Time: 721.837 ms Time: 756.035 ms Time: 734.545 ms Time: 751.969 ms Time: 730.140 ms Time: 724.266 ms Time: 713.625 ms (+3.66% avg) This should be down to the improved performance of visibilitymap_count(), but it may not be entirely just from faster bit counter as I also couldn't resist tightening up the inner-most loop. [1] https://www.postgresql.org/message-id/flat/CAEepm%3D3k%2B%2BYtf2LNQCvpP6m1%3DgY9zZHP_cfnn47%3DWTsoCrLCvA%40mail.gmail.com [2] https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/ [3] https://www.postgresql.org/message-id/CAEepm%3D3g1_fjJGp38QGv%2B38BC2HHVkzUq6s69nk3mWLgPHqC3A%40mail.gmail.com [4] https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT [5] https://en.wikipedia.org/wiki/CPUID#CPUID_usage_from_high-level_languages -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 20/12/2018 18:53, David Rowley wrote [...] > Patched: > > postgres=# analyze t1; > Time: 680.833 ms > Time: 699.976 ms > Time: 695.608 ms > Time: 676.007 ms > Time: 693.487 ms > Time: 726.982 ms > Time: 677.835 ms > Time: 688.426 ms > > Master: > > postgres=# analyze t1; > Time: 721.837 ms > Time: 756.035 ms > Time: 734.545 ms > Time: 751.969 ms > Time: 730.140 ms > Time: 724.266 ms > Time: 713.625 ms > > (+3.66% avg) [...] Looking at the normalized standard deviations, the patched results have a higher than 5% chance of being better simply by chance. I suspect that you have made an improvement, but the statistics are not convincing. I can supply detailed working if you want. Cheers, Gavin
On Thu, 20 Dec 2018 at 20:17, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote: > Looking at the normalized standard deviations, the patched results have > a higher than 5% chance of being better simply by chance. I suspect > that you have made an improvement, but the statistics are not convincing. Yeah, I'd hoped that I could have gotten a better signal to noise ratio by running the test many times, but you're right. That was on my laptop. I've run the test again on an AWS instance and the results seem to be a bit more stable. Same table with 1 int column and 100m rows. statistics set to 10. Unpatched postgres=# analyze a; Time: 38.248 ms Time: 35.185 ms Time: 35.067 ms Time: 34.879 ms Time: 34.816 ms Time: 34.558 ms Time: 34.722 ms Time: 34.427 ms Time: 34.214 ms Time: 34.301 ms Time: 35.751 ms Time: 33.993 ms Time: 33.880 ms Time: 33.617 ms Time: 33.381 ms Time: 33.326 ms Patched: postgres=# analyze a; Time: 34.421 ms Time: 33.523 ms Time: 33.230 ms Time: 33.678 ms Time: 32.987 ms Time: 32.914 ms Time: 33.165 ms Time: 32.707 ms Time: 32.645 ms Time: 32.814 ms Time: 32.082 ms Time: 32.143 ms Time: 32.310 ms Time: 31.966 ms Time: 31.702 ms Time: 32.089 ms Avg +5.72%, Median +5.29% -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> On Thu, Dec 20, 2018 at 6:53 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > Thomas mentions in [1], to get the GCC to use the POPCNT instruction, > we must pass -mpopcnt in the build flags. After doing a bit of > research, I found [2] which mentions that some compilers have some > pattern matching code to allow the popcnt instruction to be used even > without a macro such as __builtin_popcount(). I believe I've > correctly written the run-time test to skip using the new popcnt > function, but if there's any code around that might cause the compiler > to use the popcnt instruction from pattern matching, then that might > cause problems. I've checked for Clang 6, it turns out that indeed it generates popcnt without any macro, but only in one place for bloom_prop_bits_set. After looking at this function it seems that it would be benefitial to actually use popcnt there too. > I am able to measure performance gains from the patch. In a 3.4GB > table containing a single column with just 10 statistics targets, I > got the following times after running ANALYZE on the table. I've tested it too a bit, and got similar results when the patched version is slightly faster. But then I wonder if popcnt is the best solution here, since after some short research I found a paper [1], where authors claim that: Maybe surprisingly, we show that a vectorized approach using SIMD instructions can be twice as fast as using the dedicated instructions on recent Intel processors. [1]: https://arxiv.org/pdf/1611.07612.pdf
On 20/12/18 6:53, David Rowley wrote: > Back in 2016 [1] there was some discussion about using the POPCNT > instruction to improve the performance of counting the number of bits > set in a word. Improving this helps various cases, such as > bms_num_members and also things like counting the allvisible and > frozen pages in the visibility map. > > [snip] > > I've put together a very rough patch to implement using POPCNT and the > leading and trailing 0-bit instructions to improve the performance of > bms_next_member() and bms_prev_member(). The correct function should > be determined on the first call to each function by way of setting a > function pointer variable to the most suitable supported > implementation. I've not yet gone through and removed all the > number_of_ones[] arrays to replace with a pg_popcount*() call. IMVHO: Please do not disregard potential optimization by the compiler around those calls.. o_0 That might explain the reduced performance improvement observed. Not that I can see any obvious alternative to your implementation right away ... > That > seems to have mostly been done in Thomas' patch [3], part of which > I've used for the visibilitymap.c code changes. If this patch proves > to be possible, then I'll look at including the other changes Thomas > made in his patch too. > > What I'm really looking for by posting now are reasons why we can't do > this. I'm also interested in getting some testing done on older > machines, particularly machines with processors that are from before > 2007, both AMD and Intel. I can offer a 2005-vintage Opteron 2216 rev3 (bought late 2007) to test on. Feel free to toss me some test code. cpuinfo flags: fpu de tsc msr pae mce cx8 apic mca cmov pat clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good nopl extd_apicid eagerfpu pni cx16 hypervisor lahf_lm cmp_legacy 3dnowprefetch vmmcall > 2007-2008 seems to be around the time both > AMD and Intel added support for POPCNT and LZCNT, going by [4]. Thanks
Thanks for looking at this. On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > I've checked for Clang 6, it turns out that indeed it generates popcnt without > any macro, but only in one place for bloom_prop_bits_set. After looking at this > function it seems that it would be benefitial to actually use popcnt there too. Yeah, that's the pattern that's mentioned in https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/ It would need to be changed to call the popcount function. This existing makes me a bit more worried that some extension could be using a similar pattern and end up being compiled with -mpopcnt due to pg_config having that CFLAG. That's all fine until the binary makes it's way over to a machine without that instruction. > > I am able to measure performance gains from the patch. In a 3.4GB > > table containing a single column with just 10 statistics targets, I > > got the following times after running ANALYZE on the table. > > I've tested it too a bit, and got similar results when the patched version is > slightly faster. But then I wonder if popcnt is the best solution here, since > after some short research I found a paper [1], where authors claim that: > > Maybe surprisingly, we show that a vectorized approach using SIMD > instructions can be twice as fast as using the dedicated instructions on > recent Intel processors. > > > [1]: https://arxiv.org/pdf/1611.07612.pdf I can't imagine that using the number_of_ones[] array processing 8-bits at a time would be slower than POPCNT though. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 20 Dec 2018 at 23:59, Jose Luis Tallon <jltallon@adv-solutions.net> wrote: > IMVHO: Please do not disregard potential optimization by the compiler > around those calls.. o_0 That might explain the reduced performance > improvement observed. It was a speedup that I measured. Did you see something else? > > What I'm really looking for by posting now are reasons why we can't do > > this. I'm also interested in getting some testing done on older > > machines, particularly machines with processors that are from before > > 2007, both AMD and Intel. > > I can offer a 2005-vintage Opteron 2216 rev3 (bought late 2007) to test > on. Feel free to toss me some test code. > > cpuinfo flags: fpu de tsc msr pae mce cx8 apic mca cmov pat clflush > mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow > rep_good nopl extd_apicid eagerfpu pni cx16 hypervisor lahf_lm > cmp_legacy 3dnowprefetch vmmcall > > > 2007-2008 seems to be around the time both > > AMD and Intel added support for POPCNT and LZCNT, going by [4]. It would be really good if you could git clone a copy of master and patch it with the patch from earlier in the thread and see if you encounter any issues running make check-world. I'm a bit uncertain if passing -mpopcnt to a recent gcc would result in the popcnt instruction being compiled in if the machine doing the compiling had no support for that. Likely it would be simple to test that with: echo "int main(char **argv, int argc) { return __builtin_popcount(argc); }" > popcnt.c && gcc popcnt.c -S -mpopcnt && cat popcnt.s | grep pop I see a "popcntl" in there on my machine. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> On Fri, Jan 4, 2019 at 1:38 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > I've checked for Clang 6, it turns out that indeed it generates popcnt without > > any macro, but only in one place for bloom_prop_bits_set. After looking at this > > function it seems that it would be benefitial to actually use popcnt there too. > > Yeah, that's the pattern that's mentioned in > https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/ > It would need to be changed to call the popcount function. This > existing makes me a bit more worried that some extension could be > using a similar pattern and end up being compiled with -mpopcnt due to > pg_config having that CFLAG. That's all fine until the binary makes > it's way over to a machine without that instruction. It surprises me, that it's not that obvious how to disable this feature for clang. I guess one should be able to turn it off by invoking opt manually: clang -S -mpopcnt -emit-llvm *.c opt -S -mattr=+popcnt <all the options without -loop-idiom> *.ll llc -mattr=+popcnt *.optimized.ll clang -mpopcnt *optimized.s But for some reason this doesn't work for me (popcnt is not appearing in the first place). > > > I am able to measure performance gains from the patch. In a 3.4GB > > > table containing a single column with just 10 statistics targets, I > > > got the following times after running ANALYZE on the table. > > > > I've tested it too a bit, and got similar results when the patched version is > > slightly faster. But then I wonder if popcnt is the best solution here, since > > after some short research I found a paper [1], where authors claim that: > > > > Maybe surprisingly, we show that a vectorized approach using SIMD > > instructions can be twice as fast as using the dedicated instructions on > > recent Intel processors. > > > > > > [1]: https://arxiv.org/pdf/1611.07612.pdf > > I can't imagine that using the number_of_ones[] array processing > 8-bits at a time would be slower than POPCNT though. Yeah, probably you're right. If I understand correctly even with the lookup table in the cache the access would be a bit slower than a POPCNT instruction.
I only have cosmetic suggestions for this patch. For one thing, I think the .c file should be in src/port and its header should be in src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h. For another, I think the arrangement of all those "ifdef HAVE_THIS_OR_THAT" in the bitutils.c file is a bit hard to read. I'd lay them out like this: #ifdef HAVE__BUILTIN_CTZ int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose; #else int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow; #endif #ifdef HAVE__BUILTIN_CTZ /* * This gets called on the first call. It replaces the function pointer * so that subsequent calls are routed directly to the chosen implementation. */ static int pg_rightmost_one32_choose(uint32 word) { ... (You need declarations for the "choose" variants at the top of the file, but that seems okay.) Finally, the part in bitmapset.c is repetitive on the #ifdefs; I'd just put at the top of the file something like #if bms are 32 bits #define pg_rightmost_one(x) pg_rightmost_one32(x) #define pg_popcount(x) pg_popcount32(x) #elif they are 64 bits #define ... #else #error ... #endif This way, each place that uses the functions does not need the ifdefs. Other than those minor changes, I think we should just get this pushed and see what the buildfarm thinks. In the words of a famous PG hacker: if a platform ain't in the buildfarm, we don't support it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 31, 2019 at 07:45:02PM -0300, Alvaro Herrera wrote: > Other than those minor changes, I think we should just get this pushed > and see what the buildfarm thinks. In the words of a famous PG hacker: > if a platform ain't in the buildfarm, we don't support it. Moved to next CF, waiting on author. I think that this needs more reviews. -- Michael
Attachment
Thanks for looking at this. On Fri, 1 Feb 2019 at 11:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I only have cosmetic suggestions for this patch. For one thing, I think > the .c file should be in src/port and its header should be in > src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h. I've moved the code into src/port and renamed the file to pg_bitutils.c > For another, I think the arrangement of all those "ifdef > HAVE_THIS_OR_THAT" in the bitutils.c file is a bit hard to read. I'd > lay them out like this: I've made this change too, although when doing it I realised that I had forgotten to include the check for CPUID. It's possible that does not exist but POPCNT does, I guess. This has made the #ifs a bit more complex. > Finally, the part in bitmapset.c is repetitive on the #ifdefs; I'd just > put at the top of the file something like Yeah, agreed. Much neater that way. > Other than those minor changes, I think we should just get this pushed > and see what the buildfarm thinks. In the words of a famous PG hacker: > if a platform ain't in the buildfarm, we don't support it. I also made a number of other changes to the patch. 1. The patch now only uses the -mpopcnt CFLAG for pg_bitutils.c. I thought this was important so we don't expose that flag in pg_config and possibly end up building extension with popcnt instructions, which might not be portable to other older hardware. 2. Wrote a new pg_popcnt function that accepts an array of bytes and a size variable. This seems useful for the bloomfilter use. There are still various number_of_ones[] arrays around the codebase. These exist in tsgistidx.c, _intbig_gist.c and _ltree_gist.c. It would be nice to get rid of those too, but one of the usages in each of those 3 files requires XORing with another bit array before counting the bits. I thought about maybe writing a pop_count_xor() function that accepts 2 byte arrays and a length parameter, but it seems a bit special case, so I didn't. Another thing I wasn't sure of was if I should just have bms_num_members() just call pg_popcount(). It might be worth benchmarking to see what's faster. My thinking is that pg_popcount will inline the pg_popcount64() call so it would mean a single function call rather than one for each bitmapword in the set. I've compiled and run make check-world on Linux with GCC7.3 and clang6.0. I've also tested on MSVC to ensure I didn't break windows. It would be good to get a few more people to compile it and run the tests. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2019-Feb-04, David Rowley wrote: > On Fri, 1 Feb 2019 at 11:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > I only have cosmetic suggestions for this patch. For one thing, I think > > the .c file should be in src/port and its header should be in > > src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h. > > I've moved the code into src/port and renamed the file to pg_bitutils.c I've pushed this now. Let's see what the buildfarm has to say about it. > I've compiled and run make check-world on Linux with GCC7.3 and > clang6.0. I've also tested on MSVC to ensure I didn't break windows. > It would be good to get a few more people to compile it and run the > tests. That's what the buildfarm is for ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I've pushed this now. Let's see what the buildfarm has to say about it. It's likely to be hard to tell, given the amount of pink from the Ryu patch. If Andrew is not planning to clean that up PDQ, I'd suggest reverting that patch pending having some repairs for it. regards, tom lane
Hi, On February 13, 2019 8:40:14 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> I've pushed this now. Let's see what the buildfarm has to say about >it. > >It's likely to be hard to tell, given the amount of pink from the Ryu >patch. If Andrew is not planning to clean that up PDQ, I'd suggest >reverting that patch pending having some repairs for it. I'd assume that breaking bit counting would cause distinct enough damage (compile time or crashes). That's not to say thatreverting ryu shouldn't be considered (although I'm not that bothered by cross version, ia64 and cygwin failures, especiallybecause the latter two might be hard to come by outside the bf). Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > I'd assume that breaking bit counting would cause distinct enough damage (compile time or crashes). That's not to saythat reverting ryu shouldn't be considered (although I'm not that bothered by cross version, ia64 and cygwin failures,especially because the latter two might be hard to come by outside the bf). The pink doesn't appear to be limited to non-mainstream platforms, see eg lapwing, fulmar. However, I see Andrew just pushed some fixes, so this argument is moot pending how much that helps. regards, tom lane
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: >> It's likely to be hard to tell, given the amount of pink from the >> Ryu patch. If Andrew is not planning to clean that up PDQ, Besides crake (x-version), fulmar (icc) and lorikeet (cygwin), I hope the rest of the known failures should pass on the next cycle; the mac/ppc failures are because we redefine "bool" in a way that broke the upstream code's c99 assumptions, and the rest are numerical instability in ts_rank. >> I'd suggest reverting that patch pending having some repairs for it. Andres> I'd assume that breaking bit counting would cause distinct Andres> enough damage (compile time or crashes). That's not to say that Andres> reverting ryu shouldn't be considered (although I'm not that Andres> bothered by cross version, ia64 and cygwin failures, especially Andres> because the latter two might be hard to come by outside the Andres> bf). IA64 is working fine as far as I can see (specifically, anole is passing); it's ICC on x86_64 that broke (fulmar). -- Andrew (irc:RhodiumToad)
On 2019-Feb-13, Andres Freund wrote: > Hi, > > On February 13, 2019 8:40:14 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >Alvaro Herrera <alvherre@2ndquadrant.com> writes: > >> I've pushed this now. Let's see what the buildfarm has to say about > >it. > > > >It's likely to be hard to tell, given the amount of pink from the Ryu > >patch. If Andrew is not planning to clean that up PDQ, I'd suggest > >reverting that patch pending having some repairs for it. > > I'd assume that breaking bit counting would cause distinct enough > damage (compile time or crashes). I was a bit surprised to find out that the assembly generated by compiling the code in test for __builtin_foo() does not actually include the calls being tested ... (they're only used to generate the value for a static variable, and that gets optimized away); but then the comment for the test does say that we're only testing that the compiler understands the construct, so I suppose that's fine. Also, we already do that for bswap. This "compiler explorer" tool is nice: https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(j:1,source:'int+main(int+argc+,+char**+argv)%0A%7B%0A%0A++return++__builtin_popcountll((unsigned)argc)%3B%0A%0A%7D'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:45.01119572686435,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:g447,filters:(b:'0',commentOnly:'0',directives:'0',intel:'0'),libs:!(),options:'-Wall+-O3+-msse4.2',source:1),l:'5',n:'0',o:'x86-64+gcc+4.4.7+(Editor+%231,+Compiler+%231)',t:'0')),k:54.98880427313565,l:'4',m:100,n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4 -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes: Andrew> IA64 is working fine as far as I can see (specifically, anole Andrew> is passing); it's ICC on x86_64 that broke (fulmar). And I know what's wrong on fulmar now, so that'll be fixed shortly. -- Andrew (irc:RhodiumToad)
>>>>> "Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes: Alvaro> I've pushed this now. Let's see what the buildfarm has to say Alvaro> about it. Lapwing's latest failure looks like yours rather than mine now? (the previous two were mine) -- Andrew (irc:RhodiumToad)
On 2019-Feb-13, Andrew Gierth wrote: > >>>>> "Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Alvaro> I've pushed this now. Let's see what the buildfarm has to say > Alvaro> about it. > > Lapwing's latest failure looks like yours rather than mine now? (the > previous two were mine) It definitely is ... plans have changed from using IndexOnly scans to Seqscans, which is likely fallout from the visibilitymap_count() change. Looking. (I patched the Makefile to add -mpopcnt to all the compile lines rather than just the frontend one, but I can't see that explaining the failure.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>>>> "Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Lapwing's latest failure looks like yours rather than mine now? (the >> previous two were mine) Alvaro> It definitely is ... plans have changed from using IndexOnly Alvaro> scans to Seqscans, which is likely fallout from the Alvaro> visibilitymap_count() change. Looking. As for the rest, crake's "configure" failure was due to Andrew aborting a run presumably to tweak the config, and fulmar finished a run just before I committed the fix that should turn it green again. I'm obviously going to keep watching, but to my knowledge only crake (x-version test) and lorikeet (cygwin) should still be broken from my stuff. -- Andrew (irc:RhodiumToad)
On 2019-Feb-13, Alvaro Herrera wrote: > It definitely is ... plans have changed from using IndexOnly scans to > Seqscans, which is likely fallout from the visibilitymap_count() change. I think the problem here is that "unsigned long" is 32 bits in this machine: checking whether long int is 64 bits... no and we have defined pg_popcount64() like this: static int pg_popcount64_sse42(uint64 word) { return __builtin_popcountl(word); } so it's counting bits in the lower half of the uint64. If that's correct, then I think we need something like this patch. But it makes me wonder whether we need a configure test for __builtin_popcountll() and friends. I wonder if there's any compiler that implements __builtin_popcountl() but not __builtin_popcountll() ... and if not, then the test for __builtin_popcountl() should be removed, and have everything rely on the one for __builtin_popcount(). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > and we have defined pg_popcount64() like this: > static int > pg_popcount64_sse42(uint64 word) > { > return __builtin_popcountl(word); > } That is clearly completely broken. > If that's correct, then I think we need something like this patch. But > it makes me wonder whether we need a configure test for > __builtin_popcountll() and friends. I wonder if there's any compiler > that implements __builtin_popcountl() but not __builtin_popcountll() ... > and if not, then the test for __builtin_popcountl() should be removed, > and have everything rely on the one for __builtin_popcount(). AFAICS, this is a gcc-ism, and it looks like they've probably had all width variants for the same amount of time. I'd take out the test for __builtin_popcountl(), and assume that testing for __builtin_popcount() is sufficient until proven differently. regards, tom lane
... btw, why is pg_popcount casting away the const from its pointer argument? regards, tom lane
On 2019-Feb-13, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > and we have defined pg_popcount64() like this: > > > static int > > pg_popcount64_sse42(uint64 word) > > { > > return __builtin_popcountl(word); > > } > > That is clearly completely broken. Pushed my proposed fix, which includes removing the configure tests for builtins of varying widths. I couldn't resist sorting entries alphabetically in configure.in. (I also used autoheader to produce the new pg_config.h, which showed me that David had not used it to generate his diffs there.) For pg_config.h.win32 I used the compiler explorer tool I just learned about, and came to the conclusion that MSVC's compiler does not implement these builtins. I didn't do anything about the const-cast-away in pg_popcount() yet. I think that should use PointerIsAligned() instead of what it's doing now. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
And, while I'm complaining: why the devil is use of the compiler builtins gated by HAVE__GET_CPUID? This is unbelievably Intel-centric, because it prevents use of the builtins on other architectures. If the builtin exists, we should use it, full stop. There's no reason to expect that it would be slower than hand-rolled code, regardless of the architecture. I'd be inclined to rip out all of the run-time-detection logic here; I doubt any of it is buying anything that's worth the price of an indirect call. regards, tom lane
On Thu, Feb 14, 2019 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > And, while I'm complaining: why the devil is use of the compiler builtins > gated by HAVE__GET_CPUID? This is unbelievably Intel-centric, because > it prevents use of the builtins on other architectures. If the builtin > exists, we should use it, full stop. There's no reason to expect that it > would be slower than hand-rolled code, regardless of the architecture. FWIW a quick test of __builtin_popcount(n) compiles as CNT on a Debian ARM system, without any special compiler flags. > I'd be inclined to rip out all of the run-time-detection logic here; > I doubt any of it is buying anything that's worth the price of an > indirect call. No view on that but apparently there were Intel Atom and AMD C chips sold in the early part of this decade that lack POPCNT so I suspect the distros can't ship software that requires it with no fallback. -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Thu, Feb 14, 2019 at 4:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'd be inclined to rip out all of the run-time-detection logic here; >> I doubt any of it is buying anything that's worth the price of an >> indirect call. > No view on that but apparently there were Intel Atom and AMD C chips > sold in the early part of this decade that lack POPCNT so I suspect > the distros can't ship software that requires it with no fallback. Ah, I was not looking at the business with the optional -mpopcnt compiler flag. I agree that we probably should not assume that code compiled with that will run anywhere. But it's silly to build all this infrastructure and then throw away the opportunity to optimize for anything but late-model Intel. A survey of the buildfarm results so far says that __builtin_clz and __builtin_ctz exist just about everywhere, and even __builtin_popcount is available on some non-Intel architectures. It is reasonable to assume that those builtins are faster than the C equivalents if they exist. It's reasonable to assume that even on old-school Intel hardware. The way this should have been done is to have a separate file that's compiled with -mpopcnt if the compiler has that (and has the builtins), and for the mainline file to have "slow" versions that use the less-optimized builtins if available, and only fall back to raw C code if not HAVE__BUILTIN_WHATEVER. Also, in #if defined(HAVE__GET_CPUID) && defined(HAVE__BUILTIN_POPCOUNT) static bool pg_popcount_available(void) { unsigned int exx[4] = { 0, 0, 0, 0 }; #if defined(HAVE__GET_CPUID) __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUID) __cpuid(exx, 1); #else #error cpuid instruction not available #endif return (exx[2] & (1 << 23)) != 0; /* POPCNT */ } #endif it's obvious to the naked eye that the __cpuid() and #error branches are unreachable because of the outer #if. I don't think that was the design intention. regards, tom lane
Some further thoughts here ... Does the "lzcnt" runtime probe actually do anything useful? On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz and __builtin_ctz compile to sequences involving bsrq and bsfq regardless of -mpopcnt. It's fairly hard to see how lzcnt would buy anything over those sequences even if there were zero overhead involved in using it. Alvaro noted that the test programs used by c-compiler.m4 fail to produce any actual code involving the builtin, because of compile-time constant folding. This seems pretty unwise. I see that on my x86_64 compilers, without -mpopcnt, __builtin_popcnt compiles to a call of some libgcc function or other. It's conceivable that on an (arguably misconfigured) platform, these c-compiler.m4 tests would pass yet the build fails at link because libgcc lacks the needed infrastructure. These tests should be coded in a way that doesn't allow the call to be optimized away -- cf comments for PGAC_C_BUILTIN_OP_OVERFLOW. Also, it's starting to seem like we have enough probes for compiler builtins that we should fold them to use one set of infrastructure. There are some like __builtin_constant_p that probably do need their own custom tests, but these ones that just verify that a call compiles seem pretty duplicative ... regards, tom lane
On 14/02/2019 11:17, Alvaro Herrera wrote: > On 2019-Feb-13, Alvaro Herrera wrote: > >> It definitely is ... plans have changed from using IndexOnly scans to >> Seqscans, which is likely fallout from the visibilitymap_count() change. > I think the problem here is that "unsigned long" is 32 bits in this > machine: [...] From my memory of reading of K&R many moons ago, it said that C only guarantees that the lengths are such that byte <= half word <= word <= long But I don't recall ever seeing a long less than 32 bits! Cheers, Gavin
Gavin Flower <GavinFlower@archidevsys.co.nz> writes: > From my memory of reading of K&R many moons ago, it said that C only > guarantees that the lengths are such that > byte <= half word <= word <= long Indeed. > But I don't recall ever seeing a long less than 32 bits! I'm not sure offhand what C89 said, but C99 requires "short" to be at least 16 bits, "long" to be at least 32 bits, and "long long" to be at least 64; see the minimum allowed values for SHRT_MAX etc. C99 does permit "int" to be only 16 bits, but Postgres doesn't pretend to work on such an architecture, and nobody's made one since the (early?) 90s. regards, tom lane
On 2019-Feb-14, Tom Lane wrote: > Some further thoughts here ... > > Does the "lzcnt" runtime probe actually do anything useful? > On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz > and __builtin_ctz compile to sequences involving bsrq and bsfq > regardless of -mpopcnt. It's fairly hard to see how lzcnt would > buy anything over those sequences even if there were zero overhead > involved in using it. Hah, I just realized you have to add -mlzcnt in order for these builtins to use the lzcnt instructions. It goes from something like bsrq %rax, %rax xorq $63, %rax to lzcntq %rax, %rax Significant? I have this patch now, written before I realized the above; I'll augment it to cater to this (adding -mlzcnt and a new set of functions -- perhaps a new file "lzcnt.c" or maybe put the lot in pg_popcount.c and rename it?) and resubmit after an errand I have to run now. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Hah, I just realized you have to add -mlzcnt in order for these builtins > to use the lzcnt instructions. It goes from something like > bsrq %rax, %rax > xorq $63, %rax > to > lzcntq %rax, %rax > Significant? I'd bet a fair amount of money that we'd be better off *not* using lzcnt, even if available, because then we could just expose things along this line: static inline int pg_clz(...) { #ifdef HAVE__BUILTIN_CLZ return __builtin_clz(x); #else handwritten implementation; #endif } Avoiding a function call (that has to indirect through a pointer) probably saves much more than the difference between lzcnt and the other way. The tradeoff might be different for popcount, though, especially since it looks like __builtin_popcount() is not nearly as widely available as the clz/ctz builtins. regards, tom lane
Hi, On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote: > On 2019-Feb-14, Tom Lane wrote: > > > Some further thoughts here ... > > > > Does the "lzcnt" runtime probe actually do anything useful? > > On the x86_64 compilers I tried (gcc 8.2.1 and 4.4.7), __builtin_clz > > and __builtin_ctz compile to sequences involving bsrq and bsfq > > regardless of -mpopcnt. It's fairly hard to see how lzcnt would > > buy anything over those sequences even if there were zero overhead > > involved in using it. > > Hah, I just realized you have to add -mlzcnt in order for these builtins > to use the lzcnt instructions. It goes from something like > > bsrq %rax, %rax > xorq $63, %rax I'm confused how this is a general count leading zero operation? Did you use constants or something that allowed ot infer a range in the test? If so the compiler probably did some optimizations allowing it to do the above. > to > lzcntq %rax, %rax > > Significant? If I understand Agner's tables correctly, then no, this isn't faster than the two instructions above. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote: >> Hah, I just realized you have to add -mlzcnt in order for these builtins >> to use the lzcnt instructions. It goes from something like >> >> bsrq %rax, %rax >> xorq $63, %rax > I'm confused how this is a general count leading zero operation? Did you > use constants or something that allowed ot infer a range in the test? If > so the compiler probably did some optimizations allowing it to do the > above. No. If you compile int myclz(unsigned long long x) { return __builtin_clzll(x); } at -O2, on just about any x86_64 gcc, you will get myclz: .LFB1: .cfi_startproc bsrq %rdi, %rax xorq $63, %rax ret .cfi_endproc regards, tom lane
On 2019-Feb-14, Tom Lane wrote: > I'd bet a fair amount of money that we'd be better off *not* using > lzcnt, even if available, because then we could just expose things > along this line: > > static inline int > pg_clz(...) > { > #ifdef HAVE__BUILTIN_CLZ > return __builtin_clz(x); > #else > handwritten implementation; > #endif > } > > Avoiding a function call (that has to indirect through a pointer) probably > saves much more than the difference between lzcnt and the other way. I see ... makes sense. That leads me to the attached patch. It creates a new file pg_popcount.c which is the only one compiled with -mpopcnt (if available); if there's no compiler switch to enable POPCNT, we just don't compile the file. I'm not sure that's kosher -- in particular I'm not sure if it can fail when POPCNT is enabled by other flags and -mpopcnt is not needed at all. I think our c-compiler.m4 stuff is a bit too simplistic there: it just assumes that -mpopcnt is always required. But what if the user passes it in CFLAGS? I left CPUID alone for the CLZ/CTZ builtins. So we either use the table, or the builtins. We never try the instructions. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2019-Feb-14, Tom Lane wrote: > static inline int > pg_clz(...) Hmm, I missed this bit. So we put all these functions in the header, as in the attached. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > That leads me to the attached patch. It creates a new file > pg_popcount.c which is the only one compiled with -mpopcnt (if > available); if there's no compiler switch to enable POPCNT, we just > don't compile the file. I'm not sure that's kosher -- in particular I'm > not sure if it can fail when POPCNT is enabled by other flags and > -mpopcnt is not needed at all. I think our c-compiler.m4 stuff is a bit > too simplistic there: it just assumes that -mpopcnt is always required. Yes, the configure test for this stuff is really pretty broken. It's conflating two nearly independent questions: (1) does the compiler have __builtin_popcount(), and (2) does the compiler accept -mpopcnt. It is certainly the case that (1) may hold without (2); in fact, every recent non-x86_64 gcc is a counterexample to how that's done in HEAD. I think we need a clean test for __builtin_popcount(), and to be willing to use it if available, independently of -mpopcnt. Then separately we should test to see if -mpopcnt works, probably with the same infrastructure we use for checking for other compiler flags, viz # Optimization flags for specific files that benefit from vectorization PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-funroll-loops]) PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-ftree-vectorize]) + # Optimization flags for bit-twiddling + PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcnt]) # We want to suppress clang's unhelpful unused-command-line-argument warnings Then the correct test to see if we want to build pg_popcount.c (BTW, please pick a less generic name for that) and the choose function is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty CFLAGS_POPCNT. I don't think this'd be fooled by user-specified CFLAGS. The worst possible outcome is that it builds a function that we intended would use POPCNT but it's falling back to some other implementation, in case the compiler has a switch named -mpopcnt but it doesn't do what we think it does, or the user overrode things by adding -mno-popcnt. That would really be nearly cost-free, other than the overhead of the choose function the first time through: both of the execution functions would be reducing to __builtin_popcount(), for whatever version of that the compiler is giving us, so the choice wouldn't matter. regards, tom lane
On 2019-Feb-14, Tom Lane wrote: > I think we need a clean test for __builtin_popcount(), and to be willing > to use it if available, independently of -mpopcnt. Then separately we > should test to see if -mpopcnt works, probably with the same > infrastructure we use for checking for other compiler flags, viz > > # Optimization flags for specific files that benefit from vectorization > PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-funroll-loops]) > PGAC_PROG_CC_VAR_OPT(CFLAGS_VECTOR, [-ftree-vectorize]) > + # Optimization flags for bit-twiddling > + PGAC_PROG_CC_VAR_OPT(CFLAGS_POPCNT, [-mpopcnt]) > # We want to suppress clang's unhelpful unused-command-line-argument warnings > > Then the correct test to see if we want to build pg_popcount.c (BTW, > please pick a less generic name for that) and the choose function > is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty > CFLAGS_POPCNT. Yeah, this works. I'll post the patch tomorrow. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Thu, 14 Feb 2019 16:45:38 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <822.1550180738@sss.pgh.pa.us> > Andres Freund <andres@anarazel.de> writes: > > On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote: > >> Hah, I just realized you have to add -mlzcnt in order for these builtins > >> to use the lzcnt instructions. It goes from something like > >> > >> bsrq %rax, %rax > >> xorq $63, %rax > > > I'm confused how this is a general count leading zero operation? Did you > > use constants or something that allowed ot infer a range in the test? If > > so the compiler probably did some optimizations allowing it to do the > > above. > > No. If you compile > > int myclz(unsigned long long x) > { > return __builtin_clzll(x); > } > > at -O2, on just about any x86_64 gcc, you will get > > myclz: > .LFB1: > .cfi_startproc > bsrq %rdi, %rax > xorq $63, %rax > ret > .cfi_endproc > I understand that the behavior of __builtin_c[tl]z(0) is undefined from the reason, they convert to bs[rf]. So if we use these builtins, additional check is required. https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html > Built-in Function: int __builtin_clz (unsigned int x) > Returns the number of leading 0-bits in x, starting at the most > significant bit position. If x is 0, the result is undefined. > > Built-in Function: int __builtin_ctz (unsigned int x) > Returns the number of trailing 0-bits in x, starting at the > least significant bit position. If x is 0, the result is > undefined. And even worse lzcntx is accidentially "fallback"s to bsrx on unsupported CPUs, which leads to bogus results. __builtin_clzll(8) = 3, which should be 60. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2019-Feb-15, Kyotaro HORIGUCHI wrote: > I understand that the behavior of __builtin_c[tl]z(0) is > undefined from the reason, they convert to bs[rf]. So if we use > these builtins, additional check is required. > > https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html Okay -- the functions check for a 0 argument: +static inline int +pg_rightmost_one32(uint32 word) +{ + int result = 0; + + Assert(word != 0); + +#ifdef HAVE__BUILTIN_CTZ + result = __builtin_ctz(word); +#else + while ((word & 255) == 0) + { + word >>= 8; + result += 8; + } + result += rightmost_one_pos[word & 255]; +#endif /* HAVE__BUILTIN_CTZ */ + + return result; +} so we're fine. > And even worse lzcntx is accidentially "fallback"s to bsrx on > unsupported CPUs, which leads to bogus results. > __builtin_clzll(8) = 3, which should be 60. I'm not sure I understand this point. Are you saying that if we use -mlzcnt to compile, then the compiler builtin is broken in platforms that don't support the lzcnt instruction? That's horrible. Let's stay away from -mlzcnt then. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Feb-15, Alvaro Herrera wrote: > On 2019-Feb-15, Kyotaro HORIGUCHI wrote: > > And even worse lzcntx is accidentially "fallback"s to bsrx on > > unsupported CPUs, which leads to bogus results. > > __builtin_clzll(8) = 3, which should be 60. > > I'm not sure I understand this point. Are you saying that if we use > -mlzcnt to compile, then the compiler builtin is broken in platforms > that don't support the lzcnt instruction? That's horrible. Let's stay > away from -mlzcnt then. Ah, I understand it now: https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701 if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise SIGILL or anything ... it'll just silently compute the wrong result. That's certainly not what I call a fallback! I think David's code was correct because it was testing CPUID for instruction support before using the specialized code (well, except that he forgot to add the right compiler option to *enable* the LZCNT/TZCNT instructions in the first place); but given subsequent discussion that the instruction is not worth it anyway, we might as well ignore it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Feb-14, Tom Lane wrote: > Then the correct test to see if we want to build pg_popcount.c (BTW, > please pick a less generic name for that) and the choose function > is whether we have *both* HAVE__BUILTIN_POPCOUNT and nonempty > CFLAGS_POPCNT. I used pg_bitutils_sse42.c to host the specially-compiled functions. The name is not entirely correct, but seems clear enough. I noticed in Compiler Explorer that some (ancient?) Power cpus implement instruction "popcntb", and GCC support for those uses -mpopcntb switch enabling __builtin_popcount() to use it. I added the switch to configure.in but I'm not sure how well that will work ... I don't know if this is represented in buildfarm. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Ah, I understand it now: > https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701 > if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise > SIGILL or anything ... it'll just silently compute the wrong result. > That's certainly not what I call a fallback! Yeah, that's pretty nasty; it means there's no backstop for whether your choose function gets it right :-( Is POPCNT any better in this respect? regards, tom lane
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I noticed in Compiler Explorer that some (ancient?) Power cpus > implement instruction "popcntb", and GCC support for those uses > -mpopcntb switch enabling __builtin_popcount() to use it. I added the > switch to configure.in but I'm not sure how well that will work ... I > don't know if this is represented in buildfarm. I experimented a bit with this on an old Apple laptop. Apple's compiler rejects -mpopcntb altogether. FreeBSD's compiler (gcc 4.2.1) recognizes the switch, but I could not get it to emit the instruction, even when specifying -mcpu=power5, which ought to enable it according to the gcc docs: ... The `-mpopcntb' option allows GCC to generate the popcount and double precision FP reciprocal estimate instruction implemented on the POWER5 processor and other processors that support the PowerPC V2.02 architecture. A more recent gcc info file also mentions The `-mpopcntd' option allows GCC to generate the popcount instruction implemented on the POWER7 processor and other processors that support the PowerPC V2.06 architecture. but the gcc version I have on this laptop doesn't know that switch. In any case, I'm pretty sure Apple never shipped a CPU that could run either instruction. I suspect that probing for either option may not be worth the configure cycles it'd consume :-( ... there are just way too few of those specific POWER variants out there anymore, even granting that you have a compiler that will play along. Moreover, you can't turn on -mpopcntb without having some POWER equivalent to the CPUID test. However, if you want to leave the option for this open in future, it really makes the file name pg_bitutils_sse42.c quite inappropriate. How about pg_bitutils_hwpopcnt.c or something like that? regards, tom lane
On 2019-Feb-15, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Ah, I understand it now: > > https://stackoverflow.com/questions/25683690/confusion-about-bsr-and-lzcnt/43443701#43443701 > > if you call LZCNT/TZCNT on a CPU that doesn't support it, it won't raise > > SIGILL or anything ... it'll just silently compute the wrong result. > > That's certainly not what I call a fallback! > > Yeah, that's pretty nasty; it means there's no backstop for whether > your choose function gets it right :-( Hopefully other tests will fail in some visible way, though. My fear is whether we have such systems in buildfarm. > Is POPCNT any better in this respect? I couldn't find how is POPCNT encoded. https://stackoverflow.com/a/28803917/242383 I did find these articles: http://danluu.com/assembly-intrinsics/ https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati This suggests that this all a largely pointless exercise at least on Intel and GCC/Clang. It may be better on AMD ... but to get really better performance we'd need to be coding the popcnt calls in assembly rather than using the compiler intrinsics, even with -mpopcnt, because the intrinsics suck. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Here's a final version that I intend to push shortly, to have time before EOB today to handle any fallout. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Here's a final version that I intend to push shortly, to have time > before EOB today to handle any fallout. I think this is likely to result in a lot of complaints about rightmost_one_pos[] being unreferenced, in non-HAVE__BUILTIN_CTZ builds. Probably that has to be an extern rather than static in the header. leftmost_one_pos[] likewise. I might have a go at improving the configure tests later --- I still don't like that they're compile-time-optimizable. But that can wait. regards, tom lane
Hi, On 2019-02-14 16:45:38 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2019-02-14 15:47:13 -0300, Alvaro Herrera wrote: > >> Hah, I just realized you have to add -mlzcnt in order for these builtins > >> to use the lzcnt instructions. It goes from something like > >> > >> bsrq %rax, %rax > >> xorq $63, %rax > > > I'm confused how this is a general count leading zero operation? Did you > > use constants or something that allowed ot infer a range in the test? If > > so the compiler probably did some optimizations allowing it to do the > > above. > > No. If you compile > > int myclz(unsigned long long x) > { > return __builtin_clzll(x); > } > > at -O2, on just about any x86_64 gcc, you will get > > myclz: > .LFB1: > .cfi_startproc > bsrq %rdi, %rax > xorq $63, %rax > ret > .cfi_endproc Yea, sorry for the noise. I misremembered the bsrq mnemonic. bsr has a latency of three cycles, xor of one. lzcnt a latency of three. So it's mildly faster to use lzcnt (it uses fewer ports, and has a shorter latency). But I doubt we have code where that's noticable. Greetings, Andres Freund