Thread: call popcount32/64 directly on non-x86 platforms
Currently, all platforms must indirect through a function pointer to call popcount on a word-sized input, even though we don't arrange for a fast implementation on non-x86 to make it worthwhile.
0001 moves some declarations around so that "slow" popcount functions are called directly on non-x86 platforms.
0002 was an idea to simplify and unify the coding for the slow functions.
Also attached is a test module for building microbenchmarks.
On a Power8 machine using gcc 4.8, and running
time ./inst/bin/psql -c 'select drive_popcount(100000, 1024)'
I get
master: 647ms
0001: 183ms
0002: 228ms
So 0001 is a clear winner on that platform. 0002 is still good, but slower than 0001 for some reason, and it turns out that on master, gcc does emit a popcnt instruction from the intrinsic:
0000000000000000 <pg_popcount32_slow>:
0: f4 02 63 7c popcntw r3,r3
4: b4 07 63 7c extsw r3,r3
8: 20 00 80 4e blr
...
The gcc docs mention a flag for this, but I'm not sure why it seems not to need it:
https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html#RS_002f6000-and-PowerPC-Options
Maybe that's because the machine I used was ppc64le, but I'm not sure a ppc binary built like this is portable to other hardware. For that reason, maybe 0002 is a good idea.
--
0001 moves some declarations around so that "slow" popcount functions are called directly on non-x86 platforms.
0002 was an idea to simplify and unify the coding for the slow functions.
Also attached is a test module for building microbenchmarks.
On a Power8 machine using gcc 4.8, and running
time ./inst/bin/psql -c 'select drive_popcount(100000, 1024)'
I get
master: 647ms
0001: 183ms
0002: 228ms
So 0001 is a clear winner on that platform. 0002 is still good, but slower than 0001 for some reason, and it turns out that on master, gcc does emit a popcnt instruction from the intrinsic:
0000000000000000 <pg_popcount32_slow>:
0: f4 02 63 7c popcntw r3,r3
4: b4 07 63 7c extsw r3,r3
8: 20 00 80 4e blr
...
The gcc docs mention a flag for this, but I'm not sure why it seems not to need it:
https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html#RS_002f6000-and-PowerPC-Options
Maybe that's because the machine I used was ppc64le, but I'm not sure a ppc binary built like this is portable to other hardware. For that reason, maybe 0002 is a good idea.
Attachment
On Thu, 12 Aug 2021 at 05:11, John Naylor <john.naylor@enterprisedb.com> wrote: > 0001 moves some declarations around so that "slow" popcount functions are called directly on non-x86 platforms. I was wondering if there was a reason that you didn't implement this by just changing pg_popcount32 and pg_popcount64 to be actual functions rather than function pointers when TRY_POPCNT_FAST is not defined? These functions would then just return pg_popcountNN_slow(word); This would save from having to change all the current callers of the functions to use the macro instead. That might be nice for any extensions which are using these functions. David
On Wed, Aug 11, 2021 at 8:13 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 12 Aug 2021 at 05:11, John Naylor <john.naylor@enterprisedb.com> wrote:
> > 0001 moves some declarations around so that "slow" popcount functions are called directly on non-x86 platforms.
>
> I was wondering if there was a reason that you didn't implement this
> by just changing pg_popcount32 and pg_popcount64 to be actual
> functions rather than function pointers when TRY_POPCNT_FAST is not
> defined? These functions would then just return
> pg_popcountNN_slow(word);
>
> This would save from having to change all the current callers of the
> functions to use the macro instead. That might be nice for any
> extensions which are using these functions.
Hmm, it wasn't obvious to me that would work, but I tried it and came up with v2. Is this what you had in mind?
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On Thu, 12 Aug 2021 at 14:02, John Naylor <john.naylor@enterprisedb.com> wrote: > > > On Wed, Aug 11, 2021 at 8:13 PM David Rowley <dgrowleyml@gmail.com> wrote: > > > > On Thu, 12 Aug 2021 at 05:11, John Naylor <john.naylor@enterprisedb.com> wrote: > > > 0001 moves some declarations around so that "slow" popcount functions are called directly on non-x86 platforms. > > > > I was wondering if there was a reason that you didn't implement this > > by just changing pg_popcount32 and pg_popcount64 to be actual > > functions rather than function pointers when TRY_POPCNT_FAST is not > > defined? These functions would then just return > > pg_popcountNN_slow(word); > > > > This would save from having to change all the current callers of the > > functions to use the macro instead. That might be nice for any > > extensions which are using these functions. > > Hmm, it wasn't obvious to me that would work, but I tried it and came up with v2. Is this what you had in mind? Closer, but I don't see why there's any need to make the fast and slow functions external. It should be perfectly fine to keep them static. I didn't test the performance, but the attached works for me. Going by https://godbolt.org/z/ocv1Kj5K4 f2 seems to inline f1 David
Attachment
So when on MSVC, you don't have to check CPUID for support? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "La primera ley de las demostraciones en vivo es: no trate de usar el sistema. Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
On Fri, 13 Aug 2021 at 01:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > So when on MSVC, you don't have to check CPUID for support? That still needs to be checked in MSVC and as far as I can see it is being properly checked. David
On Fri, 13 Aug 2021 at 01:28, David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 13 Aug 2021 at 01:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > So when on MSVC, you don't have to check CPUID for support? > > That still needs to be checked in MSVC and as far as I can see it is > being properly checked. Something there that might cause confusion is we do a configure check to see if popcntq works and define HAVE_X86_64_POPCNTQ if it does. I'm still a bit confused at why we bother doing that. Surely it just means that if the build machine does not have popcntq that we'll always use pg_popcount64_slow, regardless if the machine that's actually running the code has popcntq. Maybe you saw that there's no such equivalent test when we set HAVE_X86_64_POPCNTQ for MSVC on x86_64. The reason for that is that we do the run-time test using cpuid. David
On Thu, Aug 12, 2021 at 9:33 AM David Rowley <dgrowleyml@gmail.com> wrote:
> Something there that might cause confusion is we do a configure check
> to see if popcntq works and define HAVE_X86_64_POPCNTQ if it does.
> I'm still a bit confused at why we bother doing that. Surely it just
> means that if the build machine does not have popcntq that we'll
> always use pg_popcount64_slow, regardless if the machine that's
> actually running the code has popcntq.
Yeah, it's a bit strange, a configure check makes more sense if we have a way to specify we can build with a direct call (like x86-64-v2), but we don't right now. Maybe short-term we could always do the runtime check on x86-64.
On 2021-Aug-13, David Rowley wrote: > Maybe you saw that there's no such equivalent test when we set > HAVE_X86_64_POPCNTQ for MSVC on x86_64. The reason for that is that > we do the run-time test using cpuid. Yeah, that and also I mistook the two independent "ifdef" blocks for one block with an "#else". Re-reading the ifdef maze, it looks reasonable. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
On Thu, Aug 12, 2021 at 1:26 AM David Rowley <dgrowleyml@gmail.com> wrote:
> Closer, but I don't see why there's any need to make the fast and slow
> functions external. It should be perfectly fine to keep them static.
>
> I didn't test the performance, but the attached works for me.
Thanks for that! I still get a big improvement to on Power8 / gcc 4.8, but it's not quite as fast as earlier versions, which were around 200ms:
master: 646ms
v3: 312ms
v3: 312ms
This machine does seem to be pickier about code layout than any other I've tried running benchmarks on, but that's still a bit much. In any case, your version is clearer and has the intended effect, so I plan to commit that, barring other comments.
I think I'll leave my v2-0002 aside for now, since it has wider implications, and I have bigger things to work on.
I wrote:
> On Thu, Aug 12, 2021 at 1:26 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > Closer, but I don't see why there's any need to make the fast and slow
> > functions external. It should be perfectly fine to keep them static.
> >
> > I didn't test the performance, but the attached works for me.
>
> Thanks for that! I still get a big improvement to on Power8 / gcc 4.8, but it's not quite as fast as earlier versions, which were around 200ms:
>
> master: 646ms
> v3: 312ms
>
> This machine does seem to be pickier about code layout than any other I've tried running benchmarks on, but that's still a bit much. In any case, your version is clearer and has the intended effect, so I plan to commit that, barring other comments.
Pushed, thanks for looking1
--
John Naylor
EDB: http://www.enterprisedb.com
> > Closer, but I don't see why there's any need to make the fast and slow
> > functions external. It should be perfectly fine to keep them static.
> >
> > I didn't test the performance, but the attached works for me.
>
> Thanks for that! I still get a big improvement to on Power8 / gcc 4.8, but it's not quite as fast as earlier versions, which were around 200ms:
>
> master: 646ms
> v3: 312ms
>
> This machine does seem to be pickier about code layout than any other I've tried running benchmarks on, but that's still a bit much. In any case, your version is clearer and has the intended effect, so I plan to commit that, barring other comments.
Pushed, thanks for looking1
--
John Naylor
EDB: http://www.enterprisedb.com