Thread: call popcount32/64 directly on non-x86 platforms

call popcount32/64 directly on non-x86 platforms

From
John Naylor
Date:
Currently, all platforms must indirect through a function pointer to call popcount on a word-sized input, even though we don't arrange for a fast implementation on non-x86 to make it worthwhile.

0001 moves some declarations around so that "slow" popcount functions are called directly on non-x86 platforms.

0002 was an idea to simplify and unify the coding for the slow functions.

Also attached is a test module for building microbenchmarks.

On a Power8 machine using gcc 4.8, and running
time ./inst/bin/psql -c 'select drive_popcount(100000, 1024)'

I get

master: 647ms
0001: 183ms
0002: 228ms

So 0001 is a clear winner on that platform. 0002 is still good, but slower than 0001 for some reason, and it turns out that on master, gcc does emit a popcnt instruction from the intrinsic:

0000000000000000 <pg_popcount32_slow>:
   0:   f4 02 63 7c     popcntw r3,r3
   4:   b4 07 63 7c     extsw   r3,r3
   8:   20 00 80 4e     blr
        ...

The gcc docs mention a flag for this, but I'm not sure why it seems not to need it:

https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html#RS_002f6000-and-PowerPC-Options

Maybe that's because the machine I used was ppc64le, but I'm not sure a ppc binary built like this is portable to other hardware. For that reason, maybe 0002 is a good idea. 

--
Attachment

Re: call popcount32/64 directly on non-x86 platforms

From
David Rowley
Date:
On Thu, 12 Aug 2021 at 05:11, John Naylor <john.naylor@enterprisedb.com> wrote:
> 0001 moves some declarations around so that "slow" popcount functions are called directly on non-x86 platforms.

I was wondering if there was a reason that you didn't implement this
by just changing pg_popcount32 and pg_popcount64 to be actual
functions rather than function pointers when TRY_POPCNT_FAST is not
defined? These functions would then just return
pg_popcountNN_slow(word);

This would save from having to change all the current callers of the
functions to use the macro instead.  That might be nice for any
extensions which are using these functions.

David



Re: call popcount32/64 directly on non-x86 platforms

From
John Naylor
Date:

On Wed, Aug 11, 2021 at 8:13 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 12 Aug 2021 at 05:11, John Naylor <john.naylor@enterprisedb.com> wrote:
> > 0001 moves some declarations around so that "slow" popcount functions are called directly on non-x86 platforms.
>
> I was wondering if there was a reason that you didn't implement this
> by just changing pg_popcount32 and pg_popcount64 to be actual
> functions rather than function pointers when TRY_POPCNT_FAST is not
> defined? These functions would then just return
> pg_popcountNN_slow(word);
>
> This would save from having to change all the current callers of the
> functions to use the macro instead.  That might be nice for any
> extensions which are using these functions.

Hmm, it wasn't obvious to me that would work, but I tried it and came up with v2. Is this what you had in mind?

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: call popcount32/64 directly on non-x86 platforms

From
David Rowley
Date:
On Thu, 12 Aug 2021 at 14:02, John Naylor <john.naylor@enterprisedb.com> wrote:
>
>
> On Wed, Aug 11, 2021 at 8:13 PM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Thu, 12 Aug 2021 at 05:11, John Naylor <john.naylor@enterprisedb.com> wrote:
> > > 0001 moves some declarations around so that "slow" popcount functions are called directly on non-x86 platforms.
> >
> > I was wondering if there was a reason that you didn't implement this
> > by just changing pg_popcount32 and pg_popcount64 to be actual
> > functions rather than function pointers when TRY_POPCNT_FAST is not
> > defined? These functions would then just return
> > pg_popcountNN_slow(word);
> >
> > This would save from having to change all the current callers of the
> > functions to use the macro instead.  That might be nice for any
> > extensions which are using these functions.
>
> Hmm, it wasn't obvious to me that would work, but I tried it and came up with v2. Is this what you had in mind?

Closer, but I don't see why there's any need to make the fast and slow
functions external.  It should be perfectly fine to keep them static.

I didn't test the performance, but the attached works for me.

Going by https://godbolt.org/z/ocv1Kj5K4 f2 seems to inline f1

David

Attachment

Re: call popcount32/64 directly on non-x86 platforms

From
Alvaro Herrera
Date:
So when on MSVC, you don't have to check CPUID for support?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)



Re: call popcount32/64 directly on non-x86 platforms

From
David Rowley
Date:
On Fri, 13 Aug 2021 at 01:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> So when on MSVC, you don't have to check CPUID for support?

That still needs to be checked in MSVC and as far as I can see it is
being properly checked.

David



Re: call popcount32/64 directly on non-x86 platforms

From
David Rowley
Date:
On Fri, 13 Aug 2021 at 01:28, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 13 Aug 2021 at 01:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > So when on MSVC, you don't have to check CPUID for support?
>
> That still needs to be checked in MSVC and as far as I can see it is
> being properly checked.

Something there that might cause confusion is we do a configure check
to see if popcntq works and define HAVE_X86_64_POPCNTQ if it does.
I'm still a bit confused at why we bother doing that. Surely it just
means that if the build machine does not have popcntq that we'll
always use pg_popcount64_slow, regardless if the machine that's
actually running the code has popcntq.

Maybe you saw that there's no such equivalent test when we set
HAVE_X86_64_POPCNTQ for MSVC on x86_64.  The reason for that is that
we do the run-time test using cpuid.

David



Re: call popcount32/64 directly on non-x86 platforms

From
John Naylor
Date:

On Thu, Aug 12, 2021 at 9:33 AM David Rowley <dgrowleyml@gmail.com> wrote:

> Something there that might cause confusion is we do a configure check
> to see if popcntq works and define HAVE_X86_64_POPCNTQ if it does.
> I'm still a bit confused at why we bother doing that. Surely it just
> means that if the build machine does not have popcntq that we'll
> always use pg_popcount64_slow, regardless if the machine that's
> actually running the code has popcntq.

Yeah, it's a bit strange, a configure check makes more sense if we have a way to specify we can build with a direct call (like x86-64-v2), but we don't right now. Maybe short-term we could always do the runtime check on x86-64.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: call popcount32/64 directly on non-x86 platforms

From
Alvaro Herrera
Date:
On 2021-Aug-13, David Rowley wrote:

> Maybe you saw that there's no such equivalent test when we set
> HAVE_X86_64_POPCNTQ for MSVC on x86_64.  The reason for that is that
> we do the run-time test using cpuid.

Yeah, that and also I mistook the two independent "ifdef" blocks for one
block with an "#else".  Re-reading the ifdef maze, it looks reasonable.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/



Re: call popcount32/64 directly on non-x86 platforms

From
John Naylor
Date:

On Thu, Aug 12, 2021 at 1:26 AM David Rowley <dgrowleyml@gmail.com> wrote:
> Closer, but I don't see why there's any need to make the fast and slow
> functions external.  It should be perfectly fine to keep them static.
>
> I didn't test the performance, but the attached works for me.

Thanks for that! I still get a big improvement to on Power8 / gcc 4.8, but it's not quite as fast as earlier versions, which were around 200ms:

master: 646ms
v3:     312ms

This machine does seem to be pickier about code layout than any other I've tried running benchmarks on, but that's still a bit much. In any case, your version is clearer and has the intended effect, so I plan to commit that, barring other comments.

I think I'll leave my v2-0002 aside for now, since it has wider implications, and I have bigger things to work on.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: call popcount32/64 directly on non-x86 platforms

From
John Naylor
Date:

I wrote:
> On Thu, Aug 12, 2021 at 1:26 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > Closer, but I don't see why there's any need to make the fast and slow
> > functions external.  It should be perfectly fine to keep them static.
> >
> > I didn't test the performance, but the attached works for me.
>
> Thanks for that! I still get a big improvement to on Power8 / gcc 4.8, but it's not quite as fast as earlier versions, which were around 200ms:
>
> master: 646ms
> v3:     312ms
>
> This machine does seem to be pickier about code layout than any other I've tried running benchmarks on, but that's still a bit much. In any case, your version is clearer and has the intended effect, so I plan to commit that, barring other comments.

Pushed, thanks for looking1

--
John Naylor
EDB: http://www.enterprisedb.com