Thread: Use POPCNT on MSVC
Going by [1], it looks like we can use the __popcnt and __popcnt64 intrinsic functions on MSVC if the CPU supports POPCNT. We already have code to check for that, we just need to enable it on MSVC. The attached patch seems to be all that's needed. David [1] https://docs.microsoft.com/en-us/cpp/intrinsics/popcnt16-popcnt-popcnt64?view=msvc-140
Attachment
On Tue, Aug 3, 2021 at 5:03 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> Going by [1], it looks like we can use the __popcnt and __popcnt64
> intrinsic functions on MSVC if the CPU supports POPCNT. We already
> have code to check for that, we just need to enable it on MSVC.
>
> The attached patch seems to be all that's needed.
+1 for the concept, but the coding is a bit strange. Granted, the way we handle popcnt is a bit strange, but this makes it stranger:
1. the __popcnt64() intrinsic is put inside pg_popcount64_asm(), which is a bit of a misnomer since it's not assembly. Renaming s/_asm/_fast/ would help it look better. But then looking around, other platforms have intrinsics coded, but for some reason they're put in pg_popcount64_slow(), where the compiler will emit instructions we could easily write ourselves in C (and without #ifdefs) since without the right CFLAGS these intrinsics won't emit a popcnt instruction. Is MSVC different in that regard? If so, it might be worth a comment.
2. (defined(_MSC_VER) && defined(_WIN64) lead to a runtime check for the CPUID, which is fine, but now next to it HAVE_X86_64_POPCNTQ looks strange because the latter symbol comes from a specific configure test -- the two don't seem equivalent, but maybe they are because of what MSVC does? That would be nice to spell out here.
I know the present state of affairs works a bit different than what you proposed a couple years ago and that something had to change for portability reasons I didn't quite understand, but I think if we change things here we should at least try to have things fit together a bit more nicely.
(Side note, but sort of related to #1 above: non-x86 platforms have to indirect through a function pointer even though they have no fast implementation to make it worth their while. It would be better for them if the "slow" implementation was called static inline or at least a direct function call, but that's a separate thread.)
On Tue, Aug 3, 2021 at 10:43 PM John Naylor <john.naylor@enterprisedb.com> wrote: > (Side note, but sort of related to #1 above: non-x86 platforms have to indirect through a function pointer even thoughthey have no fast implementation to make it worth their while. It would be better for them if the "slow" implementationwas called static inline or at least a direct function call, but that's a separate thread.) +1 I haven't looked into whether we could benefit from it in real use cases, but it seems like it'd also be nice if pg_popcount() were a candidate for auto-vectorisation and inlining. For example, NEON has vector popcount, and for Intel/AMD there is a shuffle-based AVX2 trick that at least Clang produces automatically[1]. We're obstructing that by doing function dispatch at individual word level, and using inline assembler instead of builtins. [1] https://arxiv.org/abs/1611.07612
On Tue, 3 Aug 2021 at 22:43, John Naylor <john.naylor@enterprisedb.com> wrote: > 1. the __popcnt64() intrinsic is put inside pg_popcount64_asm(), which is a bit of a misnomer since it's not assembly.Renaming s/_asm/_fast/ would help it look better. But then looking around, other platforms have intrinsics coded,but for some reason they're put in pg_popcount64_slow(), where the compiler will emit instructions we could easilywrite ourselves in C (and without #ifdefs) since without the right CFLAGS these intrinsics won't emit a popcnt instruction.Is MSVC different in that regard? If so, it might be worth a comment. Yeah, the names no longer fit so well after stuffing the MSVC intrinsic into the asm function. The reason I did it that way is down to what I read in the docs. Namely: "If you run code that uses these intrinsics on hardware that doesn't support the popcnt instruction, the results are unpredictable." So, it has to go somewhere where we're sure the CPU supports POPCNT and that seemed like the correct place. From what I understand of GCC and __builtin_popcountl(), the code it'll output will depend on the -mpopcnt flag. So having __builtin_popcountl() does not mean the popcnt instruction will be used. The Microsoft documentation indicates that's not the case for __popcnt(). > 2. (defined(_MSC_VER) && defined(_WIN64) lead to a runtime check for the CPUID, which is fine, but now next to it HAVE_X86_64_POPCNTQlooks strange because the latter symbol comes from a specific configure test -- the two don't seem equivalent,but maybe they are because of what MSVC does? That would be nice to spell out here. The problem there is that we define HAVE_X86_64_POPCNTQ based on the outcome of configure so it does not get set for MSVC. Maybe it's worth coming up with some other constant that can be defined or we could just do: #if defined(_MSC_VER) && defined(_WIN64) #define HAVE_X86_64_POPCNTQ #endif > I know the present state of affairs works a bit different than what you proposed a couple years ago and that somethinghad to change for portability reasons I didn't quite understand, but I think if we change things here we shouldat least try to have things fit together a bit more nicely. I think the reason for the asm is that __builtin_popcountl does not mean popcnt will be used. Maybe we could have done something like compile pg_bitutils.c with -mpopcnt, and have kept the run-time check, but the problem there is that means that the compiler might end up using that instruction in some other function in that file that we don't want it to. It looks like my patch in [1] did pass the -mpopcnt flag which Tom fixed. > (Side note, but sort of related to #1 above: non-x86 platforms have to indirect through a function pointer even thoughthey have no fast implementation to make it worth their while. It would be better for them if the "slow" implementationwas called static inline or at least a direct function call, but that's a separate thread.) hmm yeah. I see there's a few more usages of pg_popcount() than when I looked last. It would be fairly easy to get rid of the pg_popcount64/pg_popcount32 function call in that. A separate patch though. David [1] https://www.postgresql.org/message-id/CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com
On Tue, Aug 3, 2021 at 11:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 3 Aug 2021 at 22:43, John Naylor <john.naylor@enterprisedb.com> wrote:
> > 1. the __popcnt64() intrinsic is put inside pg_popcount64_asm(), which is a bit of a misnomer since it's not assembly. Renaming s/_asm/_fast/ would help it look better. But then looking around, other platforms have intrinsics coded, but for some reason they're put in pg_popcount64_slow(), where the compiler will emit instructions we could easily write ourselves in C (and without #ifdefs) since without the right CFLAGS these intrinsics won't emit a popcnt instruction. Is MSVC different in that regard? If so, it might be worth a comment.
>
> Yeah, the names no longer fit so well after stuffing the MSVC
> intrinsic into the asm function. The reason I did it that way is down
> to what I read in the docs. Namely:
>
> "If you run code that uses these intrinsics on hardware that doesn't
> support the popcnt instruction, the results are unpredictable."
>
> So, it has to go somewhere where we're sure the CPU supports POPCNT
> and that seemed like the correct place.
>
> From what I understand of GCC and __builtin_popcountl(), the code
> it'll output will depend on the -mpopcnt flag. So having
> __builtin_popcountl() does not mean the popcnt instruction will be
> used. The Microsoft documentation indicates that's not the case for
> __popcnt().
Okay, "unpredictable" sounds bad.
> > 2. (defined(_MSC_VER) && defined(_WIN64) lead to a runtime check for the CPUID, which is fine, but now next to it HAVE_X86_64_POPCNTQ looks strange because the latter symbol comes from a specific configure test -- the two don't seem equivalent, but maybe they are because of what MSVC does? That would be nice to spell out here.
>
> The problem there is that we define HAVE_X86_64_POPCNTQ based on the
> outcome of configure so it does not get set for MSVC. Maybe it's
> worth coming up with some other constant that can be defined or we
> could just do:
>
> #if defined(_MSC_VER) && defined(_WIN64)
> #define HAVE_X86_64_POPCNTQ
> #endif
That seems fine. I don't know PG can build with Arm on Windows, but for the cpuid to work, it seems safer to also check for __x86_64?
> I think the reason for the asm is that __builtin_popcountl does not
> mean popcnt will be used. Maybe we could have done something like
> compile pg_bitutils.c with -mpopcnt, and have kept the run-time check,
> but the problem there is that means that the compiler might end up
> using that instruction in some other function in that file that we
> don't want it to. It looks like my patch in [1] did pass the -mpopcnt
> flag which Tom fixed.
Ah, that makes sense. (If we someday offer a configure option for x86-64-v2, we can use intrinsics in the asm functions, and call them directly. Yet another different thread.)
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On Tue, 3 Aug 2021 at 22:43, John Naylor <john.naylor@enterprisedb.com> wrote:
> > 1. the __popcnt64() intrinsic is put inside pg_popcount64_asm(), which is a bit of a misnomer since it's not assembly. Renaming s/_asm/_fast/ would help it look better. But then looking around, other platforms have intrinsics coded, but for some reason they're put in pg_popcount64_slow(), where the compiler will emit instructions we could easily write ourselves in C (and without #ifdefs) since without the right CFLAGS these intrinsics won't emit a popcnt instruction. Is MSVC different in that regard? If so, it might be worth a comment.
>
> Yeah, the names no longer fit so well after stuffing the MSVC
> intrinsic into the asm function. The reason I did it that way is down
> to what I read in the docs. Namely:
>
> "If you run code that uses these intrinsics on hardware that doesn't
> support the popcnt instruction, the results are unpredictable."
>
> So, it has to go somewhere where we're sure the CPU supports POPCNT
> and that seemed like the correct place.
>
> From what I understand of GCC and __builtin_popcountl(), the code
> it'll output will depend on the -mpopcnt flag. So having
> __builtin_popcountl() does not mean the popcnt instruction will be
> used. The Microsoft documentation indicates that's not the case for
> __popcnt().
Okay, "unpredictable" sounds bad.
> > 2. (defined(_MSC_VER) && defined(_WIN64) lead to a runtime check for the CPUID, which is fine, but now next to it HAVE_X86_64_POPCNTQ looks strange because the latter symbol comes from a specific configure test -- the two don't seem equivalent, but maybe they are because of what MSVC does? That would be nice to spell out here.
>
> The problem there is that we define HAVE_X86_64_POPCNTQ based on the
> outcome of configure so it does not get set for MSVC. Maybe it's
> worth coming up with some other constant that can be defined or we
> could just do:
>
> #if defined(_MSC_VER) && defined(_WIN64)
> #define HAVE_X86_64_POPCNTQ
> #endif
That seems fine. I don't know PG can build with Arm on Windows, but for the cpuid to work, it seems safer to also check for __x86_64?
> I think the reason for the asm is that __builtin_popcountl does not
> mean popcnt will be used. Maybe we could have done something like
> compile pg_bitutils.c with -mpopcnt, and have kept the run-time check,
> but the problem there is that means that the compiler might end up
> using that instruction in some other function in that file that we
> don't want it to. It looks like my patch in [1] did pass the -mpopcnt
> flag which Tom fixed.
Ah, that makes sense. (If we someday offer a configure option for x86-64-v2, we can use intrinsics in the asm functions, and call them directly. Yet another different thread.)
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, 5 Aug 2021 at 07:02, John Naylor <john.naylor@enterprisedb.com> wrote: > > #if defined(_MSC_VER) && defined(_WIN64) > > #define HAVE_X86_64_POPCNTQ > > #endif > > That seems fine. I don't know PG can build with Arm on Windows, but for the cpuid to work, it seems safer to also checkfor __x86_64? That's a good point. I've adjusted it to do #if defined(_MSC_VER) && defined(_M_AMD64). I've attached a v2 patch which I think is more along the lines of what you had in mind. David
Attachment
On Sun, Aug 8, 2021 at 8:31 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> I've attached a v2 patch which I think is more along the lines of what
> you had in mind.
LGTM
>
> I've attached a v2 patch which I think is more along the lines of what
> you had in mind.
LGTM
On Mon, 9 Aug 2021 at 12:58, John Naylor <john.naylor@enterprisedb.com> wrote: > > On Sun, Aug 8, 2021 at 8:31 PM David Rowley <dgrowleyml@gmail.com> wrote: > > > > I've attached a v2 patch which I think is more along the lines of what > > you had in mind. > > LGTM Thanks for the review. Pushed. David