On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote:
> The main issue I saw was that clang was able to peel off the first
> iteration of the loop and then eliminate the mask assignment and
> replace masked load with a memory operand for vpopcnt. I was not able
> to convince gcc to do that regardless of optimization options.
> Generated code for the inner loop:
> 
> clang:
> <L2>:
>       50:      add rdx, 64
>       54:      cmp rdx, rdi
>       57:      jae <L1>
>       59:      vpopcntq zmm1, zmmword ptr [rdx]
>       5f:      vpaddq zmm0, zmm1, zmm0
>       65:      jmp <L2>
> 
> gcc:
> <L1>:
>       38:      kmovq k1, rdx
>       3d:      vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
>       43:      add rax, 64
>       47:      mov rdx, -1
>       4e:      vpopcntq zmm0, zmm0
>       54:      vpaddq zmm0, zmm0, zmm1
>       5a:      vmovdqa64 zmm1, zmm0
>       60:      cmp rax, rsi
>       63:      jb <L1>
> 
> I'm not sure how much that matters in practice. Attached is a patch to
> do this manually giving essentially the same result in gcc. As most
> distro packages are built using gcc I think it would make sense to
> have the extra code if it gives a noticeable benefit for large cases.
Yeah, I did see this, but I also wasn't sure if it was worth further
complicating the code.  I can test with and without your fix and see if it
makes any difference in the benchmarks.
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com