Thread: Re: define pg_structiszero(addr, s, r)

Re: define pg_structiszero(addr, s, r)

From
Peter Eisentraut
Date:
On 18.09.24 06:16, Bertrand Drouvot wrote:
> +#define pg_structiszero(addr, s, r)                                    \
> +    do {                                                            \
> +        /* We assume this initializes to zeroes */                    \
> +        static const s all_zeroes;                                    \
> +        r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0);    \
> +    } while (0)

This assumption is kind of the problem, isn't it?  Because, you can't 
assume that.  And the existing code is arguably kind of wrong.  But 
moreover, this macro also assumes that the "addr" argument has no random 
padding bits.

In the existing code, you can maybe make a local analysis that the code 
is working correctly, although I'm not actually sure.  But if you are 
repackaging this as a general macro under a general-sounding name, then 
the requirements should be more stringent.



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote:
> On 18.09.24 06:16, Bertrand Drouvot wrote:
> > +#define pg_structiszero(addr, s, r)                                    \
> > +    do {                                                            \
> > +        /* We assume this initializes to zeroes */                    \
> > +        static const s all_zeroes;                                    \
> > +        r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0);    \
> > +    } while (0)
>

Thanks for the feedback.
 
> This assumption is kind of the problem, isn't it?  Because, you can't assume
> that.  And the existing code is arguably kind of wrong.  But moreover, this
> macro also assumes that the "addr" argument has no random padding bits.
> 
> In the existing code, you can maybe make a local analysis that the code is
> working correctly, although I'm not actually sure.

I think it is but will give it a closer look.

>  But if you are
> repackaging this as a general macro under a general-sounding name, then the
> requirements should be more stringent.

Agree. That said in v2 ([1]), it has been renamed to pgstat_entry_all_zeros().

I think that I will:

1/ take a closer look regarding the existing assumption
2/ if 1/ outcome is fine, then add more detailed comments around
pgstat_entry_all_zeros() to make sure it's not used outside of the existing
context 

Sounds good to you?

[1]: https://www.postgresql.org/message-id/ZuqHLCdZXtEsbyb/%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Heikki Linnakangas
Date:
On 18/09/2024 21:57, Bertrand Drouvot wrote:
> On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote:
>> On 18.09.24 06:16, Bertrand Drouvot wrote:
>>> +#define pg_structiszero(addr, s, r)                                    \
>>> +    do {                                                            \
>>> +        /* We assume this initializes to zeroes */                    \
>>> +        static const s all_zeroes;                                    \
>>> +        r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0);    \
>>> +    } while (0)

Not new with this patch, but do we guarantee padding bytes to be zeros?

How about this instead:

static inline bool
pg_is_all_zeros(const char *p, size_t len)
{
   for (size_t i = 0; i < len; i++)
   {
     if (p[i] != 0)
         return false;
   }
   return true;
}

Is there's a de facto standard name for that function? I was surprised 
that I couldn't find one with a quick google search. That seems like the 
kind of small utility function that every C program needs.

How performance sensitive is this? If it's not, then the above seems 
like the most straightforward way to do this, which is good. If it is 
performance sensitive, it's still good, because the compiler can 
optimize that well: https://godbolt.org/z/x9hPWjheq.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: define pg_structiszero(addr, s, r)

From
Ranier Vilela
Date:
Em seg., 28 de out. de 2024 às 11:33, Heikki Linnakangas <hlinnaka@iki.fi> escreveu:
On 18/09/2024 21:57, Bertrand Drouvot wrote:
> On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote:
>> On 18.09.24 06:16, Bertrand Drouvot wrote:
>>> +#define pg_structiszero(addr, s, r)                                                                        \
>>> +   do {                                                                                                                    \
>>> +           /* We assume this initializes to zeroes */                                      \
>>> +           static const s all_zeroes;                                                                      \
>>> +           r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0);       \
>>> +   } while (0)

Not new with this patch, but do we guarantee padding bytes to be zeros?

How about this instead:

static inline bool
pg_is_all_zeros(const char *p, size_t len)
{
   for (size_t i = 0; i < len; i++)
   {
     if (p[i] != 0)
         return false;
   }
   return true;
}

It seems to me that this way is more optimized.

static inline bool
is_all_zeros(const char *p, size_t len)
{
  for (size_t i = len; i >= 0; i--)
  {
    if (p[i] != 0)
        return false;
  }
  return true;
}

main:
        sub     rsp, 24
        lea     rdx, [rsp + 12]
        lea     rcx, [rsp + 16]
        lea     rdi, [rip + .L.str]
        lea     rsi, [rsp + 8]
        xor     eax, eax
        call    __isoc99_scanf@PLT
        lea     rdi, [rip + .L.str.1]
        xor     esi, esi
        xor     eax, eax
        call    printf@PLT
        xor     eax, eax
        add     rsp, 24
        ret

best regards,
Ranier Vilela

Re: define pg_structiszero(addr, s, r)

From
Tom Lane
Date:
Ranier Vilela <ranier.vf@gmail.com> writes:
> It seems to me that [reversing the loop direction] is more optimized.

That's far from clear: you're ignoring the possibility that memory
access logic is better optimized for forward scanning than reverse
scanning.  I'd stick with the forward scan without some extremely
detailed testing.

            regards, tom lane



Re: define pg_structiszero(addr, s, r)

From
Ranier Vilela
Date:
Em seg., 28 de out. de 2024 às 12:08, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Ranier Vilela <ranier.vf@gmail.com> writes:
> It seems to me that [reversing the loop direction] is more optimized.

That's far from clear: you're ignoring the possibility that memory
access logic is better optimized for forward scanning than reverse
scanning.  I'd stick with the forward scan without some extremely
detailed testing.
I don't disagree.
After posting, I was wondering if the first possible is not zero, it should probably be at the beginning and not at the end.

best regards,
Ranier Vilela

Re: define pg_structiszero(addr, s, r)

From
Peter Smith
Date:
On Thu, Sep 19, 2024 at 4:57 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote:
> > On 18.09.24 06:16, Bertrand Drouvot wrote:
> > > +#define pg_structiszero(addr, s, r)                                                                        \
> > > +   do {
         \ 
> > > +           /* We assume this initializes to zeroes */                                      \
> > > +           static const s all_zeroes;                                                                      \
> > > +           r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0);       \
> > > +   } while (0)
> >
>
> Thanks for the feedback.
>
> > This assumption is kind of the problem, isn't it?  Because, you can't assume
> > that.  And the existing code is arguably kind of wrong.  But moreover, this
> > macro also assumes that the "addr" argument has no random padding bits.
> >
> > In the existing code, you can maybe make a local analysis that the code is
> > working correctly, although I'm not actually sure.
>
> I think it is but will give it a closer look.
>
> >  But if you are
> > repackaging this as a general macro under a general-sounding name, then the
> > requirements should be more stringent.
>
> Agree. That said in v2 ([1]), it has been renamed to pgstat_entry_all_zeros().
>
> I think that I will:
>
> 1/ take a closer look regarding the existing assumption
> 2/ if 1/ outcome is fine, then add more detailed comments around
> pgstat_entry_all_zeros() to make sure it's not used outside of the existing
> context
>
> Sounds good to you?
>
> [1]: https://www.postgresql.org/message-id/ZuqHLCdZXtEsbyb/%40ip-10-97-1-34.eu-west-3.compute.internal
>
> Regards,
>
> --
> Bertrand Drouvot
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>

Hi, if this is performance critical then FWIW my understanding is that
memcmp can outperform simple loop checking, and my hacky testing
seemed to confirm that.

See https://godbolt.org/z/GWY1ob9bE

static inline bool
is_all_zeros2(const char *p, size_t len)
{
    static size_t nz = 0;
    static const char *z = NULL;

    if (nz < len)
    {
        if (z) free((void *)z);
        nz = len;
        z = (const char *)calloc(1, nz);
    }

    return memcmp(p, z, len) == 0;
}

~~

Executor x86-64 gcc 14.2 (C, Editor #1)
Program stdout

Iterate 1000000 times...
check zeros using loop -- elapsed=0.041196s
check zeros using memcmp -- elapsed=0.016407s

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: define pg_structiszero(addr, s, r)

From
Peter Eisentraut
Date:
On 29.10.24 08:54, Bertrand Drouvot wrote:
> +static inline bool
> +pg_mem_is_all_zeros(const char *p, size_t len)

This should be a void * pointer please.




Re: define pg_structiszero(addr, s, r)

From
Heikki Linnakangas
Date:
On 29/10/2024 09:54, Bertrand Drouvot wrote:
>> https://godbolt.org/z/x9hPWjheq.
> 
> Yeah, I also think that's fine. Peter Smith did some testing in [1] comparing
> memcmp and simple loop checking (thanks Peter for the testing!):
> 
> "
> Iterate 1000000 times...
> check zeros using loop -- elapsed=0.041196s
> check zeros using memcmp -- elapsed=0.016407s
> "
> 
> So, in this test, the loop is 0.024789s longer means 0.024789s/1000000=24 Nanosecond
> slower per comparison (If my math is correct).

I believe that test program is bogus. Look at the assembly code; the 
compiler optimized away the loops.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Tue, Oct 29, 2024 at 11:39:03AM +0200, Heikki Linnakangas wrote:
> On 29/10/2024 09:54, Bertrand Drouvot wrote:
> > > https://godbolt.org/z/x9hPWjheq.
> > 
> > Yeah, I also think that's fine. Peter Smith did some testing in [1] comparing
> > memcmp and simple loop checking (thanks Peter for the testing!):
> > 
> > "
> > Iterate 1000000 times...
> > check zeros using loop -- elapsed=0.041196s
> > check zeros using memcmp -- elapsed=0.016407s
> > "
> > 
> > So, in this test, the loop is 0.024789s longer means 0.024789s/1000000=24 Nanosecond
> > slower per comparison (If my math is correct).
> 
> I believe that test program is bogus. Look at the assembly code; the
> compiler optimized away the loops.

Oh right. It looks like that moving the "scanf" within each loop "helps" and
that both give pretty comparable results.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Peter Eisentraut
Date:
On 29.10.24 14:58, Bertrand Drouvot wrote:
> hi,
> 
> On Tue, Oct 29, 2024 at 10:23:37AM +0100, Peter Eisentraut wrote:
>> On 29.10.24 08:54, Bertrand Drouvot wrote:
>>> +static inline bool
>>> +pg_mem_is_all_zeros(const char *p, size_t len)
>>
>> This should be a void * pointer please.
> 
> Thanks for looking at it!
> Yeah better, done in v4 attached.

Sorry for the confusion.  I didn't mean to discourage you from the const 
qualifier.  That should still be there.




Re: define pg_structiszero(addr, s, r)

From
Peter Smith
Date:
+/*
+ * Test if a memory region starting at p and of size len is full of zeroes.
+ */
+static inline bool
+pg_mem_is_all_zeros(const void *ptr, size_t len)

The function comment should say 'ptr' instead of 'p', right?

======
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Thu, 31 Oct 2024 at 19:17, Michael Paquier <michael@paquier.xyz> wrote:
> Seems kind of OK seen from here.  I am wondering if others more
> comments about the name of the macro, its location, the fact that we
> still have pagebytes in bufpage.c, etc.

This change looks incorrect:

@@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber
blkno, int flags)
    }

    /* Check all-zeroes case */
-   all_zeroes = true;
    pagebytes = (size_t *) page;
-   for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
-   {
-       if (pagebytes[i] != 0)
-       {
-           all_zeroes = false;
-           break;
-       }
-   }

-   if (all_zeroes)
+   if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t))))
        return true;

I think this should be passing BLCKSZ rather than (BLCKSZ /
sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is
zero rather than the entire page.

However, I've also performance concerns as if I look at the assembly
of PageIsVerifiedExtended, I see the zero checking is now done 1 byte
at a time:

(gcc 11.4)

leaq 1024(%rbx), %rdx <-- 1KB bug
.p2align 4,,10
.p2align 3
.L60:
cmpb $0, (%rax) <-- check 1 byte is zero.
jne .L59
addq $1, %rax <-- increment by 1 byte.
cmpq %rax, %rdx <-- check if we've done 1024 bytes yet.
jne .L60

Whereas previously it was doing:

movq %rbx, %rax
leaq 8192(%rbx), %rdx <-- 8KB
jmp .L60
.p2align 4,,10
.p2align 3
.L90:
addq $8, %rax <-- increment by 8 bytes (or sizeof(size_t))
cmpq %rax, %rdx
je .L61
.L60:
cmpq $0, (%rax)  <-- checks an entire 8 bytes are zero.

I didn't test how performance-critical this is, but the header comment
for this function does use the words "cheaply detect".

 * This is called when a page has just been read in from disk.  The idea is
 * to cheaply detect trashed pages before we go nuts following bogus line
 * pointers, testing invalid transaction identifiers, etc.

so it seems a bit dangerous to switch this loop to byte-at-a-time
rather than doing 8 bytes at once without testing the performance
isn't affected.

David



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Fri, Nov 01, 2024 at 05:45:44PM +1300, David Rowley wrote:
> On Thu, 31 Oct 2024 at 19:17, Michael Paquier <michael@paquier.xyz> wrote:
> > Seems kind of OK seen from here.  I am wondering if others more
> > comments about the name of the macro, its location, the fact that we
> > still have pagebytes in bufpage.c, etc.
> 
> This change looks incorrect:
> 
> @@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber
> blkno, int flags)
>     }
> 
>     /* Check all-zeroes case */
> -   all_zeroes = true;
>     pagebytes = (size_t *) page;
> -   for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
> -   {
> -       if (pagebytes[i] != 0)
> -       {
> -           all_zeroes = false;
> -           break;
> -       }
> -   }
> 
> -   if (all_zeroes)
> +   if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t))))
>         return true;
> 
> I think this should be passing BLCKSZ rather than (BLCKSZ /
> sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is
> zero rather than the entire page.

Thanks for looking at it!

Nice catch, indeed by using the new function we are changing the pointer arithmetic
here and then we should pass BLCKSZ instead.

> However, I've also performance concerns as if I look at the assembly
> of PageIsVerifiedExtended, I see the zero checking is now done 1 byte
> at a time:
> 
> (gcc 11.4)
> 
> leaq 1024(%rbx), %rdx <-- 1KB bug
> .p2align 4,,10
> .p2align 3
> .L60:
> cmpb $0, (%rax) <-- check 1 byte is zero.
> jne .L59
> addq $1, %rax <-- increment by 1 byte.
> cmpq %rax, %rdx <-- check if we've done 1024 bytes yet.
> jne .L60
> 
> Whereas previously it was doing:
> 
> movq %rbx, %rax
> leaq 8192(%rbx), %rdx <-- 8KB
> jmp .L60
> .p2align 4,,10
> .p2align 3
> .L90:
> addq $8, %rax <-- increment by 8 bytes (or sizeof(size_t))
> cmpq %rax, %rdx
> je .L61
> .L60:
> cmpq $0, (%rax)  <-- checks an entire 8 bytes are zero.
> 
> I didn't test how performance-critical this is, but the header comment
> for this function does use the words "cheaply detect".
> 
>  * This is called when a page has just been read in from disk.  The idea is
>  * to cheaply detect trashed pages before we go nuts following bogus line
>  * pointers, testing invalid transaction identifiers, etc.
> 
> so it seems a bit dangerous to switch this loop to byte-at-a-time
> rather than doing 8 bytes at once without testing the performance
> isn't affected.

Agree, I did a quick test (see [0]) and it looks like it's significantly slower
using the new inline function.

We could try to write a more elaborate version of pg_memory_is_all_zeros(), but
as it looks like there is only one use case, then it's probably better to not
implement (revert) this change here and "just" add a comment as to why pg_memory_is_all_zeros()
should not be used here, thoughts?

[0]: https://godbolt.org/z/xqnW4MPY5

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Fri, 1 Nov 2024 at 19:33, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> Agree, I did a quick test (see [0]) and it looks like it's significantly slower
> using the new inline function.
>
> We could try to write a more elaborate version of pg_memory_is_all_zeros(), but
> as it looks like there is only one use case, then it's probably better to not
> implement (revert) this change here and "just" add a comment as to why pg_memory_is_all_zeros()
> should not be used here, thoughts?

My vote is to just revert this usage of the function. Anything more
elaborate would need to check pointer alignment before using any types
larger than char. The previous code does not need to do that because
the page is going to be at least MAXALIGNed.

David



Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Fri, 1 Nov 2024 at 20:14, Michael Paquier <michael@paquier.xyz> wrote:
> 2) On HEAD at 49d6c7d8daba:
> .LVL299:
>    .loc 1 131 16 is_stmt 0 discriminator 1 view .LVU524
>    cmpq    $8192, %rbx
>    je      .L419
>
> 3) With the patch sent at [1]:
> .LVL306:
>    .loc 3 201 23 is_stmt 1 discriminator 1 view .LVU545
>    cmpq    $8192, %rbx
>    jne     .L417
>
> So it does not matter one way or another for 2) or 3), does it?

The patch in [1] will fix the bug. But I'm still concerned about the
performance implications of moving to byte-at-a-time processing. There
are about 8 times more instructions being expected to do the same
work.

David

> [1]: https://www.postgresql.org/message-id/ZyR02ofHiWG1HmLI@paquier.xyz



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Fri, Nov 01, 2024 at 04:36:45PM +0900, Michael Paquier wrote:
> On Fri, Nov 01, 2024 at 08:21:50PM +1300, David Rowley wrote:
> > My vote is to just revert this usage of the function. Anything more
> > elaborate would need to check pointer alignment before using any types
> > larger than char. The previous code does not need to do that because
> > the page is going to be at least MAXALIGNed.
> 
> Fine, here you go.  The attached reverts back this part in bufpage.c
> to what it was in 49d6c7d8daba.

Thanks! Worth to add a comment as to why pg_memory_is_all_zeros() should not
be used here?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> On Fri, 1 Nov 2024 at 20:49, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Nov 01, 2024 at 07:44:22AM +0000, Bertrand Drouvot wrote:
> > > Worth to add a comment as to why pg_memory_is_all_zeros() should not
> > > be used here?
> >
> > I would not add one in bufpage.c, documenting that where
> > pg_memory_is_all_zeros() is defined may be more adapted.
> 
> The thought of having to write a comment to warn people not to use it
> for performance-critical things makes me think it might be better just
> to write a more optimal version of the function so we don't need to
> warn people.

Yeah, that's probably a good idea to write a more elaborate function.

> I looked around at the callers of the function I saw the
> following numbers of bytes being used for the length: 8192 (the one in
> question), 88, 32 and 112.
> 
> I don't know how performance-critical the final three of those are,
> but I imagine all apart from the 32-byte one might be better with a
> non-inlined and more optimised version of the function. The problem
> with inlining the optimised version is that it's more code to inline.

I agree that's more code to inline and contains multiple loops and branches.

For the last 3 callers, I think that non inline would still be "cheap" as compared
to what lead to those code paths (stats increments).

> I've attached what I thought a more optimal version might look like in
> case anyone thinks making it better is a good idea.
> 

Thanks for the proposal!

I like the idea, I think that's worth to add a few comments, something like:

1 ===

+  while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)

Add a comment like "Checks bytes, byte by byte, until the pointer is aligned"?

2 ===

+  for (; p < aligned_end; p += sizeof(size_t))

Add a comment like "Multiple bytes comparison(s) at once"?

3 ===

+  while (p < end)
+  {

Add a comment like "Compare remaining bytes, byte by byte"?

4 ===

Out of curiosity I did test your proposal and it performs well (see [0]) for
the PageIsVerifiedExtended() case.


[0]: https://godbolt.org/z/Mdaxz5W7c

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Fri, Nov 01, 2024 at 12:50:10PM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> > On Fri, 1 Nov 2024 at 20:49, Michael Paquier <michael@paquier.xyz> wrote:
> > I've attached what I thought a more optimal version might look like in
> > case anyone thinks making it better is a good idea.
> > 
> 
> Thanks for the proposal!
> 
> I like the idea, I think that's worth to add a few comments, something like:
> 
> 1 ===
> 
> +  while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
> 
> Add a comment like "Checks bytes, byte by byte, until the pointer is aligned"?
> 
> 2 ===
> 
> +  for (; p < aligned_end; p += sizeof(size_t))
> 
> Add a comment like "Multiple bytes comparison(s) at once"?
> 
> 3 ===
> 
> +  while (p < end)
> +  {
> 
> Add a comment like "Compare remaining bytes, byte by byte"?
> 
> 4 ===
> 
> Out of curiosity I did test your proposal and it performs well (see [0]) for
> the PageIsVerifiedExtended() case.

Also, 

5 ===

Shouldn't we handle the cases where ptr is NULL and/or len == 0? (should probably
have already been done in the current version of pg_memory_is_all_zeros() though).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Sat, 2 Nov 2024 at 01:50, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> > I've attached what I thought a more optimal version might look like in
> > case anyone thinks making it better is a good idea.
> >
>
> Thanks for the proposal!
>
> I like the idea, I think that's worth to add a few comments, something like:

I'm happy if you want to pick this up and continue working on it. I'm
mostly just keen to not leave the suboptimal version of the function
in core as it is.

One thing you might want to consider is if it's worth having a macro
to help decide if you want to inline the function for smaller sizes
and not inline for larger sizes. A macro that checks something like:
if (__builtin_constant_p(len) && len <= 32) could call an inline
version of the function for smaller sizes and do a function call for
lager sizes. Compilers seem to have heuristics that result in
behaviour like this for library functions such as memset and memcpy.
Maybe some experimentation with godbolt.org would yield the crossover
point in bytes where compilers switch tactics.

I just feel that at the rate we receive small code change suggestions
on the mailing lists, it's just a matter of time before someone will
come along and suggest we use pg_memory_is_all_zeros() in
PageIsVerifiedExtended() again. If we don't optimize that function,
then there's a chance a committer might re-commit what's just been
reverted.

David



Re: define pg_structiszero(addr, s, r)

From
Ranier Vilela
Date:
Hi.

Em seg., 4 de nov. de 2024 às 14:18, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,

On Tue, Nov 05, 2024 at 12:24:48AM +1300, David Rowley wrote:
> On Sat, 2 Nov 2024 at 01:50, Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> > > I've attached what I thought a more optimal version might look like in
> > > case anyone thinks making it better is a good idea.
> > >
> >
> > Thanks for the proposal!
> >
> > I like the idea, I think that's worth to add a few comments, something like:
>
> I'm happy if you want to pick this up and continue working on it.

Sure, please find attached v1, the changes are:

- switch from "const char" to "const unsigned char" (could have been done in the
current version of pg_memory_is_all_zeros() though)
- added some comments
- adding an Assert for ptr != 0
- re-introduce the function call in PageIsVerifiedExtended()
- propose a commit message

I think we can add a small optimization to this last patch [1].
The variable *aligned_end* is only needed in the second loop (for).
So, only before the for loop do we actually declare it.

Result before this change:
check zeros using BERTRAND 1 0.000031s

Result after this change:
check zeros using BERTRAND 1 0.000018s

+ const unsigned char *aligned_end;

+ /* Multiple bytes comparison(s) at once */
+ aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1)));
+ for (; p < aligned_end; p += sizeof(size_t))

best regards,
Ranier Vilela

Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier.vf@gmail.com> wrote:
> I think we can add a small optimization to this last patch [1].

I think if you want to make it faster, you could partially unroll the
inner-most loop, like:

// size_t * 4
for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4)
{
    if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *)
p)[2] != 0 | ((size_t *) p)[3] != 0)
        return false;
}

$ gcc allzeros.c -O2 -o allzeros && ./allzeros
char: done in 1595000 nanoseconds
size_t: done in 198300 nanoseconds (8.04337 times faster than char)
size_t * 4: done in 81500 nanoseconds (19.5706 times faster than char)
size_t * 8: done in 71000 nanoseconds (22.4648 times faster than char)

The final one above is 110GB/sec, so probably only going that fast
because the memory being checked is in L1. DDR5 is only 64GB/sec. So
it's probably overkill to unroll the loop that much.

Also, doing something like that means the final byte-at-a-time loop
might have more to do, which might cases with a long remainder slower.
To make up for that there's some incentive to introduce yet another
loop to process single size_t's up to aligned_end. Then you end up
with even more code.

I was happy enough with my patch with Bertrand's comments.  I'm not
sure why unsigned chars are better than chars. It doesn't seem to have
any effect on the compiled code.

David



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Tue, Nov 05, 2024 at 05:08:41PM +1300, David Rowley wrote:
> I was happy enough with my patch with Bertrand's comments.  I'm not
> sure why unsigned chars are better than chars. It doesn't seem to have
> any effect on the compiled code.
> 

I think that unsigned chars is better than char for byte level memory
operations (no sign extension issues).

Though I agree that using char in this function is not an issue as p is only
compared with 0. This is more a matter of taste here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Ranier Vilela
Date:


Em ter., 5 de nov. de 2024 às 00:23, David Rowley <dgrowleyml@gmail.com> escreveu:
On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier.vf@gmail.com> wrote:
> I think we can add a small optimization to this last patch [1].
> The variable *aligned_end* is only needed in the second loop (for).
> So, only before the for loop do we actually declare it.
>
> Result before this change:
> check zeros using BERTRAND 1 0.000031s
>
> Result after this change:
> check zeros using BERTRAND 1 0.000018s
>
> + const unsigned char *aligned_end;
>
> + /* Multiple bytes comparison(s) at once */
> + aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1)));
> + for (; p < aligned_end; p += sizeof(size_t))

I think we all need to stop using Godbolt's servers to run benchmarks
on. These servers are likely to be running various other workloads in
highly virtualised environments and are not going to be stable servers
that would give consistent benchmark results.

I tried your optimisation in the attached allzeros.c and here are my results:

# My version
$ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do ./allzeros; done
char: done in 1566400 nanoseconds
size_t: done in 195400 nanoseconds (8.01638 times faster than char)
char: done in 1537500 nanoseconds
size_t: done in 196300 nanoseconds (7.8324 times faster than char)
char: done in 1543600 nanoseconds
size_t: done in 196300 nanoseconds (7.86347 times faster than char)

# Ranier's optimization
$ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
{1..3}; do ./allzeros; done
char: done in 1943100 nanoseconds
size_t: done in 531700 nanoseconds (3.6545 times faster than char)
char: done in 1957200 nanoseconds
size_t: done in 458400 nanoseconds (4.26963 times faster than char)
char: done in 1949500 nanoseconds
size_t: done in 469000 nanoseconds (4.15672 times faster than char)

Seems to be about half as fast with gcc on -O2
Thanks for test coding.

I've tried with msvc 2022 32bits
Here the results:

C:\usr\src\tests\allzeros>allzeros
char: done in 71431900 nanoseconds
size_t: done in 18010900 nanoseconds (3.96604 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 71070100 nanoseconds
size_t: done in 19654300 nanoseconds (3.61601 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 68682400 nanoseconds
size_t: done in 19841100 nanoseconds (3.46162 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 63215100 nanoseconds
size_t: done in 17920200 nanoseconds (3.52759 times faster than char)


C:\usr\src\tests\allzeros>c /DRANIERS_OPTIMIZATION

Microsoft (R) Program Maintenance Utility Versão 14.40.33813.0
Direitos autorais da Microsoft Corporation. Todos os direitos reservados.

C:\usr\src\tests\allzeros>allzeros
char: done in 67213800 nanoseconds
size_t: done in 15049200 nanoseconds (4.46627 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 51505900 nanoseconds
size_t: done in 13645700 nanoseconds (3.77452 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 62852600 nanoseconds
size_t: done in 17863800 nanoseconds (3.51843 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 51877200 nanoseconds
size_t: done in 13759900 nanoseconds (3.77017 times faster than char)

The function used to replace clock_getime is:
timespec_get(ts, TIME_UTC)

best regards,
Ranier Vilela

Re: define pg_structiszero(addr, s, r)

From
Ranier Vilela
Date:

Em ter., 5 de nov. de 2024 às 01:12, Michael Paquier <michael@paquier.xyz> escreveu:
On Tue, Nov 05, 2024 at 04:23:34PM +1300, David Rowley wrote:
> I tried your optimisation in the attached allzeros.c and here are my results:
>
> # My version
> $ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do ./allzeros; done
> char: done in 1543600 nanoseconds
> size_t: done in 196300 nanoseconds (7.86347 times faster than char)
>
> # Ranier's optimization
> $ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
> size_t: done in 531700 nanoseconds (3.6545 times faster than char)
> char: done in 1957200 nanoseconds

I am not seeing numbers as good as yours, but the winner is clear as
well here:
Thanks for testing.
 

$ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do
./allzeros; done
char: done in 6578995 nanoseconds
size_t: done in 829916 nanoseconds (7.9273 times faster than char)
char: done in 6581465 nanoseconds
size_t: done in 829948 nanoseconds (7.92997 times faster than char)
char: done in 6585748 nanoseconds
size_t: done in 834929 nanoseconds (7.88779 times faster than char)

$ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
{1..3}; do ./allzeros;
done char: done in 6591803 nanoseconds
size_t: done in 1236102 nanoseconds (5.33273 times faster than char)
char: done in 6606219 nanoseconds
size_t: done in 1235979 nanoseconds (5.34493 times faster than char)
char: done in 6594413 nanoseconds
size_t: done in 1238770 nanoseconds (5.32336 times faster than char)

I'm surprised to see that assigning aligned_end at these two different
locations has this much effect once the compiler optimizes the
surroundings, but well.
I think that's a plus point for the benefit of not touching the memory if it's not explicitly necessary.

best regards,
Ranier Vilela

Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Tue, Nov 05, 2024 at 05:08:41PM +1300, David Rowley wrote:
> On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier.vf@gmail.com> wrote:
> > I think we can add a small optimization to this last patch [1].
> 
> I think if you want to make it faster, you could partially unroll the
> inner-most loop, like:
> 
> // size_t * 4
> for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4)
> {
>     if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *)
> p)[2] != 0 | ((size_t *) p)[3] != 0)
>         return false;
> }

Another option could be to use SIMD instructions to check multiple bytes
is zero in a single operation. Maybe just an idea to keep in mind and experiment
if we feel the need later on.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> Another option could be to use SIMD instructions to check multiple bytes
> is zero in a single operation. Maybe just an idea to keep in mind and experiment
> if we feel the need later on.

Could do. I just wrote it that way to give the compiler flexibility to
do SIMD implicitly. That seemed easier than messing around with SIMD
intrinsics. I guess the compiler won't use SIMD with the single
size_t-at-a-time version as it can't be certain it's ok to access the
memory beyond the first zero word. Because I wrote the "if" condition
using bitwise-OR, there's no boolean short-circuiting, so the compiler
sees it must be safe to access all the memory for the loop iteration.

If I use -march=native or -march=znver2 on my Zen2 machine, gcc does
use SIMD operators.  Clang uses some 128-bit registers without
specifying -march:

drowley@amd3990x:~$ gcc -O2 allzeros.c -march=native -o allzeros &&
for i in {1..3}; do ./allzeros; done
char: done in 1940539 nanoseconds
size_t: done in 261731 nanoseconds (7.41425 times faster than char)
size_t * 4: done in 130415 nanoseconds (14.8797 times faster than char)
size_t * 8: done in 70031 nanoseconds (27.7097 times faster than char)
char: done in 3030132 nanoseconds
size_t: done in 477044 nanoseconds (6.35189 times faster than char)
size_t * 4: done in 123551 nanoseconds (24.5254 times faster than char)
size_t * 8: done in 68549 nanoseconds (44.2039 times faster than char)
char: done in 3214037 nanoseconds
size_t: done in 256901 nanoseconds (12.5108 times faster than char)
size_t * 4: done in 126017 nanoseconds (25.5048 times faster than char)
size_t * 8: done in 73167 nanoseconds (43.9274 times faster than char)

David



Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Wed, 6 Nov 2024 at 13:20, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Nov 06, 2024 at 12:16:33PM +1300, David Rowley wrote:
> > On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> >> Another option could be to use SIMD instructions to check multiple bytes
> >> is zero in a single operation. Maybe just an idea to keep in mind and experiment
> >> if we feel the need later on.
> >
> > Could do. I just wrote it that way to give the compiler flexibility to
> > do SIMD implicitly. That seemed easier than messing around with SIMD
> > intrinsics. I guess the compiler won't use SIMD with the single
> > size_t-at-a-time version as it can't be certain it's ok to access the
> > memory beyond the first zero word. Because I wrote the "if" condition
> > using bitwise-OR, there's no boolean short-circuiting, so the compiler
> > sees it must be safe to access all the memory for the loop iteration.
>
> How complex would that be compared to the latest patch proposed if
> done this way?  If you can force SIMD without having to know about
> these specific gcc switches (aka -march is not set by default in the
> tree except for some armv8 path), then the performance happens
> magically.  If that makes the code more readable, that's even better.

We could just write it that way and leave it up to the compiler to
decide whether to use SIMD or not. Going by [1], gcc with -O2 uses
SIMD instructions from 14.1 and clang with -O2 does it from version
8.0.0. gcc 14.1 was release in May 2024, so still quite new. It'll be
quite a bit older once PG18 is out.  Using the bitwise-OR method, more
and more people will benefit as gcc14.1 and beyond becomes more
mainstream.

Clang 8.0.0 is from March 2019, so quite old already.

David

[1] https://godbolt.org/z/MTqao8scW



Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Wed, 6 Nov 2024 at 13:52, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Nov 06, 2024 at 01:38:50PM +1300, David Rowley wrote:
> > We could just write it that way and leave it up to the compiler to
> > decide whether to use SIMD or not. Going by [1], gcc with -O2 uses
> > SIMD instructions from 14.1 and clang with -O2 does it from version
> > 8.0.0. gcc 14.1 was release in May 2024, so still quite new. It'll be
> > quite a bit older once PG18 is out.  Using the bitwise-OR method, more
> > and more people will benefit as gcc14.1 and beyond becomes more
> > mainstream.
> >
> > Clang 8.0.0 is from March 2019, so quite old already.
>
> Okay, WFM to keep things the way they are in the patch.

I'm not sure if I'm clear on what works for you.  The latest patch I
saw did 1 size_t per iteration. Are you saying we should do just
size_t per loop? or we should form the code in a way that allows the
compiler to use SIMD instructions?

David



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Wed, Nov 06, 2024 at 01:44:58PM +0900, Michael Paquier wrote:
> Should the last loop check only 1 byte at a time or should this stuff
> include one more step before the last one you wrote to do a couple of
> checks with size_t?  That may matter for areas small enough (len <
> sizeof(size_t) * 8) causing the second step to not be taken, but large
> enough (len > sizeof(size_t)) to apply a couple of size_t checks per
> loop. 

Do you mean add:

"
for (; p < aligned_end; p += sizeof(size_t))
{
   if (*(size_t *)p != 0)
       return false;
}
"

just before the last loop?

If so, I did a few tests and did not see any major improvements. So, I thought
it's simpler to not add more code in this inline function in v7 shared up-thread.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Peter Eisentraut
Date:
On 05.11.24 16:03, Bertrand Drouvot wrote:
> On Tue, Nov 05, 2024 at 05:08:41PM +1300, David Rowley wrote:
>> On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier.vf@gmail.com> wrote:
>>> I think we can add a small optimization to this last patch [1].
>>
>> I think if you want to make it faster, you could partially unroll the
>> inner-most loop, like:
>>
>> // size_t * 4
>> for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4)
>> {
>>      if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *)
>> p)[2] != 0 | ((size_t *) p)[3] != 0)
>>          return false;
>> }
> 
> Another option could be to use SIMD instructions to check multiple bytes
> is zero in a single operation. Maybe just an idea to keep in mind and experiment
> if we feel the need later on.

Speaking of which, couldn't you just use

     pg_popcount(ptr, len) == 0

?




Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Thu, 7 Nov 2024 at 07:34, Peter Eisentraut <peter@eisentraut.org> wrote:
> Speaking of which, couldn't you just use
>
>      pg_popcount(ptr, len) == 0

That might be quite good for small lengths or for use cases where the
memory is always or almost always zero. The problem is there's no
early exit when you find the first non-zero which means, for larger
lengths, reading much more memory. That'll both take longer and
possibly evict cache lines which might be useful to have in the near
future.

David



Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Thu, 7 Nov 2024 at 00:40, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> Do you mean add:
>
> "
> for (; p < aligned_end; p += sizeof(size_t))
> {
>    if (*(size_t *)p != 0)
>        return false;
> }
> "
>
> just before the last loop?
>
> If so, I did a few tests and did not see any major improvements. So, I thought
> it's simpler to not add more code in this inline function in v7 shared up-thread.

Did you try with a size where there's a decent remainder, say 124
bytes? FWIW, one of the cases has 112 bytes, and I think that is
aligned memory meaning we'll do the first 64 in the SIMD loop and have
to do 48 bytes in the byte-at-a-time loop. If you had the loop Michael
mentioned, that would instead be 6 loops of size_t-at-a-time.

David



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Thu, Nov 07, 2024 at 09:45:44AM +0900, Michael Paquier wrote:
> On Thu, Nov 07, 2024 at 08:05:10AM +1300, David Rowley wrote:
> > That might be quite good for small lengths or for use cases where the
> > memory is always or almost always zero. The problem is there's no
> > early exit when you find the first non-zero which means, for larger
> > lengths, reading much more memory. That'll both take longer and
> > possibly evict cache lines which might be useful to have in the near
> > future.
> 
> Didn't know this one either, thanks for the explanation.

+1, thanks!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
David Rowley
Date:
On Fri, 8 Nov 2024 at 18:34, Michael Paquier <michael@paquier.xyz> wrote:
> I've done a round of comment and term cleanup for the whole patch,
> while on it.

I don't think "intrinsics" is the correct word to use here:

+ * - 8 * sizeof(size_t) comparisons using bitwise OR, to encourage compilers
+ *   to use SIMD intrinsics if available, up to the last aligned location

and

+ * All comparisons are combined with a single OR operation, making it a
+ * good candidate for SIMD intrinsics, if available.

an intrinsic function is a function built into the compiler that
provides some lower-level functionality. e.g. __builtin_popcount().

I'm slightly worried due to the current rate we're receiving cleanup
suggestions that someone might come along and think they'd be doing us
a favour by submitting a patch to "fixup the inefficient bitwise-ORs
and use boolean-OR". Maybe a comment like the following might prevent
that from happening.

+ * For performance reasons, we manually unroll this loop and purposefully
+ * use bitwise-ORs to combine each comparison.  This prevents boolean
+ * short-circuiting and lets the compiler know that it's safe to access all 8
+ * elements regardless of the result of the other comparisons.  This seems
+ * to be enough to coax a few compilers into using SIMD instructions.

> Btw, gcc seems a bit smarter than clang when it comes to optimizing
> the code depending on the size of the structures.  gcc gives up on
> SIMD if it's sure that the structure on which we are going to use the
> allzero check won't need it at all, and clang keeps it even if it does
> not need it.  That was interesting to see, while going through the
> review..

Can you share your test case for this?  I tried with [1] and the
latest gcc does not seem to be smart enough to figure this out.  I
tried adding some additional len checks that the compiler can use as a
cue and won't need to emit code for the checks providing the function
does get inlined. That was enough to get the compiler to not emit the
loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
in the 2nd compiler window. I just don't know if putting something
like that into the code is a good idea as if the function wasn't
inlined for some reason, the extra len checks would have to be
compiled into the function.

David

[1] https://godbolt.org/z/xa81ro8GK



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Sat, Nov 09, 2024 at 08:00:35AM +0900, Michael Paquier wrote:
> On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote:
> > I'm slightly worried due to the current rate we're receiving cleanup
> > suggestions that someone might come along and think they'd be doing us
> > a favour by submitting a patch to "fixup the inefficient bitwise-ORs
> > and use boolean-OR". Maybe a comment like the following might prevent
> > that from happening.
> 
> Not sure, but OK by me to tweak things more.
> 
> > Can you share your test case for this?  I tried with [1] and the
> > latest gcc does not seem to be smart enough to figure this out.  I
> > tried adding some additional len checks that the compiler can use as a
> > cue and won't need to emit code for the checks providing the function
> > does get inlined. That was enough to get the compiler to not emit the
> > loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
> > in the 2nd compiler window. I just don't know if putting something
> > like that into the code is a good idea as if the function wasn't
> > inlined for some reason, the extra len checks would have to be
> > compiled into the function.
> 
> Feel free to use that (I hope it works), and see the difference once
> the aligned structure is 121 bytes or more:
> https://godbolt.org/z/94393nPGG
> 
> At least, I can see that the SIMD loop is ignored.

What I see (with the godbolt you shared) is that with BLCKSZ of 120:

gcc: then no SIMD instructions are used (I think that's because sizeof(AlignedBlock)
is 120 which is not a multiple of 16 (SIMD xmm register size)).

with BLCKSZ of 121:

gcc: then SIMD instructions are used (I think that's because sizeof(AlignedBlock)
is 128 which is a multiple of 16 (SIMD xmm register size)).

While clang uses SIMD instructions in both cases (more complex code with more
branches at least in the 120 case).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote:
> I tried with [1] and the
> latest gcc does not seem to be smart enough to figure this out.  I
> tried adding some additional len checks that the compiler can use as a
> cue and won't need to emit code for the checks providing the function
> does get inlined. That was enough to get the compiler to not emit the
> loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
> in the 2nd compiler window. I just don't know if putting something
> like that into the code is a good idea as if the function wasn't
> inlined for some reason, the extra len checks would have to be
> compiled into the function.
> 
> David
> 
> [1] https://godbolt.org/z/xa81ro8GK

Looking at it, that looks like an issue.

I mean, without the -DCHECK_LEN flag then the SIMD code will read up to 48 bytes
beyond the struct's memory (which is 16 bytes):

This is fine:
"
movdqu  xmm0, XMMWORD PTR [rdi]
"

But I don't think it is:

"
movdqu  xmm2, XMMWORD PTR [rdi+16]
movdqu  xmm1, XMMWORD PTR [rdi+32]
movdqu  xmm3, XMMWORD PTR [rdi+48]
"

given that the struct size is only 16 bytes.

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Sat, Nov 09, 2024 at 04:15:04AM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote:
> > I tried with [1] and the
> > latest gcc does not seem to be smart enough to figure this out.  I
> > tried adding some additional len checks that the compiler can use as a
> > cue and won't need to emit code for the checks providing the function
> > does get inlined. That was enough to get the compiler to not emit the
> > loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
> > in the 2nd compiler window. I just don't know if putting something
> > like that into the code is a good idea as if the function wasn't
> > inlined for some reason, the extra len checks would have to be
> > compiled into the function.
> > 
> > David
> > 
> > [1] https://godbolt.org/z/xa81ro8GK
> 
> Looking at it, that looks like an issue.
> 
> I mean, without the -DCHECK_LEN flag then the SIMD code will read up to 48 bytes
> beyond the struct's memory (which is 16 bytes):
> 
> This is fine:
> "
> movdqu  xmm0, XMMWORD PTR [rdi]
> "
> 
> But I don't think it is:
> 
> "
> movdqu  xmm2, XMMWORD PTR [rdi+16]
> movdqu  xmm1, XMMWORD PTR [rdi+32]
> movdqu  xmm3, XMMWORD PTR [rdi+48]
> "
> 
> given that the struct size is only 16 bytes.
> 
> Thoughts?

What about "simply" starting pg_memory_is_all_zeros() with?

"
    if (len < sizeof(size_t)*8) {
        while (p < end) {
            if (*p++ != 0)
                return false;
        }
        return true;
    }
"

That way:

- we prevents reading beyond the memory area in the SIMD section (if < 64 bytes)
- we make sure that aligned_end can not be after the real end (could be if the
len is < 8 bytes)
- there is no need for additional size checks later in the code
- len < 64 bytes will be read byte per byte but that's likely "enough" (if not
faster) for those "small" sizes

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Tue, Nov 12, 2024 at 12:28:53PM +0900, Michael Paquier wrote:
> On Mon, Nov 11, 2024 at 05:07:51PM +0000, Bertrand Drouvot wrote:
> > To handle the "what about the len check if the function is not inlined?", I
> > can't think about a good approach.
> 
> FWIW, my choice would be to not over-engineer things more than what's
> in v10 posted at [1], hence do something without the exception case
> where the size is less than 64b.

I think that the 64b len check done in v11 is mandatory for safety reasons.

1. First reason:

"
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
"

The loop above reads 64 bytes at once, so would read beyond the memory area bounds
if len < 64: That could cause crash or read invalid data.

It's observed in [1] (using the godbolt shared in [2]), where we can see:

"
movdqu  xmm2, XMMWORD PTR [rdi+16]
movdqu  xmm1, XMMWORD PTR [rdi+32]
movdqu  xmm3, XMMWORD PTR [rdi+48]
"

while the struct size is 16 bytes (so we are reading 48 bytes beyond it).

2. Second reason

"
const unsigned char *aligned_end = (const unsigned char *)
    ((uintptr_t) end & (~(sizeof(size_t) - 1)));
"

aligned_end could be beyond the end for len < 8, so that we could read 
invalid data or crash here:

"
for (; p < aligned_end; p += sizeof(size_t)) {
"

The len < 8 check is covered into the len < 64 check, so only the 64b check is
needed.

[1]: https://www.postgresql.org/message-id/Zy7hyG8JUMC5P2T3%40ip-10-97-1-34.eu-west-3.compute.internal
[2]: https://www.postgresql.org/message-id/CAApHDvp2jx_%3DpFbgj-O1_ZmzP9WOZKfwLzZrS_%3DZmbsqMQQ59g%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Ranier Vilela
Date:

Em ter., 12 de nov. de 2024 às 21:33, Michael Paquier <michael@paquier.xyz> escreveu:
On Tue, Nov 12, 2024 at 01:32:36PM -0300, Ranier Vilela wrote:
> See v1_allzeros_small.c attached.

In your pg_memory_is_all_zeros_v11:
    while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
    {
        if (p == end)
            return true;

        if (*p++ != 0)
            return false;
    }

    if (len > sizeof(size_t) * 8)
    {
      for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
      {
          if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
              (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
              (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
              (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
              return false;
      }
    }

If I'm reading that right, this could still read a couple of bytes
past the wanted memory area. 
Yeah, this is possible.
 
For example, imagine a case of 65 bytes
with a location a bit unaligned (more than 2 bytes).  You'd want to
check the remaining size after the first loop, not the initial one.

I'd be OK to have a quick loop for the less-than-64-byte case rather
than more checks depending on sizeof(size_t) spread, like Bertrand is
suggesting.
I'm ok too. Maybe we are trying to optimize early.
 
best regards,
Ranier Vilela

Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Tue, Nov 12, 2024 at 01:32:36PM -0300, Ranier Vilela wrote:
> It seems to me that it is enough to protect the SIMD loop when the size is
> smaller.
> 
>     if (len > sizeof(size_t) * 8)
>     {
>       for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) *
> 8)
>       {
>           if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
>               (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
>               (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
>               (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
>               return false;
>       }
>     }
> 
> See v1_allzeros_small.c attached.

Thanks for looking at it!

It's not enough, as that would not fix the second reason mentioned in [1].


[1]: https://www.postgresql.org/message-id/ZzLxAJuGzyqA7cUo%40ip-10-97-1-34.eu-west-3.compute.internal 


Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Ranier Vilela
Date:


Em qua., 13 de nov. de 2024 às 04:50, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,

On Wed, Nov 13, 2024 at 09:25:37AM +0900, Michael Paquier wrote:
> So that seems worth the addition, especially for
> smaller sizes where this is 6 times faster here.

So, something like v12 in pg_memory_is_all_zeros_v12() in allzeros_small.c
attached?
I ran the latest version (allzeros_small.c) with v12.
 

If so, that gives us:

== with BLCKSZ 32

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 22421 nanoseconds
size_t: done in 7269 nanoseconds (3.08447 times faster than byte per byte)
SIMD v10: done in 6349 nanoseconds (3.53142 times faster than byte per byte)
SIMD v11: done in 22080 nanoseconds (1.01544 times faster than byte per byte)
SIMD v12: done in 5595 nanoseconds (4.00733 times faster than byte per byte)
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 43882 nanoseconds
size_t: done in 8845 nanoseconds (4.96122 times faster than byte per byte)
SIMD v10: done in 10673 nanoseconds (4.1115 times faster than byte per byte)
SIMD v11: done in 29177 nanoseconds (1.50399 times faster than byte per byte)
SIMD v12: done in 9992 nanoseconds (4.39171 times faster than byte per byte)


== with BLCKSZ 63

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 29525 nanoseconds
size_t: done in 11232 nanoseconds (2.62865 times faster than byte per byte)
SIMD v10: done in 10828 nanoseconds (2.72673 times faster than byte per byte)
SIMD v11: done in 42056 nanoseconds (0.70204 times faster than byte per byte)
SIMD v12: done in 10468 nanoseconds (2.8205 times faster than byte per byte)
gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 68887 nanoseconds
size_t: done in 20147 nanoseconds (3.41922 times faster than byte per byte)
SIMD v10: done in 21410 nanoseconds (3.21752 times faster than byte per byte)
SIMD v11: done in 56987 nanoseconds (1.20882 times faster than byte per byte)
SIMD v12: done in 25102 nanoseconds (2.74428 times faster than byte per byte)
 

== with BLCKSZ 256

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 120483 nanoseconds
size_t: done in 23098 nanoseconds (5.21617 times faster than byte per byte)
SIMD v10: done in 6737 nanoseconds (17.8838 times faster than byte per byte)
SIMD v11: done in 6621 nanoseconds (18.1971 times faster than byte per byte)
SIMD v12: done in 6519 nanoseconds (18.4818 times faster than byte per byte)
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 211759 nanoseconds
size_t: done in 45879 nanoseconds (4.6156 times faster than byte per byte)
SIMD v10: done in 12262 nanoseconds (17.2695 times faster than byte per byte)
SIMD v11: done in 12018 nanoseconds (17.6202 times faster than byte per byte)
SIMD v12: done in 11993 nanoseconds (17.6569 times faster than byte per byte)
 

== with BLCKSZ 8192

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 3393459 nanoseconds
size_t: done in 707304 nanoseconds (4.79774 times faster than byte per byte)
SIMD v10: done in 233559 nanoseconds (14.5293 times faster than byte per byte)
SIMD v11: done in 225951 nanoseconds (15.0186 times faster than byte per byte)
SIMD v12: done in 225766 nanoseconds (15.0309 times faster than byte per byte)
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 12786295 nanoseconds
size_t: done in 1071590 nanoseconds (11.9321 times faster than byte per byte)
SIMD v10: done in 413219 nanoseconds (30.9431 times faster than byte per byte)
SIMD v11: done in 423469 nanoseconds (30.1942 times faster than byte per byte)
SIMD v12: done in 414106 nanoseconds (30.8769 times faster than byte per byte
 

That's better for small size but given the extra len checks that
has been added I think we're back to David's point in [1]: What if the function
is not inlined for some reason?

So, out of curiosity, let's see what happens if not inlined in [2] (see the
-O2 -DNOT_INLINE compiler window):

- if a[3]: it looks like gcc is smart enough to create an optimized version
for that size using constant propagation
- if a[63]: Same as above
- if a[256]: Same as above
- if a[8192]: Same as above

I did a quick check with clang and it looks like it is not as smart as gcc
for the non inline case.

Anyway it's not like we have the choice: we need (at least) one len check for
safety reason (to not crash or read invalid data).

So, I'd vote for pg_memory_is_all_zeros_v12() then, thoughts?
I think that's good enough.

best regards,
Ranier Vilela

Re: define pg_structiszero(addr, s, r)

From
Ranier Vilela
Date:

Em qui., 14 de nov. de 2024 às 07:09, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,

On Thu, Nov 14, 2024 at 09:27:06AM +0900, Michael Paquier wrote:
> Makes sense to me to just do that, with a first < 8 loop, and a second
> for the 8~63 range.

Thanks for looking at it!

> There is also a "cant'" in the last size_t check.  Simple typo.

Please find attached v12, with more comments and comments changes to explain
the multiple cases (for safety) and phases (for efficiency).
Is it worth mentioning that pg_memory_is_all_zeros does not work correctly on 32-bit systems?

(63 < (size_t) * 8) /* 63 - 32*/

Or do we adjust magic constants according to 32/64 bit?

best regards,
Ranier Vilela

Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Thu, Nov 14, 2024 at 08:22:23AM -0300, Ranier Vilela wrote:
> Em qui., 14 de nov. de 2024 às 07:09, Bertrand Drouvot <
> bertranddrouvot.pg@gmail.com> escreveu:
> 
> > Hi,
> >
> > On Thu, Nov 14, 2024 at 09:27:06AM +0900, Michael Paquier wrote:
> > > Makes sense to me to just do that, with a first < 8 loop, and a second
> > > for the 8~63 range.
> >
> > Thanks for looking at it!
> >
> > > There is also a "cant'" in the last size_t check.  Simple typo.
> >
> > Please find attached v12, with more comments and comments changes to
> > explain
> > the multiple cases (for safety) and phases (for efficiency).
> >
> Is it worth mentioning that pg_memory_is_all_zeros does not work correctly
> on 32-bit systems?
> 
> (63 < (size_t) * 8) /* 63 - 32*/

I think that the code is fully portable on 32-bit systems as it's using size_t
in all the places.

I agree that the comments are "64-bit" focused though, but I don't think that's
an issue (as I think it's already the case in multiple places in the core code).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Ranier Vilela
Date:


Em qui., 14 de nov. de 2024 às 08:58, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,

On Thu, Nov 14, 2024 at 08:22:23AM -0300, Ranier Vilela wrote:
> Em qui., 14 de nov. de 2024 ąs 07:09, Bertrand Drouvot <
> bertranddrouvot.pg@gmail.com> escreveu:
>
> > Hi,
> >
> > On Thu, Nov 14, 2024 at 09:27:06AM +0900, Michael Paquier wrote:
> > > Makes sense to me to just do that, with a first < 8 loop, and a second
> > > for the 8~63 range.
> >
> > Thanks for looking at it!
> >
> > > There is also a "cant'" in the last size_t check.  Simple typo.
> >
> > Please find attached v12, with more comments and comments changes to
> > explain
> > the multiple cases (for safety) and phases (for efficiency).
> >
> Is it worth mentioning that pg_memory_is_all_zeros does not work correctly
> on 32-bit systems?
>
> (63 < (size_t) * 8) /* 63 - 32*/

I think that the code is fully portable on 32-bit systems as it's using size_t
in all the places.
Maybe I'm doing something wrong.
But I'm testing in 32-bit, with the size set to 63, with v12 and I'm seeing the SIMD loop execute.
Because the test
if (len < sizeof(size_t) * 8) // 8-63 bytes
failed.

I expected that with size 63, it would be solved by case 2, or am I wrong?

best regards,
Ranier Vilela

PS. Windows 11 64 bits
msvc 32 bits 2022

Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Thu, Nov 14, 2024 at 09:13:19AM -0300, Ranier Vilela wrote:
> Em qui., 14 de nov. de 2024 às 08:58, Bertrand Drouvot <
> Maybe I'm doing something wrong.
> But I'm testing in 32-bit, with the size set to 63, with v12 and I'm seeing
> the SIMD loop execute.

Yeah, that's expected and safe as each iteration reads 32 bytes on 32-bit.

> if (len < sizeof(size_t) * 8) // 8-63 bytes
> failed.
> 
> I expected that with size 63, it would be solved by case 2, or am I wrong?

Case 2 should be read as "in the 4-31" bytes range on 32-bit system as all 
comparisons are done in size_t.

What would be unsafe on 32-bit would be to read up to 32 bytes while len < 32
and that can not happen.

As mentioned up-thread the comments are wrong on 32-bit, indeed they must be read
as:

Case 1: len < 4 bytes
Case 2: len in the 4-31 bytes range
Case 3: len >= 32 bytes

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Fri, Nov 15, 2024 at 09:54:33AM -0300, Ranier Vilela wrote:
> There is a tiny typo with V13.
> + /* "len" in the [sizeof(size_t) * 8, inf] range */

I think "[sizeof(size_t) * 8, inf[ range" is correct. Infinity can not be included
into a interval.

Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at the end)
might be the proper notation from a mathematical point of view.

> But, I'm not sure if I'm still doing something wrong.
> If so, forgive me for the noise.
> 
> Of course I expected "not is_allzeros".

That's the test case which is "wrong" (not the function):

"
size_t pagebytes[BLCKSZ] = {0};
volatile bool result;

pagebytes[BLCKSZ-2] = 1;

result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ);
"

The pagebytes is an array of size_t (8 bytes each), so the actual array size
is 8192 * 8 = 65536 bytes.

So, pagebytes[BLCKSZ-2] = 1, sets byte 65528 ((8192-2)*8) to 1.

But the function is checking up to BLCKSZ bytes (8192), so the results you
observed (which are correct).

> Anyway, I made another attempt to optimize a bit more, I don't know if it's
> safe though.

There is an issue in your v14, it calls:

"
return pg_memory_is_all_zeros_simd(ptr, ptr + len);
"

but you defined it that way:

"
static inline bool
pg_memory_is_all_zeros_simd(const size_t *p, const size_t * end)

"

while that should be:

"
static inline bool
pg_memory_is_all_zeros_simd(const void *p, const void *end)
"

Doing so, I do not observe any improvments with v14.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: define pg_structiszero(addr, s, r)

From
Ranier Vilela
Date:

Em sex., 15 de nov. de 2024 às 11:43, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,

On Fri, Nov 15, 2024 at 09:54:33AM -0300, Ranier Vilela wrote:
> There is a tiny typo with V13.
> + /* "len" in the [sizeof(size_t) * 8, inf] range */

I think "[sizeof(size_t) * 8, inf[ range" is correct. Infinity can not be included
into a interval.

Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at the end)
might be the proper notation from a mathematical point of view.
Thanks for clarifying.
 

> But, I'm not sure if I'm still doing something wrong.
> If so, forgive me for the noise.
>
> Of course I expected "not is_allzeros".

That's the test case which is "wrong" (not the function):

"
size_t pagebytes[BLCKSZ] = {0};
volatile bool result;

pagebytes[BLCKSZ-2] = 1;

result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ);
"

The pagebytes is an array of size_t (8 bytes each), so the actual array size
is 8192 * 8 = 65536 bytes.

So, pagebytes[BLCKSZ-2] = 1, sets byte 65528 ((8192-2)*8) to 1.

But the function is checking up to BLCKSZ bytes (8192), so the results you
observed (which are correct).
Thanks for pointing out my mistake.
 

> Anyway, I made another attempt to optimize a bit more, I don't know if it's
> safe though.

There is an issue in your v14, it calls:

"
return pg_memory_is_all_zeros_simd(ptr, ptr + len);
"

but you defined it that way:

"
static inline bool
pg_memory_is_all_zeros_simd(const size_t *p, const size_t * end)

"

while that should be:

"
static inline bool
pg_memory_is_all_zeros_simd(const void *p, const void *end)
What I'm trying here, obviously, is a hack.
If it works, and the compiler accepts it, it's ok for me.
 
"

Doing so, I do not observe any improvments with v14.
So.
Again new results from v4_allzeros_small.c attached:
Linux Ubuntu 22.04
gcc 13 64 bits
 
With BLCKSZ 32
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small
byte per byte: done in 44092 nanoseconds
size_t: done in 13456 nanoseconds (3.27675 times faster than byte per byte)
SIMD v10: done in 14249 nanoseconds (3.09439 times faster than byte per byte)
SIMD v11: done in 32516 nanoseconds (1.35601 times faster than byte per byte)
SIMD v12: done in 14973 nanoseconds (2.94477 times faster than byte per byte)
SIMD v14: done in 13101 nanoseconds (3.36554 times faster than byte per byte)

With BLCKSZ 63
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small
byte per byte: done in 67656 nanoseconds
size_t: done in 25768 nanoseconds (2.62558 times faster than byte per byte)
SIMD v10: done in 21446 nanoseconds (3.15471 times faster than byte per byte)
SIMD v11: done in 56887 nanoseconds (1.18931 times faster than byte per byte)
SIMD v12: done in 22863 nanoseconds (2.95919 times faster than byte per byte)
SIMD v14: done in 21273 nanoseconds (3.18037 times faster than byte per byte)

With BLCKSZ 256
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small
byte per byte: done in 220064 nanoseconds
size_t: done in 45886 nanoseconds (4.79589 times faster than byte per byte)
SIMD v10: done in 12032 nanoseconds (18.2899 times faster than byte per byte)
SIMD v11: done in 11965 nanoseconds (18.3923 times faster than byte per byte)
SIMD v12: done in 12041 nanoseconds (18.2762 times faster than byte per byte)
SIMD v14: done in 12575 nanoseconds (17.5001 times faster than byte per byte)

With BLCKSZ 8192
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small
byte per byte: done in 10365876 nanoseconds
size_t: done in 827654 nanoseconds (12.5244 times faster than byte per byte)
SIMD v10: done in 347755 nanoseconds (29.808 times faster than byte per byte)
SIMD v11: done in 342813 nanoseconds (30.2377 times faster than byte per byte)
SIMD v12: done in 341124 nanoseconds (30.3874 times faster than byte per byte)
SIMD v14: done in 50646 nanoseconds (204.673 times faster than byte per byte)

Results with v4_allzeros_check.c attached:
gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check
sizeof(pagebytes)=32
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
SIMD v14: is_allzeros

gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check
sizeof(pagebytes)=63
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
SIMD v14: is_allzeros

gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check
sizeof(pagebytes)=256
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
p01=(0x7ffedb8ac430)
end=(0x7ffedb8ac530)
p02=(0x7ffedb8ac530)
SIMD v14: is_allzeros

gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check
sizeof(pagebytes)=8192
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
p01=(0x7ffd8864c200)
end=(0x7ffd8864e200)
p02=(0x7ffd8864e200)
SIMD v14: is_allzeros

If this hack is safe and correct, I think that 204 times faster,
it is very good, for a block size 8192.

That said,
V13 is fine as is.
LGTM.

best regards,
Ranier Vilela

Re: define pg_structiszero(addr, s, r)

From
Bertrand Drouvot
Date:
Hi,

On Sat, Nov 16, 2024 at 11:42:54AM -0300, Ranier Vilela wrote:
> > Em sex., 15 de nov. de 2024 às 11:43, Bertrand Drouvot <
> > bertranddrouvot.pg@gmail.com> escreveu:
> >
> >> while that should be:
> >>
> >> "
> >> static inline bool
> >> pg_memory_is_all_zeros_simd(const void *p, const void *end)
> >>
> > What I'm trying here, obviously, is a hack.
> > If it works, and the compiler accepts it, it's ok for me.
> >
> > If this hack is safe and correct, I think that 204 times faster,
> > it is very good, for a block size 8192.

The "hack" is not correct, because it's doing:

"
static inline bool
all_zeros_simd(const size_t *p, const size_t * end)
{
    for (; p < (end - sizeof(size_t) * 7); p += sizeof(size_t) * 8)
    {
        if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
            (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
            (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
            (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
            return false;
    }
.
.
"

"p += sizeof(size_t) * 8" advances by 64 elements. But those elements are "size_t"
elements (since you're using size_t pointers as the function arguments).

Then instead of advancing by 64 bytes, you know advance by 512 bytes. But
you only check 64 bytes per iteration -> you're missing 448 bytes to check
per iteration.

We can "visualize" this by adding a few output messages like:

"
static inline bool
all_zeros_simd(const size_t *p, const size_t * end)
{
    for (; p < (end - sizeof(size_t) * 7); p += sizeof(size_t) * 8)
    {
        printf("Current p: %p\n", (void*)p);
        printf("Checking elements:\n");
        printf("[0]: %p = %zu\n", (void*)&((size_t *)p)[0], ((size_t *)p)[0]);
        printf("[1]: %p = %zu\n", (void*)&((size_t *)p)[1], ((size_t *)p)[1]);
        printf("[2]: %p = %zu\n", (void*)&((size_t *)p)[2], ((size_t *)p)[2]);
        printf("[3]: %p = %zu\n", (void*)&((size_t *)p)[3], ((size_t *)p)[3]);
        printf("[4]: %p = %zu\n", (void*)&((size_t *)p)[4], ((size_t *)p)[4]);
        printf("[5]: %p = %zu\n", (void*)&((size_t *)p)[5], ((size_t *)p)[5]);
        printf("[6]: %p = %zu\n", (void*)&((size_t *)p)[6], ((size_t *)p)[6]);
        printf("[7]: %p = %zu\n", (void*)&((size_t *)p)[7], ((size_t *)p)[7]);

        const size_t *next_p = p + sizeof(size_t) * 8;
        printf("Next p will be: %p (advance of %zu bytes)\n",
               (void*)next_p,
               (size_t)((char*)next_p - (char*)p));
.
.
.
"

Then we get things like:

"
Current p: 0x7fff2a93e500
Checking elements:
[0]: 0x7fff2a93e500 = 0
[1]: 0x7fff2a93e508 = 0
[2]: 0x7fff2a93e510 = 0
[3]: 0x7fff2a93e518 = 0
[4]: 0x7fff2a93e520 = 0
[5]: 0x7fff2a93e528 = 0
[6]: 0x7fff2a93e530 = 0
[7]: 0x7fff2a93e538 = 0
Next p will be: 0x7fff2a93e700 (advance of 512 bytes)
"

Meaning that you're checking 64 bytes per iteration (for example from 0x500 to
0x508 is 8 bytes) while advancing by 512 bytes: you're missing 448 bytes to check
per iteration.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com