Thread: Re: define pg_structiszero(addr, s, r)
On 18.09.24 06:16, Bertrand Drouvot wrote: > +#define pg_structiszero(addr, s, r) \ > + do { \ > + /* We assume this initializes to zeroes */ \ > + static const s all_zeroes; \ > + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0); \ > + } while (0) This assumption is kind of the problem, isn't it? Because, you can't assume that. And the existing code is arguably kind of wrong. But moreover, this macro also assumes that the "addr" argument has no random padding bits. In the existing code, you can maybe make a local analysis that the code is working correctly, although I'm not actually sure. But if you are repackaging this as a general macro under a general-sounding name, then the requirements should be more stringent.
Hi, On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote: > On 18.09.24 06:16, Bertrand Drouvot wrote: > > +#define pg_structiszero(addr, s, r) \ > > + do { \ > > + /* We assume this initializes to zeroes */ \ > > + static const s all_zeroes; \ > > + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0); \ > > + } while (0) > Thanks for the feedback. > This assumption is kind of the problem, isn't it? Because, you can't assume > that. And the existing code is arguably kind of wrong. But moreover, this > macro also assumes that the "addr" argument has no random padding bits. > > In the existing code, you can maybe make a local analysis that the code is > working correctly, although I'm not actually sure. I think it is but will give it a closer look. > But if you are > repackaging this as a general macro under a general-sounding name, then the > requirements should be more stringent. Agree. That said in v2 ([1]), it has been renamed to pgstat_entry_all_zeros(). I think that I will: 1/ take a closer look regarding the existing assumption 2/ if 1/ outcome is fine, then add more detailed comments around pgstat_entry_all_zeros() to make sure it's not used outside of the existing context Sounds good to you? [1]: https://www.postgresql.org/message-id/ZuqHLCdZXtEsbyb/%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 18/09/2024 21:57, Bertrand Drouvot wrote: > On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote: >> On 18.09.24 06:16, Bertrand Drouvot wrote: >>> +#define pg_structiszero(addr, s, r) \ >>> + do { \ >>> + /* We assume this initializes to zeroes */ \ >>> + static const s all_zeroes; \ >>> + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0); \ >>> + } while (0) Not new with this patch, but do we guarantee padding bytes to be zeros? How about this instead: static inline bool pg_is_all_zeros(const char *p, size_t len) { for (size_t i = 0; i < len; i++) { if (p[i] != 0) return false; } return true; } Is there's a de facto standard name for that function? I was surprised that I couldn't find one with a quick google search. That seems like the kind of small utility function that every C program needs. How performance sensitive is this? If it's not, then the above seems like the most straightforward way to do this, which is good. If it is performance sensitive, it's still good, because the compiler can optimize that well: https://godbolt.org/z/x9hPWjheq. -- Heikki Linnakangas Neon (https://neon.tech)
Em seg., 28 de out. de 2024 às 11:33, Heikki Linnakangas <hlinnaka@iki.fi> escreveu:
On 18/09/2024 21:57, Bertrand Drouvot wrote:
> On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote:
>> On 18.09.24 06:16, Bertrand Drouvot wrote:
>>> +#define pg_structiszero(addr, s, r) \
>>> + do { \
>>> + /* We assume this initializes to zeroes */ \
>>> + static const s all_zeroes; \
>>> + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0); \
>>> + } while (0)
Not new with this patch, but do we guarantee padding bytes to be zeros?
How about this instead:
static inline bool
pg_is_all_zeros(const char *p, size_t len)
{
for (size_t i = 0; i < len; i++)
{
if (p[i] != 0)
return false;
}
return true;
}
It seems to me that this way is more optimized.
static inline bool
is_all_zeros(const char *p, size_t len)
{
for (size_t i = len; i >= 0; i--)
{
if (p[i] != 0)
return false;
}
return true;
}
main:
sub rsp, 24
lea rdx, [rsp + 12]
lea rcx, [rsp + 16]
lea rdi, [rip + .L.str]
lea rsi, [rsp + 8]
xor eax, eax
call __isoc99_scanf@PLT
lea rdi, [rip + .L.str.1]
xor esi, esi
xor eax, eax
call printf@PLT
xor eax, eax
add rsp, 24
ret
best regards,
Ranier Vilela
Ranier Vilela <ranier.vf@gmail.com> writes: > It seems to me that [reversing the loop direction] is more optimized. That's far from clear: you're ignoring the possibility that memory access logic is better optimized for forward scanning than reverse scanning. I'd stick with the forward scan without some extremely detailed testing. regards, tom lane
Em seg., 28 de out. de 2024 às 12:08, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Ranier Vilela <ranier.vf@gmail.com> writes:
> It seems to me that [reversing the loop direction] is more optimized.
That's far from clear: you're ignoring the possibility that memory
access logic is better optimized for forward scanning than reverse
scanning. I'd stick with the forward scan without some extremely
detailed testing.
I don't disagree.
After posting, I was wondering if the first possible is not zero, it should probably be at the beginning and not at the end.
best regards,
Ranier Vilela
On Thu, Sep 19, 2024 at 4:57 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote: > > On 18.09.24 06:16, Bertrand Drouvot wrote: > > > +#define pg_structiszero(addr, s, r) \ > > > + do { \ > > > + /* We assume this initializes to zeroes */ \ > > > + static const s all_zeroes; \ > > > + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0); \ > > > + } while (0) > > > > Thanks for the feedback. > > > This assumption is kind of the problem, isn't it? Because, you can't assume > > that. And the existing code is arguably kind of wrong. But moreover, this > > macro also assumes that the "addr" argument has no random padding bits. > > > > In the existing code, you can maybe make a local analysis that the code is > > working correctly, although I'm not actually sure. > > I think it is but will give it a closer look. > > > But if you are > > repackaging this as a general macro under a general-sounding name, then the > > requirements should be more stringent. > > Agree. That said in v2 ([1]), it has been renamed to pgstat_entry_all_zeros(). > > I think that I will: > > 1/ take a closer look regarding the existing assumption > 2/ if 1/ outcome is fine, then add more detailed comments around > pgstat_entry_all_zeros() to make sure it's not used outside of the existing > context > > Sounds good to you? > > [1]: https://www.postgresql.org/message-id/ZuqHLCdZXtEsbyb/%40ip-10-97-1-34.eu-west-3.compute.internal > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com > Hi, if this is performance critical then FWIW my understanding is that memcmp can outperform simple loop checking, and my hacky testing seemed to confirm that. See https://godbolt.org/z/GWY1ob9bE static inline bool is_all_zeros2(const char *p, size_t len) { static size_t nz = 0; static const char *z = NULL; if (nz < len) { if (z) free((void *)z); nz = len; z = (const char *)calloc(1, nz); } return memcmp(p, z, len) == 0; } ~~ Executor x86-64 gcc 14.2 (C, Editor #1) Program stdout Iterate 1000000 times... check zeros using loop -- elapsed=0.041196s check zeros using memcmp -- elapsed=0.016407s ====== Kind Regards, Peter Smith. Fujitsu Australia
On 29.10.24 08:54, Bertrand Drouvot wrote: > +static inline bool > +pg_mem_is_all_zeros(const char *p, size_t len) This should be a void * pointer please.
On 29/10/2024 09:54, Bertrand Drouvot wrote: >> https://godbolt.org/z/x9hPWjheq. > > Yeah, I also think that's fine. Peter Smith did some testing in [1] comparing > memcmp and simple loop checking (thanks Peter for the testing!): > > " > Iterate 1000000 times... > check zeros using loop -- elapsed=0.041196s > check zeros using memcmp -- elapsed=0.016407s > " > > So, in this test, the loop is 0.024789s longer means 0.024789s/1000000=24 Nanosecond > slower per comparison (If my math is correct). I believe that test program is bogus. Look at the assembly code; the compiler optimized away the loops. -- Heikki Linnakangas Neon (https://neon.tech)
Hi, On Tue, Oct 29, 2024 at 11:39:03AM +0200, Heikki Linnakangas wrote: > On 29/10/2024 09:54, Bertrand Drouvot wrote: > > > https://godbolt.org/z/x9hPWjheq. > > > > Yeah, I also think that's fine. Peter Smith did some testing in [1] comparing > > memcmp and simple loop checking (thanks Peter for the testing!): > > > > " > > Iterate 1000000 times... > > check zeros using loop -- elapsed=0.041196s > > check zeros using memcmp -- elapsed=0.016407s > > " > > > > So, in this test, the loop is 0.024789s longer means 0.024789s/1000000=24 Nanosecond > > slower per comparison (If my math is correct). > > I believe that test program is bogus. Look at the assembly code; the > compiler optimized away the loops. Oh right. It looks like that moving the "scanf" within each loop "helps" and that both give pretty comparable results. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 29.10.24 14:58, Bertrand Drouvot wrote: > hi, > > On Tue, Oct 29, 2024 at 10:23:37AM +0100, Peter Eisentraut wrote: >> On 29.10.24 08:54, Bertrand Drouvot wrote: >>> +static inline bool >>> +pg_mem_is_all_zeros(const char *p, size_t len) >> >> This should be a void * pointer please. > > Thanks for looking at it! > Yeah better, done in v4 attached. Sorry for the confusion. I didn't mean to discourage you from the const qualifier. That should still be there.
+/* + * Test if a memory region starting at p and of size len is full of zeroes. + */ +static inline bool +pg_mem_is_all_zeros(const void *ptr, size_t len) The function comment should say 'ptr' instead of 'p', right? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thu, 31 Oct 2024 at 19:17, Michael Paquier <michael@paquier.xyz> wrote: > Seems kind of OK seen from here. I am wondering if others more > comments about the name of the macro, its location, the fact that we > still have pagebytes in bufpage.c, etc. This change looks incorrect: @@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags) } /* Check all-zeroes case */ - all_zeroes = true; pagebytes = (size_t *) page; - for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) - { - if (pagebytes[i] != 0) - { - all_zeroes = false; - break; - } - } - if (all_zeroes) + if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t)))) return true; I think this should be passing BLCKSZ rather than (BLCKSZ / sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is zero rather than the entire page. However, I've also performance concerns as if I look at the assembly of PageIsVerifiedExtended, I see the zero checking is now done 1 byte at a time: (gcc 11.4) leaq 1024(%rbx), %rdx <-- 1KB bug .p2align 4,,10 .p2align 3 .L60: cmpb $0, (%rax) <-- check 1 byte is zero. jne .L59 addq $1, %rax <-- increment by 1 byte. cmpq %rax, %rdx <-- check if we've done 1024 bytes yet. jne .L60 Whereas previously it was doing: movq %rbx, %rax leaq 8192(%rbx), %rdx <-- 8KB jmp .L60 .p2align 4,,10 .p2align 3 .L90: addq $8, %rax <-- increment by 8 bytes (or sizeof(size_t)) cmpq %rax, %rdx je .L61 .L60: cmpq $0, (%rax) <-- checks an entire 8 bytes are zero. I didn't test how performance-critical this is, but the header comment for this function does use the words "cheaply detect". * This is called when a page has just been read in from disk. The idea is * to cheaply detect trashed pages before we go nuts following bogus line * pointers, testing invalid transaction identifiers, etc. so it seems a bit dangerous to switch this loop to byte-at-a-time rather than doing 8 bytes at once without testing the performance isn't affected. David
Hi, On Fri, Nov 01, 2024 at 05:45:44PM +1300, David Rowley wrote: > On Thu, 31 Oct 2024 at 19:17, Michael Paquier <michael@paquier.xyz> wrote: > > Seems kind of OK seen from here. I am wondering if others more > > comments about the name of the macro, its location, the fact that we > > still have pagebytes in bufpage.c, etc. > > This change looks incorrect: > > @@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber > blkno, int flags) > } > > /* Check all-zeroes case */ > - all_zeroes = true; > pagebytes = (size_t *) page; > - for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) > - { > - if (pagebytes[i] != 0) > - { > - all_zeroes = false; > - break; > - } > - } > > - if (all_zeroes) > + if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t)))) > return true; > > I think this should be passing BLCKSZ rather than (BLCKSZ / > sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is > zero rather than the entire page. Thanks for looking at it! Nice catch, indeed by using the new function we are changing the pointer arithmetic here and then we should pass BLCKSZ instead. > However, I've also performance concerns as if I look at the assembly > of PageIsVerifiedExtended, I see the zero checking is now done 1 byte > at a time: > > (gcc 11.4) > > leaq 1024(%rbx), %rdx <-- 1KB bug > .p2align 4,,10 > .p2align 3 > .L60: > cmpb $0, (%rax) <-- check 1 byte is zero. > jne .L59 > addq $1, %rax <-- increment by 1 byte. > cmpq %rax, %rdx <-- check if we've done 1024 bytes yet. > jne .L60 > > Whereas previously it was doing: > > movq %rbx, %rax > leaq 8192(%rbx), %rdx <-- 8KB > jmp .L60 > .p2align 4,,10 > .p2align 3 > .L90: > addq $8, %rax <-- increment by 8 bytes (or sizeof(size_t)) > cmpq %rax, %rdx > je .L61 > .L60: > cmpq $0, (%rax) <-- checks an entire 8 bytes are zero. > > I didn't test how performance-critical this is, but the header comment > for this function does use the words "cheaply detect". > > * This is called when a page has just been read in from disk. The idea is > * to cheaply detect trashed pages before we go nuts following bogus line > * pointers, testing invalid transaction identifiers, etc. > > so it seems a bit dangerous to switch this loop to byte-at-a-time > rather than doing 8 bytes at once without testing the performance > isn't affected. Agree, I did a quick test (see [0]) and it looks like it's significantly slower using the new inline function. We could try to write a more elaborate version of pg_memory_is_all_zeros(), but as it looks like there is only one use case, then it's probably better to not implement (revert) this change here and "just" add a comment as to why pg_memory_is_all_zeros() should not be used here, thoughts? [0]: https://godbolt.org/z/xqnW4MPY5 Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, 1 Nov 2024 at 19:33, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Agree, I did a quick test (see [0]) and it looks like it's significantly slower > using the new inline function. > > We could try to write a more elaborate version of pg_memory_is_all_zeros(), but > as it looks like there is only one use case, then it's probably better to not > implement (revert) this change here and "just" add a comment as to why pg_memory_is_all_zeros() > should not be used here, thoughts? My vote is to just revert this usage of the function. Anything more elaborate would need to check pointer alignment before using any types larger than char. The previous code does not need to do that because the page is going to be at least MAXALIGNed. David
On Fri, 1 Nov 2024 at 20:14, Michael Paquier <michael@paquier.xyz> wrote: > 2) On HEAD at 49d6c7d8daba: > .LVL299: > .loc 1 131 16 is_stmt 0 discriminator 1 view .LVU524 > cmpq $8192, %rbx > je .L419 > > 3) With the patch sent at [1]: > .LVL306: > .loc 3 201 23 is_stmt 1 discriminator 1 view .LVU545 > cmpq $8192, %rbx > jne .L417 > > So it does not matter one way or another for 2) or 3), does it? The patch in [1] will fix the bug. But I'm still concerned about the performance implications of moving to byte-at-a-time processing. There are about 8 times more instructions being expected to do the same work. David > [1]: https://www.postgresql.org/message-id/ZyR02ofHiWG1HmLI@paquier.xyz
Hi, On Fri, Nov 01, 2024 at 04:36:45PM +0900, Michael Paquier wrote: > On Fri, Nov 01, 2024 at 08:21:50PM +1300, David Rowley wrote: > > My vote is to just revert this usage of the function. Anything more > > elaborate would need to check pointer alignment before using any types > > larger than char. The previous code does not need to do that because > > the page is going to be at least MAXALIGNed. > > Fine, here you go. The attached reverts back this part in bufpage.c > to what it was in 49d6c7d8daba. Thanks! Worth to add a comment as to why pg_memory_is_all_zeros() should not be used here? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote: > On Fri, 1 Nov 2024 at 20:49, Michael Paquier <michael@paquier.xyz> wrote: > > > > On Fri, Nov 01, 2024 at 07:44:22AM +0000, Bertrand Drouvot wrote: > > > Worth to add a comment as to why pg_memory_is_all_zeros() should not > > > be used here? > > > > I would not add one in bufpage.c, documenting that where > > pg_memory_is_all_zeros() is defined may be more adapted. > > The thought of having to write a comment to warn people not to use it > for performance-critical things makes me think it might be better just > to write a more optimal version of the function so we don't need to > warn people. Yeah, that's probably a good idea to write a more elaborate function. > I looked around at the callers of the function I saw the > following numbers of bytes being used for the length: 8192 (the one in > question), 88, 32 and 112. > > I don't know how performance-critical the final three of those are, > but I imagine all apart from the 32-byte one might be better with a > non-inlined and more optimised version of the function. The problem > with inlining the optimised version is that it's more code to inline. I agree that's more code to inline and contains multiple loops and branches. For the last 3 callers, I think that non inline would still be "cheap" as compared to what lead to those code paths (stats increments). > I've attached what I thought a more optimal version might look like in > case anyone thinks making it better is a good idea. > Thanks for the proposal! I like the idea, I think that's worth to add a few comments, something like: 1 === + while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) Add a comment like "Checks bytes, byte by byte, until the pointer is aligned"? 2 === + for (; p < aligned_end; p += sizeof(size_t)) Add a comment like "Multiple bytes comparison(s) at once"? 3 === + while (p < end) + { Add a comment like "Compare remaining bytes, byte by byte"? 4 === Out of curiosity I did test your proposal and it performs well (see [0]) for the PageIsVerifiedExtended() case. [0]: https://godbolt.org/z/Mdaxz5W7c Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Fri, Nov 01, 2024 at 12:50:10PM +0000, Bertrand Drouvot wrote: > Hi, > > On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote: > > On Fri, 1 Nov 2024 at 20:49, Michael Paquier <michael@paquier.xyz> wrote: > > I've attached what I thought a more optimal version might look like in > > case anyone thinks making it better is a good idea. > > > > Thanks for the proposal! > > I like the idea, I think that's worth to add a few comments, something like: > > 1 === > > + while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0) > > Add a comment like "Checks bytes, byte by byte, until the pointer is aligned"? > > 2 === > > + for (; p < aligned_end; p += sizeof(size_t)) > > Add a comment like "Multiple bytes comparison(s) at once"? > > 3 === > > + while (p < end) > + { > > Add a comment like "Compare remaining bytes, byte by byte"? > > 4 === > > Out of curiosity I did test your proposal and it performs well (see [0]) for > the PageIsVerifiedExtended() case. Also, 5 === Shouldn't we handle the cases where ptr is NULL and/or len == 0? (should probably have already been done in the current version of pg_memory_is_all_zeros() though). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Sat, 2 Nov 2024 at 01:50, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote: > > I've attached what I thought a more optimal version might look like in > > case anyone thinks making it better is a good idea. > > > > Thanks for the proposal! > > I like the idea, I think that's worth to add a few comments, something like: I'm happy if you want to pick this up and continue working on it. I'm mostly just keen to not leave the suboptimal version of the function in core as it is. One thing you might want to consider is if it's worth having a macro to help decide if you want to inline the function for smaller sizes and not inline for larger sizes. A macro that checks something like: if (__builtin_constant_p(len) && len <= 32) could call an inline version of the function for smaller sizes and do a function call for lager sizes. Compilers seem to have heuristics that result in behaviour like this for library functions such as memset and memcpy. Maybe some experimentation with godbolt.org would yield the crossover point in bytes where compilers switch tactics. I just feel that at the rate we receive small code change suggestions on the mailing lists, it's just a matter of time before someone will come along and suggest we use pg_memory_is_all_zeros() in PageIsVerifiedExtended() again. If we don't optimize that function, then there's a chance a committer might re-commit what's just been reverted. David
Hi.
Em seg., 4 de nov. de 2024 às 14:18, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,
On Tue, Nov 05, 2024 at 12:24:48AM +1300, David Rowley wrote:
> On Sat, 2 Nov 2024 at 01:50, Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> > > I've attached what I thought a more optimal version might look like in
> > > case anyone thinks making it better is a good idea.
> > >
> >
> > Thanks for the proposal!
> >
> > I like the idea, I think that's worth to add a few comments, something like:
>
> I'm happy if you want to pick this up and continue working on it.
Sure, please find attached v1, the changes are:
- switch from "const char" to "const unsigned char" (could have been done in the
current version of pg_memory_is_all_zeros() though)
- added some comments
- adding an Assert for ptr != 0
- re-introduce the function call in PageIsVerifiedExtended()
- propose a commit message
I think we can add a small optimization to this last patch [1].
The variable *aligned_end* is only needed in the second loop (for).
So, only before the for loop do we actually declare it.
Result before this change:
check zeros using BERTRAND 1 0.000031s
Result after this change:
check zeros using BERTRAND 1 0.000018s
The variable *aligned_end* is only needed in the second loop (for).
So, only before the for loop do we actually declare it.
Result before this change:
check zeros using BERTRAND 1 0.000031s
Result after this change:
check zeros using BERTRAND 1 0.000018s
+ const unsigned char *aligned_end;
+ /* Multiple bytes comparison(s) at once */
+ aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1)));
+ for (; p < aligned_end; p += sizeof(size_t))
+ aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1)));
+ for (; p < aligned_end; p += sizeof(size_t))
best regards,
Ranier Vilela
On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier.vf@gmail.com> wrote: > I think we can add a small optimization to this last patch [1]. I think if you want to make it faster, you could partially unroll the inner-most loop, like: // size_t * 4 for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4) { if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *) p)[2] != 0 | ((size_t *) p)[3] != 0) return false; } $ gcc allzeros.c -O2 -o allzeros && ./allzeros char: done in 1595000 nanoseconds size_t: done in 198300 nanoseconds (8.04337 times faster than char) size_t * 4: done in 81500 nanoseconds (19.5706 times faster than char) size_t * 8: done in 71000 nanoseconds (22.4648 times faster than char) The final one above is 110GB/sec, so probably only going that fast because the memory being checked is in L1. DDR5 is only 64GB/sec. So it's probably overkill to unroll the loop that much. Also, doing something like that means the final byte-at-a-time loop might have more to do, which might cases with a long remainder slower. To make up for that there's some incentive to introduce yet another loop to process single size_t's up to aligned_end. Then you end up with even more code. I was happy enough with my patch with Bertrand's comments. I'm not sure why unsigned chars are better than chars. It doesn't seem to have any effect on the compiled code. David
Hi, On Tue, Nov 05, 2024 at 05:08:41PM +1300, David Rowley wrote: > I was happy enough with my patch with Bertrand's comments. I'm not > sure why unsigned chars are better than chars. It doesn't seem to have > any effect on the compiled code. > I think that unsigned chars is better than char for byte level memory operations (no sign extension issues). Though I agree that using char in this function is not an issue as p is only compared with 0. This is more a matter of taste here. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Em ter., 5 de nov. de 2024 às 00:23, David Rowley <dgrowleyml@gmail.com> escreveu:
On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier.vf@gmail.com> wrote:
> I think we can add a small optimization to this last patch [1].
> The variable *aligned_end* is only needed in the second loop (for).
> So, only before the for loop do we actually declare it.
>
> Result before this change:
> check zeros using BERTRAND 1 0.000031s
>
> Result after this change:
> check zeros using BERTRAND 1 0.000018s
>
> + const unsigned char *aligned_end;
>
> + /* Multiple bytes comparison(s) at once */
> + aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1)));
> + for (; p < aligned_end; p += sizeof(size_t))
I think we all need to stop using Godbolt's servers to run benchmarks
on. These servers are likely to be running various other workloads in
highly virtualised environments and are not going to be stable servers
that would give consistent benchmark results.
I tried your optimisation in the attached allzeros.c and here are my results:
# My version
$ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do ./allzeros; done
char: done in 1566400 nanoseconds
size_t: done in 195400 nanoseconds (8.01638 times faster than char)
char: done in 1537500 nanoseconds
size_t: done in 196300 nanoseconds (7.8324 times faster than char)
char: done in 1543600 nanoseconds
size_t: done in 196300 nanoseconds (7.86347 times faster than char)
# Ranier's optimization
$ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
{1..3}; do ./allzeros; done
char: done in 1943100 nanoseconds
size_t: done in 531700 nanoseconds (3.6545 times faster than char)
char: done in 1957200 nanoseconds
size_t: done in 458400 nanoseconds (4.26963 times faster than char)
char: done in 1949500 nanoseconds
size_t: done in 469000 nanoseconds (4.15672 times faster than char)
Seems to be about half as fast with gcc on -O2
Thanks for test coding.
I've tried with msvc 2022 32bits
Here the results:
C:\usr\src\tests\allzeros>allzeros
char: done in 71431900 nanoseconds
size_t: done in 18010900 nanoseconds (3.96604 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 71070100 nanoseconds
size_t: done in 19654300 nanoseconds (3.61601 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 68682400 nanoseconds
size_t: done in 19841100 nanoseconds (3.46162 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 63215100 nanoseconds
size_t: done in 17920200 nanoseconds (3.52759 times faster than char)
char: done in 71431900 nanoseconds
size_t: done in 18010900 nanoseconds (3.96604 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 71070100 nanoseconds
size_t: done in 19654300 nanoseconds (3.61601 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 68682400 nanoseconds
size_t: done in 19841100 nanoseconds (3.46162 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 63215100 nanoseconds
size_t: done in 17920200 nanoseconds (3.52759 times faster than char)
C:\usr\src\tests\allzeros>c /DRANIERS_OPTIMIZATION
Microsoft (R) Program Maintenance Utility Versão 14.40.33813.0
Direitos autorais da Microsoft Corporation. Todos os direitos reservados.
C:\usr\src\tests\allzeros>allzeros
char: done in 67213800 nanoseconds
size_t: done in 15049200 nanoseconds (4.46627 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 51505900 nanoseconds
size_t: done in 13645700 nanoseconds (3.77452 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 62852600 nanoseconds
size_t: done in 17863800 nanoseconds (3.51843 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 51877200 nanoseconds
size_t: done in 13759900 nanoseconds (3.77017 times faster than char)
Microsoft (R) Program Maintenance Utility Versão 14.40.33813.0
Direitos autorais da Microsoft Corporation. Todos os direitos reservados.
C:\usr\src\tests\allzeros>allzeros
char: done in 67213800 nanoseconds
size_t: done in 15049200 nanoseconds (4.46627 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 51505900 nanoseconds
size_t: done in 13645700 nanoseconds (3.77452 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 62852600 nanoseconds
size_t: done in 17863800 nanoseconds (3.51843 times faster than char)
C:\usr\src\tests\allzeros>allzeros
char: done in 51877200 nanoseconds
size_t: done in 13759900 nanoseconds (3.77017 times faster than char)
The function used to replace clock_getime is:
timespec_get(ts, TIME_UTC)
best regards,
Ranier Vilela
Em ter., 5 de nov. de 2024 às 01:12, Michael Paquier <michael@paquier.xyz> escreveu:
On Tue, Nov 05, 2024 at 04:23:34PM +1300, David Rowley wrote:
> I tried your optimisation in the attached allzeros.c and here are my results:
>
> # My version
> $ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do ./allzeros; done
> char: done in 1543600 nanoseconds
> size_t: done in 196300 nanoseconds (7.86347 times faster than char)
>
> # Ranier's optimization
> $ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
> size_t: done in 531700 nanoseconds (3.6545 times faster than char)
> char: done in 1957200 nanoseconds
I am not seeing numbers as good as yours, but the winner is clear as
well here:
Thanks for testing.
$ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do
./allzeros; done
char: done in 6578995 nanoseconds
size_t: done in 829916 nanoseconds (7.9273 times faster than char)
char: done in 6581465 nanoseconds
size_t: done in 829948 nanoseconds (7.92997 times faster than char)
char: done in 6585748 nanoseconds
size_t: done in 834929 nanoseconds (7.88779 times faster than char)
$ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
{1..3}; do ./allzeros;
done char: done in 6591803 nanoseconds
size_t: done in 1236102 nanoseconds (5.33273 times faster than char)
char: done in 6606219 nanoseconds
size_t: done in 1235979 nanoseconds (5.34493 times faster than char)
char: done in 6594413 nanoseconds
size_t: done in 1238770 nanoseconds (5.32336 times faster than char)
I'm surprised to see that assigning aligned_end at these two different
locations has this much effect once the compiler optimizes the
surroundings, but well.
I think that's a plus point for the benefit of not touching the memory if it's not explicitly necessary.
best regards,
Ranier Vilela
Hi, On Tue, Nov 05, 2024 at 05:08:41PM +1300, David Rowley wrote: > On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier.vf@gmail.com> wrote: > > I think we can add a small optimization to this last patch [1]. > > I think if you want to make it faster, you could partially unroll the > inner-most loop, like: > > // size_t * 4 > for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4) > { > if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *) > p)[2] != 0 | ((size_t *) p)[3] != 0) > return false; > } Another option could be to use SIMD instructions to check multiple bytes is zero in a single operation. Maybe just an idea to keep in mind and experiment if we feel the need later on. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Another option could be to use SIMD instructions to check multiple bytes > is zero in a single operation. Maybe just an idea to keep in mind and experiment > if we feel the need later on. Could do. I just wrote it that way to give the compiler flexibility to do SIMD implicitly. That seemed easier than messing around with SIMD intrinsics. I guess the compiler won't use SIMD with the single size_t-at-a-time version as it can't be certain it's ok to access the memory beyond the first zero word. Because I wrote the "if" condition using bitwise-OR, there's no boolean short-circuiting, so the compiler sees it must be safe to access all the memory for the loop iteration. If I use -march=native or -march=znver2 on my Zen2 machine, gcc does use SIMD operators. Clang uses some 128-bit registers without specifying -march: drowley@amd3990x:~$ gcc -O2 allzeros.c -march=native -o allzeros && for i in {1..3}; do ./allzeros; done char: done in 1940539 nanoseconds size_t: done in 261731 nanoseconds (7.41425 times faster than char) size_t * 4: done in 130415 nanoseconds (14.8797 times faster than char) size_t * 8: done in 70031 nanoseconds (27.7097 times faster than char) char: done in 3030132 nanoseconds size_t: done in 477044 nanoseconds (6.35189 times faster than char) size_t * 4: done in 123551 nanoseconds (24.5254 times faster than char) size_t * 8: done in 68549 nanoseconds (44.2039 times faster than char) char: done in 3214037 nanoseconds size_t: done in 256901 nanoseconds (12.5108 times faster than char) size_t * 4: done in 126017 nanoseconds (25.5048 times faster than char) size_t * 8: done in 73167 nanoseconds (43.9274 times faster than char) David
On Wed, 6 Nov 2024 at 13:20, Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Nov 06, 2024 at 12:16:33PM +1300, David Rowley wrote: > > On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > >> Another option could be to use SIMD instructions to check multiple bytes > >> is zero in a single operation. Maybe just an idea to keep in mind and experiment > >> if we feel the need later on. > > > > Could do. I just wrote it that way to give the compiler flexibility to > > do SIMD implicitly. That seemed easier than messing around with SIMD > > intrinsics. I guess the compiler won't use SIMD with the single > > size_t-at-a-time version as it can't be certain it's ok to access the > > memory beyond the first zero word. Because I wrote the "if" condition > > using bitwise-OR, there's no boolean short-circuiting, so the compiler > > sees it must be safe to access all the memory for the loop iteration. > > How complex would that be compared to the latest patch proposed if > done this way? If you can force SIMD without having to know about > these specific gcc switches (aka -march is not set by default in the > tree except for some armv8 path), then the performance happens > magically. If that makes the code more readable, that's even better. We could just write it that way and leave it up to the compiler to decide whether to use SIMD or not. Going by [1], gcc with -O2 uses SIMD instructions from 14.1 and clang with -O2 does it from version 8.0.0. gcc 14.1 was release in May 2024, so still quite new. It'll be quite a bit older once PG18 is out. Using the bitwise-OR method, more and more people will benefit as gcc14.1 and beyond becomes more mainstream. Clang 8.0.0 is from March 2019, so quite old already. David [1] https://godbolt.org/z/MTqao8scW
On Wed, 6 Nov 2024 at 13:52, Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Nov 06, 2024 at 01:38:50PM +1300, David Rowley wrote: > > We could just write it that way and leave it up to the compiler to > > decide whether to use SIMD or not. Going by [1], gcc with -O2 uses > > SIMD instructions from 14.1 and clang with -O2 does it from version > > 8.0.0. gcc 14.1 was release in May 2024, so still quite new. It'll be > > quite a bit older once PG18 is out. Using the bitwise-OR method, more > > and more people will benefit as gcc14.1 and beyond becomes more > > mainstream. > > > > Clang 8.0.0 is from March 2019, so quite old already. > > Okay, WFM to keep things the way they are in the patch. I'm not sure if I'm clear on what works for you. The latest patch I saw did 1 size_t per iteration. Are you saying we should do just size_t per loop? or we should form the code in a way that allows the compiler to use SIMD instructions? David
Hi, On Wed, Nov 06, 2024 at 01:44:58PM +0900, Michael Paquier wrote: > Should the last loop check only 1 byte at a time or should this stuff > include one more step before the last one you wrote to do a couple of > checks with size_t? That may matter for areas small enough (len < > sizeof(size_t) * 8) causing the second step to not be taken, but large > enough (len > sizeof(size_t)) to apply a couple of size_t checks per > loop. Do you mean add: " for (; p < aligned_end; p += sizeof(size_t)) { if (*(size_t *)p != 0) return false; } " just before the last loop? If so, I did a few tests and did not see any major improvements. So, I thought it's simpler to not add more code in this inline function in v7 shared up-thread. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 05.11.24 16:03, Bertrand Drouvot wrote: > On Tue, Nov 05, 2024 at 05:08:41PM +1300, David Rowley wrote: >> On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier.vf@gmail.com> wrote: >>> I think we can add a small optimization to this last patch [1]. >> >> I think if you want to make it faster, you could partially unroll the >> inner-most loop, like: >> >> // size_t * 4 >> for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4) >> { >> if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *) >> p)[2] != 0 | ((size_t *) p)[3] != 0) >> return false; >> } > > Another option could be to use SIMD instructions to check multiple bytes > is zero in a single operation. Maybe just an idea to keep in mind and experiment > if we feel the need later on. Speaking of which, couldn't you just use pg_popcount(ptr, len) == 0 ?
On Thu, 7 Nov 2024 at 07:34, Peter Eisentraut <peter@eisentraut.org> wrote: > Speaking of which, couldn't you just use > > pg_popcount(ptr, len) == 0 That might be quite good for small lengths or for use cases where the memory is always or almost always zero. The problem is there's no early exit when you find the first non-zero which means, for larger lengths, reading much more memory. That'll both take longer and possibly evict cache lines which might be useful to have in the near future. David
On Thu, 7 Nov 2024 at 00:40, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Do you mean add: > > " > for (; p < aligned_end; p += sizeof(size_t)) > { > if (*(size_t *)p != 0) > return false; > } > " > > just before the last loop? > > If so, I did a few tests and did not see any major improvements. So, I thought > it's simpler to not add more code in this inline function in v7 shared up-thread. Did you try with a size where there's a decent remainder, say 124 bytes? FWIW, one of the cases has 112 bytes, and I think that is aligned memory meaning we'll do the first 64 in the SIMD loop and have to do 48 bytes in the byte-at-a-time loop. If you had the loop Michael mentioned, that would instead be 6 loops of size_t-at-a-time. David
Hi, On Thu, Nov 07, 2024 at 09:45:44AM +0900, Michael Paquier wrote: > On Thu, Nov 07, 2024 at 08:05:10AM +1300, David Rowley wrote: > > That might be quite good for small lengths or for use cases where the > > memory is always or almost always zero. The problem is there's no > > early exit when you find the first non-zero which means, for larger > > lengths, reading much more memory. That'll both take longer and > > possibly evict cache lines which might be useful to have in the near > > future. > > Didn't know this one either, thanks for the explanation. +1, thanks! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, 8 Nov 2024 at 18:34, Michael Paquier <michael@paquier.xyz> wrote: > I've done a round of comment and term cleanup for the whole patch, > while on it. I don't think "intrinsics" is the correct word to use here: + * - 8 * sizeof(size_t) comparisons using bitwise OR, to encourage compilers + * to use SIMD intrinsics if available, up to the last aligned location and + * All comparisons are combined with a single OR operation, making it a + * good candidate for SIMD intrinsics, if available. an intrinsic function is a function built into the compiler that provides some lower-level functionality. e.g. __builtin_popcount(). I'm slightly worried due to the current rate we're receiving cleanup suggestions that someone might come along and think they'd be doing us a favour by submitting a patch to "fixup the inefficient bitwise-ORs and use boolean-OR". Maybe a comment like the following might prevent that from happening. + * For performance reasons, we manually unroll this loop and purposefully + * use bitwise-ORs to combine each comparison. This prevents boolean + * short-circuiting and lets the compiler know that it's safe to access all 8 + * elements regardless of the result of the other comparisons. This seems + * to be enough to coax a few compilers into using SIMD instructions. > Btw, gcc seems a bit smarter than clang when it comes to optimizing > the code depending on the size of the structures. gcc gives up on > SIMD if it's sure that the structure on which we are going to use the > allzero check won't need it at all, and clang keeps it even if it does > not need it. That was interesting to see, while going through the > review.. Can you share your test case for this? I tried with [1] and the latest gcc does not seem to be smart enough to figure this out. I tried adding some additional len checks that the compiler can use as a cue and won't need to emit code for the checks providing the function does get inlined. That was enough to get the compiler to not emit the loops when they'll not be used. See the -DCHECK_LEN flag I'm passing in the 2nd compiler window. I just don't know if putting something like that into the code is a good idea as if the function wasn't inlined for some reason, the extra len checks would have to be compiled into the function. David [1] https://godbolt.org/z/xa81ro8GK
Hi, On Sat, Nov 09, 2024 at 08:00:35AM +0900, Michael Paquier wrote: > On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote: > > I'm slightly worried due to the current rate we're receiving cleanup > > suggestions that someone might come along and think they'd be doing us > > a favour by submitting a patch to "fixup the inefficient bitwise-ORs > > and use boolean-OR". Maybe a comment like the following might prevent > > that from happening. > > Not sure, but OK by me to tweak things more. > > > Can you share your test case for this? I tried with [1] and the > > latest gcc does not seem to be smart enough to figure this out. I > > tried adding some additional len checks that the compiler can use as a > > cue and won't need to emit code for the checks providing the function > > does get inlined. That was enough to get the compiler to not emit the > > loops when they'll not be used. See the -DCHECK_LEN flag I'm passing > > in the 2nd compiler window. I just don't know if putting something > > like that into the code is a good idea as if the function wasn't > > inlined for some reason, the extra len checks would have to be > > compiled into the function. > > Feel free to use that (I hope it works), and see the difference once > the aligned structure is 121 bytes or more: > https://godbolt.org/z/94393nPGG > > At least, I can see that the SIMD loop is ignored. What I see (with the godbolt you shared) is that with BLCKSZ of 120: gcc: then no SIMD instructions are used (I think that's because sizeof(AlignedBlock) is 120 which is not a multiple of 16 (SIMD xmm register size)). with BLCKSZ of 121: gcc: then SIMD instructions are used (I think that's because sizeof(AlignedBlock) is 128 which is a multiple of 16 (SIMD xmm register size)). While clang uses SIMD instructions in both cases (more complex code with more branches at least in the 120 case). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote: > I tried with [1] and the > latest gcc does not seem to be smart enough to figure this out. I > tried adding some additional len checks that the compiler can use as a > cue and won't need to emit code for the checks providing the function > does get inlined. That was enough to get the compiler to not emit the > loops when they'll not be used. See the -DCHECK_LEN flag I'm passing > in the 2nd compiler window. I just don't know if putting something > like that into the code is a good idea as if the function wasn't > inlined for some reason, the extra len checks would have to be > compiled into the function. > > David > > [1] https://godbolt.org/z/xa81ro8GK Looking at it, that looks like an issue. I mean, without the -DCHECK_LEN flag then the SIMD code will read up to 48 bytes beyond the struct's memory (which is 16 bytes): This is fine: " movdqu xmm0, XMMWORD PTR [rdi] " But I don't think it is: " movdqu xmm2, XMMWORD PTR [rdi+16] movdqu xmm1, XMMWORD PTR [rdi+32] movdqu xmm3, XMMWORD PTR [rdi+48] " given that the struct size is only 16 bytes. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Sat, Nov 09, 2024 at 04:15:04AM +0000, Bertrand Drouvot wrote: > Hi, > > On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote: > > I tried with [1] and the > > latest gcc does not seem to be smart enough to figure this out. I > > tried adding some additional len checks that the compiler can use as a > > cue and won't need to emit code for the checks providing the function > > does get inlined. That was enough to get the compiler to not emit the > > loops when they'll not be used. See the -DCHECK_LEN flag I'm passing > > in the 2nd compiler window. I just don't know if putting something > > like that into the code is a good idea as if the function wasn't > > inlined for some reason, the extra len checks would have to be > > compiled into the function. > > > > David > > > > [1] https://godbolt.org/z/xa81ro8GK > > Looking at it, that looks like an issue. > > I mean, without the -DCHECK_LEN flag then the SIMD code will read up to 48 bytes > beyond the struct's memory (which is 16 bytes): > > This is fine: > " > movdqu xmm0, XMMWORD PTR [rdi] > " > > But I don't think it is: > > " > movdqu xmm2, XMMWORD PTR [rdi+16] > movdqu xmm1, XMMWORD PTR [rdi+32] > movdqu xmm3, XMMWORD PTR [rdi+48] > " > > given that the struct size is only 16 bytes. > > Thoughts? What about "simply" starting pg_memory_is_all_zeros() with? " if (len < sizeof(size_t)*8) { while (p < end) { if (*p++ != 0) return false; } return true; } " That way: - we prevents reading beyond the memory area in the SIMD section (if < 64 bytes) - we make sure that aligned_end can not be after the real end (could be if the len is < 8 bytes) - there is no need for additional size checks later in the code - len < 64 bytes will be read byte per byte but that's likely "enough" (if not faster) for those "small" sizes Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Nov 12, 2024 at 12:28:53PM +0900, Michael Paquier wrote: > On Mon, Nov 11, 2024 at 05:07:51PM +0000, Bertrand Drouvot wrote: > > To handle the "what about the len check if the function is not inlined?", I > > can't think about a good approach. > > FWIW, my choice would be to not over-engineer things more than what's > in v10 posted at [1], hence do something without the exception case > where the size is less than 64b. I think that the 64b len check done in v11 is mandatory for safety reasons. 1. First reason: " for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8) " The loop above reads 64 bytes at once, so would read beyond the memory area bounds if len < 64: That could cause crash or read invalid data. It's observed in [1] (using the godbolt shared in [2]), where we can see: " movdqu xmm2, XMMWORD PTR [rdi+16] movdqu xmm1, XMMWORD PTR [rdi+32] movdqu xmm3, XMMWORD PTR [rdi+48] " while the struct size is 16 bytes (so we are reading 48 bytes beyond it). 2. Second reason " const unsigned char *aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1))); " aligned_end could be beyond the end for len < 8, so that we could read invalid data or crash here: " for (; p < aligned_end; p += sizeof(size_t)) { " The len < 8 check is covered into the len < 64 check, so only the 64b check is needed. [1]: https://www.postgresql.org/message-id/Zy7hyG8JUMC5P2T3%40ip-10-97-1-34.eu-west-3.compute.internal [2]: https://www.postgresql.org/message-id/CAApHDvp2jx_%3DpFbgj-O1_ZmzP9WOZKfwLzZrS_%3DZmbsqMQQ59g%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Em ter., 12 de nov. de 2024 às 21:33, Michael Paquier <michael@paquier.xyz> escreveu:
On Tue, Nov 12, 2024 at 01:32:36PM -0300, Ranier Vilela wrote:
> See v1_allzeros_small.c attached.
In your pg_memory_is_all_zeros_v11:
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
if (len > sizeof(size_t) * 8)
{
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
}
If I'm reading that right, this could still read a couple of bytes
past the wanted memory area.
Yeah, this is possible.
For example, imagine a case of 65 bytes
with a location a bit unaligned (more than 2 bytes). You'd want to
check the remaining size after the first loop, not the initial one.
I'd be OK to have a quick loop for the less-than-64-byte case rather
than more checks depending on sizeof(size_t) spread, like Bertrand is
suggesting.
I'm ok too. Maybe we are trying to optimize early.
best regards,
Ranier Vilela
Hi, On Tue, Nov 12, 2024 at 01:32:36PM -0300, Ranier Vilela wrote: > It seems to me that it is enough to protect the SIMD loop when the size is > smaller. > > if (len > sizeof(size_t) * 8) > { > for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * > 8) > { > if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | > (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | > (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | > (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) > return false; > } > } > > See v1_allzeros_small.c attached. Thanks for looking at it! It's not enough, as that would not fix the second reason mentioned in [1]. [1]: https://www.postgresql.org/message-id/ZzLxAJuGzyqA7cUo%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Em qua., 13 de nov. de 2024 às 04:50, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,
On Wed, Nov 13, 2024 at 09:25:37AM +0900, Michael Paquier wrote:
> So that seems worth the addition, especially for
> smaller sizes where this is 6 times faster here.
So, something like v12 in pg_memory_is_all_zeros_v12() in allzeros_small.c
attached?
I ran the latest version (allzeros_small.c) with v12.
If so, that gives us:
== with BLCKSZ 32
$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 22421 nanoseconds
size_t: done in 7269 nanoseconds (3.08447 times faster than byte per byte)
SIMD v10: done in 6349 nanoseconds (3.53142 times faster than byte per byte)
SIMD v11: done in 22080 nanoseconds (1.01544 times faster than byte per byte)
SIMD v12: done in 5595 nanoseconds (4.00733 times faster than byte per byte)
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 43882 nanoseconds
size_t: done in 8845 nanoseconds (4.96122 times faster than byte per byte)
SIMD v10: done in 10673 nanoseconds (4.1115 times faster than byte per byte)
SIMD v11: done in 29177 nanoseconds (1.50399 times faster than byte per byte)
SIMD v12: done in 9992 nanoseconds (4.39171 times faster than byte per byte)
byte per byte: done in 43882 nanoseconds
size_t: done in 8845 nanoseconds (4.96122 times faster than byte per byte)
SIMD v10: done in 10673 nanoseconds (4.1115 times faster than byte per byte)
SIMD v11: done in 29177 nanoseconds (1.50399 times faster than byte per byte)
SIMD v12: done in 9992 nanoseconds (4.39171 times faster than byte per byte)
== with BLCKSZ 63
$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 29525 nanoseconds
size_t: done in 11232 nanoseconds (2.62865 times faster than byte per byte)
SIMD v10: done in 10828 nanoseconds (2.72673 times faster than byte per byte)
SIMD v11: done in 42056 nanoseconds (0.70204 times faster than byte per byte)
SIMD v12: done in 10468 nanoseconds (2.8205 times faster than byte per byte)
gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 68887 nanoseconds
size_t: done in 20147 nanoseconds (3.41922 times faster than byte per byte)
SIMD v10: done in 21410 nanoseconds (3.21752 times faster than byte per byte)
SIMD v11: done in 56987 nanoseconds (1.20882 times faster than byte per byte)
SIMD v12: done in 25102 nanoseconds (2.74428 times faster than byte per byte)
byte per byte: done in 68887 nanoseconds
size_t: done in 20147 nanoseconds (3.41922 times faster than byte per byte)
SIMD v10: done in 21410 nanoseconds (3.21752 times faster than byte per byte)
SIMD v11: done in 56987 nanoseconds (1.20882 times faster than byte per byte)
SIMD v12: done in 25102 nanoseconds (2.74428 times faster than byte per byte)
== with BLCKSZ 256
$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 120483 nanoseconds
size_t: done in 23098 nanoseconds (5.21617 times faster than byte per byte)
SIMD v10: done in 6737 nanoseconds (17.8838 times faster than byte per byte)
SIMD v11: done in 6621 nanoseconds (18.1971 times faster than byte per byte)
SIMD v12: done in 6519 nanoseconds (18.4818 times faster than byte per byte)
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 211759 nanoseconds
size_t: done in 45879 nanoseconds (4.6156 times faster than byte per byte)
SIMD v10: done in 12262 nanoseconds (17.2695 times faster than byte per byte)
SIMD v11: done in 12018 nanoseconds (17.6202 times faster than byte per byte)
SIMD v12: done in 11993 nanoseconds (17.6569 times faster than byte per byte)
byte per byte: done in 211759 nanoseconds
size_t: done in 45879 nanoseconds (4.6156 times faster than byte per byte)
SIMD v10: done in 12262 nanoseconds (17.2695 times faster than byte per byte)
SIMD v11: done in 12018 nanoseconds (17.6202 times faster than byte per byte)
SIMD v12: done in 11993 nanoseconds (17.6569 times faster than byte per byte)
== with BLCKSZ 8192
$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 3393459 nanoseconds
size_t: done in 707304 nanoseconds (4.79774 times faster than byte per byte)
SIMD v10: done in 233559 nanoseconds (14.5293 times faster than byte per byte)
SIMD v11: done in 225951 nanoseconds (15.0186 times faster than byte per byte)
SIMD v12: done in 225766 nanoseconds (15.0309 times faster than byte per byte)
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 12786295 nanoseconds
size_t: done in 1071590 nanoseconds (11.9321 times faster than byte per byte)
SIMD v10: done in 413219 nanoseconds (30.9431 times faster than byte per byte)
SIMD v11: done in 423469 nanoseconds (30.1942 times faster than byte per byte)
SIMD v12: done in 414106 nanoseconds (30.8769 times faster than byte per byte
byte per byte: done in 12786295 nanoseconds
size_t: done in 1071590 nanoseconds (11.9321 times faster than byte per byte)
SIMD v10: done in 413219 nanoseconds (30.9431 times faster than byte per byte)
SIMD v11: done in 423469 nanoseconds (30.1942 times faster than byte per byte)
SIMD v12: done in 414106 nanoseconds (30.8769 times faster than byte per byte
That's better for small size but given the extra len checks that
has been added I think we're back to David's point in [1]: What if the function
is not inlined for some reason?
So, out of curiosity, let's see what happens if not inlined in [2] (see the
-O2 -DNOT_INLINE compiler window):
- if a[3]: it looks like gcc is smart enough to create an optimized version
for that size using constant propagation
- if a[63]: Same as above
- if a[256]: Same as above
- if a[8192]: Same as above
I did a quick check with clang and it looks like it is not as smart as gcc
for the non inline case.
Anyway it's not like we have the choice: we need (at least) one len check for
safety reason (to not crash or read invalid data).
So, I'd vote for pg_memory_is_all_zeros_v12() then, thoughts?
I think that's good enough.
best regards,
Ranier Vilela
Em qui., 14 de nov. de 2024 às 07:09, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,
On Thu, Nov 14, 2024 at 09:27:06AM +0900, Michael Paquier wrote:
> Makes sense to me to just do that, with a first < 8 loop, and a second
> for the 8~63 range.
Thanks for looking at it!
> There is also a "cant'" in the last size_t check. Simple typo.
Please find attached v12, with more comments and comments changes to explain
the multiple cases (for safety) and phases (for efficiency).
Is it worth mentioning that pg_memory_is_all_zeros does not work correctly on 32-bit systems?
(63 < (size_t) * 8) /* 63 - 32*/
Or do we adjust magic constants according to 32/64 bit?
best regards,
Ranier Vilela
Hi, On Thu, Nov 14, 2024 at 08:22:23AM -0300, Ranier Vilela wrote: > Em qui., 14 de nov. de 2024 às 07:09, Bertrand Drouvot < > bertranddrouvot.pg@gmail.com> escreveu: > > > Hi, > > > > On Thu, Nov 14, 2024 at 09:27:06AM +0900, Michael Paquier wrote: > > > Makes sense to me to just do that, with a first < 8 loop, and a second > > > for the 8~63 range. > > > > Thanks for looking at it! > > > > > There is also a "cant'" in the last size_t check. Simple typo. > > > > Please find attached v12, with more comments and comments changes to > > explain > > the multiple cases (for safety) and phases (for efficiency). > > > Is it worth mentioning that pg_memory_is_all_zeros does not work correctly > on 32-bit systems? > > (63 < (size_t) * 8) /* 63 - 32*/ I think that the code is fully portable on 32-bit systems as it's using size_t in all the places. I agree that the comments are "64-bit" focused though, but I don't think that's an issue (as I think it's already the case in multiple places in the core code). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Em qui., 14 de nov. de 2024 às 08:58, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,Maybe I'm doing something wrong.
On Thu, Nov 14, 2024 at 08:22:23AM -0300, Ranier Vilela wrote:
> Em qui., 14 de nov. de 2024 ąs 07:09, Bertrand Drouvot <
> bertranddrouvot.pg@gmail.com> escreveu:
>
> > Hi,
> >
> > On Thu, Nov 14, 2024 at 09:27:06AM +0900, Michael Paquier wrote:
> > > Makes sense to me to just do that, with a first < 8 loop, and a second
> > > for the 8~63 range.
> >
> > Thanks for looking at it!
> >
> > > There is also a "cant'" in the last size_t check. Simple typo.
> >
> > Please find attached v12, with more comments and comments changes to
> > explain
> > the multiple cases (for safety) and phases (for efficiency).
> >
> Is it worth mentioning that pg_memory_is_all_zeros does not work correctly
> on 32-bit systems?
>
> (63 < (size_t) * 8) /* 63 - 32*/
I think that the code is fully portable on 32-bit systems as it's using size_t
in all the places.
But I'm testing in 32-bit, with the size set to 63, with v12 and I'm seeing the SIMD loop execute.
Because the test
if (len < sizeof(size_t) * 8) // 8-63 bytes
failed.
failed.
I expected that with size 63, it would be solved by case 2, or am I wrong?
best regards,
Ranier Vilela
PS. Windows 11 64 bits
msvc 32 bits 2022
Hi, On Thu, Nov 14, 2024 at 09:13:19AM -0300, Ranier Vilela wrote: > Em qui., 14 de nov. de 2024 às 08:58, Bertrand Drouvot < > Maybe I'm doing something wrong. > But I'm testing in 32-bit, with the size set to 63, with v12 and I'm seeing > the SIMD loop execute. Yeah, that's expected and safe as each iteration reads 32 bytes on 32-bit. > if (len < sizeof(size_t) * 8) // 8-63 bytes > failed. > > I expected that with size 63, it would be solved by case 2, or am I wrong? Case 2 should be read as "in the 4-31" bytes range on 32-bit system as all comparisons are done in size_t. What would be unsafe on 32-bit would be to read up to 32 bytes while len < 32 and that can not happen. As mentioned up-thread the comments are wrong on 32-bit, indeed they must be read as: Case 1: len < 4 bytes Case 2: len in the 4-31 bytes range Case 3: len >= 32 bytes Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Fri, Nov 15, 2024 at 09:54:33AM -0300, Ranier Vilela wrote: > There is a tiny typo with V13. > + /* "len" in the [sizeof(size_t) * 8, inf] range */ I think "[sizeof(size_t) * 8, inf[ range" is correct. Infinity can not be included into a interval. Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at the end) might be the proper notation from a mathematical point of view. > But, I'm not sure if I'm still doing something wrong. > If so, forgive me for the noise. > > Of course I expected "not is_allzeros". That's the test case which is "wrong" (not the function): " size_t pagebytes[BLCKSZ] = {0}; volatile bool result; pagebytes[BLCKSZ-2] = 1; result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ); " The pagebytes is an array of size_t (8 bytes each), so the actual array size is 8192 * 8 = 65536 bytes. So, pagebytes[BLCKSZ-2] = 1, sets byte 65528 ((8192-2)*8) to 1. But the function is checking up to BLCKSZ bytes (8192), so the results you observed (which are correct). > Anyway, I made another attempt to optimize a bit more, I don't know if it's > safe though. There is an issue in your v14, it calls: " return pg_memory_is_all_zeros_simd(ptr, ptr + len); " but you defined it that way: " static inline bool pg_memory_is_all_zeros_simd(const size_t *p, const size_t * end) " while that should be: " static inline bool pg_memory_is_all_zeros_simd(const void *p, const void *end) " Doing so, I do not observe any improvments with v14. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Em sex., 15 de nov. de 2024 às 11:43, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> escreveu:
Hi,
On Fri, Nov 15, 2024 at 09:54:33AM -0300, Ranier Vilela wrote:
> There is a tiny typo with V13.
> + /* "len" in the [sizeof(size_t) * 8, inf] range */
I think "[sizeof(size_t) * 8, inf[ range" is correct. Infinity can not be included
into a interval.
Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at the end)
might be the proper notation from a mathematical point of view.
Thanks for clarifying.
> But, I'm not sure if I'm still doing something wrong.
> If so, forgive me for the noise.
>
> Of course I expected "not is_allzeros".
That's the test case which is "wrong" (not the function):
"
size_t pagebytes[BLCKSZ] = {0};
volatile bool result;
pagebytes[BLCKSZ-2] = 1;
result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ);
"
The pagebytes is an array of size_t (8 bytes each), so the actual array size
is 8192 * 8 = 65536 bytes.
So, pagebytes[BLCKSZ-2] = 1, sets byte 65528 ((8192-2)*8) to 1.
But the function is checking up to BLCKSZ bytes (8192), so the results you
observed (which are correct).
Thanks for pointing out my mistake.
> Anyway, I made another attempt to optimize a bit more, I don't know if it's
> safe though.
There is an issue in your v14, it calls:
"
return pg_memory_is_all_zeros_simd(ptr, ptr + len);
"
but you defined it that way:
"
static inline bool
pg_memory_is_all_zeros_simd(const size_t *p, const size_t * end)
"
while that should be:
"
static inline bool
pg_memory_is_all_zeros_simd(const void *p, const void *end)
What I'm trying here, obviously, is a hack.
If it works, and the compiler accepts it, it's ok for me.
"
Doing so, I do not observe any improvments with v14.
So.
Again new results from v4_allzeros_small.c attached:
Linux Ubuntu 22.04
gcc 13 64 bits
With BLCKSZ 32
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small
byte per byte: done in 44092 nanoseconds
size_t: done in 13456 nanoseconds (3.27675 times faster than byte per byte)
SIMD v10: done in 14249 nanoseconds (3.09439 times faster than byte per byte)
SIMD v11: done in 32516 nanoseconds (1.35601 times faster than byte per byte)
SIMD v12: done in 14973 nanoseconds (2.94477 times faster than byte per byte)
SIMD v14: done in 13101 nanoseconds (3.36554 times faster than byte per byte)
byte per byte: done in 44092 nanoseconds
size_t: done in 13456 nanoseconds (3.27675 times faster than byte per byte)
SIMD v10: done in 14249 nanoseconds (3.09439 times faster than byte per byte)
SIMD v11: done in 32516 nanoseconds (1.35601 times faster than byte per byte)
SIMD v12: done in 14973 nanoseconds (2.94477 times faster than byte per byte)
SIMD v14: done in 13101 nanoseconds (3.36554 times faster than byte per byte)
With BLCKSZ 63
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small
byte per byte: done in 67656 nanoseconds
size_t: done in 25768 nanoseconds (2.62558 times faster than byte per byte)
SIMD v10: done in 21446 nanoseconds (3.15471 times faster than byte per byte)
SIMD v11: done in 56887 nanoseconds (1.18931 times faster than byte per byte)
SIMD v12: done in 22863 nanoseconds (2.95919 times faster than byte per byte)
SIMD v14: done in 21273 nanoseconds (3.18037 times faster than byte per byte)
byte per byte: done in 67656 nanoseconds
size_t: done in 25768 nanoseconds (2.62558 times faster than byte per byte)
SIMD v10: done in 21446 nanoseconds (3.15471 times faster than byte per byte)
SIMD v11: done in 56887 nanoseconds (1.18931 times faster than byte per byte)
SIMD v12: done in 22863 nanoseconds (2.95919 times faster than byte per byte)
SIMD v14: done in 21273 nanoseconds (3.18037 times faster than byte per byte)
With BLCKSZ 256
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small
byte per byte: done in 220064 nanoseconds
size_t: done in 45886 nanoseconds (4.79589 times faster than byte per byte)
SIMD v10: done in 12032 nanoseconds (18.2899 times faster than byte per byte)
SIMD v11: done in 11965 nanoseconds (18.3923 times faster than byte per byte)
SIMD v12: done in 12041 nanoseconds (18.2762 times faster than byte per byte)
SIMD v14: done in 12575 nanoseconds (17.5001 times faster than byte per byte)
byte per byte: done in 220064 nanoseconds
size_t: done in 45886 nanoseconds (4.79589 times faster than byte per byte)
SIMD v10: done in 12032 nanoseconds (18.2899 times faster than byte per byte)
SIMD v11: done in 11965 nanoseconds (18.3923 times faster than byte per byte)
SIMD v12: done in 12041 nanoseconds (18.2762 times faster than byte per byte)
SIMD v14: done in 12575 nanoseconds (17.5001 times faster than byte per byte)
With BLCKSZ 8192
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ; ./v4_allzeros_small
byte per byte: done in 10365876 nanoseconds
size_t: done in 827654 nanoseconds (12.5244 times faster than byte per byte)
SIMD v10: done in 347755 nanoseconds (29.808 times faster than byte per byte)
SIMD v11: done in 342813 nanoseconds (30.2377 times faster than byte per byte)
SIMD v12: done in 341124 nanoseconds (30.3874 times faster than byte per byte)
SIMD v14: done in 50646 nanoseconds (204.673 times faster than byte per byte)
byte per byte: done in 10365876 nanoseconds
size_t: done in 827654 nanoseconds (12.5244 times faster than byte per byte)
SIMD v10: done in 347755 nanoseconds (29.808 times faster than byte per byte)
SIMD v11: done in 342813 nanoseconds (30.2377 times faster than byte per byte)
SIMD v12: done in 341124 nanoseconds (30.3874 times faster than byte per byte)
SIMD v14: done in 50646 nanoseconds (204.673 times faster than byte per byte)
Results with v4_allzeros_check.c attached:
gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check
sizeof(pagebytes)=32
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
SIMD v14: is_allzeros
sizeof(pagebytes)=32
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
SIMD v14: is_allzeros
gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check
sizeof(pagebytes)=63
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
SIMD v14: is_allzeros
sizeof(pagebytes)=63
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
SIMD v14: is_allzeros
gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check
sizeof(pagebytes)=256
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
p01=(0x7ffedb8ac430)
end=(0x7ffedb8ac530)
p02=(0x7ffedb8ac530)
SIMD v14: is_allzeros
sizeof(pagebytes)=256
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
p01=(0x7ffedb8ac430)
end=(0x7ffedb8ac530)
p02=(0x7ffedb8ac530)
SIMD v14: is_allzeros
gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ; ./v4_allzeros_check
sizeof(pagebytes)=8192
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
p01=(0x7ffd8864c200)
end=(0x7ffd8864e200)
p02=(0x7ffd8864e200)
SIMD v14: is_allzeros
sizeof(pagebytes)=8192
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
p01=(0x7ffd8864c200)
end=(0x7ffd8864e200)
p02=(0x7ffd8864e200)
SIMD v14: is_allzeros
If this hack is safe and correct, I think that 204 times faster,
it is very good, for a block size 8192.
That said,
V13 is fine as is.
LGTM.
V13 is fine as is.
LGTM.
best regards,
Ranier Vilela
Hi, On Sat, Nov 16, 2024 at 11:42:54AM -0300, Ranier Vilela wrote: > > Em sex., 15 de nov. de 2024 às 11:43, Bertrand Drouvot < > > bertranddrouvot.pg@gmail.com> escreveu: > > > >> while that should be: > >> > >> " > >> static inline bool > >> pg_memory_is_all_zeros_simd(const void *p, const void *end) > >> > > What I'm trying here, obviously, is a hack. > > If it works, and the compiler accepts it, it's ok for me. > > > > If this hack is safe and correct, I think that 204 times faster, > > it is very good, for a block size 8192. The "hack" is not correct, because it's doing: " static inline bool all_zeros_simd(const size_t *p, const size_t * end) { for (; p < (end - sizeof(size_t) * 7); p += sizeof(size_t) * 8) { if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) | (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) | (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) | (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0)) return false; } . . " "p += sizeof(size_t) * 8" advances by 64 elements. But those elements are "size_t" elements (since you're using size_t pointers as the function arguments). Then instead of advancing by 64 bytes, you know advance by 512 bytes. But you only check 64 bytes per iteration -> you're missing 448 bytes to check per iteration. We can "visualize" this by adding a few output messages like: " static inline bool all_zeros_simd(const size_t *p, const size_t * end) { for (; p < (end - sizeof(size_t) * 7); p += sizeof(size_t) * 8) { printf("Current p: %p\n", (void*)p); printf("Checking elements:\n"); printf("[0]: %p = %zu\n", (void*)&((size_t *)p)[0], ((size_t *)p)[0]); printf("[1]: %p = %zu\n", (void*)&((size_t *)p)[1], ((size_t *)p)[1]); printf("[2]: %p = %zu\n", (void*)&((size_t *)p)[2], ((size_t *)p)[2]); printf("[3]: %p = %zu\n", (void*)&((size_t *)p)[3], ((size_t *)p)[3]); printf("[4]: %p = %zu\n", (void*)&((size_t *)p)[4], ((size_t *)p)[4]); printf("[5]: %p = %zu\n", (void*)&((size_t *)p)[5], ((size_t *)p)[5]); printf("[6]: %p = %zu\n", (void*)&((size_t *)p)[6], ((size_t *)p)[6]); printf("[7]: %p = %zu\n", (void*)&((size_t *)p)[7], ((size_t *)p)[7]); const size_t *next_p = p + sizeof(size_t) * 8; printf("Next p will be: %p (advance of %zu bytes)\n", (void*)next_p, (size_t)((char*)next_p - (char*)p)); . . . " Then we get things like: " Current p: 0x7fff2a93e500 Checking elements: [0]: 0x7fff2a93e500 = 0 [1]: 0x7fff2a93e508 = 0 [2]: 0x7fff2a93e510 = 0 [3]: 0x7fff2a93e518 = 0 [4]: 0x7fff2a93e520 = 0 [5]: 0x7fff2a93e528 = 0 [6]: 0x7fff2a93e530 = 0 [7]: 0x7fff2a93e538 = 0 Next p will be: 0x7fff2a93e700 (advance of 512 bytes) " Meaning that you're checking 64 bytes per iteration (for example from 0x500 to 0x508 is 8 bytes) while advancing by 512 bytes: you're missing 448 bytes to check per iteration. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com