Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c) - Mailing list pgsql-hackers

From Ranier Vilela
Subject Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)
Date
Msg-id CAEudQArJk9fOm0sSLL8UOoaRKn6y59aKiA_NOM9qef0AXRZFqw@mail.gmail.com
Whole thread Raw
In response to Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)
List pgsql-hackers
Em qua., 18 de mai. de 2022 às 19:57, David Rowley <dgrowleyml@gmail.com> escreveu:
On Thu, 19 May 2022 at 02:08, Ranier Vilela <ranier.vf@gmail.com> wrote:
> That would initialize the content at compilation and not at runtime, correct?

Your mental model of compilation and run-time might be flawed here.
Here's no such thing as zeroing memory at compile time. There's only
emitting instructions that perform those tasks at run-time.
https://godbolt.org/ might help your understanding.

> There are a lot of cases using MemSet (with struct variables) and at Windows 64 bits, long are 4 (four) bytes.
> So I believe that MemSet is less efficient on Windows than on Linux.
> "The size of the '_vstart' buffer is not a multiple of the element size of the type 'long'."
> message from PVS-Studio static analysis tool.

I've been wondering for a while if we really need to have the MemSet()
macro. I see it was added in 8cb415449 (1997).  I think compilers have
evolved quite a bit in the past 25 years, so it could be time to
revisit that.

Your comment on the sizeof(long) on win64 is certainly true.  I wrote
the attached C program to test the performance difference.

(windows 64-bit)
>cl memset.c /Ox
>memset 200000000
Running 200000000 loops
MemSet: size 8: 1.833000 seconds
MemSet: size 16: 1.841000 seconds
MemSet: size 32: 1.838000 seconds
MemSet: size 64: 1.851000 seconds
MemSet: size 128: 3.228000 seconds
MemSet: size 256: 5.278000 seconds
MemSet: size 512: 3.943000 seconds
memset: size 8: 0.065000 seconds
memset: size 16: 0.131000 seconds
memset: size 32: 0.262000 seconds
memset: size 64: 0.530000 seconds
memset: size 128: 1.169000 seconds
memset: size 256: 2.950000 seconds
memset: size 512: 3.191000 seconds

It seems like there's no cases there where MemSet is faster than
memset.  I was careful to only provide MemSet() with inputs that
result in it not using the memset fallback.  I also provided constants
so that the decision about which method to use was known at compile
time.

It's not clear to me why 512 is faster than 256. I saw the same on a repeat run.

Changing "long" to "long long" it looks like:

>memset 200000000
Running 200000000 loops
MemSet: size 8: 0.066000 seconds
MemSet: size 16: 1.978000 seconds
MemSet: size 32: 1.982000 seconds
MemSet: size 64: 1.973000 seconds
MemSet: size 128: 1.970000 seconds
MemSet: size 256: 3.225000 seconds
MemSet: size 512: 5.366000 seconds
memset: size 8: 0.069000 seconds
memset: size 16: 0.132000 seconds
memset: size 32: 0.265000 seconds
memset: size 64: 0.527000 seconds
memset: size 128: 1.161000 seconds
memset: size 256: 2.976000 seconds
memset: size 512: 3.179000 seconds

The situation is a little different on my Linux machine:

$ gcc memset.c -o memset -O2
$ ./memset 200000000
Running 200000000 loops
MemSet: size 8: 0.000002 seconds
MemSet: size 16: 0.000000 seconds
MemSet: size 32: 0.094041 seconds
MemSet: size 64: 0.184618 seconds
MemSet: size 128: 1.781503 seconds
MemSet: size 256: 2.547910 seconds
MemSet: size 512: 4.005173 seconds
memset: size 8: 0.046156 seconds
memset: size 16: 0.046123 seconds
memset: size 32: 0.092291 seconds
memset: size 64: 0.184509 seconds
memset: size 128: 1.781518 seconds
memset: size 256: 2.577104 seconds
memset: size 512: 4.004757 seconds

It looks like part of the work might be getting optimised away in the
8-16 MemSet() calls.

clang seems to have the opposite for size 8.

$ clang memset.c -o memset -O2
$ ./memset 200000000
Running 200000000 loops
MemSet: size 8: 0.007653 seconds
MemSet: size 16: 0.005771 seconds
MemSet: size 32: 0.011539 seconds
MemSet: size 64: 0.023095 seconds
MemSet: size 128: 0.046130 seconds
MemSet: size 256: 0.092269 seconds
MemSet: size 512: 0.968564 seconds
memset: size 8: 0.000000 seconds
memset: size 16: 0.005776 seconds
memset: size 32: 0.011559 seconds
memset: size 64: 0.023069 seconds
memset: size 128: 0.046129 seconds
memset: size 256: 0.092243 seconds
memset: size 512: 0.968534 seconds
The results from clang, only reinforce the argument in favor of native memset.
There is still room for gcc to improve with 8/16 bytes and for sure at some point they will.
Which will make memset faster on all platforms and compilers.

regards,
Ranier Vilela

pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: Avoid unecessary MemSet call (src/backend/utils/cache/relcache.c)
Next
From: Thomas Munro
Date:
Subject: Re: PostgreSQL 15 Beta 1 release announcement draft