Thread: Simplify newNode()

Simplify newNode()

From

Heikki Linnakangas

Date:

14 December 2023, 00:48:36

The newNode() macro can be turned into a static inline function, which 
makes it a lot simpler. See attached. This was not possible when the 
macro was originally written, as we didn't require compiler to have 
static inline support, but nowadays we do.

This was last discussed in 2008, see discussion at 
https://www.postgresql.org/message-id/26133.1220037409%40sss.pgh.pa.us. 
In those tests, Tom observed that gcc refused to inline the static 
inline function. That was weird, the function is very small and doesn't 
do anything special. Whatever the problem was, I think we can dismiss it 
with modern compilers. It does get inlined on gcc 12 and clang 14 that I 
have installed.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

0001-Convert-newNode-into-a-static-inline-function.patch

Re: Simplify newNode()

From

Zhang Mingli

Date:

14 December 2023, 01:34:26

Hi,

LGTM.

+ Assert(size >= sizeof(Node)); /* need the tag, at least */
+ result = (Node *) palloc0fast(size);
+ result->type = tag;

+ return result;
+}

How about moving the comments /* need the tag, at least */ after result->type = tag; by the way?

Zhang Mingli

www.hashdata.xyz

Re: Simplify newNode()

From

Junwang Zhao

Date:

14 December 2023, 02:19:13

On Thu, Dec 14, 2023 at 9:34 AM Zhang Mingli <zmlpostgres@gmail.com> wrote:
>
> Hi,
>
> LGTM.
>
> + Assert(size >= sizeof(Node)); /* need the tag, at least */
> + result = (Node *) palloc0fast(size);
> + result->type = tag;
>
> + return result;
> +}
>
> How about moving the comments /* need the tag, at least */ after result->type = tag; by the way?

I don't think so, the comment has the meaning of the requested size
should at least the size
of Node, which contains just a NodeTag.

typedef struct Node
{
NodeTag type;
} Node;

>
>
>
> Zhang Mingli
> www.hashdata.xyz



--
Regards
Junwang Zhao

Re: Simplify newNode()

From

Peter Eisentraut

Date:

14 December 2023, 08:32:09

On 14.12.23 01:48, Heikki Linnakangas wrote:
> The newNode() macro can be turned into a static inline function, which 
> makes it a lot simpler. See attached. This was not possible when the 
> macro was originally written, as we didn't require compiler to have 
> static inline support, but nowadays we do.
> 
> This was last discussed in 2008, see discussion at 
> https://www.postgresql.org/message-id/26133.1220037409%40sss.pgh.pa.us. 
> In those tests, Tom observed that gcc refused to inline the static 
> inline function. That was weird, the function is very small and doesn't 
> do anything special. Whatever the problem was, I think we can dismiss it 
> with modern compilers. It does get inlined on gcc 12 and clang 14 that I 
> have installed.

I notice that the existing comments point out that the size argument 
should be a compile-time constant, but that is no longer the case for 
ExtensibleNode().  Also, newNode() is the only caller of palloc0fast(), 
which also points out that the size argument should be a compile-time 
constant, and palloc0fast() is the only caller of MemSetTest().  I can 
see how an older compiler might have gotten too confused by all that. 
But if we think that compilers are now smart enough, maybe we can unwind 
this whole stack a bit more?  Maybe we don't need MemSetTest() and/or 
palloc0fast() and/or newNode() at all?

Re: Simplify newNode()

From

Heikki Linnakangas

Date:

14 December 2023, 13:53:27

On 14/12/2023 10:32, Peter Eisentraut wrote:
> I notice that the existing comments point out that the size argument
> should be a compile-time constant, but that is no longer the case for
> ExtensibleNode().  Also, newNode() is the only caller of palloc0fast(),
> which also points out that the size argument should be a compile-time
> constant, and palloc0fast() is the only caller of MemSetTest().  I can
> see how an older compiler might have gotten too confused by all that.
> But if we think that compilers are now smart enough, maybe we can unwind
> this whole stack a bit more?  Maybe we don't need MemSetTest() and/or
> palloc0fast() and/or newNode() at all?

Good point. Looking closer, modern compilers will actually turn the 
MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() 
anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. 
Here's a link to a godbolt snippet to play with this: 
https://godbolt.org/z/9b89P3c8x (full link at [0]).

Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned. 
It's not doing any good as it is, as it gets compiled to be identical to 
MemoryContextAllocZero. (There are small differences depending compiler 
and version, but e.g. on gcc 13.2, the code generated for 
MemoryContextAllocZero() is actually smaller even though both call memset())

Another approach would be to add more hints to 
MemoryContextAllocZeroAligned to dissuade the compiler from turning the 
loop into a memset() call. If you add an "if (size > 1024) abort" there, 
then gcc 13 doesn't optimize into a memset() call, but clang still does. 
Some micro-benchmarks on that would be nice.

But given that the compiler has been doing this optimization for a while 
and we haven't noticed, I think memset() should be considered the status 
quo, and converting it to a loop again should be considered a new 
optimization.

Also, replacing MemoryContextAllocZeroAligned(CurrentMemoryContext, 
size) with palloc0(size) has one fewer argument and the assembly code of 
the call has one fewer instruction. That's something too.

[0] 

https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxBIa0gAOqAqETgwe3r56icmOAkEh4SxRMVxxtpj2uQxCBEzEBOk%2Bflzllak1dQT5YZHRsdIKtfWNmS2Dnd2Fxf0AlLaoXsTI7BzmAMzByN5YANQma26D6BGongB0CPvYJhoAguub25h7B0e0eBEXVzf3ZhsMWy8u32hwI6CwVC%2Ba2udweAKeL1BxGCwChMPudwA9AAqHZUYioFg7ZAXHbYzE/B5UHYAfRpAHFQnI3HSXtc1gARHZrSl/CHBZ7vADWFQAnhBVDMbgBOOkRLx0RwMGmYVTxTAOKCSnZgMD7LkaUg7LhS2F8zA0EI7LwMYViiUzWk0%2BWK4IqtUaghax26/U7Q3%2B02/NYVJS8tb8q122jiyUy706vWcwPhyPPG3R2OOhO%2B5MaIPrVx4KiUs0Ri0CnYAWSYqlutFoqGQQjwAC9MPGIC3246NKo1lRB0OqI6ccadsA8MAmBFRQRngBacfk1MVq31xvNtuYACSCgAamI8OgIMke53u5hs2er4i/TW6w2m5eC3dBsQvA5q5gWCRRR5BFVAgOSYWp9isO4CFFdUIR2d9PwIb9f2If9ZCAkDajJKsfz/AD51UAhwJ%2BKCYItODtxpRDLyIyDoMwWD4K/bDkNQwCCI8G152IBQkNwtD2IWQDuJojFbhI%2BiyIAN1QY8dggbFmL/DoCAUZRkUEAAxG1kGzRSULwoDiX4ggjWk2TsXiJgFEGBAP1IUtpRlJyNEclznNcjydiMhhBmJBA6jJcYVJpd8UXs2E3Mijz3JilyvJOTwdnidSCEo1AQrBaJiClNYINE8SGIID8mJw/TjOwggEAwBRKQAdjymUzPQGUvLJeSxE3XTStY/DELQNiTJ2S9yPPXKHLHUQGx2Sq8B4/FMEwGl4h2YJiSs54FEJTABGedY0zADg5uIBaIEdFcIulJr3Pk%2BbFvibMmrJRJgi4nKGpcq64q87F5OOjqmwemT0CemShKNYabzehzPo837MCUAgupYgyCO83qoYumH42xLB6HnGl%2BvRuS9J6wzCaAjGMWlEmUcQ%2BTgEwVLkAQG0hQJ4zAfM56hMpmVqKi%2BnGYJlmGDZhRLOWTngYs0HXpEmUEtoa7sVmlUWHiKCkb4ga0Yp%2BWPqB5WgoULWyp18mCPCqnYuir6SeU1SUq0gEkpSqhtNMoGnqsmy7Icm33K8mnjI4oSeOxIhaloBQrYDqKWp2RXXZetKMqwYhsvlv5i1gqtsCrAB5AAlABNGk3AL0IABVsAADSr8uAAlsDcABpHdQnpaHDYF7FmY1IVTdJ1GLcRrOQwYfAS1hWquWDgaKqq9AarGiKxLowrisQ%2Bfeowpg6ryrzMUxJLgBpBgMEW0CiogGd3yYBxTQ8sdKuiZ46meC%2BdgvrAeNQakZo8XEmSCka8j6gOlKES%2BVcmDAHcuJGittpRjmPIwRwNB4Y7FVA/RCQpgjA3/rrVG50qYvwQPDZ4BAADuqA8R4AqMvHYH8koGGWMDchx1pq0JYMEPAPD2xMPeMABgbBBA7CoVZWoDMQAgIcordys0i4UMIqvZ%2BuIq4vC5F/cWD935PjYeRAEgpJE7GOgjWRF15FRX%2BlQncaRkTCE9KkcCOwxw2KSv9ZAK0GDEgcXgSacEnECAsVTfmHk2AsBpJ40C9FEEv2IA/IUOwIl/kEZuGJwN%2BDEGmggWaRDEIkJlP1XyO8gKL2qmSNglVqpxNxJJPA9QvBiDxNpKo00Zz0BCTKUpqNLLHUEIgmUY4mQBACCtakX8%2BloLkkQeI9BJIVHyWdUBVMemIRoNxAgzM6DNVUWOchTACHUneKLeiOx3i%2BUIds2g6B%2BldJcms12mBJLXN2RBZBuJkrPKcF4HirydiEIUKwZ4UyxGFIed1Wm38gKvNqdCkeuSbkAupECtgHj%2BkFJWUUgQvlmYBWxMwNggyooTWMt/YFcltC/PWSQHYWB5TAEnEYZZDlil9X8tklWWABmqPcqSnWO4uTFiYQwUUlLqV4lpfSrwjKUQsruOAtZbgOoRESWSMxQtkARBXu8scFzEKEI1QQTEuNGbPEmrQVVyAhThyxbPXiZtd6gX3qvUSq5LTPGYkIRmARTjxFPEMQakkxBGnoAwR0JgACsbgHLoFoVGmNF0TD1T2NG/2LlGxGDJLScY9RNFyUzcAMk15A1vVTYm62GaBBFtxBlVAy0/QQELcWuSEA8WcsdBlToexLByRfAWxgMwy0Jv9mmpNLkqGIueBALteaQQ5tmRGsdlaPLYlnYRSw1h80aBcSOpN9rJ10GnfmUsolHrYh%2BGsjcTYABa0RUD1inCEE8jzR7g23CNK8B9u7HgTuqxm%2BtSGYluNZaIXo1l7kPO8E8o8h362QcB0D9QoEEDsW4Pxk1vUOFSG2jmgGZTCogLqa9W52yQaPCeSGr5pReRnCQL0vMLqjwXFcRRyj81UDEGGV1DljqIT9Mxq4VSl4KBY9CTxuGBpGkhvBwjGY8AihjBAPjmi/QjICHB/d70aNMJOPUU68Hk0clPUMzE%2B5bgBHpEXDuHIaR5yrMoAuBcAg0gswEAubhJO9SNHx6T25GNAa9T6v1ynGZGgDDJ11rVkk/gRqFwaEX/P4ZcnxxYPi%2BPy1nj8DgcxaCcEjbwPwHAtCkFQJwGNm7e2bUWMsHtaweCkAIJoHLcwhQgEjYaPLHBJCFea6VzgvAFAgENE14rOXSBwFgEgNA6sj1kAoLh2b9AYjAC4GsMwfBFTRCGxACIfWIjBDqKKTgDWDvMBQgXCI2hPQnd4DN0RBAC62mO2N0g0rgDKobEN7gvAsAsEMMAcQr38DHWwws77JXVQai8POW75BAJdZK%2B8CICSypYD60VPht25hUAMMAA89CqEF3VEVhr/BBAiDEOwKQMhBCKBUOoV7ugWgGCMCgaw1h9AfCG5AOY9aqjfYXAXMwvBUALIzqgnnp1WhBL8BAVwIxmikECAKKYfQWjZBSAIRXWQkha4YJMXoJQZfYYEB0YYngmh6DsLL83XRVdG%2Bt4GnXYxA2G6KOruY1WljU9y/l3rr2yscB2KoAAHAANgXOHyQE5kBeLW2cMwclcCEFpesE0vBRtaCHaQNrHX9CcB66QIrJWg%2BDeG415rOeusi%2BL31svlexs5/F8kZwkggA%3D%3D

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

v2-0001-Simplify-newNode-by-removing-now-unnecessary-fast.patch

Re: Simplify newNode()

From

Tom Lane

Date:

14 December 2023, 22:44:14

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 14/12/2023 10:32, Peter Eisentraut wrote:
>> But if we think that compilers are now smart enough, maybe we can unwind
>> this whole stack a bit more?  Maybe we don't need MemSetTest() and/or
>> palloc0fast() and/or newNode() at all?

> Good point. Looking closer, modern compilers will actually turn the 
> MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() 
> anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. 

I experimented with the same planner-intensive test case that I used
in the last discussion back in 2008.  I got these results:

HEAD:
tps = 144.974195 (without initial connection time)

v1 patch:
tps = 146.302004 (without initial connection time)

v2 patch:
tps = 144.882208 (without initial connection time)

While there's not much daylight between these numbers, the times are
quite reproducible for me.  This is with RHEL8's gcc 8.5.0 on x86_64.
That's probably a bit trailing-edge in terms of what people might be
using with v17, but I don't think it's discountable.

I also looked at the backend's overall code size per size(1):

HEAD:
   text    data     bss     dec     hex filename
8613007  100192  220176 8933375  884fff testversion.stock/bin/postgres

v1 patch:
   text    data     bss     dec     hex filename
8615126  100192  220144 8935462  885826 testversion.v1/bin/postgres

v2 patch:
   text    data     bss     dec     hex filename
8595322  100192  220144 8915658  880aca testversion.v2/bin/postgres

I did check that the v1 patch successfully inlines newNode() and
reduces it to just a MemoryContextAllocZeroAligned call, so it's
correct that modern compilers do that better than whatever I tested
in 2008.  But I wonder what is happening in v2 to reduce the code
size so much.  MemoryContextAllocZeroAligned is not 20kB by itself.

> Good point. Looking closer, modern compilers will actually turn the 
> MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() 
> anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. 

Not here ...

> Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned. 
> It's not doing any good as it is, as it gets compiled to be identical to 
> MemoryContextAllocZero.

Also not so here.  Admittedly, my results don't make much of a case
for keeping the two code paths, even on compilers where it matters.

            regards, tom lane

Re: Simplify newNode()

From

Thomas Munro

Date:

14 December 2023, 22:56:25

On Fri, Dec 15, 2023 at 11:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned.
> > It's not doing any good as it is, as it gets compiled to be identical to
> > MemoryContextAllocZero.
>
> Also not so here.  Admittedly, my results don't make much of a case
> for keeping the two code paths, even on compilers where it matters.

FWIW here is what I figured out once about why it gets compiled the
same these days:

https://www.postgresql.org/message-id/CA+hUKGLfa6ANa0vs7Lf0op0XBH05HE8SyX8NFhDyT7k2CHYLXw@mail.gmail.com

Re: Simplify newNode()

From

John Naylor

Date:

18 December 2023, 09:55:18

On Fri, Dec 15, 2023 at 5:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I did check that the v1 patch successfully inlines newNode() and
> reduces it to just a MemoryContextAllocZeroAligned call, so it's
> correct that modern compilers do that better than whatever I tested
> in 2008.  But I wonder what is happening in v2 to reduce the code
> size so much.  MemoryContextAllocZeroAligned is not 20kB by itself.

I poked at this a bit and it seems to come from what Heikki said
upthread about fewer instructions before the calls: Running objdump on
v1 and v2 copyfuncs.o and diff'ing shows there are fewer MOV
instructions (some extraneous stuff removed):

        e9 da 5f 00 00          jmp    <_copyReindexStmt>
-       48 8b 05 00 00 00 00    mov    rax,QWORD PTR [rip+0x0]
-       be 18 00 00 00          mov    esi,0x18
-       48 8b 38                mov    rdi,QWORD PTR [rax]
-       e8 00 00 00 00          call   MemoryContextAllocZeroAligned-0x4
+       bf 18 00 00 00          mov    edi,0x18
+       e8 00 00 00 00          call   palloc0-0x4

That's 10 bytes savings.

-       48 8b 05 00 00 00 00    mov    rax,QWORD PTR [rip+0x0]
-       48 8b 38                mov    rdi,QWORD PTR [rax]
-       e8 00 00 00 00          call   MemoryContextAllocZeroAligned-0x4
+       e8 00 00 00 00          call   palloc0-0x4

...another 10 bytes. Over and over again.

Because of the size differences, the compiler is inlining more: e.g.
in v1 _copyFieldStore has 4 call sites, but in v2 it got inlined.

About the patch, I'm wondering if this whitespace is intentional, but
it's otherwise straightforward:

--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -132,6 +132,7 @@ typedef struct Node

 #define nodeTag(nodeptr) (((const Node*)(nodeptr))->type)

+
 /*

Re: Simplify newNode()

From

Heikki Linnakangas

Date:

18 December 2023, 14:28:31

On 15/12/2023 00:44, Tom Lane wrote:
>> Good point. Looking closer, modern compilers will actually turn the
>> MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset()
>> anyway! Funny. That is true for recent versions of gcc, clang, and MSVC.
> Not here ...

Hmm, according to godbolt, the change happened in GCC version 10.1. 
Starting with gcc 10.1, it is turned into a memset(). On clang, the same 
change happened in version 3.4.1.

I think we have consensus on patch v2. It's simpler and not less 
performant than what we have now, at least on modern compilers. Barring 
objections, I'll commit that.

I'm not planning to spend more time on this, but there might be some 
room for further optimization if someone is interested to do the 
micro-benchmarking. The obvious thing would be to persuade modern 
compilers to not switch to memset() in MemoryContextAllocZeroAligned 
(*), making the old macro logic work the same way it used to on old 
compilers.

Also, instead of palloc0, it might be better for newNode() to call 
palloc followed by memset. That would allow the compiler to partially 
optimize away the memset. Most callers fill at least some of the fields 
after calling makeNode(), so the compiler could generate code that 
clears only the uninitialized fields and padding bytes.

(*) or rather, a new function like MemoryContextAllocZeroAligned but 
without the 'context' argument. We want to keep the savings in the 
callers from eliminating the extra argument.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Simplify newNode()

From

Heikki Linnakangas

Date:

19 December 2023, 10:13:12

On 18/12/2023 16:28, Heikki Linnakangas wrote:
> I think we have consensus on patch v2. It's simpler and not less
> performant than what we have now, at least on modern compilers. Barring
> objections, I'll commit that.

Committed that.

-- 
Heikki Linnakangas
Neon (https://neon.tech)