Thread: Simplify newNode()
The newNode() macro can be turned into a static inline function, which makes it a lot simpler. See attached. This was not possible when the macro was originally written, as we didn't require compiler to have static inline support, but nowadays we do. This was last discussed in 2008, see discussion at https://www.postgresql.org/message-id/26133.1220037409%40sss.pgh.pa.us. In those tests, Tom observed that gcc refused to inline the static inline function. That was weird, the function is very small and doesn't do anything special. Whatever the problem was, I think we can dismiss it with modern compilers. It does get inlined on gcc 12 and clang 14 that I have installed. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
Hi,
LGTM.
+ Assert(size >= sizeof(Node)); /* need the tag, at least */
+ result = (Node *) palloc0fast(size);
+ result->type = tag;
+ return result;
+}
How about moving the comments /* need the tag, at least */ after result->type = tag; by the way?
LGTM.
+ Assert(size >= sizeof(Node)); /* need the tag, at least */
+ result = (Node *) palloc0fast(size);
+ result->type = tag;
+ return result;
+}
How about moving the comments /* need the tag, at least */ after result->type = tag; by the way?
www.hashdata.xyz
On Thu, Dec 14, 2023 at 9:34 AM Zhang Mingli <zmlpostgres@gmail.com> wrote: > > Hi, > > LGTM. > > + Assert(size >= sizeof(Node)); /* need the tag, at least */ > + result = (Node *) palloc0fast(size); > + result->type = tag; > > + return result; > +} > > How about moving the comments /* need the tag, at least */ after result->type = tag; by the way? I don't think so, the comment has the meaning of the requested size should at least the size of Node, which contains just a NodeTag. typedef struct Node { NodeTag type; } Node; > > > > Zhang Mingli > www.hashdata.xyz -- Regards Junwang Zhao
On 14.12.23 01:48, Heikki Linnakangas wrote: > The newNode() macro can be turned into a static inline function, which > makes it a lot simpler. See attached. This was not possible when the > macro was originally written, as we didn't require compiler to have > static inline support, but nowadays we do. > > This was last discussed in 2008, see discussion at > https://www.postgresql.org/message-id/26133.1220037409%40sss.pgh.pa.us. > In those tests, Tom observed that gcc refused to inline the static > inline function. That was weird, the function is very small and doesn't > do anything special. Whatever the problem was, I think we can dismiss it > with modern compilers. It does get inlined on gcc 12 and clang 14 that I > have installed. I notice that the existing comments point out that the size argument should be a compile-time constant, but that is no longer the case for ExtensibleNode(). Also, newNode() is the only caller of palloc0fast(), which also points out that the size argument should be a compile-time constant, and palloc0fast() is the only caller of MemSetTest(). I can see how an older compiler might have gotten too confused by all that. But if we think that compilers are now smart enough, maybe we can unwind this whole stack a bit more? Maybe we don't need MemSetTest() and/or palloc0fast() and/or newNode() at all?
On 14/12/2023 10:32, Peter Eisentraut wrote: > I notice that the existing comments point out that the size argument > should be a compile-time constant, but that is no longer the case for > ExtensibleNode(). Also, newNode() is the only caller of palloc0fast(), > which also points out that the size argument should be a compile-time > constant, and palloc0fast() is the only caller of MemSetTest(). I can > see how an older compiler might have gotten too confused by all that. > But if we think that compilers are now smart enough, maybe we can unwind > this whole stack a bit more? Maybe we don't need MemSetTest() and/or > palloc0fast() and/or newNode() at all? Good point. Looking closer, modern compilers will actually turn the MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. Here's a link to a godbolt snippet to play with this: https://godbolt.org/z/9b89P3c8x (full link at [0]). Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned. It's not doing any good as it is, as it gets compiled to be identical to MemoryContextAllocZero. (There are small differences depending compiler and version, but e.g. on gcc 13.2, the code generated for MemoryContextAllocZero() is actually smaller even though both call memset()) Another approach would be to add more hints to MemoryContextAllocZeroAligned to dissuade the compiler from turning the loop into a memset() call. If you add an "if (size > 1024) abort" there, then gcc 13 doesn't optimize into a memset() call, but clang still does. Some micro-benchmarks on that would be nice. But given that the compiler has been doing this optimization for a while and we haven't noticed, I think memset() should be considered the status quo, and converting it to a loop again should be considered a new optimization. Also, replacing MemoryContextAllocZeroAligned(CurrentMemoryContext, size) with palloc0(size) has one fewer argument and the assembly code of the call has one fewer instruction. That's something too. [0] https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGe1wAyeAyYAHI%2BAEaYxBIa0gAOqAqETgwe3r56icmOAkEh4SxRMVxxtpj2uQxCBEzEBOk%2Bflzllak1dQT5YZHRsdIKtfWNmS2Dnd2Fxf0AlLaoXsTI7BzmAMzByN5YANQma26D6BGongB0CPvYJhoAguub25h7B0e0eBEXVzf3ZhsMWy8u32hwI6CwVC%2Ba2udweAKeL1BxGCwChMPudwA9AAqHZUYioFg7ZAXHbYzE/B5UHYAfRpAHFQnI3HSXtc1gARHZrSl/CHBZ7vADWFQAnhBVDMbgBOOkRLx0RwMGmYVTxTAOKCSnZgMD7LkaUg7LhS2F8zA0EI7LwMYViiUzWk0%2BWK4IqtUaghax26/U7Q3%2B02/NYVJS8tb8q122jiyUy706vWcwPhyPPG3R2OOhO%2B5MaIPrVx4KiUs0Ri0CnYAWSYqlutFoqGQQjwAC9MPGIC3246NKo1lRB0OqI6ccadsA8MAmBFRQRngBacfk1MVq31xvNtuYACSCgAamI8OgIMke53u5hs2er4i/TW6w2m5eC3dBsQvA5q5gWCRRR5BFVAgOSYWp9isO4CFFdUIR2d9PwIb9f2If9ZCAkDajJKsfz/AD51UAhwJ%2BKCYItODtxpRDLyIyDoMwWD4K/bDkNQwCCI8G152IBQkNwtD2IWQDuJojFbhI%2BiyIAN1QY8dggbFmL/DoCAUZRkUEAAxG1kGzRSULwoDiX4ggjWk2TsXiJgFEGBAP1IUtpRlJyNEclznNcjydiMhhBmJBA6jJcYVJpd8UXs2E3Mijz3JilyvJOTwdnidSCEo1AQrBaJiClNYINE8SGIID8mJw/TjOwggEAwBRKQAdjymUzPQGUvLJeSxE3XTStY/DELQNiTJ2S9yPPXKHLHUQGx2Sq8B4/FMEwGl4h2YJiSs54FEJTABGedY0zADg5uIBaIEdFcIulJr3Pk%2BbFvibMmrJRJgi4nKGpcq64q87F5OOjqmwemT0CemShKNYabzehzPo837MCUAgupYgyCO83qoYumH42xLB6HnGl%2BvRuS9J6wzCaAjGMWlEmUcQ%2BTgEwVLkAQG0hQJ4zAfM56hMpmVqKi%2BnGYJlmGDZhRLOWTngYs0HXpEmUEtoa7sVmlUWHiKCkb4ga0Yp%2BWPqB5WgoULWyp18mCPCqnYuir6SeU1SUq0gEkpSqhtNMoGnqsmy7Icm33K8mnjI4oSeOxIhaloBQrYDqKWp2RXXZetKMqwYhsvlv5i1gqtsCrAB5AAlABNGk3AL0IABVsAADSr8uAAlsDcABpHdQnpaHDYF7FmY1IVTdJ1GLcRrOQwYfAS1hWquWDgaKqq9AarGiKxLowrisQ%2Bfeowpg6ryrzMUxJLgBpBgMEW0CiogGd3yYBxTQ8sdKuiZ46meC%2BdgvrAeNQakZo8XEmSCka8j6gOlKES%2BVcmDAHcuJGittpRjmPIwRwNB4Y7FVA/RCQpgjA3/rrVG50qYvwQPDZ4BAADuqA8R4AqMvHYH8koGGWMDchx1pq0JYMEPAPD2xMPeMABgbBBA7CoVZWoDMQAgIcordys0i4UMIqvZ%2BuIq4vC5F/cWD935PjYeRAEgpJE7GOgjWRF15FRX%2BlQncaRkTCE9KkcCOwxw2KSv9ZAK0GDEgcXgSacEnECAsVTfmHk2AsBpJ40C9FEEv2IA/IUOwIl/kEZuGJwN%2BDEGmggWaRDEIkJlP1XyO8gKL2qmSNglVqpxNxJJPA9QvBiDxNpKo00Zz0BCTKUpqNLLHUEIgmUY4mQBACCtakX8%2BloLkkQeI9BJIVHyWdUBVMemIRoNxAgzM6DNVUWOchTACHUneKLeiOx3i%2BUIds2g6B%2BldJcms12mBJLXN2RBZBuJkrPKcF4HirydiEIUKwZ4UyxGFIed1Wm38gKvNqdCkeuSbkAupECtgHj%2BkFJWUUgQvlmYBWxMwNggyooTWMt/YFcltC/PWSQHYWB5TAEnEYZZDlil9X8tklWWABmqPcqSnWO4uTFiYQwUUlLqV4lpfSrwjKUQsruOAtZbgOoRESWSMxQtkARBXu8scFzEKEI1QQTEuNGbPEmrQVVyAhThyxbPXiZtd6gX3qvUSq5LTPGYkIRmARTjxFPEMQakkxBGnoAwR0JgACsbgHLoFoVGmNF0TD1T2NG/2LlGxGDJLScY9RNFyUzcAMk15A1vVTYm62GaBBFtxBlVAy0/QQELcWuSEA8WcsdBlToexLByRfAWxgMwy0Jv9mmpNLkqGIueBALteaQQ5tmRGsdlaPLYlnYRSw1h80aBcSOpN9rJ10GnfmUsolHrYh%2BGsjcTYABa0RUD1inCEE8jzR7g23CNK8B9u7HgTuqxm%2BtSGYluNZaIXo1l7kPO8E8o8h362QcB0D9QoEEDsW4Pxk1vUOFSG2jmgGZTCogLqa9W52yQaPCeSGr5pReRnCQL0vMLqjwXFcRRyj81UDEGGV1DljqIT9Mxq4VSl4KBY9CTxuGBpGkhvBwjGY8AihjBAPjmi/QjICHB/d70aNMJOPUU68Hk0clPUMzE%2B5bgBHpEXDuHIaR5yrMoAuBcAg0gswEAubhJO9SNHx6T25GNAa9T6v1ynGZGgDDJ11rVkk/gRqFwaEX/P4ZcnxxYPi%2BPy1nj8DgcxaCcEjbwPwHAtCkFQJwGNm7e2bUWMsHtaweCkAIJoHLcwhQgEjYaPLHBJCFea6VzgvAFAgENE14rOXSBwFgEgNA6sj1kAoLh2b9AYjAC4GsMwfBFTRCGxACIfWIjBDqKKTgDWDvMBQgXCI2hPQnd4DN0RBAC62mO2N0g0rgDKobEN7gvAsAsEMMAcQr38DHWwws77JXVQai8POW75BAJdZK%2B8CICSypYD60VPht25hUAMMAA89CqEF3VEVhr/BBAiDEOwKQMhBCKBUOoV7ugWgGCMCgaw1h9AfCG5AOY9aqjfYXAXMwvBUALIzqgnnp1WhBL8BAVwIxmikECAKKYfQWjZBSAIRXWQkha4YJMXoJQZfYYEB0YYngmh6DsLL83XRVdG%2Bt4GnXYxA2G6KOruY1WljU9y/l3rr2yscB2KoAAHAANgXOHyQE5kBeLW2cMwclcCEFpesE0vBRtaCHaQNrHX9CcB66QIrJWg%2BDeG415rOeusi%2BL31svlexs5/F8kZwkggA%3D%3D -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 14/12/2023 10:32, Peter Eisentraut wrote: >> But if we think that compilers are now smart enough, maybe we can unwind >> this whole stack a bit more? Maybe we don't need MemSetTest() and/or >> palloc0fast() and/or newNode() at all? > Good point. Looking closer, modern compilers will actually turn the > MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() > anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. I experimented with the same planner-intensive test case that I used in the last discussion back in 2008. I got these results: HEAD: tps = 144.974195 (without initial connection time) v1 patch: tps = 146.302004 (without initial connection time) v2 patch: tps = 144.882208 (without initial connection time) While there's not much daylight between these numbers, the times are quite reproducible for me. This is with RHEL8's gcc 8.5.0 on x86_64. That's probably a bit trailing-edge in terms of what people might be using with v17, but I don't think it's discountable. I also looked at the backend's overall code size per size(1): HEAD: text data bss dec hex filename 8613007 100192 220176 8933375 884fff testversion.stock/bin/postgres v1 patch: text data bss dec hex filename 8615126 100192 220144 8935462 885826 testversion.v1/bin/postgres v2 patch: text data bss dec hex filename 8595322 100192 220144 8915658 880aca testversion.v2/bin/postgres I did check that the v1 patch successfully inlines newNode() and reduces it to just a MemoryContextAllocZeroAligned call, so it's correct that modern compilers do that better than whatever I tested in 2008. But I wonder what is happening in v2 to reduce the code size so much. MemoryContextAllocZeroAligned is not 20kB by itself. > Good point. Looking closer, modern compilers will actually turn the > MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() > anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. Not here ... > Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned. > It's not doing any good as it is, as it gets compiled to be identical to > MemoryContextAllocZero. Also not so here. Admittedly, my results don't make much of a case for keeping the two code paths, even on compilers where it matters. regards, tom lane
On Fri, Dec 15, 2023 at 11:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > Yeah, +1 on removing all that, including MemoryContextAllocZeroAligned. > > It's not doing any good as it is, as it gets compiled to be identical to > > MemoryContextAllocZero. > > Also not so here. Admittedly, my results don't make much of a case > for keeping the two code paths, even on compilers where it matters. FWIW here is what I figured out once about why it gets compiled the same these days: https://www.postgresql.org/message-id/CA+hUKGLfa6ANa0vs7Lf0op0XBH05HE8SyX8NFhDyT7k2CHYLXw@mail.gmail.com
On Fri, Dec 15, 2023 at 5:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I did check that the v1 patch successfully inlines newNode() and > reduces it to just a MemoryContextAllocZeroAligned call, so it's > correct that modern compilers do that better than whatever I tested > in 2008. But I wonder what is happening in v2 to reduce the code > size so much. MemoryContextAllocZeroAligned is not 20kB by itself. I poked at this a bit and it seems to come from what Heikki said upthread about fewer instructions before the calls: Running objdump on v1 and v2 copyfuncs.o and diff'ing shows there are fewer MOV instructions (some extraneous stuff removed): e9 da 5f 00 00 jmp <_copyReindexStmt> - 48 8b 05 00 00 00 00 mov rax,QWORD PTR [rip+0x0] - be 18 00 00 00 mov esi,0x18 - 48 8b 38 mov rdi,QWORD PTR [rax] - e8 00 00 00 00 call MemoryContextAllocZeroAligned-0x4 + bf 18 00 00 00 mov edi,0x18 + e8 00 00 00 00 call palloc0-0x4 That's 10 bytes savings. - 48 8b 05 00 00 00 00 mov rax,QWORD PTR [rip+0x0] - 48 8b 38 mov rdi,QWORD PTR [rax] - e8 00 00 00 00 call MemoryContextAllocZeroAligned-0x4 + e8 00 00 00 00 call palloc0-0x4 ...another 10 bytes. Over and over again. Because of the size differences, the compiler is inlining more: e.g. in v1 _copyFieldStore has 4 call sites, but in v2 it got inlined. About the patch, I'm wondering if this whitespace is intentional, but it's otherwise straightforward: --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -132,6 +132,7 @@ typedef struct Node #define nodeTag(nodeptr) (((const Node*)(nodeptr))->type) + /*
On 15/12/2023 00:44, Tom Lane wrote: >> Good point. Looking closer, modern compilers will actually turn the >> MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() >> anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. > Not here ... Hmm, according to godbolt, the change happened in GCC version 10.1. Starting with gcc 10.1, it is turned into a memset(). On clang, the same change happened in version 3.4.1. I think we have consensus on patch v2. It's simpler and not less performant than what we have now, at least on modern compilers. Barring objections, I'll commit that. I'm not planning to spend more time on this, but there might be some room for further optimization if someone is interested to do the micro-benchmarking. The obvious thing would be to persuade modern compilers to not switch to memset() in MemoryContextAllocZeroAligned (*), making the old macro logic work the same way it used to on old compilers. Also, instead of palloc0, it might be better for newNode() to call palloc followed by memset. That would allow the compiler to partially optimize away the memset. Most callers fill at least some of the fields after calling makeNode(), so the compiler could generate code that clears only the uninitialized fields and padding bytes. (*) or rather, a new function like MemoryContextAllocZeroAligned but without the 'context' argument. We want to keep the savings in the callers from eliminating the extra argument. -- Heikki Linnakangas Neon (https://neon.tech)
On 18/12/2023 16:28, Heikki Linnakangas wrote: > I think we have consensus on patch v2. It's simpler and not less > performant than what we have now, at least on modern compilers. Barring > objections, I'll commit that. Committed that. -- Heikki Linnakangas Neon (https://neon.tech)