Thread: inline newNode()
I've been taking a look at improving the performance of the GEQO code. Before looking at algorithmic improvements, I noticed some "low-hanging fruit": gprof revealed that newNode() was a hotspot. When I altered it to be an inline function, the overall time to plan a 12-table table join (using the default GEQO settings) dropped by about 9%. I haven't taken a look at how the patch effects the other places that use newNode(), but it stands to reason that they'd see a performance improvement as well (although probably less noticeable). However, I'm not sure if I used the correct syntax for inlining the function (since it was originally declared in a header and defined elsewhere, I couldn't just add 'inline'). The method I used (declaring the function 'extern inline' and defining it in the header file) works for me with GCC 3.2, but I'm open to suggestions for improvement. BTW, after applying the patch, the GEQO profile looks like: % cumulative self self total time seconds seconds calls s/call s/call name 16.07 0.18 0.18 4618735 0.00 0.00 compare_path_costs 9.82 0.29 0.11 2149666 0.00 0.00 AllocSetAlloc 8.04 0.38 0.09 396333 0.00 0.00 add_path 4.46 0.43 0.05 2149661 0.00 0.00 MemoryContextAlloc 3.57 0.47 0.04 1150953 0.00 0.00 compare_pathkeys (Yes, gprof on my machine is still a bit fubared...) Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Attachment
Neil Conway <neilc@samurai.com> writes: > I've been taking a look at improving the performance of the GEQO > code. Before looking at algorithmic improvements, I noticed some > "low-hanging fruit": gprof revealed that newNode() was a hotspot. When > I altered it to be an inline function, the overall time to plan a > 12-table table join (using the default GEQO settings) dropped by about > 9%. How much did you bloat the code? There are an awful lot of calls to newNode(), so even though it's not all that large, I'd think the multiplier would be nasty. It might be better to offer two versions, standard out-of-line and a macro-or-inline called something like newNodeInline(), and change just the hottest call sites to use the inline. You could probably get most of the speedup from inlining at just a few call sites. (I've been intending to do something similar with the TransactionId comparison functions.) > However, I'm not sure if I used the correct syntax for inlining the > function (since it was originally declared in a header and defined > elsewhere, I couldn't just add 'inline'). The method I used (declaring > the function 'extern inline' and defining it in the header file) works > for me with GCC 3.2, but I'm open to suggestions for improvement. This isn't portable at all, AFAIK :-(. Unfortunately I can't think of a portable way to do it with a macro, either. However, if you're willing to go the route involving changing call sites, then you could change foo = newNode(typename); to newNodeMacro(foo, typename); whereupon it becomes pretty easy to make a suitable macro. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > How much did you bloat the code? There are an awful lot of calls to > newNode(), so even though it's not all that large, I'd think the > multiplier would be nasty. The patch increases the executable from 12844452 to 13005244 bytes, when compiled with '-pg -g -O2' and without being stripped. > This isn't portable at all, AFAIK :-(. Unfortunately I can't think > of a portable way to do it with a macro, either. Well, one alternative might be to provide 2 definitions of the function -- one an extern inline in the header file, and one using the current method (in a separate file, non-inline). Then wrap the header file in an #ifdef __GNUC__ block, and the non-inlined version in #ifndef __GNUC__. The downside is that it means maintaining two versions of the same function -- but given that newNode() is pretty trivial, that might be acceptable. BTW, the GCC docs on inline functions are here: http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Inline.html#Inline According to that page, using 'static inline' instead of 'extern inline' is recommended for future compatability with C99, so that's what we should probably use (in the __GNUC__ version). Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway <neilc@samurai.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> How much did you bloat the code? There are an awful lot of calls to >> newNode(), so even though it's not all that large, I'd think the >> multiplier would be nasty. > The patch increases the executable from 12844452 to 13005244 bytes, > when compiled with '-pg -g -O2' and without being stripped. Okay, not as bad as I feared, but still kinda high. I believe that most of the bloat comes from the MemSet macro; there's just not much else in newNode(). Now, the reason MemSet expands to a fair amount of code is its if-then-else case to decide whether to call memset() or do an inline loop. I've looked at the assembler code for it on a couple of machines, and the loop proper is only about a third of the code that gets generated. Ideally, we'd like to eliminate the if-test for inlined newNode calls. That would buy back a lot of the bloat and speed things up still further. Now the tests on _val == 0 and _len <= MEMSET_LOOP_LIMIT and _len being a multiple of 4 are no problem, since _val and _len are compile-time constants; these will be optimized away. What is not optimized away (on the compilers I've looked at) is the check for _start being int-aligned. A brute-force approach is to say "we know _start is word-aligned because we just got it from palloc, which guarantees MAXALIGNment". We could make a variant version of MemSet that omits the alignment check, and use it here and anywhere else we're sure it's safe. A nicer approach would be to somehow make use of the datatype of the first argument to MemSet. If we could determine at compile time that it's supposed to point at a type with at least int alignment, then it'd be possible for the compiler to optimize away this check in a reasonably safe fashion. I'm not sure if there's a portable way to do this, though. There's no "alignof()" construct in C :-(. Any ideas? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: [ snip interesting analysis ] > A nicer approach would be to somehow make use of the datatype of the > first argument to MemSet. If we could determine at compile time > that it's supposed to point at a type with at least int alignment, > then it'd be possible for the compiler to optimize away this check > in a reasonably safe fashion. I'm not sure if there's a portable > way to do this, though. There's no "alignof()" construct in C > :-(. Any ideas? Well, we could make use of (yet another) GCC-ism: the __alignof__ keyword, which is described here: http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Alignment.html#Alignment I don't like making the code GCC-specific any more than anyone else does, but given that the code-bloat is specific to the inline version of newNode (which in the scheme I described earlier would be GCC-only) -- so introducing a GCC-specific fix for a GCC-specific problem isn't too bad, IMHO. Or we could just use your other suggestion: define a variant of MemSet() and use it when we know it's safe. Not sure which is the better solution: any comments? Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway <neilc@samurai.com> writes: > I don't like making the code GCC-specific any more than anyone else > does, but given that the code-bloat is specific to the inline version > of newNode (which in the scheme I described earlier would be > GCC-only) -- so introducing a GCC-specific fix for a GCC-specific > problem isn't too bad, IMHO. > Or we could just use your other suggestion: define a variant of > MemSet() and use it when we know it's safe. Not sure which is the > better solution: any comments? If we're going with a GCC-only approach to inlining newNode then it seems like a tossup to me too. Any other thoughts out there? regards, tom lane
Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > I don't like making the code GCC-specific any more than anyone else > > does, but given that the code-bloat is specific to the inline version > > of newNode (which in the scheme I described earlier would be > > GCC-only) -- so introducing a GCC-specific fix for a GCC-specific > > problem isn't too bad, IMHO. > > > Or we could just use your other suggestion: define a variant of > > MemSet() and use it when we know it's safe. Not sure which is the > > better solution: any comments? > > If we're going with a GCC-only approach to inlining newNode then it > seems like a tossup to me too. Any other thoughts out there? Seems newNode can easily be made into a macro. I can do the coding, and if you tell me that newNode will always be int-aligned, I can make an assume-aligned version of MemSet. Is that what people want to try? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Seems newNode can easily be made into a macro. If you can see how to do that, show us ... that would eliminate the need for a GCC-only inline function, which would sway my vote away from depending on __alignof__ for the other thing. regards, tom lane
Tom Lane writes: > A brute-force approach is to say "we know _start is word-aligned because > we just got it from palloc, which guarantees MAXALIGNment". We could > make a variant version of MemSet that omits the alignment check, and use > it here and anywhere else we're sure it's safe. Or make a version of palloc that zeroes the memory. -- Peter Eisentraut peter_e@gmx.net
Neil Conway writes: > Well, one alternative might be to provide 2 definitions of the > function -- one an extern inline in the header file, and one using the > current method (in a separate file, non-inline). Then wrap the header > file in an #ifdef __GNUC__ block, and the non-inlined version in > #ifndef __GNUC__. External inline functions aren't even portable across different versions of GCC. It's best not to go there at all. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut <peter_e@gmx.net> writes: > Or make a version of palloc that zeroes the memory. Well, we might lose a bit of speed that way -- with the current MemSet() macro, most of the tests that MemSet() does can be eliminated at compile-time, since MemSet() is usually called with some constant parameters. If we moved this code inside palloc(), it wouldn't be possible to do this constant propagation. Not sure it would make a big performance difference, though -- and I think palloc() is a clean solution. In fact, it's probably worth having anyway: a lot of the call sites of palloc() immediately zero the memory after they allocate it... Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Peter Eisentraut wrote: > Tom Lane writes: > > > A brute-force approach is to say "we know _start is word-aligned because > > we just got it from palloc, which guarantees MAXALIGNment". We could > > make a variant version of MemSet that omits the alignment check, and use > > it here and anywhere else we're sure it's safe. > > Or make a version of palloc that zeroes the memory. OK, here is a version of newNode that is a macro. However, there are problems: First, I can't get the code to assign the tag (dereference the pointer) _and_ return the pointer, in a macro. I can convert it to take a third pointer argument, and it is only called <10 times, so that is easy, but it is used by makeNode, which is called ~1000 times, so that seems like a bad idea. My solution was to declare a global variable in the header file to be used by the macro. Because the macro isn't called recursively, that should be fine. Now, the other problem is MemSet. MemSet isn't designed to be called inside a macro. In fact, the 'if' tests are easy to do in a macro with "? :", but the "while" loop seems to be impossible in a macro. I tried seeing if 'goto' would work in a macro, but of course it doesn't, and even if it did, I doubt it would be portable. The patch merely calls memset() rather than MemSet. Not sure if this is a win or not. It makes makeNode a macro, but removes the use of the MemSet macro in favor of memset(). Does anyone have additional suggestions? The only thing I can suggest is to make a clear-memory version of palloc because palloc always calls MemoryContextAlloc() so I can put it in there. How does that sound? However, now that I look at it, MemoryContextAlloc() could be made a macro easier than newNode() and that would save function call in more cases than newNode. In fact, the comment at the top of MemoryContextAlloc() says: * This could be turned into a macro, but we'd have to import * nodes/memnodes.h into postgres.h which seems a bad idea. Ideas? The regression tests do pass with this patch, so functionally it works fine. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 Index: src/backend/nodes/nodes.c =================================================================== RCS file: /cvsroot/pgsql-server/src/backend/nodes/nodes.c,v retrieving revision 1.15 diff -c -c -r1.15 nodes.c *** src/backend/nodes/nodes.c 20 Jun 2002 20:29:29 -0000 1.15 --- src/backend/nodes/nodes.c 8 Oct 2002 23:52:21 -0000 *************** *** 28,42 **** * macro makeNode. eg. to create a Resdom node, use makeNode(Resdom) * */ ! Node * ! newNode(Size size, NodeTag tag) ! { ! Node *newNode; - Assert(size >= sizeof(Node)); /* need the tag, at least */ - - newNode = (Node *) palloc(size); - MemSet((char *) newNode, 0, size); - newNode->type = tag; - return newNode; - } --- 28,32 ---- * macro makeNode. eg. to create a Resdom node, use makeNode(Resdom) * */ ! Node *newNodeMacroHolder; Index: src/include/nodes/nodes.h =================================================================== RCS file: /cvsroot/pgsql-server/src/include/nodes/nodes.h,v retrieving revision 1.118 diff -c -c -r1.118 nodes.h *** src/include/nodes/nodes.h 31 Aug 2002 22:10:47 -0000 1.118 --- src/include/nodes/nodes.h 8 Oct 2002 23:52:25 -0000 *************** *** 261,266 **** --- 261,285 ---- #define nodeTag(nodeptr) (((Node*)(nodeptr))->type) + /* + * There is no way to deferernce the palloc'ed pointer to assign the + * tag, and return the pointer itself, so we need a holder variable. + * Fortunately, this function isn't recursive so we just define + * a global variable for this purpose. + */ + extern Node *newNodeMacroHolder; + + #define newNode(size, tag) \ + ( \ + AssertMacro((size) >= sizeof(Node)), /* need the tag, at least */ \ + \ + newNodeMacroHolder = (Node *) palloc(size), \ + memset((char *)newNodeMacroHolder, 0, (size)), \ + newNodeMacroHolder->type = (tag), \ + newNodeMacroHolder \ + ) + + #define makeNode(_type_) ((_type_ *) newNode(sizeof(_type_),T_##_type_)) #define NodeSetTag(nodeptr,t) (((Node*)(nodeptr))->type = (t)) *************** *** 281,291 **** * extern declarations follow * ---------------------------------------------------------------- */ - - /* - * nodes/nodes.c - */ - extern Node *newNode(Size size, NodeTag tag); /* * nodes/{outfuncs.c,print.c} --- 300,305 ----
Bruce Momjian <pgman@candle.pha.pa.us> writes: > OK, here is a version of newNode that is a macro. If you use memset() instead of MemSet(), I'm afraid you're going to blow off most of the performance gain this was supposed to achieve. > Does anyone have additional suggestions? The only thing I can suggest > is to make a clear-memory version of palloc because palloc always calls > MemoryContextAlloc() so I can put it in there. How does that sound? I do not think palloc should auto-zero memory. Hard to explain why, but it just feels like a bad decision. One point is that the MemSet has to be inlined or it cannot compile-out the tests on _len. palloc can't treat the length as a compile-time constant. > The regression tests do pass with this patch, so functionally it works > fine. Speed is the issue here, not functionality... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > OK, here is a version of newNode that is a macro. > > If you use memset() instead of MemSet(), I'm afraid you're going to blow > off most of the performance gain this was supposed to achieve. Yep. > > Does anyone have additional suggestions? The only thing I can suggest > > is to make a clear-memory version of palloc because palloc always calls > > MemoryContextAlloc() so I can put it in there. How does that sound? > > I do not think palloc should auto-zero memory. Hard to explain why, > but it just feels like a bad decision. One point is that the MemSet > has to be inlined or it cannot compile-out the tests on _len. palloc > can't treat the length as a compile-time constant. Right, palloc shouldn't. I was thinking of having another version of palloc that _does_ clear out memory, and calling that from a newNode() macro. We already know palloc is going to call MemoryContextAlloc, so we could have a pallocC() that calls a new MemoryContextAllocC() that would call the underlying memory allocation function, then do the loop like MemSet to clear it. It would allow newNode to be a macro and prevent the bloat of having MemSet appear everywhere newNode appears; it would all be in the new function MemoryContextAllocC(). > > The regression tests do pass with this patch, so functionally it works > > fine. > > Speed is the issue here, not functionality... I was just proving it works, that's all. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Right, palloc shouldn't. I was thinking of having another version of > palloc that _does_ clear out memory, and calling that from a newNode() > macro. We already know palloc is going to call MemoryContextAlloc, so > we could have a pallocC() that calls a new MemoryContextAllocC() that > would call the underlying memory allocation function, then do the loop > like MemSet to clear it. But if the MemSet is inside the called function then it cannot reduce the if-tests to a compile-time decision to invoke the word-zeroing loop. We want the MemSet to be expanded at the newNode call site, where the size of the allocated memory is a compile-time constant. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Right, palloc shouldn't. I was thinking of having another version of > > palloc that _does_ clear out memory, and calling that from a newNode() > > macro. We already know palloc is going to call MemoryContextAlloc, so > > we could have a pallocC() that calls a new MemoryContextAllocC() that > > would call the underlying memory allocation function, then do the loop > > like MemSet to clear it. > > But if the MemSet is inside the called function then it cannot reduce > the if-tests to a compile-time decision to invoke the word-zeroing loop. > We want the MemSet to be expanded at the newNode call site, where the > size of the allocated memory is a compile-time constant. I can easily do the tests in the MemSet macro, but I can't do a loop in a macro that has to return a value; I need while(). Though a loop in a new fuction will not be as fast as a MemSet macro, I think it will be better than what we have now with newNode only because newNode will be a macro and not a function anymore, i.e. the MemSet will happen in the function called by pallocC and not in newNode anymore, and there will be zero code bloat. I wish I saw another way. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Right, palloc shouldn't. I was thinking of having another version of > > > palloc that _does_ clear out memory, and calling that from a newNode() > > > macro. We already know palloc is going to call MemoryContextAlloc, so > > > we could have a pallocC() that calls a new MemoryContextAllocC() that > > > would call the underlying memory allocation function, then do the loop > > > like MemSet to clear it. > > > > But if the MemSet is inside the called function then it cannot reduce > > the if-tests to a compile-time decision to invoke the word-zeroing loop. > > We want the MemSet to be expanded at the newNode call site, where the > > size of the allocated memory is a compile-time constant. > > I can easily do the tests in the MemSet macro, but I can't do a loop in > a macro that has to return a value; I need while(). Though a loop in a > new fuction will not be as fast as a MemSet macro, I think it will be > better than what we have now with newNode only because newNode will be a > macro and not a function anymore, i.e. the MemSet will happen in the > function called by pallocC and not in newNode anymore, and there will be > zero code bloat. I wish I saw another way. OK, here's a patch for testing. It needs cleanup because the final version would remove the nodes/nodes.c file. The net effect of the patch is to make newNode a macro with little code bloat. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 Index: src/backend/nodes/nodes.c =================================================================== RCS file: /cvsroot/pgsql-server/src/backend/nodes/nodes.c,v retrieving revision 1.15 diff -c -c -r1.15 nodes.c *** src/backend/nodes/nodes.c 20 Jun 2002 20:29:29 -0000 1.15 --- src/backend/nodes/nodes.c 9 Oct 2002 05:17:13 -0000 *************** *** 28,42 **** * macro makeNode. eg. to create a Resdom node, use makeNode(Resdom) * */ ! Node * ! newNode(Size size, NodeTag tag) ! { ! Node *newNode; - Assert(size >= sizeof(Node)); /* need the tag, at least */ - - newNode = (Node *) palloc(size); - MemSet((char *) newNode, 0, size); - newNode->type = tag; - return newNode; - } --- 28,32 ---- * macro makeNode. eg. to create a Resdom node, use makeNode(Resdom) * */ ! Node *newNodeMacroHolder; Index: src/backend/utils/mmgr/mcxt.c =================================================================== RCS file: /cvsroot/pgsql-server/src/backend/utils/mmgr/mcxt.c,v retrieving revision 1.32 diff -c -c -r1.32 mcxt.c *** src/backend/utils/mmgr/mcxt.c 12 Aug 2002 00:36:12 -0000 1.32 --- src/backend/utils/mmgr/mcxt.c 9 Oct 2002 05:17:17 -0000 *************** *** 453,458 **** --- 453,481 ---- } /* + * MemoryContextAllocC + * Like MemoryContextAlloc, but clears allocated memory + * + * We could just call MemoryContextAlloc then clear the memory, but this + * function is called too many times, so we have a separate version. + */ + void * + MemoryContextAllocC(MemoryContext context, Size size) + { + void *ret; + + AssertArg(MemoryContextIsValid(context)); + + if (!AllocSizeIsValid(size)) + elog(ERROR, "MemoryContextAlloc: invalid request size %lu", + (unsigned long) size); + + ret = (*context->methods->alloc) (context, size); + MemSet(ret, 0, size); + return ret; + } + + /* * pfree * Release an allocated chunk. */ Index: src/include/nodes/nodes.h =================================================================== RCS file: /cvsroot/pgsql-server/src/include/nodes/nodes.h,v retrieving revision 1.118 diff -c -c -r1.118 nodes.h *** src/include/nodes/nodes.h 31 Aug 2002 22:10:47 -0000 1.118 --- src/include/nodes/nodes.h 9 Oct 2002 05:17:20 -0000 *************** *** 261,266 **** --- 261,284 ---- #define nodeTag(nodeptr) (((Node*)(nodeptr))->type) + /* + * There is no way to dereference the palloc'ed pointer to assign the + * tag, and return the pointer itself, so we need a holder variable. + * Fortunately, this function isn't recursive so we just define + * a global variable for this purpose. + */ + extern Node *newNodeMacroHolder; + + #define newNode(size, tag) \ + ( \ + AssertMacro((size) >= sizeof(Node)), /* need the tag, at least */ \ + \ + newNodeMacroHolder = (Node *) pallocC(size), \ + newNodeMacroHolder->type = (tag), \ + newNodeMacroHolder \ + ) + + #define makeNode(_type_) ((_type_ *) newNode(sizeof(_type_),T_##_type_)) #define NodeSetTag(nodeptr,t) (((Node*)(nodeptr))->type = (t)) *************** *** 281,291 **** * extern declarations follow * ---------------------------------------------------------------- */ - - /* - * nodes/nodes.c - */ - extern Node *newNode(Size size, NodeTag tag); /* * nodes/{outfuncs.c,print.c} --- 299,304 ---- Index: src/include/utils/palloc.h =================================================================== RCS file: /cvsroot/pgsql-server/src/include/utils/palloc.h,v retrieving revision 1.19 diff -c -c -r1.19 palloc.h *** src/include/utils/palloc.h 20 Jun 2002 20:29:53 -0000 1.19 --- src/include/utils/palloc.h 9 Oct 2002 05:17:21 -0000 *************** *** 46,53 **** --- 46,56 ---- * Fundamental memory-allocation operations (more are in utils/memutils.h) */ extern void *MemoryContextAlloc(MemoryContext context, Size size); + extern void *MemoryContextAllocC(MemoryContext context, Size size); #define palloc(sz) MemoryContextAlloc(CurrentMemoryContext, (sz)) + + #define pallocC(sz) MemoryContextAllocC(CurrentMemoryContext, (sz)) extern void pfree(void *pointer);
Bruce Momjian <pgman@candle.pha.pa.us> writes: > ... I wish I saw another way. I liked Neil's proposal better. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > ... I wish I saw another way. > > I liked Neil's proposal better. What is it you liked, specifically? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > ... I wish I saw another way. > > I liked Neil's proposal better. I think Neil is going to test my patch. I think he will find it yields identical performance to his patch without the bloat, and it will work on all platforms. I don't think the MemSet() constant calls are that big a performance win. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > OK, here's a patch for testing. It needs cleanup because the final > version would remove the nodes/nodes.c file. The net effect of the > patch is to make newNode a macro with little code bloat. Ok, I think this is the better version. The performance seems to be about the same as my original patch, and the code bloat is a lot less: 12857787 with the patch versus 12845168 without it. The implemementation (with a global variable) is pretty ugly, but I don't really see a way around that... One minor quibble with the patch though: MemoryContextAllocC() and pallocC() aren't very good function names, IMHO. Perhaps MemoryContextAllocZero() and palloc0() would be better? This isn't specific to your patch, but it occurs to me that we could save a few bits of code bloat if we removed the '_len' variable declaration from the MemSet() macro -- it isn't needed, AFAICT. That would mean we we'd evaluate the 'len' argument multiple times, so I'm not sure if that's a win overall... Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > OK, here's a patch for testing. It needs cleanup because the final > > version would remove the nodes/nodes.c file. The net effect of the > > patch is to make newNode a macro with little code bloat. > > Ok, I think this is the better version. The performance seems to be > about the same as my original patch, and the code bloat is a lot less: > 12857787 with the patch versus 12845168 without it. The > implemementation (with a global variable) is pretty ugly, but I don't > really see a way around that... > > One minor quibble with the patch though: MemoryContextAllocC() and > pallocC() aren't very good function names, IMHO. Perhaps > MemoryContextAllocZero() and palloc0() would be better? Sure, I like those names. > This isn't specific to your patch, but it occurs to me that we could > save a few bits of code bloat if we removed the '_len' variable > declaration from the MemSet() macro -- it isn't needed, AFAICT. That > would mean we we'd evaluate the 'len' argument multiple times, so I'm > not sure if that's a win overall... I think that was actually the goal, to not have them evaluated multiple times, and perhaps bloat if the length itself was a macro. New patch attached with your suggested renaming. Now that I look at the code, there are about 55 places where we call palloc then right after it MemSet(0), so we could merge those to use palloc0 and reduce existing MemSet code bloat some more. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 Index: src/backend/nodes/nodes.c =================================================================== RCS file: /cvsroot/pgsql-server/src/backend/nodes/nodes.c,v retrieving revision 1.15 diff -c -c -r1.15 nodes.c *** src/backend/nodes/nodes.c 20 Jun 2002 20:29:29 -0000 1.15 --- src/backend/nodes/nodes.c 9 Oct 2002 05:17:13 -0000 *************** *** 28,42 **** * macro makeNode. eg. to create a Resdom node, use makeNode(Resdom) * */ ! Node * ! newNode(Size size, NodeTag tag) ! { ! Node *newNode; - Assert(size >= sizeof(Node)); /* need the tag, at least */ - - newNode = (Node *) palloc(size); - MemSet((char *) newNode, 0, size); - newNode->type = tag; - return newNode; - } --- 28,32 ---- * macro makeNode. eg. to create a Resdom node, use makeNode(Resdom) * */ ! Node *newNodeMacroHolder; Index: src/backend/utils/mmgr/mcxt.c =================================================================== RCS file: /cvsroot/pgsql-server/src/backend/utils/mmgr/mcxt.c,v retrieving revision 1.32 diff -c -c -r1.32 mcxt.c *** src/backend/utils/mmgr/mcxt.c 12 Aug 2002 00:36:12 -0000 1.32 --- src/backend/utils/mmgr/mcxt.c 9 Oct 2002 05:17:17 -0000 *************** *** 453,458 **** --- 453,481 ---- } /* + * MemoryContextAllocZero + * Like MemoryContextAlloc, but clears allocated memory + * + * We could just call MemoryContextAlloc then clear the memory, but this + * function is called too many times, so we have a separate version. + */ + void * + MemoryContextAllocZero(MemoryContext context, Size size) + { + void *ret; + + AssertArg(MemoryContextIsValid(context)); + + if (!AllocSizeIsValid(size)) + elog(ERROR, "MemoryContextAllocZero: invalid request size %lu", + (unsigned long) size); + + ret = (*context->methods->alloc) (context, size); + MemSet(ret, 0, size); + return ret; + } + + /* * pfree * Release an allocated chunk. */ Index: src/include/nodes/nodes.h =================================================================== RCS file: /cvsroot/pgsql-server/src/include/nodes/nodes.h,v retrieving revision 1.118 diff -c -c -r1.118 nodes.h *** src/include/nodes/nodes.h 31 Aug 2002 22:10:47 -0000 1.118 --- src/include/nodes/nodes.h 9 Oct 2002 05:17:20 -0000 *************** *** 261,266 **** --- 261,284 ---- #define nodeTag(nodeptr) (((Node*)(nodeptr))->type) + /* + * There is no way to dereference the palloc'ed pointer to assign the + * tag, and return the pointer itself, so we need a holder variable. + * Fortunately, this function isn't recursive so we just define + * a global variable for this purpose. + */ + extern Node *newNodeMacroHolder; + + #define newNode(size, tag) \ + ( \ + AssertMacro((size) >= sizeof(Node)), /* need the tag, at least */ \ + \ + newNodeMacroHolder = (Node *) palloc0(size), \ + newNodeMacroHolder->type = (tag), \ + newNodeMacroHolder \ + ) + + #define makeNode(_type_) ((_type_ *) newNode(sizeof(_type_),T_##_type_)) #define NodeSetTag(nodeptr,t) (((Node*)(nodeptr))->type = (t)) *************** *** 281,291 **** * extern declarations follow * ---------------------------------------------------------------- */ - - /* - * nodes/nodes.c - */ - extern Node *newNode(Size size, NodeTag tag); /* * nodes/{outfuncs.c,print.c} --- 299,304 ---- Index: src/include/utils/palloc.h =================================================================== RCS file: /cvsroot/pgsql-server/src/include/utils/palloc.h,v retrieving revision 1.19 diff -c -c -r1.19 palloc.h *** src/include/utils/palloc.h 20 Jun 2002 20:29:53 -0000 1.19 --- src/include/utils/palloc.h 9 Oct 2002 05:17:21 -0000 *************** *** 46,53 **** --- 46,56 ---- * Fundamental memory-allocation operations (more are in utils/memutils.h) */ extern void *MemoryContextAlloc(MemoryContext context, Size size); + extern void *MemoryContextAllocZero(MemoryContext context, Size size); #define palloc(sz) MemoryContextAlloc(CurrentMemoryContext, (sz)) + + #define palloc0(sz) MemoryContextAllocZero(CurrentMemoryContext, (sz)) extern void pfree(void *pointer);
Tom Lane writes: > If you use memset() instead of MemSet(), I'm afraid you're going to blow > off most of the performance gain this was supposed to achieve. Can someone explain to me why memset() would ever be better than MemSet()? Also, shouldn't GCC (at least 3.0 or later) inline memset() automatically? What's the result of using -finline (or your favorite compiler's inlining flag)? And has someone wondered why the GEQO code needs so many new nodes? Perhaps a more lightweight data representation for internal use could be appropriate? -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut wrote: > Tom Lane writes: > > > If you use memset() instead of MemSet(), I'm afraid you're going to blow > > off most of the performance gain this was supposed to achieve. > > Can someone explain to me why memset() would ever be better than MemSet()? I am surprised MemSet was ever faster than memset(). Remember, MemSet was done only to prevent excessive function call overhead to memset(). I never anticipated that a simple while() loop would be faster than the libc version, especially ones that have assembler memset versions, but, for example on Sparc, this is true. I looked at the Sparc assembler code and I can't see why MemSet would be faster. Perhaps someone can send over the Sparc assembler output of MemSet on their platform and I can compare it to the Solaris assembler I see here. In fact, they can probably disassemble a memset() routine there and see exactly what I see. > Also, shouldn't GCC (at least 3.0 or later) inline memset() automatically? Not sure, but yes, that may be true. I think it requires a high optimizer level, perhaps higher than our default. > What's the result of using -finline (or your favorite compiler's > inlining flag)? Yes, that would be it. > And has someone wondered why the GEQO code needs so many new nodes? > Perhaps a more lightweight data representation for internal use could be > appropriate? I assume the GEQO results he is seeing is only for a tests, and that the macro version of newNode will help in all cases. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Peter Eisentraut <peter_e@gmx.net> writes: > Can someone explain to me why memset() would ever be better than MemSet()? memset() should *always* be faster than any C-coded implementation thereof. Any competent assembly-language writer can beat C-level locutions, or at least equal them, if he's willing to expend the effort. I've frankly been astonished at the number of platforms where it seems memset() has not been coded with an appropriate degree of tenseness. The fact that we've found it useful to invent MemSet() is a pretty damning indictment of the competence of modern C-library authors. Or am I just stuck in the obsolete notion that vendors should provide some amount of platform-specific tuning, and not a generic library? regards, tom lane
Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > Can someone explain to me why memset() would ever be better than MemSet()? > > memset() should *always* be faster than any C-coded implementation > thereof. Any competent assembly-language writer can beat C-level > locutions, or at least equal them, if he's willing to expend the > effort. You would think so. I am surprised too. > I've frankly been astonished at the number of platforms where it > seems memset() has not been coded with an appropriate degree of > tenseness. The fact that we've found it useful to invent MemSet() > is a pretty damning indictment of the competence of modern C-library > authors. Remember, MemSet was invented only to prevent function call overhead, and on my BSD/OS system, len >= 256 is faster with the libc memset(). It is a fluke we found out that MemSet is faster that memset for all lengths on some platforms. > Or am I just stuck in the obsolete notion that vendors should provide > some amount of platform-specific tuning, and not a generic library? What really surprised me is that MemSet won on Sparc, where they have an assembler language version that looks very similar to the MemSet loop. In fact, I was the one who encouraged BSD/OS to write assembler language verions of many of their str* and mem* functions. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Remember, MemSet was invented only to prevent function call overhead, > and on my BSD/OS system, len >= 256 is faster with the libc > memset(). Yes, I remember finding that when testing MemSet() versus memset() for various values of MEMSET_LOOP_LIMIT earlier. > What really surprised me is that MemSet won on Sparc, where they have an > assembler language version that looks very similar to the MemSet > loop. Well, I'd assume any C library / compiler of half-decent quality on any platform would provide assembly optimized versions of common stdlib functions like memset(). While playing around with memset() on my machine (P4 running Linux, glibc 2.2.5, GCC 3.2.1pre3), I found the following interesting result. I used this simple benchmark (the same one I posted for the earlier MemSet() thread on -hackers): #include <string.h> #include "postgres.h" #undef MEMSET_LOOP_LIMIT #define MEMSET_LOOP_LIMIT BUFFER_SIZE int main(void) { char buffer[BUFFER_SIZE]; long long i; for (i = 0; i < 99000000; i++) { memset(buffer, 0, sizeof(buffer)); } return 0; } Compiled with '-DBUFFER_SIZE=256 -O2', I get the following results in seconds: MemSet(): ~9.6 memset(): ~19.5 __builtin_memset(): ~10.00 So it seems there is a reasonably optimized version of memset() provided by glibc/GCC (not sure which :-) ), it's just a matter of persuading the compiler to let us use it. It's still depressing that it doesn't beat MemSet(), but perhaps __builtin_memset() has better average-case performane over a wider spectrum of memory size?[1] BTW, regarding the newNode() stuff: so is it agreed that Bruce's patch is a performance win without too high of a code bloat / uglification penalty? If so, is it 7.3 or 7.4 material? Cheers, Neil [1] Not that I really buy that -- for one thing, if the length is constant, as it is in this case, the compiler can substitute an optimized version of the function for the appropriate memory size. I'm having a little difficulty explaining GCC/glibc's poor performance... -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
On Wed, Oct 09, 2002 at 12:12:12AM -0400, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > OK, here is a version of newNode that is a macro. > > If you use memset() instead of MemSet(), I'm afraid you're going to blow > off most of the performance gain this was supposed to achieve. > > > Does anyone have additional suggestions? The only thing I can suggest > > is to make a clear-memory version of palloc because palloc always calls > > MemoryContextAlloc() so I can put it in there. How does that sound? > > I do not think palloc should auto-zero memory. Hard to explain why, > but it just feels like a bad decision. One point is that the MemSet Agree. The memory-management routine knows nothing about real memory usage - maybe is zeroize memory wanted in palloc caller, but maybe not.. The palloc() caller knows it better than plalloc(). If I good remember same discussion was long time ago in linux-kernel list and result was non-zeroize-memory. Karel -- Karel Zak <zakkr@zf.jcu.cz> http://home.zf.jcu.cz/~zakkr/ C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
On 10 Oct 2002, Neil Conway wrote: > Well, I'd assume any C library / compiler of half-decent quality on > any platform would provide assembly optimized versions of common > stdlib functions like memset(). > > While playing around with memset() on my machine (P4 running Linux, > glibc 2.2.5, GCC 3.2.1pre3), I found the following interesting > result. I used this simple benchmark (the same one I posted for the > earlier MemSet() thread on -hackers): > [snip] > Compiled with '-DBUFFER_SIZE=256 -O2', I get the following results in > seconds: > > MemSet(): ~9.6 > memset(): ~19.5 > __builtin_memset(): ~10.00 I ran the same code. I do not understand the results you go. Here are mine, on an AMD Duron with GCC 3.2 and glibc-2.2.5. Results are: MemSet(): 14.758 sec memset(): 11.597 sec __buildin_memset(): 9.000 sec Who else wants to test? Gavin
Neil Conway wrote: > BTW, regarding the newNode() stuff: so is it agreed that Bruce's patch > is a performance win without too high of a code bloat / uglification > penalty? If so, is it 7.3 or 7.4 material? Not sure. It is a small patch but we normally don't do performance fixes during beta unless they are major. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Gavin Sherry <swm@linuxworld.com.au> writes: > On 10 Oct 2002, Neil Conway wrote: > > Compiled with '-DBUFFER_SIZE=256 -O2', I get the following results in > > seconds: > > > > MemSet(): ~9.6 > > memset(): ~19.5 > > __builtin_memset(): ~10.00 > > I ran the same code. I do not understand the results you go. Interesting -- it may be the case that the optimizations performed by GCC are architecture specific. I searched for some threads about this on the GCC list. One interesting result was this: http://gcc.gnu.org/ml/gcc/2002-04/msg01114.html One possible explanation for the different performance you saw is explained by Jan Hubicka: http://gcc.gnu.org/ml/gcc/2002-04/msg01146.html One thing that confuses me is that GCC decides *not* to use __builtin_memset() for some reason, even though it appears to be superior to normal memset() on both of our systems. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway writes: > MemSet(): ~9.6 > memset(): ~19.5 > __builtin_memset(): ~10.00 I did my own tests with this code and the results vary wildly between platforms. (I do not list __builtin_memset() because the results were invariably equal to memset().) Platform buffer memset() MemSet() freebsd 32 5.3 4.9 freebsd 256 23.3 24.2 mingw 32 0.5 2.0 mingw 256 6.6 10.5 unixware 256 15.2 10.3 unixware 1024 29.2 34.1 cygwin 256 6.7 15.8 "freebsd" is i386-unknown-freebsd4.7 with GCC 2.95.4. "mingw" is i686-pc-mingw32 with GCC 3.2. "unixware" is i586-sco-sysv5uw7.1.3 with vendor compiler version 4.1. "cygwin" is i686-pc-cygwin with GCC 2.95.3. GCC was run as 'gcc -O2 -Wall'. (I also tried 'gcc -O3 -finline' but that gave only minimally better results.) The SCO compiler was run as 'cc -O -Kinine -v'. Make of those results what you will, but the current cross-over point of 1024 seems very wrong. -- Peter Eisentraut peter_e@gmx.net
Bruce Momjian writes: > I assume the GEQO results he is seeing is only for a tests, and that the > macro version of newNode will help in all cases. Well are we just assuming here or are we fixing actual problems based on real analyses? -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut wrote: > Bruce Momjian writes: > > > I assume the GEQO results he is seeing is only for a tests, and that the > > macro version of newNode will help in all cases. > > Well are we just assuming here or are we fixing actual problems based on > real analyses? What is your point? He is testing GEQO, but I assume other places would see a speedup from inlining newNode. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian writes: > Peter Eisentraut wrote: > > Bruce Momjian writes: > > > > > I assume the GEQO results he is seeing is only for a tests, and that the > > > macro version of newNode will help in all cases. > > > > Well are we just assuming here or are we fixing actual problems based on > > real analyses? > > What is your point? He is testing GEQO, but I assume other places would > see a speedup from inlining newNode. That's exactly my point: You're assuming. Of course everything would be faster if we inline it, but IMHO that's the compiler's job. -- Peter Eisentraut peter_e@gmx.net
Peter Eisentraut wrote: > Bruce Momjian writes: > > > Peter Eisentraut wrote: > > > Bruce Momjian writes: > > > > > > > I assume the GEQO results he is seeing is only for a tests, and that the > > > > macro version of newNode will help in all cases. > > > > > > Well are we just assuming here or are we fixing actual problems based on > > > real analyses? > > > > What is your point? He is testing GEQO, but I assume other places would > > see a speedup from inlining newNode. > > That's exactly my point: You're assuming. We know GEQO is faster, and the 8% is enough to justify the inlining. Whether it speeds up other operations is assumed but we don't really care; the GEQO numbers are enough. The code bloat is minimal. > Of course everything would be faster if we inline it, but IMHO that's the > compiler's job. Yes, but we have to help it in certain cases, heap_getattr() being the most dramatic. Are you suggesting we remove all the macros in our header file and replace them with function calls? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > Neil Conway wrote: > > BTW, regarding the newNode() stuff: so is it agreed that Bruce's patch > > is a performance win without too high of a code bloat / uglification > > penalty? If so, is it 7.3 or 7.4 material? > > Not sure. It is a small patch but we normally don't do performance > fixes during beta unless they are major. This performance improvement isn't related to a specific user problem report, so I am going to hold it for 7.4. At that time we can also research whether palloc0 should be used in other cases, like palloc followed by MemSet. The only issue there is that palloc0 can't optimize away constants used in the macro so it may be better _not_ to make that change. Neil, do you have any performance numbers comparing MemSet with constant vs. variable parameters, e.g: MemSet(ptr, 0, 256) vs. i = 256; MemSet(ptr, 0, i) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Peter Eisentraut wrote: > Neil Conway writes: > > > MemSet(): ~9.6 > > memset(): ~19.5 > > __builtin_memset(): ~10.00 > > I did my own tests with this code and the results vary wildly between > platforms. (I do not list __builtin_memset() because the results were > invariably equal to memset().) > > Platform buffer memset() MemSet() > > freebsd 32 5.3 4.9 > freebsd 256 23.3 24.2 > mingw 32 0.5 2.0 > mingw 256 6.6 10.5 > unixware 256 15.2 10.3 > unixware 1024 29.2 34.1 > cygwin 256 6.7 15.8 > > "freebsd" is i386-unknown-freebsd4.7 with GCC 2.95.4. > "mingw" is i686-pc-mingw32 with GCC 3.2. > "unixware" is i586-sco-sysv5uw7.1.3 with vendor compiler version 4.1. > "cygwin" is i686-pc-cygwin with GCC 2.95.3. > > GCC was run as 'gcc -O2 -Wall'. (I also tried 'gcc -O3 -finline' but > that gave only minimally better results.) The SCO compiler was run as > 'cc -O -Kinine -v'. > > Make of those results what you will, but the current cross-over point of > 1024 seems very wrong. No question 1024 looks wrong on a lot of platforms. On Sparc, we were seeing a crossover even higher, and on many like BSD/OS, the crossover is much lower. Without some platform-specific test, I don't know how we are going to set this correctly. Should we develop such a test? Do we have enough MemSet usage in the 256-4k range that people would see a difference between different MEMSET_LOOP_LIMIT values? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073