Thread: inline newNode()

inline newNode()

From
Neil Conway
Date:
I've been taking a look at improving the performance of the GEQO
code. Before looking at algorithmic improvements, I noticed some
"low-hanging fruit": gprof revealed that newNode() was a hotspot. When
I altered it to be an inline function, the overall time to plan a
12-table table join (using the default GEQO settings) dropped by about
9%. I haven't taken a look at how the patch effects the other places
that use newNode(), but it stands to reason that they'd see a
performance improvement as well (although probably less noticeable).

However, I'm not sure if I used the correct syntax for inlining the
function (since it was originally declared in a header and defined
elsewhere, I couldn't just add 'inline'). The method I used (declaring
the function 'extern inline' and defining it in the header file) works
for me with GCC 3.2, but I'm open to suggestions for improvement.

BTW, after applying the patch, the GEQO profile looks like:

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 16.07      0.18     0.18  4618735     0.00     0.00  compare_path_costs
  9.82      0.29     0.11  2149666     0.00     0.00  AllocSetAlloc
  8.04      0.38     0.09   396333     0.00     0.00  add_path
  4.46      0.43     0.05  2149661     0.00     0.00  MemoryContextAlloc
  3.57      0.47     0.04  1150953     0.00     0.00  compare_pathkeys

(Yes, gprof on my machine is still a bit fubared...)

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Attachment

Re: inline newNode()

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> I've been taking a look at improving the performance of the GEQO
> code. Before looking at algorithmic improvements, I noticed some
> "low-hanging fruit": gprof revealed that newNode() was a hotspot. When
> I altered it to be an inline function, the overall time to plan a
> 12-table table join (using the default GEQO settings) dropped by about
> 9%.

How much did you bloat the code?  There are an awful lot of calls to
newNode(), so even though it's not all that large, I'd think the
multiplier would be nasty.

It might be better to offer two versions, standard out-of-line and a
macro-or-inline called something like newNodeInline(), and change just
the hottest call sites to use the inline.  You could probably get most
of the speedup from inlining at just a few call sites.  (I've been
intending to do something similar with the TransactionId comparison
functions.)

> However, I'm not sure if I used the correct syntax for inlining the
> function (since it was originally declared in a header and defined
> elsewhere, I couldn't just add 'inline'). The method I used (declaring
> the function 'extern inline' and defining it in the header file) works
> for me with GCC 3.2, but I'm open to suggestions for improvement.

This isn't portable at all, AFAIK :-(.  Unfortunately I can't think
of a portable way to do it with a macro, either.

However, if you're willing to go the route involving changing call
sites, then you could change
    foo = newNode(typename);
to
    newNodeMacro(foo, typename);
whereupon it becomes pretty easy to make a suitable macro.

            regards, tom lane

Re: inline newNode()

From
Neil Conway
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> How much did you bloat the code?  There are an awful lot of calls to
> newNode(), so even though it's not all that large, I'd think the
> multiplier would be nasty.

The patch increases the executable from 12844452 to 13005244 bytes,
when compiled with '-pg -g -O2' and without being stripped.

> This isn't portable at all, AFAIK :-(.  Unfortunately I can't think
> of a portable way to do it with a macro, either.

Well, one alternative might be to provide 2 definitions of the
function -- one an extern inline in the header file, and one using the
current method (in a separate file, non-inline). Then wrap the header
file in an #ifdef __GNUC__ block, and the non-inlined version in
#ifndef __GNUC__. The downside is that it means maintaining two
versions of the same function -- but given that newNode() is pretty
trivial, that might be acceptable.

BTW, the GCC docs on inline functions are here:

        http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Inline.html#Inline

According to that page, using 'static inline' instead of 'extern
inline' is recommended for future compatability with C99, so that's
what we should probably use (in the __GNUC__ version).

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: inline newNode()

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> How much did you bloat the code?  There are an awful lot of calls to
>> newNode(), so even though it's not all that large, I'd think the
>> multiplier would be nasty.

> The patch increases the executable from 12844452 to 13005244 bytes,
> when compiled with '-pg -g -O2' and without being stripped.

Okay, not as bad as I feared, but still kinda high.

I believe that most of the bloat comes from the MemSet macro; there's
just not much else in newNode().  Now, the reason MemSet expands to
a fair amount of code is its if-then-else case to decide whether to
call memset() or do an inline loop.  I've looked at the assembler code
for it on a couple of machines, and the loop proper is only about a
third of the code that gets generated.

Ideally, we'd like to eliminate the if-test for inlined newNode calls.
That would buy back a lot of the bloat and speed things up still
further.

Now the tests on _val == 0 and _len <= MEMSET_LOOP_LIMIT and _len being
a multiple of 4 are no problem, since _val and _len are compile-time
constants; these will be optimized away.  What is not optimized away
(on the compilers I've looked at) is the check for _start being
int-aligned.

A brute-force approach is to say "we know _start is word-aligned because
we just got it from palloc, which guarantees MAXALIGNment".  We could
make a variant version of MemSet that omits the alignment check, and use
it here and anywhere else we're sure it's safe.

A nicer approach would be to somehow make use of the datatype of the
first argument to MemSet.  If we could determine at compile time that
it's supposed to point at a type with at least int alignment, then
it'd be possible for the compiler to optimize away this check in a
reasonably safe fashion.  I'm not sure if there's a portable way to
do this, though.  There's no "alignof()" construct in C :-(.
Any ideas?

            regards, tom lane

Re: inline newNode()

From
Neil Conway
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
[ snip interesting analysis ]
> A nicer approach would be to somehow make use of the datatype of the
> first argument to MemSet.  If we could determine at compile time
> that it's supposed to point at a type with at least int alignment,
> then it'd be possible for the compiler to optimize away this check
> in a reasonably safe fashion.  I'm not sure if there's a portable
> way to do this, though.  There's no "alignof()" construct in C
> :-(. Any ideas?

Well, we could make use of (yet another) GCC-ism: the __alignof__
keyword, which is described here:

    http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Alignment.html#Alignment

I don't like making the code GCC-specific any more than anyone else
does, but given that the code-bloat is specific to the inline version
of newNode (which in the scheme I described earlier would be
GCC-only) -- so introducing a GCC-specific fix for a GCC-specific
problem isn't too bad, IMHO.

Or we could just use your other suggestion: define a variant of
MemSet() and use it when we know it's safe. Not sure which is the
better solution: any comments?

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: inline newNode()

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> I don't like making the code GCC-specific any more than anyone else
> does, but given that the code-bloat is specific to the inline version
> of newNode (which in the scheme I described earlier would be
> GCC-only) -- so introducing a GCC-specific fix for a GCC-specific
> problem isn't too bad, IMHO.

> Or we could just use your other suggestion: define a variant of
> MemSet() and use it when we know it's safe. Not sure which is the
> better solution: any comments?

If we're going with a GCC-only approach to inlining newNode then it
seems like a tossup to me too.  Any other thoughts out there?

            regards, tom lane

Re: inline newNode()

From
Bruce Momjian
Date:
Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > I don't like making the code GCC-specific any more than anyone else
> > does, but given that the code-bloat is specific to the inline version
> > of newNode (which in the scheme I described earlier would be
> > GCC-only) -- so introducing a GCC-specific fix for a GCC-specific
> > problem isn't too bad, IMHO.
>
> > Or we could just use your other suggestion: define a variant of
> > MemSet() and use it when we know it's safe. Not sure which is the
> > better solution: any comments?
>
> If we're going with a GCC-only approach to inlining newNode then it
> seems like a tossup to me too.  Any other thoughts out there?

Seems newNode can easily be made into a macro.  I can do the coding, and
if you tell me that newNode will always be int-aligned, I can make an
assume-aligned version of MemSet.

Is that what people want to try?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Seems newNode can easily be made into a macro.

If you can see how to do that, show us ... that would eliminate the
need for a GCC-only inline function, which would sway my vote away
from depending on __alignof__ for the other thing.

            regards, tom lane

Re: inline newNode()

From
Peter Eisentraut
Date:
Tom Lane writes:

> A brute-force approach is to say "we know _start is word-aligned because
> we just got it from palloc, which guarantees MAXALIGNment".  We could
> make a variant version of MemSet that omits the alignment check, and use
> it here and anywhere else we're sure it's safe.

Or make a version of palloc that zeroes the memory.

--
Peter Eisentraut   peter_e@gmx.net


Re: inline newNode()

From
Peter Eisentraut
Date:
Neil Conway writes:

> Well, one alternative might be to provide 2 definitions of the
> function -- one an extern inline in the header file, and one using the
> current method (in a separate file, non-inline). Then wrap the header
> file in an #ifdef __GNUC__ block, and the non-inlined version in
> #ifndef __GNUC__.

External inline functions aren't even portable across different versions
of GCC.  It's best not to go there at all.

--
Peter Eisentraut   peter_e@gmx.net


Re: inline newNode()

From
Neil Conway
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Or make a version of palloc that zeroes the memory.

Well, we might lose a bit of speed that way -- with the current
MemSet() macro, most of the tests that MemSet() does can be eliminated
at compile-time, since MemSet() is usually called with some constant
parameters. If we moved this code inside palloc(), it wouldn't be
possible to do this constant propagation.

Not sure it would make a big performance difference, though -- and I
think palloc() is a clean solution. In fact, it's probably worth
having anyway: a lot of the call sites of palloc() immediately zero
the memory after they allocate it...

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: inline newNode()

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> Tom Lane writes:
>
> > A brute-force approach is to say "we know _start is word-aligned because
> > we just got it from palloc, which guarantees MAXALIGNment".  We could
> > make a variant version of MemSet that omits the alignment check, and use
> > it here and anywhere else we're sure it's safe.
>
> Or make a version of palloc that zeroes the memory.

OK, here is a version of newNode that is a macro.  However, there are
problems:

First, I can't get the code to assign the tag (dereference the pointer)
_and_ return the pointer, in a macro.  I can convert it to take a third
pointer argument, and it is only called <10 times, so that is easy, but
it is used by makeNode, which is called ~1000 times, so that seems like
a bad idea.  My solution was to declare a global variable in the header
file to be used by the macro.  Because the macro isn't called
recursively, that should be fine.

Now, the other problem is MemSet.  MemSet isn't designed to be called
inside a macro.  In fact, the 'if' tests are easy to do in a macro with
"? :", but the "while" loop seems to be impossible in a macro.  I tried
seeing if 'goto' would work in a macro, but of course it doesn't, and
even if it did, I doubt it would be portable.  The patch merely calls
memset() rather than MemSet.

Not sure if this is a win or not.  It makes makeNode a macro, but
removes the use of the MemSet macro in favor of memset().

Does anyone have additional suggestions?  The only thing I can suggest
is to make a clear-memory version of palloc because palloc always calls
MemoryContextAlloc() so I can put it in there.  How does that sound?

However, now that I look at it, MemoryContextAlloc() could be made a
macro easier than newNode() and that would save function call in more
cases than newNode.  In fact, the comment at the top of
MemoryContextAlloc() says:

 * This could be turned into a macro, but we'd have to import
 * nodes/memnodes.h into postgres.h which seems a bad idea.

Ideas?

The regression tests do pass with this patch, so functionally it works
fine.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
Index: src/backend/nodes/nodes.c
===================================================================
RCS file: /cvsroot/pgsql-server/src/backend/nodes/nodes.c,v
retrieving revision 1.15
diff -c -c -r1.15 nodes.c
*** src/backend/nodes/nodes.c    20 Jun 2002 20:29:29 -0000    1.15
--- src/backend/nodes/nodes.c    8 Oct 2002 23:52:21 -0000
***************
*** 28,42 ****
   *      macro makeNode. eg. to create a Resdom node, use makeNode(Resdom)
   *
   */
! Node *
! newNode(Size size, NodeTag tag)
! {
!     Node       *newNode;

-     Assert(size >= sizeof(Node));        /* need the tag, at least */
-
-     newNode = (Node *) palloc(size);
-     MemSet((char *) newNode, 0, size);
-     newNode->type = tag;
-     return newNode;
- }
--- 28,32 ----
   *      macro makeNode. eg. to create a Resdom node, use makeNode(Resdom)
   *
   */
! Node *newNodeMacroHolder;

Index: src/include/nodes/nodes.h
===================================================================
RCS file: /cvsroot/pgsql-server/src/include/nodes/nodes.h,v
retrieving revision 1.118
diff -c -c -r1.118 nodes.h
*** src/include/nodes/nodes.h    31 Aug 2002 22:10:47 -0000    1.118
--- src/include/nodes/nodes.h    8 Oct 2002 23:52:25 -0000
***************
*** 261,266 ****
--- 261,285 ----

  #define nodeTag(nodeptr)        (((Node*)(nodeptr))->type)

+ /*
+  *    There is no way to deferernce the palloc'ed pointer to assign the
+  *    tag, and return the pointer itself, so we need a holder variable.
+  *    Fortunately, this function isn't recursive so we just define
+  *    a global variable for this purpose.
+  */
+ extern Node *newNodeMacroHolder;
+
+ #define newNode(size, tag) \
+ ( \
+     AssertMacro((size) >= sizeof(Node)),        /* need the tag, at least */ \
+ \
+     newNodeMacroHolder = (Node *) palloc(size), \
+     memset((char *)newNodeMacroHolder, 0, (size)), \
+     newNodeMacroHolder->type = (tag), \
+     newNodeMacroHolder \
+ )
+
+
  #define makeNode(_type_)        ((_type_ *) newNode(sizeof(_type_),T_##_type_))
  #define NodeSetTag(nodeptr,t)    (((Node*)(nodeptr))->type = (t))

***************
*** 281,291 ****
   *                      extern declarations follow
   * ----------------------------------------------------------------
   */
-
- /*
-  * nodes/nodes.c
-  */
- extern Node *newNode(Size size, NodeTag tag);

  /*
   * nodes/{outfuncs.c,print.c}
--- 300,305 ----

Re: inline newNode()

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> OK, here is a version of newNode that is a macro.

If you use memset() instead of MemSet(), I'm afraid you're going to blow
off most of the performance gain this was supposed to achieve.

> Does anyone have additional suggestions?  The only thing I can suggest
> is to make a clear-memory version of palloc because palloc always calls
> MemoryContextAlloc() so I can put it in there.  How does that sound?

I do not think palloc should auto-zero memory.  Hard to explain why,
but it just feels like a bad decision.  One point is that the MemSet
has to be inlined or it cannot compile-out the tests on _len.  palloc
can't treat the length as a compile-time constant.

> The regression tests do pass with this patch, so functionally it works
> fine.

Speed is the issue here, not functionality...

            regards, tom lane

Re: inline newNode()

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > OK, here is a version of newNode that is a macro.
>
> If you use memset() instead of MemSet(), I'm afraid you're going to blow
> off most of the performance gain this was supposed to achieve.

Yep.

> > Does anyone have additional suggestions?  The only thing I can suggest
> > is to make a clear-memory version of palloc because palloc always calls
> > MemoryContextAlloc() so I can put it in there.  How does that sound?
>
> I do not think palloc should auto-zero memory.  Hard to explain why,
> but it just feels like a bad decision.  One point is that the MemSet
> has to be inlined or it cannot compile-out the tests on _len.  palloc
> can't treat the length as a compile-time constant.

Right, palloc shouldn't.  I was thinking of having another version of
palloc that _does_ clear out memory, and calling that from a newNode()
macro.  We already know palloc is going to call MemoryContextAlloc, so
we could have a pallocC() that calls a new MemoryContextAllocC() that
would call the underlying memory allocation function, then do the loop
like MemSet to clear it.

It would allow newNode to be a macro and prevent the bloat of having
MemSet appear everywhere newNode appears;  it would all be in the new
function MemoryContextAllocC().

> > The regression tests do pass with this patch, so functionally it works
> > fine.
>
> Speed is the issue here, not functionality...

I was just proving it works, that's all.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Right, palloc shouldn't.  I was thinking of having another version of
> palloc that _does_ clear out memory, and calling that from a newNode()
> macro.  We already know palloc is going to call MemoryContextAlloc, so
> we could have a pallocC() that calls a new MemoryContextAllocC() that
> would call the underlying memory allocation function, then do the loop
> like MemSet to clear it.

But if the MemSet is inside the called function then it cannot reduce
the if-tests to a compile-time decision to invoke the word-zeroing loop.
We want the MemSet to be expanded at the newNode call site, where the
size of the allocated memory is a compile-time constant.

            regards, tom lane

Re: inline newNode()

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Right, palloc shouldn't.  I was thinking of having another version of
> > palloc that _does_ clear out memory, and calling that from a newNode()
> > macro.  We already know palloc is going to call MemoryContextAlloc, so
> > we could have a pallocC() that calls a new MemoryContextAllocC() that
> > would call the underlying memory allocation function, then do the loop
> > like MemSet to clear it.
>
> But if the MemSet is inside the called function then it cannot reduce
> the if-tests to a compile-time decision to invoke the word-zeroing loop.
> We want the MemSet to be expanded at the newNode call site, where the
> size of the allocated memory is a compile-time constant.

I can easily do the tests in the MemSet macro, but I can't do a loop in
a macro that has to return a value;  I need while().  Though a loop in a
new fuction will not be as fast as a MemSet macro, I think it will be
better than what we have now with newNode only because newNode will be a
macro and not a function anymore, i.e. the MemSet will happen in the
function called by pallocC and not in newNode anymore, and there will be
zero code bloat.  I wish I saw another way.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Right, palloc shouldn't.  I was thinking of having another version of
> > > palloc that _does_ clear out memory, and calling that from a newNode()
> > > macro.  We already know palloc is going to call MemoryContextAlloc, so
> > > we could have a pallocC() that calls a new MemoryContextAllocC() that
> > > would call the underlying memory allocation function, then do the loop
> > > like MemSet to clear it.
> >
> > But if the MemSet is inside the called function then it cannot reduce
> > the if-tests to a compile-time decision to invoke the word-zeroing loop.
> > We want the MemSet to be expanded at the newNode call site, where the
> > size of the allocated memory is a compile-time constant.
>
> I can easily do the tests in the MemSet macro, but I can't do a loop in
> a macro that has to return a value;  I need while().  Though a loop in a
> new fuction will not be as fast as a MemSet macro, I think it will be
> better than what we have now with newNode only because newNode will be a
> macro and not a function anymore, i.e. the MemSet will happen in the
> function called by pallocC and not in newNode anymore, and there will be
> zero code bloat.  I wish I saw another way.

OK, here's a patch for testing.  It needs cleanup because the final
version would remove the nodes/nodes.c file.  The net effect of the
patch is to make newNode a macro with little code bloat.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
Index: src/backend/nodes/nodes.c
===================================================================
RCS file: /cvsroot/pgsql-server/src/backend/nodes/nodes.c,v
retrieving revision 1.15
diff -c -c -r1.15 nodes.c
*** src/backend/nodes/nodes.c    20 Jun 2002 20:29:29 -0000    1.15
--- src/backend/nodes/nodes.c    9 Oct 2002 05:17:13 -0000
***************
*** 28,42 ****
   *      macro makeNode. eg. to create a Resdom node, use makeNode(Resdom)
   *
   */
! Node *
! newNode(Size size, NodeTag tag)
! {
!     Node       *newNode;

-     Assert(size >= sizeof(Node));        /* need the tag, at least */
-
-     newNode = (Node *) palloc(size);
-     MemSet((char *) newNode, 0, size);
-     newNode->type = tag;
-     return newNode;
- }
--- 28,32 ----
   *      macro makeNode. eg. to create a Resdom node, use makeNode(Resdom)
   *
   */
! Node *newNodeMacroHolder;

Index: src/backend/utils/mmgr/mcxt.c
===================================================================
RCS file: /cvsroot/pgsql-server/src/backend/utils/mmgr/mcxt.c,v
retrieving revision 1.32
diff -c -c -r1.32 mcxt.c
*** src/backend/utils/mmgr/mcxt.c    12 Aug 2002 00:36:12 -0000    1.32
--- src/backend/utils/mmgr/mcxt.c    9 Oct 2002 05:17:17 -0000
***************
*** 453,458 ****
--- 453,481 ----
  }

  /*
+  * MemoryContextAllocC
+  *        Like MemoryContextAlloc, but clears allocated memory
+  *
+  *    We could just call MemoryContextAlloc then clear the memory, but this
+  *    function is called too many times, so we have a separate version.
+  */
+ void *
+ MemoryContextAllocC(MemoryContext context, Size size)
+ {
+     void *ret;
+
+     AssertArg(MemoryContextIsValid(context));
+
+     if (!AllocSizeIsValid(size))
+         elog(ERROR, "MemoryContextAlloc: invalid request size %lu",
+              (unsigned long) size);
+
+     ret = (*context->methods->alloc) (context, size);
+     MemSet(ret, 0, size);
+     return ret;
+ }
+
+ /*
   * pfree
   *        Release an allocated chunk.
   */
Index: src/include/nodes/nodes.h
===================================================================
RCS file: /cvsroot/pgsql-server/src/include/nodes/nodes.h,v
retrieving revision 1.118
diff -c -c -r1.118 nodes.h
*** src/include/nodes/nodes.h    31 Aug 2002 22:10:47 -0000    1.118
--- src/include/nodes/nodes.h    9 Oct 2002 05:17:20 -0000
***************
*** 261,266 ****
--- 261,284 ----

  #define nodeTag(nodeptr)        (((Node*)(nodeptr))->type)

+ /*
+  *    There is no way to dereference the palloc'ed pointer to assign the
+  *    tag, and return the pointer itself, so we need a holder variable.
+  *    Fortunately, this function isn't recursive so we just define
+  *    a global variable for this purpose.
+  */
+ extern Node *newNodeMacroHolder;
+
+ #define newNode(size, tag) \
+ ( \
+     AssertMacro((size) >= sizeof(Node)),        /* need the tag, at least */ \
+ \
+     newNodeMacroHolder = (Node *) pallocC(size), \
+     newNodeMacroHolder->type = (tag), \
+     newNodeMacroHolder \
+ )
+
+
  #define makeNode(_type_)        ((_type_ *) newNode(sizeof(_type_),T_##_type_))
  #define NodeSetTag(nodeptr,t)    (((Node*)(nodeptr))->type = (t))

***************
*** 281,291 ****
   *                      extern declarations follow
   * ----------------------------------------------------------------
   */
-
- /*
-  * nodes/nodes.c
-  */
- extern Node *newNode(Size size, NodeTag tag);

  /*
   * nodes/{outfuncs.c,print.c}
--- 299,304 ----
Index: src/include/utils/palloc.h
===================================================================
RCS file: /cvsroot/pgsql-server/src/include/utils/palloc.h,v
retrieving revision 1.19
diff -c -c -r1.19 palloc.h
*** src/include/utils/palloc.h    20 Jun 2002 20:29:53 -0000    1.19
--- src/include/utils/palloc.h    9 Oct 2002 05:17:21 -0000
***************
*** 46,53 ****
--- 46,56 ----
   * Fundamental memory-allocation operations (more are in utils/memutils.h)
   */
  extern void *MemoryContextAlloc(MemoryContext context, Size size);
+ extern void *MemoryContextAllocC(MemoryContext context, Size size);

  #define palloc(sz)    MemoryContextAlloc(CurrentMemoryContext, (sz))
+
+ #define pallocC(sz)    MemoryContextAllocC(CurrentMemoryContext, (sz))

  extern void pfree(void *pointer);


Re: inline newNode()

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> ...  I wish I saw another way.

I liked Neil's proposal better.

            regards, tom lane

Re: inline newNode()

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > ...  I wish I saw another way.
>
> I liked Neil's proposal better.

What is it you liked, specifically?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > ...  I wish I saw another way.
>
> I liked Neil's proposal better.

I think Neil is going to test my patch. I think he will find it yields
identical performance to his patch without the bloat, and it will work
on all platforms.  I don't think the MemSet() constant calls are that
big a performance win.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Neil Conway
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> OK, here's a patch for testing.  It needs cleanup because the final
> version would remove the nodes/nodes.c file.  The net effect of the
> patch is to make newNode a macro with little code bloat.

Ok, I think this is the better version. The performance seems to be
about the same as my original patch, and the code bloat is a lot less:
12857787 with the patch versus 12845168 without it. The
implemementation (with a global variable) is pretty ugly, but I don't
really see a way around that...

One minor quibble with the patch though: MemoryContextAllocC() and
pallocC() aren't very good function names, IMHO. Perhaps
MemoryContextAllocZero() and palloc0() would be better?

This isn't specific to your patch, but it occurs to me that we could
save a few bits of code bloat if we removed the '_len' variable
declaration from the MemSet() macro -- it isn't needed, AFAICT. That
would mean we we'd evaluate the 'len' argument multiple times, so I'm
not sure if that's a win overall...

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: inline newNode()

From
Bruce Momjian
Date:
Neil Conway wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > OK, here's a patch for testing.  It needs cleanup because the final
> > version would remove the nodes/nodes.c file.  The net effect of the
> > patch is to make newNode a macro with little code bloat.
>
> Ok, I think this is the better version. The performance seems to be
> about the same as my original patch, and the code bloat is a lot less:
> 12857787 with the patch versus 12845168 without it. The
> implemementation (with a global variable) is pretty ugly, but I don't
> really see a way around that...
>
> One minor quibble with the patch though: MemoryContextAllocC() and
> pallocC() aren't very good function names, IMHO. Perhaps
> MemoryContextAllocZero() and palloc0() would be better?

Sure, I like those names.

> This isn't specific to your patch, but it occurs to me that we could
> save a few bits of code bloat if we removed the '_len' variable
> declaration from the MemSet() macro -- it isn't needed, AFAICT. That
> would mean we we'd evaluate the 'len' argument multiple times, so I'm
> not sure if that's a win overall...

I think that was actually the goal, to not have them evaluated multiple
times, and perhaps bloat if the length itself was a macro.

New patch attached with your suggested renaming.  Now that I look at the
code, there are about 55 places where we call palloc then right after it
MemSet(0), so we could merge those to use palloc0 and reduce existing
MemSet code bloat some more.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
Index: src/backend/nodes/nodes.c
===================================================================
RCS file: /cvsroot/pgsql-server/src/backend/nodes/nodes.c,v
retrieving revision 1.15
diff -c -c -r1.15 nodes.c
*** src/backend/nodes/nodes.c    20 Jun 2002 20:29:29 -0000    1.15
--- src/backend/nodes/nodes.c    9 Oct 2002 05:17:13 -0000
***************
*** 28,42 ****
   *      macro makeNode. eg. to create a Resdom node, use makeNode(Resdom)
   *
   */
! Node *
! newNode(Size size, NodeTag tag)
! {
!     Node       *newNode;

-     Assert(size >= sizeof(Node));        /* need the tag, at least */
-
-     newNode = (Node *) palloc(size);
-     MemSet((char *) newNode, 0, size);
-     newNode->type = tag;
-     return newNode;
- }
--- 28,32 ----
   *      macro makeNode. eg. to create a Resdom node, use makeNode(Resdom)
   *
   */
! Node *newNodeMacroHolder;

Index: src/backend/utils/mmgr/mcxt.c
===================================================================
RCS file: /cvsroot/pgsql-server/src/backend/utils/mmgr/mcxt.c,v
retrieving revision 1.32
diff -c -c -r1.32 mcxt.c
*** src/backend/utils/mmgr/mcxt.c    12 Aug 2002 00:36:12 -0000    1.32
--- src/backend/utils/mmgr/mcxt.c    9 Oct 2002 05:17:17 -0000
***************
*** 453,458 ****
--- 453,481 ----
  }

  /*
+  * MemoryContextAllocZero
+  *        Like MemoryContextAlloc, but clears allocated memory
+  *
+  *    We could just call MemoryContextAlloc then clear the memory, but this
+  *    function is called too many times, so we have a separate version.
+  */
+ void *
+ MemoryContextAllocZero(MemoryContext context, Size size)
+ {
+     void *ret;
+
+     AssertArg(MemoryContextIsValid(context));
+
+     if (!AllocSizeIsValid(size))
+         elog(ERROR, "MemoryContextAllocZero: invalid request size %lu",
+              (unsigned long) size);
+
+     ret = (*context->methods->alloc) (context, size);
+     MemSet(ret, 0, size);
+     return ret;
+ }
+
+ /*
   * pfree
   *        Release an allocated chunk.
   */
Index: src/include/nodes/nodes.h
===================================================================
RCS file: /cvsroot/pgsql-server/src/include/nodes/nodes.h,v
retrieving revision 1.118
diff -c -c -r1.118 nodes.h
*** src/include/nodes/nodes.h    31 Aug 2002 22:10:47 -0000    1.118
--- src/include/nodes/nodes.h    9 Oct 2002 05:17:20 -0000
***************
*** 261,266 ****
--- 261,284 ----

  #define nodeTag(nodeptr)        (((Node*)(nodeptr))->type)

+ /*
+  *    There is no way to dereference the palloc'ed pointer to assign the
+  *    tag, and return the pointer itself, so we need a holder variable.
+  *    Fortunately, this function isn't recursive so we just define
+  *    a global variable for this purpose.
+  */
+ extern Node *newNodeMacroHolder;
+
+ #define newNode(size, tag) \
+ ( \
+     AssertMacro((size) >= sizeof(Node)),        /* need the tag, at least */ \
+ \
+     newNodeMacroHolder = (Node *) palloc0(size), \
+     newNodeMacroHolder->type = (tag), \
+     newNodeMacroHolder \
+ )
+
+
  #define makeNode(_type_)        ((_type_ *) newNode(sizeof(_type_),T_##_type_))
  #define NodeSetTag(nodeptr,t)    (((Node*)(nodeptr))->type = (t))

***************
*** 281,291 ****
   *                      extern declarations follow
   * ----------------------------------------------------------------
   */
-
- /*
-  * nodes/nodes.c
-  */
- extern Node *newNode(Size size, NodeTag tag);

  /*
   * nodes/{outfuncs.c,print.c}
--- 299,304 ----
Index: src/include/utils/palloc.h
===================================================================
RCS file: /cvsroot/pgsql-server/src/include/utils/palloc.h,v
retrieving revision 1.19
diff -c -c -r1.19 palloc.h
*** src/include/utils/palloc.h    20 Jun 2002 20:29:53 -0000    1.19
--- src/include/utils/palloc.h    9 Oct 2002 05:17:21 -0000
***************
*** 46,53 ****
--- 46,56 ----
   * Fundamental memory-allocation operations (more are in utils/memutils.h)
   */
  extern void *MemoryContextAlloc(MemoryContext context, Size size);
+ extern void *MemoryContextAllocZero(MemoryContext context, Size size);

  #define palloc(sz)    MemoryContextAlloc(CurrentMemoryContext, (sz))
+
+ #define palloc0(sz)    MemoryContextAllocZero(CurrentMemoryContext, (sz))

  extern void pfree(void *pointer);


Re: inline newNode()

From
Peter Eisentraut
Date:
Tom Lane writes:

> If you use memset() instead of MemSet(), I'm afraid you're going to blow
> off most of the performance gain this was supposed to achieve.

Can someone explain to me why memset() would ever be better than MemSet()?

Also, shouldn't GCC (at least 3.0 or later) inline memset() automatically?

What's the result of using -finline (or your favorite compiler's
inlining flag)?

And has someone wondered why the GEQO code needs so many new nodes?
Perhaps a more lightweight data representation for internal use could be
appropriate?

--
Peter Eisentraut   peter_e@gmx.net


Re: inline newNode()

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> Tom Lane writes:
>
> > If you use memset() instead of MemSet(), I'm afraid you're going to blow
> > off most of the performance gain this was supposed to achieve.
>
> Can someone explain to me why memset() would ever be better than MemSet()?

I am surprised MemSet was ever faster than memset().  Remember, MemSet
was done only to prevent excessive function call overhead to memset().
I never anticipated that a simple while() loop would be faster than the
libc version, especially ones that have assembler memset versions, but,
for example on Sparc, this is true.

I looked at the Sparc assembler code and I can't see why MemSet would be
faster.  Perhaps someone can send over the Sparc assembler output of
MemSet on their platform and I can compare it to the Solaris assembler I
see here.  In fact, they can probably disassemble a memset() routine there
and see exactly what I see.

> Also, shouldn't GCC (at least 3.0 or later) inline memset() automatically?

Not sure, but yes, that may be true.  I think it requires a high
optimizer level, perhaps higher than our default.

> What's the result of using -finline (or your favorite compiler's
> inlining flag)?

Yes, that would be it.

> And has someone wondered why the GEQO code needs so many new nodes?
> Perhaps a more lightweight data representation for internal use could be
> appropriate?

I assume the GEQO results he is seeing is only for a tests, and that the
macro version of newNode will help in all cases.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Can someone explain to me why memset() would ever be better than MemSet()?

memset() should *always* be faster than any C-coded implementation
thereof.  Any competent assembly-language writer can beat C-level
locutions, or at least equal them, if he's willing to expend the
effort.

I've frankly been astonished at the number of platforms where it
seems memset() has not been coded with an appropriate degree of
tenseness.  The fact that we've found it useful to invent MemSet()
is a pretty damning indictment of the competence of modern C-library
authors.

Or am I just stuck in the obsolete notion that vendors should provide
some amount of platform-specific tuning, and not a generic library?

            regards, tom lane

Re: inline newNode()

From
Bruce Momjian
Date:
Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > Can someone explain to me why memset() would ever be better than MemSet()?
>
> memset() should *always* be faster than any C-coded implementation
> thereof.  Any competent assembly-language writer can beat C-level
> locutions, or at least equal them, if he's willing to expend the
> effort.

You would think so.  I am surprised too.

> I've frankly been astonished at the number of platforms where it
> seems memset() has not been coded with an appropriate degree of
> tenseness.  The fact that we've found it useful to invent MemSet()
> is a pretty damning indictment of the competence of modern C-library
> authors.

Remember, MemSet was invented only to prevent function call overhead,
and on my BSD/OS system, len >= 256 is faster with the libc memset().
It is a fluke we found out that MemSet is faster that memset for all
lengths on some platforms.

> Or am I just stuck in the obsolete notion that vendors should provide
> some amount of platform-specific tuning, and not a generic library?

What really surprised me is that MemSet won on Sparc, where they have an
assembler language version that looks very similar to the MemSet loop.

In fact, I was the one who encouraged BSD/OS to write assembler language
verions of many of their str* and mem* functions.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Neil Conway
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Remember, MemSet was invented only to prevent function call overhead,
> and on my BSD/OS system, len >= 256 is faster with the libc
> memset().

Yes, I remember finding that when testing MemSet() versus memset() for
various values of MEMSET_LOOP_LIMIT earlier.

> What really surprised me is that MemSet won on Sparc, where they have an
> assembler language version that looks very similar to the MemSet
> loop.

Well, I'd assume any C library / compiler of half-decent quality on
any platform would provide assembly optimized versions of common
stdlib functions like memset().

While playing around with memset() on my machine (P4 running Linux,
glibc 2.2.5, GCC 3.2.1pre3), I found the following interesting
result. I used this simple benchmark (the same one I posted for the
earlier MemSet() thread on -hackers):

#include <string.h>
#include "postgres.h"

#undef MEMSET_LOOP_LIMIT
#define MEMSET_LOOP_LIMIT BUFFER_SIZE

int
main(void)
{
    char buffer[BUFFER_SIZE];
    long long i;

    for (i = 0; i < 99000000; i++)
    {
        memset(buffer, 0, sizeof(buffer));
    }

    return 0;
}

Compiled with '-DBUFFER_SIZE=256 -O2', I get the following results in
seconds:

MemSet(): ~9.6
memset(): ~19.5
__builtin_memset(): ~10.00

So it seems there is a reasonably optimized version of memset()
provided by glibc/GCC (not sure which :-) ), it's just a matter of
persuading the compiler to let us use it. It's still depressing that
it doesn't beat MemSet(), but perhaps __builtin_memset() has better
average-case performane over a wider spectrum of memory size?[1]

BTW, regarding the newNode() stuff: so is it agreed that Bruce's patch
is a performance win without too high of a code bloat / uglification
penalty? If so, is it 7.3 or 7.4 material?

Cheers,

Neil

[1] Not that I really buy that -- for one thing, if the length is
constant, as it is in this case, the compiler can substitute an
optimized version of the function for the appropriate memory size. I'm
having a little difficulty explaining GCC/glibc's poor performance...

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: inline newNode()

From
Karel Zak
Date:
On Wed, Oct 09, 2002 at 12:12:12AM -0400, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > OK, here is a version of newNode that is a macro.
>
> If you use memset() instead of MemSet(), I'm afraid you're going to blow
> off most of the performance gain this was supposed to achieve.
>
> > Does anyone have additional suggestions?  The only thing I can suggest
> > is to make a clear-memory version of palloc because palloc always calls
> > MemoryContextAlloc() so I can put it in there.  How does that sound?
>
> I do not think palloc should auto-zero memory.  Hard to explain why,
> but it just feels like a bad decision.  One point is that the MemSet

 Agree. The memory-management routine knows nothing about real memory
 usage - maybe is zeroize memory wanted in palloc caller, but maybe
 not.. The palloc() caller knows it better than plalloc(). If I good
 remember same discussion was long time ago in linux-kernel list and
 result was non-zeroize-memory.


    Karel


--
 Karel Zak  <zakkr@zf.jcu.cz>
 http://home.zf.jcu.cz/~zakkr/

 C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

Re: inline newNode()

From
Gavin Sherry
Date:
On 10 Oct 2002, Neil Conway wrote:

> Well, I'd assume any C library / compiler of half-decent quality on
> any platform would provide assembly optimized versions of common
> stdlib functions like memset().
>
> While playing around with memset() on my machine (P4 running Linux,
> glibc 2.2.5, GCC 3.2.1pre3), I found the following interesting
> result. I used this simple benchmark (the same one I posted for the
> earlier MemSet() thread on -hackers):
>

[snip]

> Compiled with '-DBUFFER_SIZE=256 -O2', I get the following results in
> seconds:
>
> MemSet(): ~9.6
> memset(): ~19.5
> __builtin_memset(): ~10.00

I ran the same code. I do not understand the results you go. Here are
mine, on an AMD Duron with GCC 3.2 and glibc-2.2.5. Results are:

MemSet(): 14.758 sec
memset(): 11.597 sec
__buildin_memset(): 9.000 sec

Who else wants to test?

Gavin


Re: inline newNode()

From
Bruce Momjian
Date:
Neil Conway wrote:
> BTW, regarding the newNode() stuff: so is it agreed that Bruce's patch
> is a performance win without too high of a code bloat / uglification
> penalty? If so, is it 7.3 or 7.4 material?

Not sure.  It is a small patch but we normally don't do performance
fixes during beta unless they are major.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Neil Conway
Date:
Gavin Sherry <swm@linuxworld.com.au> writes:
> On 10 Oct 2002, Neil Conway wrote:
> > Compiled with '-DBUFFER_SIZE=256 -O2', I get the following results in
> > seconds:
> >
> > MemSet(): ~9.6
> > memset(): ~19.5
> > __builtin_memset(): ~10.00
>
> I ran the same code. I do not understand the results you go.

Interesting -- it may be the case that the optimizations performed by
GCC are architecture specific.

I searched for some threads about this on the GCC list. One
interesting result was this:

        http://gcc.gnu.org/ml/gcc/2002-04/msg01114.html

One possible explanation for the different performance you saw is
explained by Jan Hubicka:

        http://gcc.gnu.org/ml/gcc/2002-04/msg01146.html

One thing that confuses me is that GCC decides *not* to use
__builtin_memset() for some reason, even though it appears to be
superior to normal memset() on both of our systems.

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: inline newNode()

From
Peter Eisentraut
Date:
Neil Conway writes:

> MemSet(): ~9.6
> memset(): ~19.5
> __builtin_memset(): ~10.00

I did my own tests with this code and the results vary wildly between
platforms.  (I do not list __builtin_memset() because the results were
invariably equal to memset().)

Platform    buffer       memset()    MemSet()

freebsd        32        5.3         4.9
freebsd        256       23.3        24.2
mingw        32        0.5         2.0
mingw        256        6.6        10.5
unixware    256       15.2        10.3
unixware    1024       29.2        34.1
cygwin        256        6.7        15.8

"freebsd" is i386-unknown-freebsd4.7 with GCC 2.95.4.
"mingw" is i686-pc-mingw32 with GCC 3.2.
"unixware" is i586-sco-sysv5uw7.1.3 with vendor compiler version 4.1.
"cygwin" is i686-pc-cygwin with GCC 2.95.3.

GCC was run as 'gcc -O2 -Wall'.  (I also tried 'gcc -O3 -finline' but
that gave only minimally better results.)  The SCO compiler was run as
'cc -O -Kinine -v'.

Make of those results what you will, but the current cross-over point of
1024 seems very wrong.

--
Peter Eisentraut   peter_e@gmx.net


Re: inline newNode()

From
Peter Eisentraut
Date:
Bruce Momjian writes:

> I assume the GEQO results he is seeing is only for a tests, and that the
> macro version of newNode will help in all cases.

Well are we just assuming here or are we fixing actual problems based on
real analyses?

--
Peter Eisentraut   peter_e@gmx.net



Re: inline newNode()

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> Bruce Momjian writes:
>
> > I assume the GEQO results he is seeing is only for a tests, and that the
> > macro version of newNode will help in all cases.
>
> Well are we just assuming here or are we fixing actual problems based on
> real analyses?

What is your point?  He is testing GEQO, but I assume other places would
see a speedup from inlining newNode.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Peter Eisentraut
Date:
Bruce Momjian writes:

> Peter Eisentraut wrote:
> > Bruce Momjian writes:
> >
> > > I assume the GEQO results he is seeing is only for a tests, and that the
> > > macro version of newNode will help in all cases.
> >
> > Well are we just assuming here or are we fixing actual problems based on
> > real analyses?
>
> What is your point?  He is testing GEQO, but I assume other places would
> see a speedup from inlining newNode.

That's exactly my point: You're assuming.

Of course everything would be faster if we inline it, but IMHO that's the
compiler's job.

--
Peter Eisentraut   peter_e@gmx.net


Re: inline newNode()

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> Bruce Momjian writes:
>
> > Peter Eisentraut wrote:
> > > Bruce Momjian writes:
> > >
> > > > I assume the GEQO results he is seeing is only for a tests, and that the
> > > > macro version of newNode will help in all cases.
> > >
> > > Well are we just assuming here or are we fixing actual problems based on
> > > real analyses?
> >
> > What is your point?  He is testing GEQO, but I assume other places would
> > see a speedup from inlining newNode.
>
> That's exactly my point: You're assuming.

We know GEQO is faster, and the 8% is enough to justify the inlining.
Whether it speeds up other operations is assumed but we don't really
care;  the GEQO numbers are enough.  The code bloat is minimal.

> Of course everything would be faster if we inline it, but IMHO that's the
> compiler's job.

Yes, but we have to help it in certain cases, heap_getattr() being the
most dramatic.   Are you suggesting we remove all the macros in our
header file and replace them with function calls?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Neil Conway wrote:
> > BTW, regarding the newNode() stuff: so is it agreed that Bruce's patch
> > is a performance win without too high of a code bloat / uglification
> > penalty? If so, is it 7.3 or 7.4 material?
>
> Not sure.  It is a small patch but we normally don't do performance
> fixes during beta unless they are major.

This performance improvement isn't related to a specific user problem
report, so I am going to hold it for 7.4.  At that time we can also
research whether palloc0 should be used in other cases, like palloc
followed by MemSet.

The only issue there is that palloc0 can't optimize away constants used
in the macro so it may be better _not_ to make that change.  Neil, do
you have any performance numbers comparing MemSet with constant vs.
variable parameters, e.g:

    MemSet(ptr, 0, 256)

vs.

    i = 256;
    MemSet(ptr, 0, i)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: inline newNode()

From
Bruce Momjian
Date:
Peter Eisentraut wrote:
> Neil Conway writes:
>
> > MemSet(): ~9.6
> > memset(): ~19.5
> > __builtin_memset(): ~10.00
>
> I did my own tests with this code and the results vary wildly between
> platforms.  (I do not list __builtin_memset() because the results were
> invariably equal to memset().)
>
> Platform    buffer       memset()    MemSet()
>
> freebsd        32        5.3         4.9
> freebsd        256       23.3        24.2
> mingw        32        0.5         2.0
> mingw        256        6.6        10.5
> unixware    256       15.2        10.3
> unixware    1024       29.2        34.1
> cygwin        256        6.7        15.8
>
> "freebsd" is i386-unknown-freebsd4.7 with GCC 2.95.4.
> "mingw" is i686-pc-mingw32 with GCC 3.2.
> "unixware" is i586-sco-sysv5uw7.1.3 with vendor compiler version 4.1.
> "cygwin" is i686-pc-cygwin with GCC 2.95.3.
>
> GCC was run as 'gcc -O2 -Wall'.  (I also tried 'gcc -O3 -finline' but
> that gave only minimally better results.)  The SCO compiler was run as
> 'cc -O -Kinine -v'.
>
> Make of those results what you will, but the current cross-over point of
> 1024 seems very wrong.

No question 1024 looks wrong on a lot of platforms.  On Sparc, we were
seeing a crossover even higher, and on many like BSD/OS, the crossover
is much lower.  Without some platform-specific test, I don't know how we
are going to set this correctly.  Should we develop such a test?  Do we
have enough MemSet usage in the 256-4k range that people would see a
difference between different MEMSET_LOOP_LIMIT values?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073