Thread: Re: [PATCHES] Try again: S_LOCK reduced contentionh]

Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Bruce Momjian
Date:
>
> >
> > OK, here is my argument for inlining tas().
>
> I am going out of town till Monday, so don't have time to give this
> the thoughtful response it deserves. I will get back to you on this as
> I am not quite convinced, but obviously in the face of this I need to
> explain my reasoning.

Sure.  I just know that the reduction from 0.28 to 0.08 was performed
one 0.01 at a time.  See the MemSet() macro.  I am sure you will hate it
too, but it did reduce the number of calls to memset() and reduced
wallclock execution time as measured from the client.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
dg@illustra.com (David Gould)
Date:
Bruce Momjian:
> > > OK, here is my argument for inlining tas().
> David Gould:
> > I am going out of town till Monday, so don't have time to give this
> > the thoughtful response it deserves. I will get back to you on this as
> > I am not quite convinced, but obviously in the face of this I need to
> > explain my reasoning.
>
> Sure.  I just know that the reduction from 0.28 to 0.08 was performed
> one 0.01 at a time.  See the MemSet() macro.  I am sure you will hate it
> too, but it did reduce the number of calls to memset() and reduced
> wallclock execution time as measured from the client.

This is always how micro-optimization goes, 1% and 2% gains here and there.
I am very familiar with it.

Anyhow, I am just back, and it occured to me to ask you for an exact
reproducable test setup, so that I can run the same thing and play around
with it a bit. I am not convinced by your result for a number of reasons,
but rather than just make assertions or speculations, I think I should do a
little data collection.

Also, if you have any other cases that you believe useful to test with, I
would be very happy to try those too.

Have you done call graph profile for this, or just flat profiling? I think
you may find the call graph (gprof) output revealing, although perhaps not
on this particular topic...

One last item, appropos start up time. Illustra uses what we call "Server
Caching". Basically when a connect is terminated, the backend instead of
exiting goes into a pool of idle servers. When next a connection comes in for
the same database, instead of forking and initing a new server, we merely
reuse the old one. This saves lots of startup time. However, there are some
problems in practice so we might want to do something "like this only
different". The idea that occurred to me is to have the postmaster
"pre-spawn" some servers in each (configurable) database. These would run
all the initialization and then just wait for a socket to be handed to them.
The postmaster would during idle time replenish the pool of ready servers.
I think this might have a lot more impact on startup time than turning things
into macros...

Thoughts?

-dg


David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Of course, someone who knows more about this will correct me if I'm wrong,
 and someone who knows less will correct me if I'm right."
               --David Palmer (palmer@tybalt.caltech.edu)

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Tom Lane
Date:
dg@illustra.com (David Gould) writes:
> The idea that occurred to me is to have the postmaster
> "pre-spawn" some servers in each (configurable) database. These would run
> all the initialization and then just wait for a socket to be handed to them.
> The postmaster would during idle time replenish the pool of ready servers.

Cool idea ... but how to get the socket passed off from postmaster to
back end, other than through a fork?

I think there is a facility in SYSV messaging to transmit a file
descriptor from one process to another, but that's not going to be a
portable answer.

            regards, tom lane

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Brett McCormick
Date:
same way that the current network socket is passed -- through an execv
argument.  hopefully, however, the non-execv()ing fork will be in 6.4.

does anyone have any suggestions for postmaster->backend variable
passing?  Should it just pass an argv array for compatiblity reasons?
There will have to be some sort of arg parsing in any case,
considering that you can pass configurable arguments to the backend..

On Mon, 11 May 1998, at 10:49:11, Tom Lane wrote:

> Cool idea ... but how to get the socket passed off from postmaster to
> back end, other than through a fork?
>
> I think there is a facility in SYSV messaging to transmit a file
> descriptor from one process to another, but that's not going to be a
> portable answer.
>
>             regards, tom lane

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Tom Lane
Date:
Brett McCormick <brett@work.chicken.org> writes:
> same way that the current network socket is passed -- through an execv
> argument.  hopefully, however, the non-execv()ing fork will be in 6.4.

Um, you missed the point, Brett.  David was hoping to transfer a client
connection from the postmaster to an *already existing* backend process.
Fork, with or without exec, solves the problem for a backend that's
started after the postmaster has accepted the client socket.

This does lead to a different line of thought, however.  Pre-started
backends would have access to the "master" connection socket on which
the postmaster listens for client connections, right?  Suppose that we
fire the postmaster as postmaster, and demote it to being simply a
manufacturer of new backend processes as old ones get used up.  Have
one of the idle backend processes be the one doing the accept() on the
master socket.  Once it has a client connection, it performs the
authentication handshake and then starts serving the client (or just
quits if authentication fails).  Meanwhile the next idle backend process
has executed accept() on the master socket and is waiting for the next
client; and shortly the postmaster/factory/whateverwecallitnow notices
that it needs to start another backend to add to the idle-backend pool.

This'd probably need some interlocking among the backends.  I have no
idea whether it'd be safe to have all the idle backends trying to
do accept() on the master socket simultaneously, but it sounds risky.
Better to use a mutex so that only one gets to do it while the others
sleep.

            regards, tom lane

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Tom Lane
Date:
Meanwhile, *I* missed the point about Brett's second comment :-(

Brett McCormick <brett@work.chicken.org> writes:
> There will have to be some sort of arg parsing in any case,
> considering that you can pass configurable arguments to the backend..

If we do the sort of change David and I were just discussing, then the
pre-spawned backend would become responsible for parsing and dealing
with the PGOPTIONS portion of the client's connection request message.
That's just part of shifting the authentication handshake code from
postmaster to backend, so it shouldn't be too hard.

BUT: the whole point is to be able to initialize the backend before it
is connected to a client.  How much of the expensive backend startup
work depends on having the client connection options available?
Any work that needs to know the options will have to wait until after
the client connects.  If that means most of the startup work can't
happen in advance anyway, then we're out of luck; a pre-started backend
won't save enough time to be worth the effort.  (Unless we are willing
to eliminate or redefine the troublesome options...)

            regards, tom lane

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Brett McCormick
Date:
On Mon, 11 May 1998, at 11:14:43, Tom Lane wrote:

> Brett McCormick <brett@work.chicken.org> writes:
> > same way that the current network socket is passed -- through an execv
> > argument.  hopefully, however, the non-execv()ing fork will be in 6.4.
>
> Um, you missed the point, Brett.  David was hoping to transfer a client
> connection from the postmaster to an *already existing* backend process.
> Fork, with or without exec, solves the problem for a backend that's
> started after the postmaster has accepted the client socket.

That's what I get for jumping in on a thread I wasn't paying much
attention to begin with.

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
dg@illustra.com (David Gould)
Date:
Tom Lane:
> Meanwhile, *I* missed the point about Brett's second comment :-(
>
> Brett McCormick <brett@work.chicken.org> writes:
> > There will have to be some sort of arg parsing in any case,
> > considering that you can pass configurable arguments to the backend..
>
> If we do the sort of change David and I were just discussing, then the
> pre-spawned backend would become responsible for parsing and dealing
> with the PGOPTIONS portion of the client's connection request message.
> That's just part of shifting the authentication handshake code from
> postmaster to backend, so it shouldn't be too hard.
>
> BUT: the whole point is to be able to initialize the backend before it
> is connected to a client.  How much of the expensive backend startup
> work depends on having the client connection options available?
> Any work that needs to know the options will have to wait until after
> the client connects.  If that means most of the startup work can't
> happen in advance anyway, then we're out of luck; a pre-started backend
> won't save enough time to be worth the effort.  (Unless we are willing
> to eliminate or redefine the troublesome options...)

I was thinking that we would have a pool of ready servers _per_database_.
That is, we would be able to configure say 8 servers in a particular DB, and
say 4 in another DB etc. These servers could run most of the way through
initialization (open catalogs, read in syscache etc). Then they would wait
until a connection for the desired DB was handed to them by the postmaster.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Of course, someone who knows more about this will correct me if I'm wrong,
 and someone who knows less will correct me if I'm right."
               --David Palmer (palmer@tybalt.caltech.edu)


Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Bruce Momjian
Date:
> I was thinking that we would have a pool of ready servers _per_database_.
> That is, we would be able to configure say 8 servers in a particular DB, and
> say 4 in another DB etc. These servers could run most of the way through
> initialization (open catalogs, read in syscache etc). Then they would wait
> until a connection for the desired DB was handed to them by the postmaster.
>

OK, but how do you invalidate the catalog items that have changed from
the startup to the time it gets the client connection?

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
dg@illustra.com (David Gould)
Date:
Bruce Momjian:
> > I was thinking that we would have a pool of ready servers _per_database_.
> > That is, we would be able to configure say 8 servers in a particular DB, and
> > say 4 in another DB etc. These servers could run most of the way through
> > initialization (open catalogs, read in syscache etc). Then they would wait
> > until a connection for the desired DB was handed to them by the postmaster.
> >
>
> OK, but how do you invalidate the catalog items that have changed from
> the startup to the time it gets the client connection?

Same way we always do?

Is there any reason the "ready" servers can't track the Shared Invalidate
cache like any other backend? Maybe they have to time out their wait for a
socket every few seconds and process SI updates, but it should be possible
to make this work. Perhaps not as easy as all that, but certainly doable I
would guess.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Of course, someone who knows more about this will correct me if I'm wrong,
 and someone who knows less will correct me if I'm right."
               --David Palmer (palmer@tybalt.caltech.edu)

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Tom Lane
Date:
>>>> I was thinking that we would have a pool of ready servers
>>>> _per_database_.  That is, we would be able to configure say 8
>>>> servers in a particular DB, and say 4 in another DB etc. These
>>>> servers could run most of the way through initialization (open
>>>> catalogs, read in syscache etc). Then they would wait until a
>>>> connection for the desired DB was handed to them by the postmaster.

What I'm wondering is just how much work will actually be saved by the
additional complexity.

We are already planning to get rid of the exec() of the backend, right,
and use only a fork() to spawn off the background process?  How much of
the startup work consists only of recreating state that is lost by exec?

In particular, I'd imagine that the postmaster process already has open
(or could have open) all the necessary files, shared memory, etc.
This state will be inherited automatically across the fork.

Taking this a little further, one could imagine the postmaster
maintaining the same shared state as any backend (tracking SI cache,
for example).  Then a forked copy should be Ready To Go with very
little work except processing the client option string.

However I can see a downside to this: bugs in the backend interaction
stuff would become likely to take down the postmaster along with the
backends.  The only thing that makes the postmaster more robust than
the backends is that it *isn't* doing as much as they do.

So probably the Apache-style solution (pre-started backends listen for
client connection requests) is the way to go if there is enough bang
for the buck to justify restructuring the postmaster/backend division
of labor.  Question is, how much will that buy that just getting rid
of exec() won't?

            regards, tom lane

Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Bruce Momjian
Date:
>
> Bruce Momjian:
> > > > OK, here is my argument for inlining tas().
> > David Gould:
> > > I am going out of town till Monday, so don't have time to give this
> > > the thoughtful response it deserves. I will get back to you on this as
> > > I am not quite convinced, but obviously in the face of this I need to
> > > explain my reasoning.
> >
> > Sure.  I just know that the reduction from 0.28 to 0.08 was performed
> > one 0.01 at a time.  See the MemSet() macro.  I am sure you will hate it
> > too, but it did reduce the number of calls to memset() and reduced
> > wallclock execution time as measured from the client.
>
> This is always how micro-optimization goes, 1% and 2% gains here and there.
> I am very familiar with it.

I said 0.01 seconds at a time, not 1% at a time.  At this point, a 0.01
seconds savings is 12%, because the total test takes 0.08 seconds.

>
> Anyhow, I am just back, and it occured to me to ask you for an exact
> reproducable test setup, so that I can run the same thing and play around
> with it a bit. I am not convinced by your result for a number of reasons,
> but rather than just make assertions or speculations, I think I should do a
> little data collection.
>
> Also, if you have any other cases that you believe useful to test with, I
> would be very happy to try those too.
>
> Have you done call graph profile for this, or just flat profiling? I think
> you may find the call graph (gprof) output revealing, although perhaps not
> on this particular topic...

I ran gprof.  I did not look at the call graph, just the total number of
calls.  We have a very modular system, and the call overhead can get
exessive.  gprof shows tas() getting called far more than any other
function.  It shows it as 0.01 seconds, on a 0.08 second test!  Now, I
realize that gprof measurement is not perfect, but it certainly shows
tas as being called a lot.

The test is easy.  Execute this from psql:

    select * from pg_type where oid = 234234;

Compile with profiling, run this from psql, and run gprof on the
gmon.out file in pgsql/data/base/testdb.

I don't understand your hesitation.  The code WAS inlined.  It was
inlined because gprof showed is as being called a lot.  Most of them are
ASM anyway, so what does it matter if it sits in a a *.c or *.h file, an
asm() call looks the same in a macro or in a function.

If it makes you feel better, put it in something called tas.h, and add
it as an include in all the files that include s_lock.h, or have
s_lock.h include tas.h.

I am not looking around for 1% optimization.  I am using gprof output to
improve things that gprof shows need improving.

>
> One last item, appropos start up time. Illustra uses what we call "Server
> Caching". Basically when a connect is terminated, the backend instead of
> exiting goes into a pool of idle servers. When next a connection comes in for
> the same database, instead of forking and initing a new server, we merely
> reuse the old one. This saves lots of startup time. However, there are some
> problems in practice so we might want to do something "like this only
> different". The idea that occurred to me is to have the postmaster
> "pre-spawn" some servers in each (configurable) database. These would run
> all the initialization and then just wait for a socket to be handed to them.
> The postmaster would during idle time replenish the pool of ready servers.
> I think this might have a lot more impact on startup time than turning things
> into macros...


Sure, it will have a lot more impact than making things into macros, and
I am all for it, but inlining does improve things, and it was a macro
that worked on all platforms before it was changed.  (Except
linux/alpha, which has to be a function.)

We have tons of macros already from Berkeley.  ctags makes the macros
just as easy for me to reference as functions.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Bruce Momjian
Date:
>
> >>>> I was thinking that we would have a pool of ready servers
> >>>> _per_database_.  That is, we would be able to configure say 8
> >>>> servers in a particular DB, and say 4 in another DB etc. These
> >>>> servers could run most of the way through initialization (open
> >>>> catalogs, read in syscache etc). Then they would wait until a
> >>>> connection for the desired DB was handed to them by the postmaster.
>
> What I'm wondering is just how much work will actually be saved by the
> additional complexity.
>
> We are already planning to get rid of the exec() of the backend, right,
> and use only a fork() to spawn off the background process?  How much of
> the startup work consists only of recreating state that is lost by exec?

Yes, exec() should be gone by 6.4.

>
> In particular, I'd imagine that the postmaster process already has open
> (or could have open) all the necessary files, shared memory, etc.
> This state will be inherited automatically across the fork.

Yes removal of exec() will allow us some further optimization.  We can't
inherit the open() across fork() because the file descriptor contains an
offset, and if we shared them, then one fseek() would be seen by all
backends.

>
> Taking this a little further, one could imagine the postmaster
> maintaining the same shared state as any backend (tracking SI cache,
> for example).  Then a forked copy should be Ready To Go with very
> little work except processing the client option string.
>
> However I can see a downside to this: bugs in the backend interaction
> stuff would become likely to take down the postmaster along with the
> backends.  The only thing that makes the postmaster more robust than
> the backends is that it *isn't* doing as much as they do.
>
> So probably the Apache-style solution (pre-started backends listen for
> client connection requests) is the way to go if there is enough bang
> for the buck to justify restructuring the postmaster/backend division
> of labor.  Question is, how much will that buy that just getting rid
> of exec() won't?

Not sure.  Removal of exec() takes 0.01 seconds off a 0.08 secion test.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
dg@illustra.com (David Gould)
Date:
Bruce Momjian:
> David Gould:
> > This is always how micro-optimization goes, 1% and 2% gains here and there.
> > I am very familiar with it.
>
> I said 0.01 seconds at a time, not 1% at a time.  At this point, a 0.01
> seconds savings is 12%, because the total test takes 0.08 seconds.

I think I may have been unclear here. I was merely attempting to agree that
optimization is a process of accumulating small gains.

> > Have you done call graph profile for this, or just flat profiling? I think
> > you may find the call graph (gprof) output revealing, although perhaps not
> > on this particular topic...
>
> I ran gprof.  I did not look at the call graph, just the total number of
> calls.  We have a very modular system, and the call overhead can get
> exessive.  gprof shows tas() getting called far more than any other
> function.  It shows it as 0.01 seconds, on a 0.08 second test!  Now, I
> realize that gprof measurement is not perfect, but it certainly shows
> tas as being called a lot.

I agree the system is sometimes excessively layered resulting in many
trivial calls.

I agree tas() is called a lot. I am trying to understand if the overhead seen
below is in the call itself, or in the actual work of synchronization.

>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ms/call  ms/call  name
> 20.0       0.02     0.02                             mcount (463)
> 10.0       0.03     0.01     5288     0.00     0.00  _tas [31]
> 10.0       0.04     0.01     2368     0.00     0.00  _hash_search [30]
> 10.0       0.05     0.01     1631     0.01     0.02  _malloc [11]
> 10.0       0.06     0.01      101     0.10     0.10  _sbrk [35]
> 10.0       0.07     0.01       56     0.18     0.20  _heapgettup [25]
> 10.0       0.08     0.01        4     2.50     2.50  _write [32]
> 10.0       0.09     0.01        2     5.00     5.00  ___sysctl [33]
> 10.0       0.10     0.01        1    10.00    10.41  _cnfify [28]
>  0.0       0.10     0.00     1774     0.00     0.00  _call_hash [468]
>  0.0       0.10     0.00     1604     0.00     0.00  _tag_hash [469]
>  0.0       0.10     0.00     1380     0.00     0.00  _OrderedElemPush [470]
>  0.0       0.10     0.00     1380     0.00     0.00  _OrderedElemPushHead [471]
>  0.0       0.10     0.00     1380     0.00     0.00  _OrderedElemPushInto [472]
>  0.0       0.10     0.00     1375     0.00     0.02  _AllocSetAlloc [12]
>  0.0       0.10     0.00     1375     0.00     0.02  _MemoryContextAlloc [13]
>  0.0       0.10     0.00     1353     0.00     0.02  _palloc [14]
>  0.0       0.10     0.00     1322     0.00     0.01  _SpinAcquire [45]

I asked about the call graph for two reasons (the first of which is part
of another thread):

1) I would expect that the 1353 calls to palloc() are also responsible for:

     1375 _MemoryContextAlloc
     1375 _AllocSetAlloc
     1380 _OrderedElemPushInto
     1380 _OrderedElemPush
  for a total of (1353 + 1375 + 1375 + 1380 + 1380) = 6863 calls.
  (not including the further 1631 _malloc and 101 _sbrk calls).

  I am curious why these calls do not seem to show up on the cumulative time.

2) I wonder how fine the resolution of the profile is. I am assuming that all
   the overhead of tas comes from either:
    - the call overhead
    - or the actual work done in tas().
   Given that, I wonder if the call overhead is the major part, it could be
   that the bus/cache synchronization is the real overhead. As a function,
   it is easy to identify tas(). As a macro it does not show up on the
   profile and its contribution to the overhead is distributed amoung all the
   callers which makes it less obvious on the profile. I was hoping the call
   graph would help identify which was the case.

In any case, I will test with it as a macro and as a function. It may also
be instructive to make a dummy tas() that does nothing and if that shows
the overhead to be in the actual synchronization, or in the calling. I will
test this too.

My intent here is not to be argumentative. My current mental model is that
excess calls are unfortunate but not especially harmful and not usually
worth changing (sometimes of course an inline sequence allows the optimizer
to do something clever and makes more difference than expected). If this view
is incorrect, I would like to know and understand that so that I can adjust
my theory accordingly.

> I don't understand your hesitation.  The code WAS inlined.  It was
> inlined because gprof showed is as being called a lot.  Most of them are
> ASM anyway, so what does it matter if it sits in a a *.c or *.h file, an
> asm() call looks the same in a macro or in a function.

I do not feel strongly about this. I prefer the function on the grounds of
clarity and ease of maintenance and porting. But I would be happy to make it
a macro even if the performance difference is not significant.

> If it makes you feel better, put it in something called tas.h, and add
> it as an include in all the files that include s_lock.h, or have
> s_lock.h include tas.h.

I am fine with folding it all into s_lock.h. If you wish, I will do so.

> I am not looking around for 1% optimization.  I am using gprof output to
> improve things that gprof shows need improving.

Please, I am not intending _any_ criticism. I have no disagreement with what
you are doing. I am glad to see us working on performance.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"My life has been full of wonderful moments -
  it's only later that they become embarassing."
                                            -- Gerhard Berger

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]

From
Bruce Momjian
Date:
>
> Bruce Momjian:
> > David Gould:
> > > This is always how micro-optimization goes, 1% and 2% gains here and there.
> > > I am very familiar with it.
> >
> > I said 0.01 seconds at a time, not 1% at a time.  At this point, a 0.01
> > seconds savings is 12%, because the total test takes 0.08 seconds.
>
> I think I may have been unclear here. I was merely attempting to agree that
> optimization is a process of accumulating small gains.

True.

>
> > > Have you done call graph profile for this, or just flat profiling? I think
> > > you may find the call graph (gprof) output revealing, although perhaps not
> > > on this particular topic...
> >
> > I ran gprof.  I did not look at the call graph, just the total number of
> > calls.  We have a very modular system, and the call overhead can get
> > exessive.  gprof shows tas() getting called far more than any other
> > function.  It shows it as 0.01 seconds, on a 0.08 second test!  Now, I
> > realize that gprof measurement is not perfect, but it certainly shows
> > tas as being called a lot.
>
> I agree the system is sometimes excessively layered resulting in many
> trivial calls.
>
> I agree tas() is called a lot. I am trying to understand if the overhead seen
> below is in the call itself, or in the actual work of synchronization.

Yep.  Much of the actual call time is locked up in the mcount line.  It
says it was counting functions at the time of sampling, I think.  So,
many times, inlining something that showed NO cpu time caused an
improvement because the mcount time went down.

My first attack was to reduce functions called for each column.  When
those were gone, I went after ones that were called for each row.  I am
going to post timing on sequential scans that I think you will find
interesting.


>
> >  %   cumulative   self              self     total
> > time   seconds   seconds    calls  ms/call  ms/call  name
> > 20.0       0.02     0.02                             mcount (463)
> > 10.0       0.03     0.01     5288     0.00     0.00  _tas [31]
> > 10.0       0.04     0.01     2368     0.00     0.00  _hash_search [30]
> > 10.0       0.05     0.01     1631     0.01     0.02  _malloc [11]
> > 10.0       0.06     0.01      101     0.10     0.10  _sbrk [35]
> > 10.0       0.07     0.01       56     0.18     0.20  _heapgettup [25]
> > 10.0       0.08     0.01        4     2.50     2.50  _write [32]
> > 10.0       0.09     0.01        2     5.00     5.00  ___sysctl [33]
> > 10.0       0.10     0.01        1    10.00    10.41  _cnfify [28]
> >  0.0       0.10     0.00     1774     0.00     0.00  _call_hash [468]
> >  0.0       0.10     0.00     1604     0.00     0.00  _tag_hash [469]
> >  0.0       0.10     0.00     1380     0.00     0.00  _OrderedElemPush [470]
> >  0.0       0.10     0.00     1380     0.00     0.00  _OrderedElemPushHead [471]
> >  0.0       0.10     0.00     1380     0.00     0.00  _OrderedElemPushInto [472]
> >  0.0       0.10     0.00     1375     0.00     0.02  _AllocSetAlloc [12]
> >  0.0       0.10     0.00     1375     0.00     0.02  _MemoryContextAlloc [13]
> >  0.0       0.10     0.00     1353     0.00     0.02  _palloc [14]
> >  0.0       0.10     0.00     1322     0.00     0.01  _SpinAcquire [45]
>
> I asked about the call graph for two reasons (the first of which is part
> of another thread):
>
> 1) I would expect that the 1353 calls to palloc() are also responsible for:
>
>      1375 _MemoryContextAlloc
>      1375 _AllocSetAlloc
>      1380 _OrderedElemPushInto
>      1380 _OrderedElemPush
>   for a total of (1353 + 1375 + 1375 + 1380 + 1380) = 6863 calls.
>   (not including the further 1631 _malloc and 101 _sbrk calls).
>
>   I am curious why these calls do not seem to show up on the cumulative time.


Not sure, but with such a quick test, the times are not significant.  It
is the number of calls, that can get very large for a large table scan.
I look more for a pattern of calls, and when certain handling causes a
lot of function call overhead.

For example, if the tables has 180k rows, and there are 180k calls to
the function, it is called one per row.  If there are 360k calls, it is
called two per row.  I believe tas is called multiple times per row.


>
> 2) I wonder how fine the resolution of the profile is. I am assuming that all
>    the overhead of tas comes from either:
>     - the call overhead
>     - or the actual work done in tas().
>    Given that, I wonder if the call overhead is the major part, it could be
>    that the bus/cache synchronization is the real overhead. As a function,
>    it is easy to identify tas(). As a macro it does not show up on the
>    profile and its contribution to the overhead is distributed amoung all the
>    callers which makes it less obvious on the profile. I was hoping the call
>    graph would help identify which was the case.

This is true.

>
> In any case, I will test with it as a macro and as a function. It may also
> be instructive to make a dummy tas() that does nothing and if that shows
> the overhead to be in the actual synchronization, or in the calling. I will
> test this too.

Intresting, but again, with this type of test, I am only looking for
areas of slowness, not actual function duration times.  They are going
to be meaningless in a small test.

>
> My intent here is not to be argumentative. My current mental model is that
> excess calls are unfortunate but not especially harmful and not usually
> worth changing (sometimes of course an inline sequence allows the optimizer
> to do something clever and makes more difference than expected). If this view
> is incorrect, I would like to know and understand that so that I can adjust
> my theory accordingly.

I understand.  You want to know WHY it is improving performance.  Not
sure I can answer that.  I will say that because SQL databases are so
compilcated, certain queries can generate very different call profiles,
so I have tried to find cases where the call path is generating call
traffic, and inline it if is a very simple function.

I have inlined very complex functions, but only when the are called FOR
EVERY COLUMN.  These cases really cause a big win on call overhead.  See
include/access/heapam.h.  I am not proud of that code, but it makes a
large difference.

I could clearly see improvements in client timing by inlining functions
that were called a lot, so I continued to decrease the number of calls,
even when the timing did not show a decrease because the times had
become so small.

>
> > I don't understand your hesitation.  The code WAS inlined.  It was
> > inlined because gprof showed is as being called a lot.  Most of them are
> > ASM anyway, so what does it matter if it sits in a a *.c or *.h file, an
> > asm() call looks the same in a macro or in a function.
>
> I do not feel strongly about this. I prefer the function on the grounds of
> clarity and ease of maintenance and porting. But I would be happy to make it
> a macro even if the performance difference is not significant.

I do not want to end up with most of our code in macros, nor to make the
code on big file.  But, for items called a lot, macros seem to make
sense, especially if the functions are small.

>
> > If it makes you feel better, put it in something called tas.h, and add
> > it as an include in all the files that include s_lock.h, or have
> > s_lock.h include tas.h.
>
> I am fine with folding it all into s_lock.h. If you wish, I will do so.
>
> > I am not looking around for 1% optimization.  I am using gprof output to
> > improve things that gprof shows need improving.
>
> Please, I am not intending _any_ criticism. I have no disagreement with what
> you are doing. I am glad to see us working on performance.

Let me know what you find in your testing.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)