Thread: Re: [PATCHES] Try again: S_LOCK reduced contentionh]
> > > > > OK, here is my argument for inlining tas(). > > I am going out of town till Monday, so don't have time to give this > the thoughtful response it deserves. I will get back to you on this as > I am not quite convinced, but obviously in the face of this I need to > explain my reasoning. Sure. I just know that the reduction from 0.28 to 0.08 was performed one 0.01 at a time. See the MemSet() macro. I am sure you will hate it too, but it did reduce the number of calls to memset() and reduced wallclock execution time as measured from the client. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Bruce Momjian: > > > OK, here is my argument for inlining tas(). > David Gould: > > I am going out of town till Monday, so don't have time to give this > > the thoughtful response it deserves. I will get back to you on this as > > I am not quite convinced, but obviously in the face of this I need to > > explain my reasoning. > > Sure. I just know that the reduction from 0.28 to 0.08 was performed > one 0.01 at a time. See the MemSet() macro. I am sure you will hate it > too, but it did reduce the number of calls to memset() and reduced > wallclock execution time as measured from the client. This is always how micro-optimization goes, 1% and 2% gains here and there. I am very familiar with it. Anyhow, I am just back, and it occured to me to ask you for an exact reproducable test setup, so that I can run the same thing and play around with it a bit. I am not convinced by your result for a number of reasons, but rather than just make assertions or speculations, I think I should do a little data collection. Also, if you have any other cases that you believe useful to test with, I would be very happy to try those too. Have you done call graph profile for this, or just flat profiling? I think you may find the call graph (gprof) output revealing, although perhaps not on this particular topic... One last item, appropos start up time. Illustra uses what we call "Server Caching". Basically when a connect is terminated, the backend instead of exiting goes into a pool of idle servers. When next a connection comes in for the same database, instead of forking and initing a new server, we merely reuse the old one. This saves lots of startup time. However, there are some problems in practice so we might want to do something "like this only different". The idea that occurred to me is to have the postmaster "pre-spawn" some servers in each (configurable) database. These would run all the initialization and then just wait for a socket to be handed to them. The postmaster would during idle time replenish the pool of ready servers. I think this might have a lot more impact on startup time than turning things into macros... Thoughts? -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Of course, someone who knows more about this will correct me if I'm wrong, and someone who knows less will correct me if I'm right." --David Palmer (palmer@tybalt.caltech.edu)
dg@illustra.com (David Gould) writes: > The idea that occurred to me is to have the postmaster > "pre-spawn" some servers in each (configurable) database. These would run > all the initialization and then just wait for a socket to be handed to them. > The postmaster would during idle time replenish the pool of ready servers. Cool idea ... but how to get the socket passed off from postmaster to back end, other than through a fork? I think there is a facility in SYSV messaging to transmit a file descriptor from one process to another, but that's not going to be a portable answer. regards, tom lane
same way that the current network socket is passed -- through an execv argument. hopefully, however, the non-execv()ing fork will be in 6.4. does anyone have any suggestions for postmaster->backend variable passing? Should it just pass an argv array for compatiblity reasons? There will have to be some sort of arg parsing in any case, considering that you can pass configurable arguments to the backend.. On Mon, 11 May 1998, at 10:49:11, Tom Lane wrote: > Cool idea ... but how to get the socket passed off from postmaster to > back end, other than through a fork? > > I think there is a facility in SYSV messaging to transmit a file > descriptor from one process to another, but that's not going to be a > portable answer. > > regards, tom lane
Brett McCormick <brett@work.chicken.org> writes: > same way that the current network socket is passed -- through an execv > argument. hopefully, however, the non-execv()ing fork will be in 6.4. Um, you missed the point, Brett. David was hoping to transfer a client connection from the postmaster to an *already existing* backend process. Fork, with or without exec, solves the problem for a backend that's started after the postmaster has accepted the client socket. This does lead to a different line of thought, however. Pre-started backends would have access to the "master" connection socket on which the postmaster listens for client connections, right? Suppose that we fire the postmaster as postmaster, and demote it to being simply a manufacturer of new backend processes as old ones get used up. Have one of the idle backend processes be the one doing the accept() on the master socket. Once it has a client connection, it performs the authentication handshake and then starts serving the client (or just quits if authentication fails). Meanwhile the next idle backend process has executed accept() on the master socket and is waiting for the next client; and shortly the postmaster/factory/whateverwecallitnow notices that it needs to start another backend to add to the idle-backend pool. This'd probably need some interlocking among the backends. I have no idea whether it'd be safe to have all the idle backends trying to do accept() on the master socket simultaneously, but it sounds risky. Better to use a mutex so that only one gets to do it while the others sleep. regards, tom lane
Meanwhile, *I* missed the point about Brett's second comment :-( Brett McCormick <brett@work.chicken.org> writes: > There will have to be some sort of arg parsing in any case, > considering that you can pass configurable arguments to the backend.. If we do the sort of change David and I were just discussing, then the pre-spawned backend would become responsible for parsing and dealing with the PGOPTIONS portion of the client's connection request message. That's just part of shifting the authentication handshake code from postmaster to backend, so it shouldn't be too hard. BUT: the whole point is to be able to initialize the backend before it is connected to a client. How much of the expensive backend startup work depends on having the client connection options available? Any work that needs to know the options will have to wait until after the client connects. If that means most of the startup work can't happen in advance anyway, then we're out of luck; a pre-started backend won't save enough time to be worth the effort. (Unless we are willing to eliminate or redefine the troublesome options...) regards, tom lane
On Mon, 11 May 1998, at 11:14:43, Tom Lane wrote: > Brett McCormick <brett@work.chicken.org> writes: > > same way that the current network socket is passed -- through an execv > > argument. hopefully, however, the non-execv()ing fork will be in 6.4. > > Um, you missed the point, Brett. David was hoping to transfer a client > connection from the postmaster to an *already existing* backend process. > Fork, with or without exec, solves the problem for a backend that's > started after the postmaster has accepted the client socket. That's what I get for jumping in on a thread I wasn't paying much attention to begin with.
Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]
From
dg@illustra.com (David Gould)
Date:
Tom Lane: > Meanwhile, *I* missed the point about Brett's second comment :-( > > Brett McCormick <brett@work.chicken.org> writes: > > There will have to be some sort of arg parsing in any case, > > considering that you can pass configurable arguments to the backend.. > > If we do the sort of change David and I were just discussing, then the > pre-spawned backend would become responsible for parsing and dealing > with the PGOPTIONS portion of the client's connection request message. > That's just part of shifting the authentication handshake code from > postmaster to backend, so it shouldn't be too hard. > > BUT: the whole point is to be able to initialize the backend before it > is connected to a client. How much of the expensive backend startup > work depends on having the client connection options available? > Any work that needs to know the options will have to wait until after > the client connects. If that means most of the startup work can't > happen in advance anyway, then we're out of luck; a pre-started backend > won't save enough time to be worth the effort. (Unless we are willing > to eliminate or redefine the troublesome options...) I was thinking that we would have a pool of ready servers _per_database_. That is, we would be able to configure say 8 servers in a particular DB, and say 4 in another DB etc. These servers could run most of the way through initialization (open catalogs, read in syscache etc). Then they would wait until a connection for the desired DB was handed to them by the postmaster. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Of course, someone who knows more about this will correct me if I'm wrong, and someone who knows less will correct me if I'm right." --David Palmer (palmer@tybalt.caltech.edu)
> I was thinking that we would have a pool of ready servers _per_database_. > That is, we would be able to configure say 8 servers in a particular DB, and > say 4 in another DB etc. These servers could run most of the way through > initialization (open catalogs, read in syscache etc). Then they would wait > until a connection for the desired DB was handed to them by the postmaster. > OK, but how do you invalidate the catalog items that have changed from the startup to the time it gets the client connection? -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]
From
dg@illustra.com (David Gould)
Date:
Bruce Momjian: > > I was thinking that we would have a pool of ready servers _per_database_. > > That is, we would be able to configure say 8 servers in a particular DB, and > > say 4 in another DB etc. These servers could run most of the way through > > initialization (open catalogs, read in syscache etc). Then they would wait > > until a connection for the desired DB was handed to them by the postmaster. > > > > OK, but how do you invalidate the catalog items that have changed from > the startup to the time it gets the client connection? Same way we always do? Is there any reason the "ready" servers can't track the Shared Invalidate cache like any other backend? Maybe they have to time out their wait for a socket every few seconds and process SI updates, but it should be possible to make this work. Perhaps not as easy as all that, but certainly doable I would guess. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Of course, someone who knows more about this will correct me if I'm wrong, and someone who knows less will correct me if I'm right." --David Palmer (palmer@tybalt.caltech.edu)
>>>> I was thinking that we would have a pool of ready servers >>>> _per_database_. That is, we would be able to configure say 8 >>>> servers in a particular DB, and say 4 in another DB etc. These >>>> servers could run most of the way through initialization (open >>>> catalogs, read in syscache etc). Then they would wait until a >>>> connection for the desired DB was handed to them by the postmaster. What I'm wondering is just how much work will actually be saved by the additional complexity. We are already planning to get rid of the exec() of the backend, right, and use only a fork() to spawn off the background process? How much of the startup work consists only of recreating state that is lost by exec? In particular, I'd imagine that the postmaster process already has open (or could have open) all the necessary files, shared memory, etc. This state will be inherited automatically across the fork. Taking this a little further, one could imagine the postmaster maintaining the same shared state as any backend (tracking SI cache, for example). Then a forked copy should be Ready To Go with very little work except processing the client option string. However I can see a downside to this: bugs in the backend interaction stuff would become likely to take down the postmaster along with the backends. The only thing that makes the postmaster more robust than the backends is that it *isn't* doing as much as they do. So probably the Apache-style solution (pre-started backends listen for client connection requests) is the way to go if there is enough bang for the buck to justify restructuring the postmaster/backend division of labor. Question is, how much will that buy that just getting rid of exec() won't? regards, tom lane
> > Bruce Momjian: > > > > OK, here is my argument for inlining tas(). > > David Gould: > > > I am going out of town till Monday, so don't have time to give this > > > the thoughtful response it deserves. I will get back to you on this as > > > I am not quite convinced, but obviously in the face of this I need to > > > explain my reasoning. > > > > Sure. I just know that the reduction from 0.28 to 0.08 was performed > > one 0.01 at a time. See the MemSet() macro. I am sure you will hate it > > too, but it did reduce the number of calls to memset() and reduced > > wallclock execution time as measured from the client. > > This is always how micro-optimization goes, 1% and 2% gains here and there. > I am very familiar with it. I said 0.01 seconds at a time, not 1% at a time. At this point, a 0.01 seconds savings is 12%, because the total test takes 0.08 seconds. > > Anyhow, I am just back, and it occured to me to ask you for an exact > reproducable test setup, so that I can run the same thing and play around > with it a bit. I am not convinced by your result for a number of reasons, > but rather than just make assertions or speculations, I think I should do a > little data collection. > > Also, if you have any other cases that you believe useful to test with, I > would be very happy to try those too. > > Have you done call graph profile for this, or just flat profiling? I think > you may find the call graph (gprof) output revealing, although perhaps not > on this particular topic... I ran gprof. I did not look at the call graph, just the total number of calls. We have a very modular system, and the call overhead can get exessive. gprof shows tas() getting called far more than any other function. It shows it as 0.01 seconds, on a 0.08 second test! Now, I realize that gprof measurement is not perfect, but it certainly shows tas as being called a lot. The test is easy. Execute this from psql: select * from pg_type where oid = 234234; Compile with profiling, run this from psql, and run gprof on the gmon.out file in pgsql/data/base/testdb. I don't understand your hesitation. The code WAS inlined. It was inlined because gprof showed is as being called a lot. Most of them are ASM anyway, so what does it matter if it sits in a a *.c or *.h file, an asm() call looks the same in a macro or in a function. If it makes you feel better, put it in something called tas.h, and add it as an include in all the files that include s_lock.h, or have s_lock.h include tas.h. I am not looking around for 1% optimization. I am using gprof output to improve things that gprof shows need improving. > > One last item, appropos start up time. Illustra uses what we call "Server > Caching". Basically when a connect is terminated, the backend instead of > exiting goes into a pool of idle servers. When next a connection comes in for > the same database, instead of forking and initing a new server, we merely > reuse the old one. This saves lots of startup time. However, there are some > problems in practice so we might want to do something "like this only > different". The idea that occurred to me is to have the postmaster > "pre-spawn" some servers in each (configurable) database. These would run > all the initialization and then just wait for a socket to be handed to them. > The postmaster would during idle time replenish the pool of ready servers. > I think this might have a lot more impact on startup time than turning things > into macros... Sure, it will have a lot more impact than making things into macros, and I am all for it, but inlining does improve things, and it was a macro that worked on all platforms before it was changed. (Except linux/alpha, which has to be a function.) We have tons of macros already from Berkeley. ctags makes the macros just as easy for me to reference as functions. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > >>>> I was thinking that we would have a pool of ready servers > >>>> _per_database_. That is, we would be able to configure say 8 > >>>> servers in a particular DB, and say 4 in another DB etc. These > >>>> servers could run most of the way through initialization (open > >>>> catalogs, read in syscache etc). Then they would wait until a > >>>> connection for the desired DB was handed to them by the postmaster. > > What I'm wondering is just how much work will actually be saved by the > additional complexity. > > We are already planning to get rid of the exec() of the backend, right, > and use only a fork() to spawn off the background process? How much of > the startup work consists only of recreating state that is lost by exec? Yes, exec() should be gone by 6.4. > > In particular, I'd imagine that the postmaster process already has open > (or could have open) all the necessary files, shared memory, etc. > This state will be inherited automatically across the fork. Yes removal of exec() will allow us some further optimization. We can't inherit the open() across fork() because the file descriptor contains an offset, and if we shared them, then one fseek() would be seen by all backends. > > Taking this a little further, one could imagine the postmaster > maintaining the same shared state as any backend (tracking SI cache, > for example). Then a forked copy should be Ready To Go with very > little work except processing the client option string. > > However I can see a downside to this: bugs in the backend interaction > stuff would become likely to take down the postmaster along with the > backends. The only thing that makes the postmaster more robust than > the backends is that it *isn't* doing as much as they do. > > So probably the Apache-style solution (pre-started backends listen for > client connection requests) is the way to go if there is enough bang > for the buck to justify restructuring the postmaster/backend division > of labor. Question is, how much will that buy that just getting rid > of exec() won't? Not sure. Removal of exec() takes 0.01 seconds off a 0.08 secion test. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]
From
dg@illustra.com (David Gould)
Date:
Bruce Momjian: > David Gould: > > This is always how micro-optimization goes, 1% and 2% gains here and there. > > I am very familiar with it. > > I said 0.01 seconds at a time, not 1% at a time. At this point, a 0.01 > seconds savings is 12%, because the total test takes 0.08 seconds. I think I may have been unclear here. I was merely attempting to agree that optimization is a process of accumulating small gains. > > Have you done call graph profile for this, or just flat profiling? I think > > you may find the call graph (gprof) output revealing, although perhaps not > > on this particular topic... > > I ran gprof. I did not look at the call graph, just the total number of > calls. We have a very modular system, and the call overhead can get > exessive. gprof shows tas() getting called far more than any other > function. It shows it as 0.01 seconds, on a 0.08 second test! Now, I > realize that gprof measurement is not perfect, but it certainly shows > tas as being called a lot. I agree the system is sometimes excessively layered resulting in many trivial calls. I agree tas() is called a lot. I am trying to understand if the overhead seen below is in the call itself, or in the actual work of synchronization. > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 20.0 0.02 0.02 mcount (463) > 10.0 0.03 0.01 5288 0.00 0.00 _tas [31] > 10.0 0.04 0.01 2368 0.00 0.00 _hash_search [30] > 10.0 0.05 0.01 1631 0.01 0.02 _malloc [11] > 10.0 0.06 0.01 101 0.10 0.10 _sbrk [35] > 10.0 0.07 0.01 56 0.18 0.20 _heapgettup [25] > 10.0 0.08 0.01 4 2.50 2.50 _write [32] > 10.0 0.09 0.01 2 5.00 5.00 ___sysctl [33] > 10.0 0.10 0.01 1 10.00 10.41 _cnfify [28] > 0.0 0.10 0.00 1774 0.00 0.00 _call_hash [468] > 0.0 0.10 0.00 1604 0.00 0.00 _tag_hash [469] > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPush [470] > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPushHead [471] > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPushInto [472] > 0.0 0.10 0.00 1375 0.00 0.02 _AllocSetAlloc [12] > 0.0 0.10 0.00 1375 0.00 0.02 _MemoryContextAlloc [13] > 0.0 0.10 0.00 1353 0.00 0.02 _palloc [14] > 0.0 0.10 0.00 1322 0.00 0.01 _SpinAcquire [45] I asked about the call graph for two reasons (the first of which is part of another thread): 1) I would expect that the 1353 calls to palloc() are also responsible for: 1375 _MemoryContextAlloc 1375 _AllocSetAlloc 1380 _OrderedElemPushInto 1380 _OrderedElemPush for a total of (1353 + 1375 + 1375 + 1380 + 1380) = 6863 calls. (not including the further 1631 _malloc and 101 _sbrk calls). I am curious why these calls do not seem to show up on the cumulative time. 2) I wonder how fine the resolution of the profile is. I am assuming that all the overhead of tas comes from either: - the call overhead - or the actual work done in tas(). Given that, I wonder if the call overhead is the major part, it could be that the bus/cache synchronization is the real overhead. As a function, it is easy to identify tas(). As a macro it does not show up on the profile and its contribution to the overhead is distributed amoung all the callers which makes it less obvious on the profile. I was hoping the call graph would help identify which was the case. In any case, I will test with it as a macro and as a function. It may also be instructive to make a dummy tas() that does nothing and if that shows the overhead to be in the actual synchronization, or in the calling. I will test this too. My intent here is not to be argumentative. My current mental model is that excess calls are unfortunate but not especially harmful and not usually worth changing (sometimes of course an inline sequence allows the optimizer to do something clever and makes more difference than expected). If this view is incorrect, I would like to know and understand that so that I can adjust my theory accordingly. > I don't understand your hesitation. The code WAS inlined. It was > inlined because gprof showed is as being called a lot. Most of them are > ASM anyway, so what does it matter if it sits in a a *.c or *.h file, an > asm() call looks the same in a macro or in a function. I do not feel strongly about this. I prefer the function on the grounds of clarity and ease of maintenance and porting. But I would be happy to make it a macro even if the performance difference is not significant. > If it makes you feel better, put it in something called tas.h, and add > it as an include in all the files that include s_lock.h, or have > s_lock.h include tas.h. I am fine with folding it all into s_lock.h. If you wish, I will do so. > I am not looking around for 1% optimization. I am using gprof output to > improve things that gprof shows need improving. Please, I am not intending _any_ criticism. I have no disagreement with what you are doing. I am glad to see us working on performance. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "My life has been full of wonderful moments - it's only later that they become embarassing." -- Gerhard Berger
> > Bruce Momjian: > > David Gould: > > > This is always how micro-optimization goes, 1% and 2% gains here and there. > > > I am very familiar with it. > > > > I said 0.01 seconds at a time, not 1% at a time. At this point, a 0.01 > > seconds savings is 12%, because the total test takes 0.08 seconds. > > I think I may have been unclear here. I was merely attempting to agree that > optimization is a process of accumulating small gains. True. > > > > Have you done call graph profile for this, or just flat profiling? I think > > > you may find the call graph (gprof) output revealing, although perhaps not > > > on this particular topic... > > > > I ran gprof. I did not look at the call graph, just the total number of > > calls. We have a very modular system, and the call overhead can get > > exessive. gprof shows tas() getting called far more than any other > > function. It shows it as 0.01 seconds, on a 0.08 second test! Now, I > > realize that gprof measurement is not perfect, but it certainly shows > > tas as being called a lot. > > I agree the system is sometimes excessively layered resulting in many > trivial calls. > > I agree tas() is called a lot. I am trying to understand if the overhead seen > below is in the call itself, or in the actual work of synchronization. Yep. Much of the actual call time is locked up in the mcount line. It says it was counting functions at the time of sampling, I think. So, many times, inlining something that showed NO cpu time caused an improvement because the mcount time went down. My first attack was to reduce functions called for each column. When those were gone, I went after ones that were called for each row. I am going to post timing on sequential scans that I think you will find interesting. > > > % cumulative self self total > > time seconds seconds calls ms/call ms/call name > > 20.0 0.02 0.02 mcount (463) > > 10.0 0.03 0.01 5288 0.00 0.00 _tas [31] > > 10.0 0.04 0.01 2368 0.00 0.00 _hash_search [30] > > 10.0 0.05 0.01 1631 0.01 0.02 _malloc [11] > > 10.0 0.06 0.01 101 0.10 0.10 _sbrk [35] > > 10.0 0.07 0.01 56 0.18 0.20 _heapgettup [25] > > 10.0 0.08 0.01 4 2.50 2.50 _write [32] > > 10.0 0.09 0.01 2 5.00 5.00 ___sysctl [33] > > 10.0 0.10 0.01 1 10.00 10.41 _cnfify [28] > > 0.0 0.10 0.00 1774 0.00 0.00 _call_hash [468] > > 0.0 0.10 0.00 1604 0.00 0.00 _tag_hash [469] > > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPush [470] > > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPushHead [471] > > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPushInto [472] > > 0.0 0.10 0.00 1375 0.00 0.02 _AllocSetAlloc [12] > > 0.0 0.10 0.00 1375 0.00 0.02 _MemoryContextAlloc [13] > > 0.0 0.10 0.00 1353 0.00 0.02 _palloc [14] > > 0.0 0.10 0.00 1322 0.00 0.01 _SpinAcquire [45] > > I asked about the call graph for two reasons (the first of which is part > of another thread): > > 1) I would expect that the 1353 calls to palloc() are also responsible for: > > 1375 _MemoryContextAlloc > 1375 _AllocSetAlloc > 1380 _OrderedElemPushInto > 1380 _OrderedElemPush > for a total of (1353 + 1375 + 1375 + 1380 + 1380) = 6863 calls. > (not including the further 1631 _malloc and 101 _sbrk calls). > > I am curious why these calls do not seem to show up on the cumulative time. Not sure, but with such a quick test, the times are not significant. It is the number of calls, that can get very large for a large table scan. I look more for a pattern of calls, and when certain handling causes a lot of function call overhead. For example, if the tables has 180k rows, and there are 180k calls to the function, it is called one per row. If there are 360k calls, it is called two per row. I believe tas is called multiple times per row. > > 2) I wonder how fine the resolution of the profile is. I am assuming that all > the overhead of tas comes from either: > - the call overhead > - or the actual work done in tas(). > Given that, I wonder if the call overhead is the major part, it could be > that the bus/cache synchronization is the real overhead. As a function, > it is easy to identify tas(). As a macro it does not show up on the > profile and its contribution to the overhead is distributed amoung all the > callers which makes it less obvious on the profile. I was hoping the call > graph would help identify which was the case. This is true. > > In any case, I will test with it as a macro and as a function. It may also > be instructive to make a dummy tas() that does nothing and if that shows > the overhead to be in the actual synchronization, or in the calling. I will > test this too. Intresting, but again, with this type of test, I am only looking for areas of slowness, not actual function duration times. They are going to be meaningless in a small test. > > My intent here is not to be argumentative. My current mental model is that > excess calls are unfortunate but not especially harmful and not usually > worth changing (sometimes of course an inline sequence allows the optimizer > to do something clever and makes more difference than expected). If this view > is incorrect, I would like to know and understand that so that I can adjust > my theory accordingly. I understand. You want to know WHY it is improving performance. Not sure I can answer that. I will say that because SQL databases are so compilcated, certain queries can generate very different call profiles, so I have tried to find cases where the call path is generating call traffic, and inline it if is a very simple function. I have inlined very complex functions, but only when the are called FOR EVERY COLUMN. These cases really cause a big win on call overhead. See include/access/heapam.h. I am not proud of that code, but it makes a large difference. I could clearly see improvements in client timing by inlining functions that were called a lot, so I continued to decrease the number of calls, even when the timing did not show a decrease because the times had become so small. > > > I don't understand your hesitation. The code WAS inlined. It was > > inlined because gprof showed is as being called a lot. Most of them are > > ASM anyway, so what does it matter if it sits in a a *.c or *.h file, an > > asm() call looks the same in a macro or in a function. > > I do not feel strongly about this. I prefer the function on the grounds of > clarity and ease of maintenance and porting. But I would be happy to make it > a macro even if the performance difference is not significant. I do not want to end up with most of our code in macros, nor to make the code on big file. But, for items called a lot, macros seem to make sense, especially if the functions are small. > > > If it makes you feel better, put it in something called tas.h, and add > > it as an include in all the files that include s_lock.h, or have > > s_lock.h include tas.h. > > I am fine with folding it all into s_lock.h. If you wish, I will do so. > > > I am not looking around for 1% optimization. I am using gprof output to > > improve things that gprof shows need improving. > > Please, I am not intending _any_ criticism. I have no disagreement with what > you are doing. I am glad to see us working on performance. Let me know what you find in your testing. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)