Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh] - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh] |
Date | |
Msg-id | 199805122116.RAA14555@candle.pha.pa.us Whole thread Raw |
In response to | Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh] (dg@illustra.com (David Gould)) |
List | pgsql-hackers |
> > Bruce Momjian: > > David Gould: > > > This is always how micro-optimization goes, 1% and 2% gains here and there. > > > I am very familiar with it. > > > > I said 0.01 seconds at a time, not 1% at a time. At this point, a 0.01 > > seconds savings is 12%, because the total test takes 0.08 seconds. > > I think I may have been unclear here. I was merely attempting to agree that > optimization is a process of accumulating small gains. True. > > > > Have you done call graph profile for this, or just flat profiling? I think > > > you may find the call graph (gprof) output revealing, although perhaps not > > > on this particular topic... > > > > I ran gprof. I did not look at the call graph, just the total number of > > calls. We have a very modular system, and the call overhead can get > > exessive. gprof shows tas() getting called far more than any other > > function. It shows it as 0.01 seconds, on a 0.08 second test! Now, I > > realize that gprof measurement is not perfect, but it certainly shows > > tas as being called a lot. > > I agree the system is sometimes excessively layered resulting in many > trivial calls. > > I agree tas() is called a lot. I am trying to understand if the overhead seen > below is in the call itself, or in the actual work of synchronization. Yep. Much of the actual call time is locked up in the mcount line. It says it was counting functions at the time of sampling, I think. So, many times, inlining something that showed NO cpu time caused an improvement because the mcount time went down. My first attack was to reduce functions called for each column. When those were gone, I went after ones that were called for each row. I am going to post timing on sequential scans that I think you will find interesting. > > > % cumulative self self total > > time seconds seconds calls ms/call ms/call name > > 20.0 0.02 0.02 mcount (463) > > 10.0 0.03 0.01 5288 0.00 0.00 _tas [31] > > 10.0 0.04 0.01 2368 0.00 0.00 _hash_search [30] > > 10.0 0.05 0.01 1631 0.01 0.02 _malloc [11] > > 10.0 0.06 0.01 101 0.10 0.10 _sbrk [35] > > 10.0 0.07 0.01 56 0.18 0.20 _heapgettup [25] > > 10.0 0.08 0.01 4 2.50 2.50 _write [32] > > 10.0 0.09 0.01 2 5.00 5.00 ___sysctl [33] > > 10.0 0.10 0.01 1 10.00 10.41 _cnfify [28] > > 0.0 0.10 0.00 1774 0.00 0.00 _call_hash [468] > > 0.0 0.10 0.00 1604 0.00 0.00 _tag_hash [469] > > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPush [470] > > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPushHead [471] > > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPushInto [472] > > 0.0 0.10 0.00 1375 0.00 0.02 _AllocSetAlloc [12] > > 0.0 0.10 0.00 1375 0.00 0.02 _MemoryContextAlloc [13] > > 0.0 0.10 0.00 1353 0.00 0.02 _palloc [14] > > 0.0 0.10 0.00 1322 0.00 0.01 _SpinAcquire [45] > > I asked about the call graph for two reasons (the first of which is part > of another thread): > > 1) I would expect that the 1353 calls to palloc() are also responsible for: > > 1375 _MemoryContextAlloc > 1375 _AllocSetAlloc > 1380 _OrderedElemPushInto > 1380 _OrderedElemPush > for a total of (1353 + 1375 + 1375 + 1380 + 1380) = 6863 calls. > (not including the further 1631 _malloc and 101 _sbrk calls). > > I am curious why these calls do not seem to show up on the cumulative time. Not sure, but with such a quick test, the times are not significant. It is the number of calls, that can get very large for a large table scan. I look more for a pattern of calls, and when certain handling causes a lot of function call overhead. For example, if the tables has 180k rows, and there are 180k calls to the function, it is called one per row. If there are 360k calls, it is called two per row. I believe tas is called multiple times per row. > > 2) I wonder how fine the resolution of the profile is. I am assuming that all > the overhead of tas comes from either: > - the call overhead > - or the actual work done in tas(). > Given that, I wonder if the call overhead is the major part, it could be > that the bus/cache synchronization is the real overhead. As a function, > it is easy to identify tas(). As a macro it does not show up on the > profile and its contribution to the overhead is distributed amoung all the > callers which makes it less obvious on the profile. I was hoping the call > graph would help identify which was the case. This is true. > > In any case, I will test with it as a macro and as a function. It may also > be instructive to make a dummy tas() that does nothing and if that shows > the overhead to be in the actual synchronization, or in the calling. I will > test this too. Intresting, but again, with this type of test, I am only looking for areas of slowness, not actual function duration times. They are going to be meaningless in a small test. > > My intent here is not to be argumentative. My current mental model is that > excess calls are unfortunate but not especially harmful and not usually > worth changing (sometimes of course an inline sequence allows the optimizer > to do something clever and makes more difference than expected). If this view > is incorrect, I would like to know and understand that so that I can adjust > my theory accordingly. I understand. You want to know WHY it is improving performance. Not sure I can answer that. I will say that because SQL databases are so compilcated, certain queries can generate very different call profiles, so I have tried to find cases where the call path is generating call traffic, and inline it if is a very simple function. I have inlined very complex functions, but only when the are called FOR EVERY COLUMN. These cases really cause a big win on call overhead. See include/access/heapam.h. I am not proud of that code, but it makes a large difference. I could clearly see improvements in client timing by inlining functions that were called a lot, so I continued to decrease the number of calls, even when the timing did not show a decrease because the times had become so small. > > > I don't understand your hesitation. The code WAS inlined. It was > > inlined because gprof showed is as being called a lot. Most of them are > > ASM anyway, so what does it matter if it sits in a a *.c or *.h file, an > > asm() call looks the same in a macro or in a function. > > I do not feel strongly about this. I prefer the function on the grounds of > clarity and ease of maintenance and porting. But I would be happy to make it > a macro even if the performance difference is not significant. I do not want to end up with most of our code in macros, nor to make the code on big file. But, for items called a lot, macros seem to make sense, especially if the functions are small. > > > If it makes you feel better, put it in something called tas.h, and add > > it as an include in all the files that include s_lock.h, or have > > s_lock.h include tas.h. > > I am fine with folding it all into s_lock.h. If you wish, I will do so. > > > I am not looking around for 1% optimization. I am using gprof output to > > improve things that gprof shows need improving. > > Please, I am not intending _any_ criticism. I have no disagreement with what > you are doing. I am glad to see us working on performance. Let me know what you find in your testing. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
pgsql-hackers by date: