Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh] - Mailing list pgsql-hackers
From | dg@illustra.com (David Gould) |
---|---|
Subject | Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh] |
Date | |
Msg-id | 9805122021.AA16043@hawk.illustra.com Whole thread Raw |
In response to | Re: [PATCHES] Try again: S_LOCK reduced contentionh] (Bruce Momjian <maillist@candle.pha.pa.us>) |
Responses |
Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]
|
List | pgsql-hackers |
Bruce Momjian: > David Gould: > > This is always how micro-optimization goes, 1% and 2% gains here and there. > > I am very familiar with it. > > I said 0.01 seconds at a time, not 1% at a time. At this point, a 0.01 > seconds savings is 12%, because the total test takes 0.08 seconds. I think I may have been unclear here. I was merely attempting to agree that optimization is a process of accumulating small gains. > > Have you done call graph profile for this, or just flat profiling? I think > > you may find the call graph (gprof) output revealing, although perhaps not > > on this particular topic... > > I ran gprof. I did not look at the call graph, just the total number of > calls. We have a very modular system, and the call overhead can get > exessive. gprof shows tas() getting called far more than any other > function. It shows it as 0.01 seconds, on a 0.08 second test! Now, I > realize that gprof measurement is not perfect, but it certainly shows > tas as being called a lot. I agree the system is sometimes excessively layered resulting in many trivial calls. I agree tas() is called a lot. I am trying to understand if the overhead seen below is in the call itself, or in the actual work of synchronization. > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 20.0 0.02 0.02 mcount (463) > 10.0 0.03 0.01 5288 0.00 0.00 _tas [31] > 10.0 0.04 0.01 2368 0.00 0.00 _hash_search [30] > 10.0 0.05 0.01 1631 0.01 0.02 _malloc [11] > 10.0 0.06 0.01 101 0.10 0.10 _sbrk [35] > 10.0 0.07 0.01 56 0.18 0.20 _heapgettup [25] > 10.0 0.08 0.01 4 2.50 2.50 _write [32] > 10.0 0.09 0.01 2 5.00 5.00 ___sysctl [33] > 10.0 0.10 0.01 1 10.00 10.41 _cnfify [28] > 0.0 0.10 0.00 1774 0.00 0.00 _call_hash [468] > 0.0 0.10 0.00 1604 0.00 0.00 _tag_hash [469] > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPush [470] > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPushHead [471] > 0.0 0.10 0.00 1380 0.00 0.00 _OrderedElemPushInto [472] > 0.0 0.10 0.00 1375 0.00 0.02 _AllocSetAlloc [12] > 0.0 0.10 0.00 1375 0.00 0.02 _MemoryContextAlloc [13] > 0.0 0.10 0.00 1353 0.00 0.02 _palloc [14] > 0.0 0.10 0.00 1322 0.00 0.01 _SpinAcquire [45] I asked about the call graph for two reasons (the first of which is part of another thread): 1) I would expect that the 1353 calls to palloc() are also responsible for: 1375 _MemoryContextAlloc 1375 _AllocSetAlloc 1380 _OrderedElemPushInto 1380 _OrderedElemPush for a total of (1353 + 1375 + 1375 + 1380 + 1380) = 6863 calls. (not including the further 1631 _malloc and 101 _sbrk calls). I am curious why these calls do not seem to show up on the cumulative time. 2) I wonder how fine the resolution of the profile is. I am assuming that all the overhead of tas comes from either: - the call overhead - or the actual work done in tas(). Given that, I wonder if the call overhead is the major part, it could be that the bus/cache synchronization is the real overhead. As a function, it is easy to identify tas(). As a macro it does not show up on the profile and its contribution to the overhead is distributed amoung all the callers which makes it less obvious on the profile. I was hoping the call graph would help identify which was the case. In any case, I will test with it as a macro and as a function. It may also be instructive to make a dummy tas() that does nothing and if that shows the overhead to be in the actual synchronization, or in the calling. I will test this too. My intent here is not to be argumentative. My current mental model is that excess calls are unfortunate but not especially harmful and not usually worth changing (sometimes of course an inline sequence allows the optimizer to do something clever and makes more difference than expected). If this view is incorrect, I would like to know and understand that so that I can adjust my theory accordingly. > I don't understand your hesitation. The code WAS inlined. It was > inlined because gprof showed is as being called a lot. Most of them are > ASM anyway, so what does it matter if it sits in a a *.c or *.h file, an > asm() call looks the same in a macro or in a function. I do not feel strongly about this. I prefer the function on the grounds of clarity and ease of maintenance and porting. But I would be happy to make it a macro even if the performance difference is not significant. > If it makes you feel better, put it in something called tas.h, and add > it as an include in all the files that include s_lock.h, or have > s_lock.h include tas.h. I am fine with folding it all into s_lock.h. If you wish, I will do so. > I am not looking around for 1% optimization. I am using gprof output to > improve things that gprof shows need improving. Please, I am not intending _any_ criticism. I have no disagreement with what you are doing. I am glad to see us working on performance. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "My life has been full of wonderful moments - it's only later that they become embarassing." -- Gerhard Berger
pgsql-hackers by date: