Thread: RE: [HACKERS] Hi
Okay, here's my idea of how I will get into it: 1) Try to get my apps going. That is I'll try to port my ORACLE applications to postgresql to have some programs to do testing. Therefore I need: 2) Add Linus' ecpg precompiler for embedded SQL to the dist. 3) Try to implement some speed-ups I encounter while playing with my apps. I take it some of you have already run the backend against a profiler. Could anyone please tell me which functions need works most? Michael -- Dr. Michael Meskes, Project-Manager | topsystem Systemhaus GmbH meskes@topsystem.de | Europark A2, Adenauerstr. 20 meskes@debian.org | 52146 Wuerselen Go SF49ers! Go Rhein Fire! | Tel: (+49) 2405/4670-44 Use Debian GNU/Linux! | Fax: (+49) 2405/4670-10 > -----Original Message----- > From: Bruce Momjian [SMTP:maillist@candle.pha.pa.us] > Sent: Wednesday, February 04, 1998 6:10 PM > To: meskes@topsystem.de > Cc: pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] Hi > > > > > Hi, > > > > despite bein chronically short of time I'd like to be involved with > some > > database development in free software. It seems to me that > postgresql is > > the most complete package out there and it also has some nice > features > > like object identity. I consider myself fairly knowledgable when it > > comes to databases (at least it brought me my Ph.D.) so I think I > could > > do some. But before I do so I'd like to ask some questions about the > > status of development: > > > > > 3) How would you describe the general goal of this project: Offer as > > good a database system as you can create as free software? That is > > should it be able to handle huge amounts of data, or is it seen more > as > > a tool for a private person/small business? > > About a year ago, our goal was to make it more reliable. At this > point, > our goal is SQL compliance, speed, and features. See the TODO list in > the distribution or on our web page for an idea of our open items. > > > 4) Are there areas/Which ares need help? > > What we really need are more people who can get a handle on the entire > source tree, so we can implement some of the more complex SQL > features, > or improve some of the older code. > > -- > Bruce Momjian > maillist@candle.pha.pa.us
> > Okay, here's my idea of how I will get into it: > > 1) Try to get my apps going. That is I'll try to port my ORACLE > applications to postgresql to have some programs to do testing. > Therefore I need: > 2) Add Linus' ecpg precompiler for embedded SQL to the dist. > 3) Try to implement some speed-ups I encounter while playing with my > apps. > > I take it some of you have already run the backend against a profiler. > Could anyone please tell me which functions need works most? > I have already hit many of the speedups people have reported in the profiler by changing the logic, or inlining part/all of the function. I recommend running your own profile and see what it shows. -- Bruce Momjian maillist@candle.pha.pa.us
Meskes, Michael wrote: > > I take it some of you have already run the backend against a profiler. > Could anyone please tell me which functions need works most? SYSTEM: Linux 2.0.29, gcc 2.7.2 ------------------------------- [make distclean] [CVSup'ed current source 1998-02-05] [./configure] [make] [CUSTOM_LDFLAGS not settable in Makefile.custom ...] [... setting LDFLAGS=-pg ins Makefile.global] --> fmgr.c: In function `fmgr_pl': --> fmgr.c:46: `fmgr_pl_finfo' undeclared (first use this function) [fix] cp backend/utils/fmgr.h backend/utils/fmgr/fmgr.h [make install] [Run initdb] [Regression tests] strings .. failed [ok] lseg .. failed [new operator <= on lseg] horology .. failed [daylight savings error?] triggers .. failed select_views .. failed PROFILE OF REGRESSION TEST: --------------------------- time seconds seconds calls ms/call ms/call name 35.16 7.97 7.97 mcount (profiler overhead) 5.91 9.31 1.34 9924 0.14 0.20 heapgettup 4.54 10.34 1.03 234597 0.00 0.01 hash_search 2.12 10.82 0.48 151781 0.00 0.00 SpinAcquire 2.03 11.28 0.46 46635 0.01 0.02 SearchSysCache 1.63 11.65 0.37 171345 0.00 0.00 tag_hash 1.46 11.98 0.33 20511 0.02 0.02 yylex 1.41 12.30 0.32 31311 0.01 0.03 LockAcquire 1.28 12.59 0.29 1522 0.19 0.61 yyparse 1.24 12.87 0.28 41176 0.01 0.01 fmgr_isbuiltin 1.01 13.10 0.23 201572 0.00 0.00 AllocSetAlloc 0.84 13.29 0.19 186214 0.00 0.00 OrderedElemPop 0.75 13.46 0.17 42896 0.00 0.00 nocachegetattr 0.71 13.62 0.16 29869 0.01 0.04 LockRelease 0.66 13.77 0.15 151781 0.00 0.00 SpinRelease 0.66 13.92 0.15 88766 0.00 0.01 newNode 0.66 14.07 0.15 52248 0.00 0.00 _bt_compare ... around 1850 functions follows (0.66 - 0.0%) heapgettup (called by, calling): -------------------------------- index % time self children called name 0.01 0.01 96/9924 heap_markpos [448] 0.21 0.11 1566/9924 heap_restrpos [102] 1.12 0.58 8262/9924 heap_getnext [20] [14] 13.8 1.34 0.69 9924 heapgettup [14] 0.04 0.14 9924/9924 RelationGetBufferWithBuffer [148] 0.03 0.15 5642/5702 ReleaseAndReadBuffer [145] 0.10 0.00 26276/42896 nocachegetattr [158] 0.01 0.08 7111/9607 HeapTupleSatisfiesVisibility [185] 0.04 0.00 117785/126582 char16eq [339] 0.02 0.00 111941/111994 int4eq [427] 0.02 0.00 109647/112329 nameeq [426] 0.00 0.01 1770/31585 ReleaseBuffer [134] 0.01 0.00 14000/50173 PageGetMaxOffsetNumber [341] 0.01 0.00 5185/5433 chareq [537] 0.01 0.00 15566/90147 BufferGetBlock [311] 0.01 0.00 17336/240911 BufferIsValid [207] 0.00 0.00 4/1973 int2eq [575] 0.00 0.00 7412/7412 nextpage [1186] 0.00 0.00 951/1358 SetBufferCommitInfoNeedsSave [1253] -------------------------------- The *complete* gprof output is here: ftp://postgresql.org/pub/incoming/regression-profile-980205.gz 156k /* m */
Interesting. Nothing is jumping out at me. Looks like we could try to clean up heapgettup() to see if there is anything in there that can be speeded up. None of the calls looks like it should be inlined. Do you see any that look good for inlining? > > Meskes, Michael wrote: > > > > I take it some of you have already run the backend against a profiler. > > Could anyone please tell me which functions need works most? > > SYSTEM: Linux 2.0.29, gcc 2.7.2 > ------------------------------- > [make distclean] > [CVSup'ed current source 1998-02-05] > [./configure] > [make] > [CUSTOM_LDFLAGS not settable in Makefile.custom ...] > [... setting LDFLAGS=-pg ins Makefile.global] > > --> fmgr.c: In function `fmgr_pl': > --> fmgr.c:46: `fmgr_pl_finfo' undeclared (first use this function) > [fix] cp backend/utils/fmgr.h backend/utils/fmgr/fmgr.h > > [make install] > [Run initdb] > [Regression tests] > strings .. failed [ok] > lseg .. failed [new operator <= on lseg] > horology .. failed [daylight savings error?] > triggers .. failed > select_views .. failed > > > PROFILE OF REGRESSION TEST: > --------------------------- > time seconds seconds calls ms/call ms/call name > 35.16 7.97 7.97 mcount (profiler overhead) > 5.91 9.31 1.34 9924 0.14 0.20 heapgettup > 4.54 10.34 1.03 234597 0.00 0.01 hash_search > 2.12 10.82 0.48 151781 0.00 0.00 SpinAcquire > 2.03 11.28 0.46 46635 0.01 0.02 SearchSysCache > 1.63 11.65 0.37 171345 0.00 0.00 tag_hash > 1.46 11.98 0.33 20511 0.02 0.02 yylex > 1.41 12.30 0.32 31311 0.01 0.03 LockAcquire > 1.28 12.59 0.29 1522 0.19 0.61 yyparse > 1.24 12.87 0.28 41176 0.01 0.01 fmgr_isbuiltin > 1.01 13.10 0.23 201572 0.00 0.00 AllocSetAlloc > 0.84 13.29 0.19 186214 0.00 0.00 OrderedElemPop > 0.75 13.46 0.17 42896 0.00 0.00 nocachegetattr > 0.71 13.62 0.16 29869 0.01 0.04 LockRelease > 0.66 13.77 0.15 151781 0.00 0.00 SpinRelease > 0.66 13.92 0.15 88766 0.00 0.01 newNode > 0.66 14.07 0.15 52248 0.00 0.00 _bt_compare > ... around 1850 functions follows (0.66 - 0.0%) > > heapgettup (called by, calling): > -------------------------------- > index % time self children called name > 0.01 0.01 96/9924 heap_markpos [448] > 0.21 0.11 1566/9924 heap_restrpos [102] > 1.12 0.58 8262/9924 heap_getnext [20] > [14] 13.8 1.34 0.69 9924 heapgettup [14] > 0.04 0.14 9924/9924 RelationGetBufferWithBuffer [148] > 0.03 0.15 5642/5702 ReleaseAndReadBuffer [145] > 0.10 0.00 26276/42896 nocachegetattr [158] > 0.01 0.08 7111/9607 HeapTupleSatisfiesVisibility [185] > 0.04 0.00 117785/126582 char16eq [339] > 0.02 0.00 111941/111994 int4eq [427] > 0.02 0.00 109647/112329 nameeq [426] > 0.00 0.01 1770/31585 ReleaseBuffer [134] > 0.01 0.00 14000/50173 PageGetMaxOffsetNumber [341] > 0.01 0.00 5185/5433 chareq [537] > 0.01 0.00 15566/90147 BufferGetBlock [311] > 0.01 0.00 17336/240911 BufferIsValid [207] > 0.00 0.00 4/1973 int2eq [575] > 0.00 0.00 7412/7412 nextpage [1186] > 0.00 0.00 951/1358 SetBufferCommitInfoNeedsSave [1253] > -------------------------------- > > The *complete* gprof output is here: > ftp://postgresql.org/pub/incoming/regression-profile-980205.gz 156k > > /* m */ > > -- Bruce Momjian maillist@candle.pha.pa.us
Bruce Momjian writes: > Interesting. Nothing is jumping out at me. Looks like we could try to > clean up heapgettup() to see if there is anything in there that can be > speeded up. > > None of the calls looks like it should be inlined. Do you see any that > look good for inlining? Not exactly. But to get my application running we have to do something to speed it up. This morning I started my program on a database with 15165 records in one table and something like 100 in a second plus 2 in a third. Then my software tries to add records to the first table and for each record added it updates a record in the other two. This doesn't count the select etc. Anyway, the data isn't processed fast enough. I tried adding about 600 records which should be done (the time in which the data was send) in 5 minutes, but our system needed almost 8 minutes to insert the data. And this doesn't cause Oracle too much trouble. So I guess there's need for some speed-up. :-) Michael -- Dr. Michael Meskes, Project-Manager | topsystem Systemhaus GmbH meskes@topsystem.de | Europark A2, Adenauerstr. 20 meskes@debian.org | 52146 Wuerselen Go SF49ers! Go Rhein Fire! | Tel: (+49) 2405/4670-44 Use Debian GNU/Linux! | Fax: (+49) 2405/4670-10
Bruce Momjian wrote: > > Interesting. Nothing is jumping out at me. Looks like we could try to > clean up heapgettup() to see if there is anything in there that can be > speeded up. > > None of the calls looks like it should be inlined. Do you see any that > look good for inlining? ExecScan() seems to be the only func which calls SeqNext(), which in turn accounts for 60% of the calls to heap_getnext(), which does 80% of the calls to heapgettup(). - Put SeqNext() into ExecScan() to lower function call overhead? [minimal optim.] - In heapgettup(), 50% is the func itself and 50% is called funcs. Top four CPU consumers: 0.04 0.14 9924/9924 RelationGetBufferWithBuffer [148] 0.03 0.15 5642/5702 ReleaseAndReadBuffer [145] 0.10 0.00 26276/42896 nocachegetattr [158] 0.01 0.08 7111/9607 HeapTupleSatisfiesVisibility [185] RelationGetBufferWithBuffer() seems to be called from here only. If so, inline. - Looking at RelationGetBufferWithBuffer(): 0.00 0.10 4603/32354 ReadBuffer [55] ReadBuffer() is the biggest cpu consumer called by RelationGetBufferWithBuffer(). (55%) -> *** 97% of ReadBuffer() CPU time is in calling ReadBufferWithBufferLock() -> 85% of ReadBufferWithBufferLock() CPU time is in calling BufferAlloc(). -> ReadBufferWithBufferLock() is the only func calling BufferAlloc(). -> Conclusion: INLINE BufferAlloc(). - Looking at BufferAlloc(): 0.04 0.25 37974/37974 BufTableLookup [114] 0.10 0.00 32340/151781 SpinAcquire [81] 0.10 0.00 37470/40585 PinBuffer [209] 0.08 0.00 38478/43799 RelationGetLRelId [234] 0.04 0.00 37974/151781 SpinRelease [175] -> 40% of BufferAlloc() CPU time is in calling BufTableLookup(). -> BufferAlloc() is the only func calling BufTableLookup(). -> Conclusion: INLINE BufTableLookup(). - Looking at BufTableLookup(): 86% of CPU time is in calling hash_search(). The rest is own time. - Looking at hash_search(): 0.13 0.41 179189/179189 call_hash [69] 0.00 0.00 6/6 bucket_alloc [1084] -> Conclusion: INLINE call_hash() [and bucket_alloc()] into hash_search(). - Looking at call_hash(): 0.37 0.00 171345/171345 tag_hash [94] 0.04 0.00 7844/7844 string_hash [348] -> Conclusion: INLINE tag_hash() [and string_hash()] into call_hash(). -> Perhaps disk_hash() could be used in some way? It is currently #ifdef'd away. -> Could we use a lookup table instead of doing hash calculations? Would not that -> be much faster? It looks to me as if there are too many levels of function calls. Perhaps all functions which are called by only one other func should be inlined? Guesstimate: This would speed up heapgettup() by 10% ??? Other functions would speed up too. /* m */
> Not exactly. But to get my application running we have to do something to > speed it up. This morning I started my program on a database with 15165 > records in one table and something like 100 in a second plus 2 in a third. > Then my software tries to add records to the first table and for each record > added it updates a record in the other two. This doesn't count the select > etc. > > Anyway, the data isn't processed fast enough. I tried adding about 600 > records which should be done (the time in which the data was send) in 5 > minutes, but our system needed almost 8 minutes to insert the data. And this > doesn't cause Oracle too much trouble. > > So I guess there's need for some speed-up. :-) I (and others) had done some benchmarking on simple inserts (6 months ago?) and had concluded that the speed was similar to other commercial systems (I was comparing against Ingres). I recall getting ~50TPS. This was all before Bruce did his work on startup and runtime speeds. You really think your performance is that far off? You are doing selects on the big table before inserting? Do you have indices set up?? Our results were for inserts on a heap table, which has the least overhead... - Tom
On Fri, 6 Feb 1998, Thomas G. Lockhart wrote: > > Not exactly. But to get my application running we have to do something to > > speed it up. This morning I started my program on a database with 15165 > > records in one table and something like 100 in a second plus 2 in a third. > > Then my software tries to add records to the first table and for each record > > added it updates a record in the other two. This doesn't count the select > > etc. > > > > Anyway, the data isn't processed fast enough. I tried adding about 600 > > records which should be done (the time in which the data was send) in 5 > > minutes, but our system needed almost 8 minutes to insert the data. And this > > doesn't cause Oracle too much trouble. > > > > So I guess there's need for some speed-up. :-) > > I (and others) had done some benchmarking on simple inserts (6 months ago?) and > had concluded that the speed was similar to other commercial systems (I was > comparing against Ingres). I recall getting ~50TPS. > > This was all before Bruce did his work on startup and runtime speeds. You > really think your performance is that far off? You are doing selects on the big > table before inserting? Do you have indices set up?? Our results were for > inserts on a heap table, which has the least overhead... Just curious, but do you have -F set to disable fsync()? We really really should disable that by default :(
The Hermit Hacker wrote: > > On Fri, 6 Feb 1998, Thomas G. Lockhart wrote: > > > > Not exactly. But to get my application running we have to do something to > > > speed it up. This morning I started my program on a database with 15165 > > > records in one table and something like 100 in a second plus 2 in a third. > > > Then my software tries to add records to the first table and for each record > > > added it updates a record in the other two. This doesn't count the select > > > etc. > > > > > > Anyway, the data isn't processed fast enough. I tried adding about 600 > > > records which should be done (the time in which the data was send) in 5 > > > minutes, but our system needed almost 8 minutes to insert the data. And this > > > doesn't cause Oracle too much trouble. > > > > > > So I guess there's need for some speed-up. :-) > > > > I (and others) had done some benchmarking on simple inserts (6 months ago?) and > > had concluded that the speed was similar to other commercial systems (I was > > comparing against Ingres). I recall getting ~50TPS. > > > > This was all before Bruce did his work on startup and runtime speeds. You > > really think your performance is that far off? You are doing selects on the big > > table before inserting? Do you have indices set up?? Our results were for > > inserts on a heap table, which has the least overhead... And did you use BEGIN/END ? Auto-commit is ON in postgres... > > Just curious, but do you have -F set to disable fsync()? We > really really should disable that by default :( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Oh no! Vadim
On Fri, 6 Feb 1998, Vadim B. Mikheev wrote: > > Just curious, but do you have -F set to disable fsync()? We > > really really should disable that by default :( > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Oh no! Isn't that our standard recommended operating parameter anyway? Oops, that's just our Performance improving recommendation...sorry...:)
> > Bruce Momjian wrote: > > > > Interesting. Nothing is jumping out at me. Looks like we could try to > > clean up heapgettup() to see if there is anything in there that can be > > speeded up. > > > > None of the calls looks like it should be inlined. Do you see any that > > look good for inlining? > > ExecScan() seems to be the only func which calls SeqNext(), which in > turn accounts for 60% of the calls to heap_getnext(), which does 80% of > the calls to heapgettup(). This is certainly the type of analysis that will help us. I disagree on inlining everything, because the code maintenance becomes a nightmare, but we certainly could improve things. Currently, if a function is declared static, is it only called from within that file. If there is only one call to that function in the file, it could be inlined. I suggest we only attack things like the below where we have a large amount of function call overhead. Also, there are cases where inlining the 'top' of the function can be really good. For heap_getattr(), many cases were handled in the top that just did a quick return. In those cases, I took the top and made a macro out of it, and only called the function if I needed real processing. This is a great help, and any patches would be good. The other nice thing about this is that it is pretty compartementalized. You just need to check that the new code behaves like the old code, and you are safe. This is great analysis of the problem, too. > > - Put SeqNext() into ExecScan() to lower function call overhead? [minimal optim.] Sure. Just comment it so it is clear for readers, so it is not pulled out into a separate function some day. > > - In heapgettup(), 50% is the func itself and 50% is called funcs. > Top four CPU consumers: > 0.04 0.14 9924/9924 RelationGetBufferWithBuffer [148] > 0.03 0.15 5642/5702 ReleaseAndReadBuffer [145] > 0.10 0.00 26276/42896 nocachegetattr [158] > 0.01 0.08 7111/9607 HeapTupleSatisfiesVisibility [185] > > RelationGetBufferWithBuffer() seems to be called from here only. If so, inline. OK. > > - Looking at RelationGetBufferWithBuffer(): > 0.00 0.10 4603/32354 ReadBuffer [55] > ReadBuffer() is the biggest cpu consumer called by RelationGetBufferWithBuffer(). (55%) > > -> *** 97% of ReadBuffer() CPU time is in calling ReadBufferWithBufferLock() > > -> 85% of ReadBufferWithBufferLock() CPU time is in calling BufferAlloc(). > -> ReadBufferWithBufferLock() is the only func calling BufferAlloc(). > -> Conclusion: INLINE BufferAlloc(). My only warning is that I have never inlined a function, AND left another copy of the function non-inlined. The reason is that you are left with two copies of the function, and that can get really confusing. I don't know how others feel about that, but the code is already so complicated, I hesitate to start duplicating blocks of code. Of course, if you make it a macro, you have it in one place, and the preprocessor is duplicating it for you. I have converted a function to a macro, and I have inlined frequently-called functions. > > - Looking at BufferAlloc(): > 0.04 0.25 37974/37974 BufTableLookup [114] > 0.10 0.00 32340/151781 SpinAcquire [81] > 0.10 0.00 37470/40585 PinBuffer [209] > 0.08 0.00 38478/43799 RelationGetLRelId [234] > 0.04 0.00 37974/151781 SpinRelease [175] > > -> 40% of BufferAlloc() CPU time is in calling BufTableLookup(). > -> BufferAlloc() is the only func calling BufTableLookup(). > -> Conclusion: INLINE BufTableLookup(). > > - Looking at BufTableLookup(): > 86% of CPU time is in calling hash_search(). The rest is own time. > > - Looking at hash_search(): > 0.13 0.41 179189/179189 call_hash [69] > 0.00 0.00 6/6 bucket_alloc [1084] > -> Conclusion: INLINE call_hash() [and bucket_alloc()] into hash_search(). > > - Looking at call_hash(): > 0.37 0.00 171345/171345 tag_hash [94] > 0.04 0.00 7844/7844 string_hash [348] > -> Conclusion: INLINE tag_hash() [and string_hash()] into call_hash(). > -> Perhaps disk_hash() could be used in some way? It is currently #ifdef'd away. > -> Could we use a lookup table instead of doing hash calculations? Would not that > -> be much faster? > > > It looks to me as if there are too many levels of function calls. > Perhaps all functions which are called by only one other func should be inlined? > > > Guesstimate: > This would speed up heapgettup() by 10% ??? > Other functions would speed up too. Makes sense. If the functions are in the same source file, and are declared static, doesn't the compiler inline some of them. See my heap_getattr() macro for a way perhaps to macro-ize some of these. The macro looks much clearer than the old style macros. -- Bruce Momjian maillist@candle.pha.pa.us
> > I (and others) had done some benchmarking on simple inserts (6 months ago?) and > > had concluded that the speed was similar to other commercial systems (I was > > comparing against Ingres). I recall getting ~50TPS. > > > > This was all before Bruce did his work on startup and runtime speeds. You > > really think your performance is that far off? You are doing selects on the big > > table before inserting? Do you have indices set up?? Our results were for > > inserts on a heap table, which has the least overhead... > > Just curious, but do you have -F set to disable fsync()? We > really really should disable that by default :( Agreed. But maybe we should wait until we get pg_log syncing so we have 30-second reliability. Don't know where that fits on Vadim's list. -- Bruce Momjian maillist@candle.pha.pa.us
The Hermit Hacker writes: > Just curious, but do you have -F set to disable fsync()? We > really really should disable that by default :( I tried with -F and it runs nicely. No difference to see between PostgreSQL and Oracle. I just ran another test which includes table creation, inserts and drop (source follows). Here's the result: Oracle 7.3.3.4.0: I needed 21 seconds and -345682 microseconds for this test PostgreSQL without -F: I needed 152 seconds and -623545 microseconds for this test PostgreSQL with -F: I needed 5 seconds and 84411 microseconds for this test Whow! Here's the source (yes, our precompiler can handle this kind of program already :-)): #include <stdio.h> #include <sys/time.h> #include <unistd.h> exec sql include sqlca; #define SQLCODE sqlca.sqlcode void db_error (char *msg) { sqlca.sqlerrm.sqlerrmc[sqlca.sqlerrm.sqlerrml] = '\0'; printf ("%s: db error %s\n", msg, sqlca.sqlerrm.sqlerrmc); exit (1); } int main () { exec sql begin declare section; long i; exec sql end declare section; struct timeval tvs, tve; gettimeofday(&tvs, NULL); exec sql connect 'mm'; if (SQLCODE) db_error ("connect"); exec sql create table perftest(number int4, ascii char16); if (SQLCODE) db_error ("create t"); exec sql create unique index number on perftest(number); if (SQLCODE) db_error ("create i"); for (i = 0;i < 1407; i++) { exec sql begin declare section; char text[16]; exec sql end declare section; sprintf(text, "%ld", i); exec sql insert into perftest(number, ascii) values (:i, :text); if (SQLCODE) db_error ("insert"); exec sql commit; if (SQLCODE) db_error ("commit"); } exec sql drop index number; if (SQLCODE) db_error ("drop i"); exec sql drop table perftest; if (SQLCODE) db_error ("drop t"); exec sql commit; if (SQLCODE) db_error ("commit"); gettimeofday(&tve, NULL); printf("I needed %ld seconds and %ld microseconds for this test\n", tve.tv_sec - tvs.tv_sec, tve.tv_usec - tvs.tv_usec); return (0); } Michael -- Dr. Michael Meskes, Project-Manager | topsystem Systemhaus GmbH meskes@topsystem.de | Europark A2, Adenauerstr. 20 meskes@debian.org | 52146 Wuerselen Go SF49ers! Go Rhein Fire! | Tel: (+49) 2405/4670-44 Use Debian GNU/Linux! | Fax: (+49) 2405/4670-10
> Here's the source (yes, our precompiler can handle this kind of program > already :-)): Looks nice! - Tom
> > The Hermit Hacker writes: > > Just curious, but do you have -F set to disable fsync()? We > > really really should disable that by default :( > > I tried with -F and it runs nicely. No difference to see between PostgreSQL > and Oracle. I just ran another test which includes table creation, inserts > and drop (source follows). Here's the result: > > Oracle 7.3.3.4.0: > I needed 21 seconds and -345682 microseconds for this test > > PostgreSQL without -F: > I needed 152 seconds and -623545 microseconds for this test > > PostgreSQL with -F: > I needed 5 seconds and 84411 microseconds for this test > > Whow! > This is good news. Is this with 6.3 or 6.2.1? -- Bruce Momjian maillist@candle.pha.pa.us
Can we go anywhere with this? > > Bruce Momjian wrote: > > > > Interesting. Nothing is jumping out at me. Looks like we could try to > > clean up heapgettup() to see if there is anything in there that can be > > speeded up. > > > > None of the calls looks like it should be inlined. Do you see any that > > look good for inlining? > > ExecScan() seems to be the only func which calls SeqNext(), which in > turn accounts for 60% of the calls to heap_getnext(), which does 80% of > the calls to heapgettup(). > > - Put SeqNext() into ExecScan() to lower function call overhead? [minimal optim.] > > - In heapgettup(), 50% is the func itself and 50% is called funcs. > Top four CPU consumers: > 0.04 0.14 9924/9924 RelationGetBufferWithBuffer [148] > 0.03 0.15 5642/5702 ReleaseAndReadBuffer [145] > 0.10 0.00 26276/42896 nocachegetattr [158] > 0.01 0.08 7111/9607 HeapTupleSatisfiesVisibility [185] > > RelationGetBufferWithBuffer() seems to be called from here only. If so, inline. > > - Looking at RelationGetBufferWithBuffer(): > 0.00 0.10 4603/32354 ReadBuffer [55] > ReadBuffer() is the biggest cpu consumer called by RelationGetBufferWithBuffer(). (55%) > > -> *** 97% of ReadBuffer() CPU time is in calling ReadBufferWithBufferLock() > > -> 85% of ReadBufferWithBufferLock() CPU time is in calling BufferAlloc(). > -> ReadBufferWithBufferLock() is the only func calling BufferAlloc(). > -> Conclusion: INLINE BufferAlloc(). > > - Looking at BufferAlloc(): > 0.04 0.25 37974/37974 BufTableLookup [114] > 0.10 0.00 32340/151781 SpinAcquire [81] > 0.10 0.00 37470/40585 PinBuffer [209] > 0.08 0.00 38478/43799 RelationGetLRelId [234] > 0.04 0.00 37974/151781 SpinRelease [175] > > -> 40% of BufferAlloc() CPU time is in calling BufTableLookup(). > -> BufferAlloc() is the only func calling BufTableLookup(). > -> Conclusion: INLINE BufTableLookup(). > > - Looking at BufTableLookup(): > 86% of CPU time is in calling hash_search(). The rest is own time. > > - Looking at hash_search(): > 0.13 0.41 179189/179189 call_hash [69] > 0.00 0.00 6/6 bucket_alloc [1084] > -> Conclusion: INLINE call_hash() [and bucket_alloc()] into hash_search(). > > - Looking at call_hash(): > 0.37 0.00 171345/171345 tag_hash [94] > 0.04 0.00 7844/7844 string_hash [348] > -> Conclusion: INLINE tag_hash() [and string_hash()] into call_hash(). > -> Perhaps disk_hash() could be used in some way? It is currently #ifdef'd away. > -> Could we use a lookup table instead of doing hash calculations? Would not that > -> be much faster? > > > It looks to me as if there are too many levels of function calls. > Perhaps all functions which are called by only one other func should be inlined? > > > Guesstimate: > This would speed up heapgettup() by 10% ??? > Other functions would speed up too. > > > /* m */ > -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
In src/test/regress: it is time to remove the "PST8PDT" stuff from the files README and regress.sh "set timezone to 'PST8PDT'" should be added to sql/datetime.sql /* m */
> it is time to remove the "PST8PDT" stuff from the files > README and regress.sh > > "set timezone to 'PST8PDT'" should be added to sql/datetime.sql Hmm. It would have to be added in several regression tests, not just the datetime one. Also, having it isolated to the client configuration in regress.sh gives one the opportunity to fix it or adjust it for your platform. I'm not sure what the README says about it; that may be obsolete. Is the PGTZ setting in regress.sh causing you trouble?? - Tom
Bruce Momjian writes: > Can we go anywhere with this? Did anyone do anything with this already? > > ExecScan() seems to be the only func which calls SeqNext(), which in > > turn accounts for 60% of the calls to heap_getnext(), which does 80% of > > the calls to heapgettup(). > > > > - Put SeqNext() into ExecScan() to lower function call overhead? [minimal optim.] > > > > - In heapgettup(), 50% is the func itself and 50% is called funcs. > > Top four CPU consumers: > > 0.04 0.14 9924/9924 RelationGetBufferWithBuffer [148] > > 0.03 0.15 5642/5702 ReleaseAndReadBuffer [145] > > 0.10 0.00 26276/42896 nocachegetattr [158] > > 0.01 0.08 7111/9607 HeapTupleSatisfiesVisibility [185] > > > > RelationGetBufferWithBuffer() seems to be called from here only. If so, inline. > > > > - Looking at RelationGetBufferWithBuffer(): > > 0.00 0.10 4603/32354 ReadBuffer [55] > > ReadBuffer() is the biggest cpu consumer called by RelationGetBufferWithBuffer(). (55%) > > > > -> *** 97% of ReadBuffer() CPU time is in calling ReadBufferWithBufferLock() > > > > -> 85% of ReadBufferWithBufferLock() CPU time is in calling BufferAlloc(). > > -> ReadBufferWithBufferLock() is the only func calling BufferAlloc(). > > -> Conclusion: INLINE BufferAlloc(). > > > > - Looking at BufferAlloc(): > > 0.04 0.25 37974/37974 BufTableLookup [114] > > 0.10 0.00 32340/151781 SpinAcquire [81] > > 0.10 0.00 37470/40585 PinBuffer [209] > > 0.08 0.00 38478/43799 RelationGetLRelId [234] > > 0.04 0.00 37974/151781 SpinRelease [175] > > > > -> 40% of BufferAlloc() CPU time is in calling BufTableLookup(). > > -> BufferAlloc() is the only func calling BufTableLookup(). > > -> Conclusion: INLINE BufTableLookup(). > > > > - Looking at BufTableLookup(): > > 86% of CPU time is in calling hash_search(). The rest is own time. > > > > - Looking at hash_search(): > > 0.13 0.41 179189/179189 call_hash [69] > > 0.00 0.00 6/6 bucket_alloc [1084] > > -> Conclusion: INLINE call_hash() [and bucket_alloc()] into hash_search(). > > > > - Looking at call_hash(): > > 0.37 0.00 171345/171345 tag_hash [94] > > 0.04 0.00 7844/7844 string_hash [348] > > -> Conclusion: INLINE tag_hash() [and string_hash()] into call_hash(). > > -> Perhaps disk_hash() could be used in some way? It is currently #ifdef'd away. > > -> Could we use a lookup table instead of doing hash calculations? Would not that > > -> be much faster? > > > > > > It looks to me as if there are too many levels of function calls. > > Perhaps all functions which are called by only one other func should be inlined? Isn't this a good solution? A function only called by one other function has its right to exist only for readability. And this optimization could be done automatically. Michael -- Dr. Michael Meskes, Project-Manager | topsystem Systemhaus GmbH meskes@topsystem.de | Europark A2, Adenauerstr. 20 meskes@debian.org | 52146 Wuerselen Go SF49ers! Go Rhein Fire! | Tel: (+49) 2405/4670-44 Use Debian GNU/Linux! | Fax: (+49) 2405/4670-10
> > Bruce Momjian writes: > > Can we go anywhere with this? > > Did anyone do anything with this already? No one yet. > > > > ExecScan() seems to be the only func which calls SeqNext(), which in > > > turn accounts for 60% of the calls to heap_getnext(), which does 80% of > > > the calls to heapgettup(). > > > > > > - Put SeqNext() into ExecScan() to lower function call overhead? [minimal optim.] > > > > > > - In heapgettup(), 50% is the func itself and 50% is called funcs. > > > Top four CPU consumers: > > > 0.04 0.14 9924/9924 RelationGetBufferWithBuffer [148] > > > 0.03 0.15 5642/5702 ReleaseAndReadBuffer [145] > > > 0.10 0.00 26276/42896 nocachegetattr [158] > > > 0.01 0.08 7111/9607 HeapTupleSatisfiesVisibility [185] > > > > > > RelationGetBufferWithBuffer() seems to be called from here only. If so, inline. > > > > > > - Looking at RelationGetBufferWithBuffer(): > > > 0.00 0.10 4603/32354 ReadBuffer [55] > > > ReadBuffer() is the biggest cpu consumer called by RelationGetBufferWithBuffer(). (55%) > > > > > > -> *** 97% of ReadBuffer() CPU time is in calling ReadBufferWithBufferLock() > > > > > > -> 85% of ReadBufferWithBufferLock() CPU time is in calling BufferAlloc(). > > > -> ReadBufferWithBufferLock() is the only func calling BufferAlloc(). > > > -> Conclusion: INLINE BufferAlloc(). > > > > > > - Looking at BufferAlloc(): > > > 0.04 0.25 37974/37974 BufTableLookup [114] > > > 0.10 0.00 32340/151781 SpinAcquire [81] > > > 0.10 0.00 37470/40585 PinBuffer [209] > > > 0.08 0.00 38478/43799 RelationGetLRelId [234] > > > 0.04 0.00 37974/151781 SpinRelease [175] > > > > > > -> 40% of BufferAlloc() CPU time is in calling BufTableLookup(). > > > -> BufferAlloc() is the only func calling BufTableLookup(). > > > -> Conclusion: INLINE BufTableLookup(). > > > > > > - Looking at BufTableLookup(): > > > 86% of CPU time is in calling hash_search(). The rest is own time. > > > > > > - Looking at hash_search(): > > > 0.13 0.41 179189/179189 call_hash [69] > > > 0.00 0.00 6/6 bucket_alloc [1084] > > > -> Conclusion: INLINE call_hash() [and bucket_alloc()] into hash_search(). > > > > > > - Looking at call_hash(): > > > 0.37 0.00 171345/171345 tag_hash [94] > > > 0.04 0.00 7844/7844 string_hash [348] > > > -> Conclusion: INLINE tag_hash() [and string_hash()] into call_hash(). > > > -> Perhaps disk_hash() could be used in some way? It is currently #ifdef'd away. > > > -> Could we use a lookup table instead of doing hash calculations? Would not that > > > -> be much faster? > > > > > > > > > It looks to me as if there are too many levels of function calls. > > > Perhaps all functions which are called by only one other func should be inlined? > > Isn't this a good solution? A function only called by one other function has > its right to exist only for readability. And this optimization could be done > automatically. Wouldn't be such a big deal if it was not call so many times. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > Bruce Momjian wrote: > > > > Interesting. Nothing is jumping out at me. Looks like we could try to > > clean up heapgettup() to see if there is anything in there that can be > > speeded up. > > > > None of the calls looks like it should be inlined. Do you see any that > > look good for inlining? > > ExecScan() seems to be the only func which calls SeqNext(), which in > turn accounts for 60% of the calls to heap_getnext(), which does 80% of > the calls to heapgettup(). > > - Put SeqNext() into ExecScan() to lower function call overhead? [minimal optim.] > > - In heapgettup(), 50% is the func itself and 50% is called funcs. > Top four CPU consumers: > 0.04 0.14 9924/9924 RelationGetBufferWithBuffer [148] > 0.03 0.15 5642/5702 ReleaseAndReadBuffer [145] > 0.10 0.00 26276/42896 nocachegetattr [158] > 0.01 0.08 7111/9607 HeapTupleSatisfiesVisibility [185] > > RelationGetBufferWithBuffer() seems to be called from here only. If so, inline. > > - Looking at RelationGetBufferWithBuffer(): > 0.00 0.10 4603/32354 ReadBuffer [55] > ReadBuffer() is the biggest cpu consumer called by RelationGetBufferWithBuffer(). (55%) > > -> *** 97% of ReadBuffer() CPU time is in calling ReadBufferWithBufferLock() > > -> 85% of ReadBufferWithBufferLock() CPU time is in calling BufferAlloc(). > -> ReadBufferWithBufferLock() is the only func calling BufferAlloc(). > -> Conclusion: INLINE BufferAlloc(). > > - Looking at BufferAlloc(): > 0.04 0.25 37974/37974 BufTableLookup [114] > 0.10 0.00 32340/151781 SpinAcquire [81] > 0.10 0.00 37470/40585 PinBuffer [209] > 0.08 0.00 38478/43799 RelationGetLRelId [234] > 0.04 0.00 37974/151781 SpinRelease [175] > > -> 40% of BufferAlloc() CPU time is in calling BufTableLookup(). > -> BufferAlloc() is the only func calling BufTableLookup(). > -> Conclusion: INLINE BufTableLookup(). > > - Looking at BufTableLookup(): > 86% of CPU time is in calling hash_search(). The rest is own time. > > - Looking at hash_search(): > 0.13 0.41 179189/179189 call_hash [69] > 0.00 0.00 6/6 bucket_alloc [1084] > -> Conclusion: INLINE call_hash() [and bucket_alloc()] into hash_search(). > > - Looking at call_hash(): > 0.37 0.00 171345/171345 tag_hash [94] > 0.04 0.00 7844/7844 string_hash [348] > -> Conclusion: INLINE tag_hash() [and string_hash()] into call_hash(). > -> Perhaps disk_hash() could be used in some way? It is currently #ifdef'd away. > -> Could we use a lookup table instead of doing hash calculations? Would not that > -> be much faster? > > > It looks to me as if there are too many levels of function calls. > Perhaps all functions which are called by only one other func should be inlined? > > > Guesstimate: > This would speed up heapgettup() by 10% ??? > Other functions would speed up too. I have reviewed your findings, and have inlined 9 small functions that are called for every row. Unfortunately, several of the ones you mention above are called via function pointers, so it was not possible to inline them. In my testing, I created a table with 160k rows, and did a SELECT that would do a sequential scan of the table, and looked at the functions that were called the most. --------------------------------------------------------------------------- % cumulative self self total time seconds seconds calls ms/call ms/call name 54.3 2.65 2.65 mcount (472) 16.2 3.44 0.79 2999 0.26 0.26 _read [18] 3.5 3.61 0.17 160148 0.00 0.01 _heapgettup [15] 2.5 3.73 0.12 174007 0.00 0.00 _SpinAcquire [36] 2.3 3.84 0.11 981732 0.00 0.00 _BufferIsValid [37] 2.3 3.95 0.11 160148 0.00 0.01 _heap_getnext [14] 2.0 4.05 0.10 160096 0.00 0.00 _ExecEvalVar [38] 2.0 4.15 0.10 160096 0.00 0.00 _ExecMakeFunctionResult <cycle 3> [26] 1.4 4.22 0.07 174007 0.00 0.00 _SpinRelease [41] 1.2 4.28 0.06 18176 0.00 0.00 _hash_search [40] 1.0 4.33 0.05 480288 0.00 0.00 _ExecEvalExpr <cycle 3> [34] 1.0 4.38 0.05 167313 0.00 0.00 _RelationGetLRelId [47] 1.0 4.43 0.05 160096 0.00 0.00 _ExecEvalFuncArgs <cycle 3> [48] 0.8 4.47 0.04 160097 0.00 0.01 _SeqNext [13] 0.8 4.51 0.04 160096 0.00 0.00 _ExecQual [22] 0.6 4.54 0.03 160096 0.00 0.00 _ExecEvalOper <cycle 3> [51] 0.4 4.56 0.02 163015 0.00 0.00 _IncrBufferRefCount [61] 0.4 4.58 0.02 161342 0.00 0.00 _fmgr_c <cycle 5> [29] 0.4 4.60 0.02 160451 0.00 0.00 _TransactionIdEquals [62] 0.4 4.62 0.02 160148 0.00 0.00 _RelationGetBufferWithBuffer [28] 0.4 4.64 0.02 160098 0.00 0.00 _TupIsNull [63] 0.4 4.66 0.02 160097 0.00 0.00 _ExecIncrSlotBufferRefcnt [45] 0.4 4.68 0.02 160096 0.00 0.00 _ExecQualClause [23] 0.4 4.70 0.02 6161 0.00 0.00 _strcpy [64] 0.4 4.72 0.02 31 0.65 0.65 _open [65] 0.2 4.73 0.01 164288 0.00 0.00 _ReleaseBuffer [81] 0.2 4.74 0.01 164048 0.00 0.00 _PageGetMaxOffsetNumber [97] 0.2 4.75 0.01 160450 0.00 0.00 _HeapTupleSatisfiesVisibility [50] 0.2 4.76 0.01 160449 0.00 0.00 _HeapTupleSatisfiesNow [98] 0.2 4.77 0.01 160100 0.00 0.00 _ExecClearTuple [49] 0.2 4.78 0.01 160098 0.00 0.00 _ExecStoreTuple [46] 0.2 4.79 0.01 160096 0.00 0.00 _bpchareq [102] 0.2 4.80 0.01 15696 0.00 0.00 _call_hash [99] 0.2 4.81 0.01 12187 0.00 0.00 _malloc [96] 0.2 4.82 0.01 9182 0.00 0.00 _PortalHeapMemoryAlloc [73] 0.2 4.83 0.01 7257 0.00 0.00 _newNode [54] 0.2 4.84 0.01 2973 0.00 0.20 _ReleaseAndReadBuffer [25] 0.2 4.85 0.01 1241 0.01 0.02 _LockAcquire [58] 0.2 4.86 0.01 189 0.05 0.05 _IsBootstrapProcessingMode [103] 0.2 4.87 0.01 3 3.33 3.33 _shmdt [101] 0.2 4.88 0.01 1 10.00 2172.84 _ExecScan [11] 0.0 4.88 0.00 166214 0.00 0.00 _BufferGetBlock [72] 0.0 4.88 0.00 160449 0.00 0.00 _heapisoverride [477] 0.0 4.88 0.00 15303 0.00 0.00 _tag_hash [478] 0.0 4.88 0.00 11718 0.00 0.00 _AllocSetContains [479] 0.0 4.88 0.00 11718 0.00 0.00 _OrderedSetContains [480] 0.0 4.88 0.00 11366 0.00 0.00 _OrderedElemPush [481] 0.0 4.88 0.00 11366 0.00 0.00 _OrderedElemPushHead [482] 0.0 4.88 0.00 11366 0.00 0.00 _OrderedElemPushInto [483] 0.0 4.88 0.00 11361 0.00 0.00 _AllocSetAlloc [104] 0.0 4.88 0.00 11361 0.00 0.00 _MemoryContextAlloc [67] 0.0 4.88 0.00 11339 0.00 0.00 _palloc [68] 0.0 4.88 0.00 10515 0.00 0.00 _free [484] 0.0 4.88 0.00 10488 0.00 0.00 _AllocSetFree [485] 0.0 4.88 0.00 10488 0.00 0.00 _OrderedElemPop [486] 0.0 4.88 0.00 8176 0.00 0.00 _get_expr [487] 0.0 4.88 0.00 7953 0.00 0.00 _AllocSetGetFirst [488] 0.0 4.88 0.00 7953 0.00 0.00 _OrderedSetGetHead [489] 0.0 4.88 0.00 7952 0.00 0.00 _OrderedElemGetBase [490] 0.0 4.88 0.00 7424 0.00 0.00 _memmove [491] --------------------------------------------------------------------------- Of these, I inlined: #define TransactionIdEquals(id1, id2) \ #define TupIsNull(slot) \ #define ExecIncrSlotBufferRefcnt(slot) \ #define BufferIsValid(bufnum) \ #define IncrBufferRefCount(buffer) \ #define BufferGetBlock(buffer) \ #define PageGetMaxOffsetNumber(page) \ #define HeapTupleSatisfiesVisibility(tuple, seeself) \ #define heapisoverride() \ When I say linined, I mean I made macros of them. I did not inline the code into other functions, because you loose the modularity of the function name, and many are called multiple times. These were all small functions, only a few lines of actual code. BufferIsValid was called several times per row. With that said, the execution times did not change at all. What I did see is that the mcount row in gprof output: 54.3 2.65 2.65 mcount (472) decreased from 2.65 to 1.79, which means that less time was spent in function calls. Not sure why that time did not translate to better overall performance, but the inlining is a clear win for these small functions. mcount is artificially high anyway because it is doing the profiling. The resulting gprof output shows a much smaller list of functions called 160k times. A patch will be posted to the patches list, for any of you who subscribe to that. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)