Thread: profiling pgbench
I did some profiling of pgbench -j 36 -c 36 -T 500 banging on my two-core desktop box - with synchronous_commit turned off to keep the fsyncs from dominating the profile - and got these results: 29634 4.7124 postgres base_yyparse 27906 4.4377 postgres AllocSetAlloc 17751 2.8228 postgres SearchCatCache 13029 2.0719 postgres hash_search_with_hash_value 11253 1.7895 postgres core_yylex 9957 1.5834 libc-2.11.2.so memcmp 9237 1.4689 libc-2.11.2.so __strcmp_sse2 8628 1.3720 postgres ScanKeywordLookup 7984 1.2696 postgres GetSnapshotData 7144 1.1361 postgres MemoryContextAllocZeroAligned 6898 1.0969 postgres XLogInsert 6440 1.0241 postgres LWLockAcquire Full report and call graph attached. The opannotate results are 2MB bzip'd, so I'm not attaching those; send me an email off-list if you want them. I'd like to get access to a box with (a lot) more cores, to see whether the lock stuff moves up in the profile. A big chunk of that hash_search_with_hash_value overhead is coming from LockAcquireExtended. The __strcmp_sse2 is almost entirely parsing overhead. In general, I'm not sure there's much hope for reducing the parsing overhead, although ScanKeywordLookup() can certainly be done better. XLogInsert() is spending a lot of time doing CRC's. LWLockAcquire() is dropping cycles in many different places. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > I did some profiling of pgbench -j 36 -c 36 -T 500 banging on my > two-core desktop box - with synchronous_commit turned off to keep the > fsyncs from dominating the profile - and got these results: > 29634 4.7124 postgres base_yyparse Seems like pgbench is a poster child for the value of prepared statements. Have you tried it with "-M prepared"? I'd take this with a grain of salt as to whether it's representative of real applications, of course. regards, tom lane
On Wednesday 24 November 2010 21:24:43 Robert Haas wrote: > I'd like to get access to a box with (a lot) more cores, to see > whether the lock stuff moves up in the profile. A big chunk of that > hash_search_with_hash_value overhead is coming from > LockAcquireExtended. The __strcmp_sse2 is almost entirely parsing > overhead. In general, I'm not sure there's much hope for reducing the > parsing overhead, although ScanKeywordLookup() can certainly be done > better. XLogInsert() is spending a lot of time doing CRC's. > LWLockAcquire() is dropping cycles in many different places. I can get you profiles of machines with up two 24 real cores, unfortunately I can't give access away. Regarding CRCs: I spent some time optimizing these, as you might remember. The wall I hit optimizing it benefit-wise is that the single CRC calls (4 for a non-indexed single-row insert on a table with 1 column inside a transaction) are just too damn small to get more efficient. Its causing pipeline stalls all over... (21, 5, 1, 28 bytes). I have a very preliminary patch calculating the CRC over the whole thing in one go if it can do so (no switch, no xl buffers wraparound), but its highly ugly as it needs to read from the xl insert buffers and then reinsert the crc at the correct position. While it shows a noticable improvement, that doesn't seem to be a good way to go. It could be made to work properly though. I played around with some ideas to do that more nicely, but none were gratifying. Recarding LWLockAcquire costs: Yes, its pretty noticeable - on loads of different usages. On a bunch of production machines its the second (begind XLogInsert) on some the most expensive function. Most of the time All of those machines are Nehalems though, so the image may be a bit distorted. Andres
On Wednesday 24 November 2010 22:14:04 Andres Freund wrote: > On Wednesday 24 November 2010 21:24:43 Robert Haas wrote: > > I'd like to get access to a box with (a lot) more cores, to see > > whether the lock stuff moves up in the profile. A big chunk of that > > hash_search_with_hash_value overhead is coming from > > LockAcquireExtended. The __strcmp_sse2 is almost entirely parsing > > overhead. In general, I'm not sure there's much hope for reducing the > > parsing overhead, although ScanKeywordLookup() can certainly be done > > better. XLogInsert() is spending a lot of time doing CRC's. > > LWLockAcquire() is dropping cycles in many different places. > > I can get you profiles of machines with up two 24 real cores, unfortunately > I can't give access away. > > Regarding CRCs: > I spent some time optimizing these, as you might remember. The wall I hit > optimizing it benefit-wise is that the single CRC calls (4 for a > non-indexed single-row insert on a table with 1 column inside a > transaction) are just too damn small to get more efficient. Its causing > pipeline stalls all over... (21, 5, 1, 28 bytes). > > I have a very preliminary patch calculating the CRC over the whole thing in > one go if it can do so (no switch, no xl buffers wraparound), but its > highly ugly as it needs to read from the xl insert buffers and then > reinsert the crc at the correct position. > While it shows a noticable improvement, that doesn't seem to be a good way > to go. It could be made to work properly though. > > I played around with some ideas to do that more nicely, but none were > gratifying. > > Recarding LWLockAcquire costs: > Yes, its pretty noticeable - on loads of different usages. On a bunch of > production machines its the second (begind XLogInsert) on some the most > expensive function. Most of the time AllocSetAlloc is the third, battling with hash_search_with_hash value. To complete that sentence... Andres
On Wed, Nov 24, 2010 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote: > On Wednesday 24 November 2010 22:14:04 Andres Freund wrote: >> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote: >> >> Recarding LWLockAcquire costs: >> Yes, its pretty noticeable - on loads of different usages. On a bunch of >> production machines its the second (begind XLogInsert) on some the most >> expensive function. Most of the time > AllocSetAlloc is the third, battling with hash_search_with_hash value. To > complete that sentence... I've played a bit with hash_search_with_hash_value and found that most of the time is spent on shared hash tables, not private ones. And the time attributed to it for the shared hash tables mostly seems to be due to the time it takes to fight cache lines away from other CPUs. I suspect the same thing is true of LWLockAcquire. Cheers, Jeff
On Wed, Nov 24, 2010 at 5:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Nov 24, 2010 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote: >> On Wednesday 24 November 2010 22:14:04 Andres Freund wrote: >>> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote: >>> >>> Recarding LWLockAcquire costs: >>> Yes, its pretty noticeable - on loads of different usages. On a bunch of >>> production machines its the second (begind XLogInsert) on some the most >>> expensive function. Most of the time > >> AllocSetAlloc is the third, battling with hash_search_with_hash value. To >> complete that sentence... > > I've played a bit with hash_search_with_hash_value and found that most > of the time is spent on shared hash tables, not private ones. And the > time attributed to it for the shared hash tables mostly seems to be > due to the time it takes to fight cache lines away from other CPUs. I > suspect the same thing is true of LWLockAcquire. How do you get stats on that? How big is a typical cache line on modern CPUs? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wednesday 24 November 2010 23:34:46 Robert Haas wrote: > On Wed, Nov 24, 2010 at 5:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > > On Wed, Nov 24, 2010 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote: > >> On Wednesday 24 November 2010 22:14:04 Andres Freund wrote: > >>> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote: > >>> > >>> Recarding LWLockAcquire costs: > >>> Yes, its pretty noticeable - on loads of different usages. On a bunch > >>> of production machines its the second (begind XLogInsert) on some the > >>> most expensive function. Most of the time > >> > >> AllocSetAlloc is the third, battling with hash_search_with_hash value. > >> To complete that sentence... > > > > I've played a bit with hash_search_with_hash_value and found that most > > of the time is spent on shared hash tables, not private ones. And the > > time attributed to it for the shared hash tables mostly seems to be > > due to the time it takes to fight cache lines away from other CPUs. I > > suspect the same thing is true of LWLockAcquire. > > How do you get stats on that? Btw, if you have some recent kernel I would try to use perf - I find the event mappings easier to understand there, but maybe thats just me jumping onto bandwagons. > How big is a typical cache line on modern CPUs? for the line size cat /sys/devices/system/cpu/cpu0/cache/index*/coherency_line_size for the overall size cat /sys/devices/system/cpu/cpu0/cache/index*/size Andres
Jeff Janes <jeff.janes@gmail.com> writes: > I've played a bit with hash_search_with_hash_value and found that most > of the time is spent on shared hash tables, not private ones. And the > time attributed to it for the shared hash tables mostly seems to be > due to the time it takes to fight cache lines away from other CPUs. I > suspect the same thing is true of LWLockAcquire. That squares with some behavior I've seen. If you run opannotate you often see ridiculously high time percentages attributed to extremely trivial C statements. The explanation seems to be that those places are where chunks of memory are first touched, and have to be pulled into the CPU's cache (and, if in shared memory, pulled away from some other CPU). regards, tom lane
On Wed, Nov 24, 2010 at 2:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 24, 2010 at 5:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> >> I've played a bit with hash_search_with_hash_value and found that most >> of the time is spent on shared hash tables, not private ones. And the >> time attributed to it for the shared hash tables mostly seems to be >> due to the time it takes to fight cache lines away from other CPUs. I >> suspect the same thing is true of LWLockAcquire. > > How do you get stats on that? I replicated hash_search_with_hash_value function definition many times with different names, and changed different parts of the code to invoke different function names. (I don't trust profilers to correctly distribute times over call graphs. I think most of them just assume all function calls take the same time, and don't separately measure the time of calls coming from different parts of the code.) Then I took one of them that is heavily used and leaves the hash table unchanged, and had it invoke the same call several times in a row. The profiling confirmed that it was called 3 times more often, but the time spent in it increased by far less than 3 times, I think the increase in time in that function was only 10% or so (and in non-profiling code, the total run time did not increase by a noticeable amount). The only way I could explain this, other than redundant calls being optimized away (which the profiler call-counts disputes), is caching effects. --- a/src/backend/storage/buffer/buf_table.c +++ b/src/backend/storage/buffer/buf_table.c @@ -95,7 +95,21 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode) BufferLookupEnt *result; result = (BufferLookupEnt *) - hash_search_with_hash_value(SharedBufHash, + hash_search_with_hash_value5(SharedBufHash, +(void *) tagPtr, +hashcode, +HASH_FIND, + NULL); + + result = (BufferLookupEnt *) + hash_search_with_hash_value5(SharedBufHash, +(void *) tagPtr, +hashcode, +HASH_FIND, + NULL); + + result = (BufferLookupEnt *) + hash_search_with_hash_value5(SharedBufHash, (void *) tagPtr, hashcode, HASH_FIND, > How big is a typical cache line on modern CPUs? That I don't know. I'm more of an experimentalist. Cheers, Jeff
On Nov 24, 2010, at 5:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> I've played a bit with hash_search_with_hash_value and found that most >> of the time is spent on shared hash tables, not private ones. And the >> time attributed to it for the shared hash tables mostly seems to be >> due to the time it takes to fight cache lines away from other CPUs. I >> suspect the same thing is true of LWLockAcquire. > > That squares with some behavior I've seen. If you run opannotate > you often see ridiculously high time percentages attributed to extremely > trivial C statements. The explanation seems to be that those places are > where chunks of memory are first touched, and have to be pulled into the > CPU's cache (and, if in shared memory, pulled away from some other CPU). Does it hurt that, for example, the BufMappingLocks are consecutive in memory? They appear to be among the more heavilycontended locks even on my 2-core box. ...Robert