Thread: profiling pgbench

profiling pgbench

From

Robert Haas

Date:

24 November 2010, 16:24:55

I did some profiling of pgbench -j 36 -c 36 -T 500 banging on my
two-core desktop box - with synchronous_commit turned off to keep the
fsyncs from dominating the profile - and got these results:

29634     4.7124  postgres                 base_yyparse
27906     4.4377  postgres                 AllocSetAlloc
17751     2.8228  postgres                 SearchCatCache
13029     2.0719  postgres                 hash_search_with_hash_value
11253     1.7895  postgres                 core_yylex
9957      1.5834  libc-2.11.2.so           memcmp
9237      1.4689  libc-2.11.2.so           __strcmp_sse2
8628      1.3720  postgres                 ScanKeywordLookup
7984      1.2696  postgres                 GetSnapshotData
7144      1.1361  postgres                 MemoryContextAllocZeroAligned
6898      1.0969  postgres                 XLogInsert
6440      1.0241  postgres                 LWLockAcquire

Full report and call graph attached.  The opannotate results are 2MB
bzip'd, so I'm not attaching those; send me an email off-list if you
want them.

I'd like to get access to a box with (a lot) more cores, to see
whether the lock stuff moves up in the profile.  A big chunk of that
hash_search_with_hash_value overhead is coming from
LockAcquireExtended.  The __strcmp_sse2 is almost entirely parsing
overhead.  In general, I'm not sure there's much hope for reducing the
parsing overhead, although ScanKeywordLookup() can certainly be done
better.  XLogInsert() is spending a lot of time doing CRC's.
LWLockAcquire() is dropping cycles in many different places.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: profiling pgbench

From

Tom Lane

Date:

24 November 2010, 17:13:09

Robert Haas <robertmhaas@gmail.com> writes:
> I did some profiling of pgbench -j 36 -c 36 -T 500 banging on my
> two-core desktop box - with synchronous_commit turned off to keep the
> fsyncs from dominating the profile - and got these results:

> 29634     4.7124  postgres                 base_yyparse

Seems like pgbench is a poster child for the value of prepared
statements.  Have you tried it with "-M prepared"?

I'd take this with a grain of salt as to whether it's representative of
real applications, of course.
        regards, tom lane

Re: profiling pgbench

From

Andres Freund

Date:

24 November 2010, 17:14:17

On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
> I'd like to get access to a box with (a lot) more cores, to see
> whether the lock stuff moves up in the profile.  A big chunk of that
> hash_search_with_hash_value overhead is coming from
> LockAcquireExtended.  The __strcmp_sse2 is almost entirely parsing
> overhead.  In general, I'm not sure there's much hope for reducing the
> parsing overhead, although ScanKeywordLookup() can certainly be done
> better.  XLogInsert() is spending a lot of time doing CRC's.
> LWLockAcquire() is dropping cycles in many different places.
I can get you profiles of machines with up two 24 real cores, unfortunately I 
can't give access away.

Regarding CRCs:
I spent some time optimizing these, as you might remember. The wall I hit 
optimizing it benefit-wise is that the single CRC calls (4 for a non-indexed 
single-row insert on a table with 1 column inside a transaction)  are just too 
damn small to get more efficient. Its causing pipeline stalls all over...
(21, 5, 1, 28 bytes).

I have a very preliminary patch calculating the CRC over the whole thing in 
one go if it can do so (no switch, no xl buffers wraparound), but its highly 
ugly as it needs to read from the xl insert buffers and then reinsert the crc 
at the correct position.
While it shows a noticable improvement, that doesn't seem to be a good way to 
go. It could be made to work properly though.

I played around with some ideas to do that more nicely, but none were 
gratifying.

Recarding LWLockAcquire costs: 
Yes, its pretty noticeable - on loads of different usages. On a bunch of 
production machines its the second (begind XLogInsert) on some the most 
expensive function. Most of the time 

All of those machines are Nehalems though, so the image may be a bit 
distorted.

Andres

Re: profiling pgbench

From

Andres Freund

Date:

24 November 2010, 17:20:00

On Wednesday 24 November 2010 22:14:04 Andres Freund wrote:
> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
> > I'd like to get access to a box with (a lot) more cores, to see
> > whether the lock stuff moves up in the profile.  A big chunk of that
> > hash_search_with_hash_value overhead is coming from
> > LockAcquireExtended.  The __strcmp_sse2 is almost entirely parsing
> > overhead.  In general, I'm not sure there's much hope for reducing the
> > parsing overhead, although ScanKeywordLookup() can certainly be done
> > better.  XLogInsert() is spending a lot of time doing CRC's.
> > LWLockAcquire() is dropping cycles in many different places.
> 
> I can get you profiles of machines with up two 24 real cores, unfortunately
> I can't give access away.
> 
> Regarding CRCs:
> I spent some time optimizing these, as you might remember. The wall I hit
> optimizing it benefit-wise is that the single CRC calls (4 for a
> non-indexed single-row insert on a table with 1 column inside a
> transaction)  are just too damn small to get more efficient. Its causing
> pipeline stalls all over... (21, 5, 1, 28 bytes).
> 
> I have a very preliminary patch calculating the CRC over the whole thing in
> one go if it can do so (no switch, no xl buffers wraparound), but its
> highly ugly as it needs to read from the xl insert buffers and then
> reinsert the crc at the correct position.
> While it shows a noticable improvement, that doesn't seem to be a good way
> to go. It could be made to work properly though.
> 
> I played around with some ideas to do that more nicely, but none were
> gratifying.
> 
> Recarding LWLockAcquire costs:
> Yes, its pretty noticeable - on loads of different usages. On a bunch of
> production machines its the second (begind XLogInsert) on some the most
> expensive function. Most of the time
AllocSetAlloc is the third, battling with hash_search_with_hash value. To 
complete that sentence...

Andres

Re: profiling pgbench

From

Jeff Janes

Date:

24 November 2010, 18:33:30

On Wed, Nov 24, 2010 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote:
> On Wednesday 24 November 2010 22:14:04 Andres Freund wrote:
>> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
>>
>> Recarding LWLockAcquire costs:
>> Yes, its pretty noticeable - on loads of different usages. On a bunch of
>> production machines its the second (begind XLogInsert) on some the most
>> expensive function. Most of the time

> AllocSetAlloc is the third, battling with hash_search_with_hash value. To
> complete that sentence...

I've played a bit with hash_search_with_hash_value and found that most
of the time is spent on shared hash tables, not private ones.  And the
time attributed to it for the shared hash tables mostly seems to be
due to the time it takes to fight cache lines away from other CPUs.  I
suspect the same thing is true of LWLockAcquire.

Cheers,

Jeff

Re: profiling pgbench

From

Robert Haas

Date:

24 November 2010, 18:34:54

On Wed, Nov 24, 2010 at 5:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Nov 24, 2010 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote:
>> On Wednesday 24 November 2010 22:14:04 Andres Freund wrote:
>>> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
>>>
>>> Recarding LWLockAcquire costs:
>>> Yes, its pretty noticeable - on loads of different usages. On a bunch of
>>> production machines its the second (begind XLogInsert) on some the most
>>> expensive function. Most of the time
>
>> AllocSetAlloc is the third, battling with hash_search_with_hash value. To
>> complete that sentence...
>
> I've played a bit with hash_search_with_hash_value and found that most
> of the time is spent on shared hash tables, not private ones.  And the
> time attributed to it for the shared hash tables mostly seems to be
> due to the time it takes to fight cache lines away from other CPUs.  I
> suspect the same thing is true of LWLockAcquire.

How do you get stats on that?

How big is a typical cache line on modern CPUs?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: profiling pgbench

From

Andres Freund

Date:

24 November 2010, 18:38:41

On Wednesday 24 November 2010 23:34:46 Robert Haas wrote:
> On Wed, Nov 24, 2010 at 5:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > On Wed, Nov 24, 2010 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote:
> >> On Wednesday 24 November 2010 22:14:04 Andres Freund wrote:
> >>> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
> >>> 
> >>> Recarding LWLockAcquire costs:
> >>> Yes, its pretty noticeable - on loads of different usages. On a bunch
> >>> of production machines its the second (begind XLogInsert) on some the
> >>> most expensive function. Most of the time
> >> 
> >> AllocSetAlloc is the third, battling with hash_search_with_hash value.
> >> To complete that sentence...
> > 
> > I've played a bit with hash_search_with_hash_value and found that most
> > of the time is spent on shared hash tables, not private ones.  And the
> > time attributed to it for the shared hash tables mostly seems to be
> > due to the time it takes to fight cache lines away from other CPUs.  I
> > suspect the same thing is true of LWLockAcquire.
> 
> How do you get stats on that?
Btw, if you have some recent kernel I would try to use perf - I find the event 
mappings easier to understand there, but maybe thats just me jumping onto 
bandwagons.

> How big is a typical cache line on modern CPUs?
for the line size
cat /sys/devices/system/cpu/cpu0/cache/index*/coherency_line_size

for the overall size
cat /sys/devices/system/cpu/cpu0/cache/index*/size

Andres

Re: profiling pgbench

From

Tom Lane

Date:

24 November 2010, 18:49:52

Jeff Janes <jeff.janes@gmail.com> writes:
> I've played a bit with hash_search_with_hash_value and found that most
> of the time is spent on shared hash tables, not private ones.  And the
> time attributed to it for the shared hash tables mostly seems to be
> due to the time it takes to fight cache lines away from other CPUs.  I
> suspect the same thing is true of LWLockAcquire.

That squares with some behavior I've seen.  If you run opannotate
you often see ridiculously high time percentages attributed to extremely
trivial C statements.  The explanation seems to be that those places are
where chunks of memory are first touched, and have to be pulled into the
CPU's cache (and, if in shared memory, pulled away from some other CPU).
        regards, tom lane

Re: profiling pgbench

From

Jeff Janes

Date:

24 November 2010, 19:14:18

On Wed, Nov 24, 2010 at 2:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 24, 2010 at 5:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> I've played a bit with hash_search_with_hash_value and found that most
>> of the time is spent on shared hash tables, not private ones.  And the
>> time attributed to it for the shared hash tables mostly seems to be
>> due to the time it takes to fight cache lines away from other CPUs.  I
>> suspect the same thing is true of LWLockAcquire.
>
> How do you get stats on that?

I replicated hash_search_with_hash_value function definition many
times with different names, and changed different parts of the code to
invoke different function names.  (I don't trust profilers to
correctly distribute times over call graphs.  I think most of them
just assume all function calls take the same time, and don't
separately measure the time of calls coming from different parts of
the code.)

Then I took one of them that is heavily used and leaves the hash table
unchanged, and had it invoke the same call several times in a row.
The profiling confirmed that it was called 3 times more often, but the
time spent in it increased by far less than 3 times, I think the
increase in time in that function was only 10% or so (and in
non-profiling code, the total run time did not increase by a
noticeable amount).

The only way I could explain this, other than redundant calls being
optimized away (which the profiler call-counts disputes), is caching
effects.

--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -95,7 +95,21 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode)       BufferLookupEnt *result;
       result = (BufferLookupEnt *)
-               hash_search_with_hash_value(SharedBufHash,
+               hash_search_with_hash_value5(SharedBufHash,
+(void *) tagPtr,
+hashcode,
+HASH_FIND,
+                                                                       NULL);
+
+       result = (BufferLookupEnt *)
+               hash_search_with_hash_value5(SharedBufHash,
+(void *) tagPtr,
+hashcode,
+HASH_FIND,
+                                                                       NULL);
+
+       result = (BufferLookupEnt *)
+               hash_search_with_hash_value5(SharedBufHash,
(void *) tagPtr,
hashcode,
HASH_FIND,

> How big is a typical cache line on modern CPUs?

That I don't know.  I'm more of an experimentalist.

Cheers,

Jeff

Re: profiling pgbench

From

Robert Haas

Date:

24 November 2010, 22:27:01

On Nov 24, 2010, at 5:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> I've played a bit with hash_search_with_hash_value and found that most
>> of the time is spent on shared hash tables, not private ones.  And the
>> time attributed to it for the shared hash tables mostly seems to be
>> due to the time it takes to fight cache lines away from other CPUs.  I
>> suspect the same thing is true of LWLockAcquire.
>
> That squares with some behavior I've seen.  If you run opannotate
> you often see ridiculously high time percentages attributed to extremely
> trivial C statements.  The explanation seems to be that those places are
> where chunks of memory are first touched, and have to be pulled into the
> CPU's cache (and, if in shared memory, pulled away from some other CPU).

Does it hurt that, for example, the BufMappingLocks are consecutive in memory?  They appear to be among the more
heavilycontended locks even on my 2-core box. 

...Robert