Thread: tracking context switches with perf record

tracking context switches with perf record

From
Robert Haas
Date:
I've been playing around with "perf record" on the IBM POWER7 machine,
and it's pretty cool.  One of the things I don't like is that some of
the tools don't produce very nice text reports that you can cut and
paste into an email - it's kind of a text-based GUI where you can zoom
around and look at things.  Still, it's nice.

One of the neat things I've just discovered is that you can use a
call-graph report on context-switches to find out what user code is
resulting in context-switching.  For example, I ran a 32-client
pgbench test at scale factor 300 for half an hour and captured data
for 10 minutes relatively near the beginning.  The partially expanded
call tree for postgres looks like this:
     - 87.79% .schedule_timeout        + 99.99% .unix_stream_recvmsg     - 11.63% .SyS_semtimedop          .SyS_ipc
    syscall_exit          semop        - PGSemaphoreLock           - 98.48% LWLockAcquire              - 53.11%
XLogInsert                + 48.39% heap_update                 + 18.15% heap_insert                 + 16.36%
RecordTransactionCommit                + 15.67% log_heap_clean                 + 0.98% _bt_insertonpg              +
18.71%ProcArrayEndTransaction              - 14.77% SimpleLruReadPage_ReadOnly                   TransactionIdGetStatus
                 TransactionLogFetch                 - TransactionIdDidCommit                    + 53.52%
HeapTupleSatisfiesVacuum                   + 45.00% HeapTupleSatisfiesMVCC                    + 0.82%
HeapTupleHeaderAdvanceLatestRemovedXid             + 6.08% TransactionIdSetPageStatus              + 2.63% LockBuffer
          + 1.92% GetSnapshotData              + 1.51% SimpleLruReadPage           + 1.52% ProcSleep     + 0.50%
do_work

So what you can see here is that all the context switches that
happened in postgres with the result either of a .schedule_timeout or
a .SyS_semtimedop.  87.79% of the context switches came from
.schedule_timeout, and 99.99% of those came from unix_stream_recvmsg.
If you expand that branch of the call tree, you find that all of them
are coming eventually from secure_read; the server is waiting for a
new query from the client.  This is, obviously, overhead we can't
eliminate from this test; waiting for the client is part of the job.
The other 11.63% of the context switches result from PGSemaphoreLock
calling semop.  Of those, 98.48% come from LWLockAcquire and 1.52%
come from ProcSleep.  The ProcSleep calls are all from blocking while
attempting to acquire a heavyweight lock; and nearly all of them are
from conflicting tuple locks.  However, they account for such a small
portion of the overall context switches that they're probably not
worth worrying about.

The more interesting waits, in my view anyway, are the ones that come
from LWLockAcquire, which account for nearly all of the semaphore
sleeps.  As you can see, XLogInsert accounts for over half of those,
and ProcArrayEndTransaction and SimpleLruReadPage_ReadOnly account for
most of the rest.  Just out of curiosity, I expanded a couple of those
another level, and you can see what's going on there, too.  I was
slightly surprised by the number of calls to log_heap_clean, since I
wouldn't have expected vacuuming to generate that much work; it turns
out that 99.9% of those are HOT pruning events happening during query
execution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: tracking context switches with perf record

From
Greg Stark
Date:
On Fri, Mar 30, 2012 at 5:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> If you expand that branch of the call tree, you find that all of them
> are coming eventually from secure_read; the server is waiting for a
> new query from the client.  This is, obviously, overhead we can't
> eliminate from this test; waiting for the client is part of the job.

Fwiw this isn't necessarily true. How does the absolute number of
these events compare with the number of pg_bench operations done? If
it's significantly more the server could be reading on sockets while
there are partial commands there and it might be more efficient to
wait until the whole command is ready before reading. It may be that
this indicates that pg_bench is written in an inefficient way and it
should pipeline more commands but of course optimizing pg_bench is
kind of missing the point.

Also incidentally context switches is one of the things getrusage
shows so I'm still hoping to have that per-plan-node though that's
orthogonal to what this tool gives you with the call graph.


--
greg


Re: tracking context switches with perf record

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Fri, Mar 30, 2012 at 5:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> If you expand that branch of the call tree, you find that all of them
>> are coming eventually from secure_read; the server is waiting for a
>> new query from the client. �This is, obviously, overhead we can't
>> eliminate from this test; waiting for the client is part of the job.

> Fwiw this isn't necessarily true. How does the absolute number of
> these events compare with the number of pg_bench operations done? If
> it's significantly more the server could be reading on sockets while
> there are partial commands there and it might be more efficient to
> wait until the whole command is ready before reading. It may be that
> this indicates that pg_bench is written in an inefficient way and it
> should pipeline more commands but of course optimizing pg_bench is
> kind of missing the point.

Well, that would be on libpq's head if it were true, but I believe we're
fairly careful to not flush the output buffer until we're sending a
complete message.
        regards, tom lane


Re: tracking context switches with perf record

From
Robert Haas
Date:
On Fri, Mar 30, 2012 at 1:30 PM, Greg Stark <stark@mit.edu> wrote:
> On Fri, Mar 30, 2012 at 5:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> If you expand that branch of the call tree, you find that all of them
>> are coming eventually from secure_read; the server is waiting for a
>> new query from the client.  This is, obviously, overhead we can't
>> eliminate from this test; waiting for the client is part of the job.
>
> Fwiw this isn't necessarily true. How does the absolute number of
> these events compare with the number of pg_bench operations done?

Since perf is a stochastic profiler, I don't think there's any way to
measure that using this tool.

> If
> it's significantly more the server could be reading on sockets while
> there are partial commands there and it might be more efficient to
> wait until the whole command is ready before reading. It may be that
> this indicates that pg_bench is written in an inefficient way and it
> should pipeline more commands but of course optimizing pg_bench is
> kind of missing the point.

Well, to some degree.  I think most clients are going to send commands
one at a time and wait for the responses.  OTOH, if for example libpq
is sending the data inefficiently in some way that wouldbe worth
fixing; lots of people use libpq.

> Also incidentally context switches is one of the things getrusage
> shows so I'm still hoping to have that per-plan-node though that's
> orthogonal to what this tool gives you with the call graph.

Yeah.

Actually, what is really bugging me is that I cannot find any way of
getting a profile that reflects the *time* spent waiting rather than
merely the *number* of waits.  That seems like an obvious thing to
want, and I cannot find a way to get it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: tracking context switches with perf record

From
Andres Freund
Date:
On Friday, March 30, 2012 06:27:36 PM Robert Haas wrote:
> Probability=No, score=-3.2 required=5.0 tests=AWL,BAYES_00,
> RCVD_IN_DNSWL_LOW autolearn=ham version=3.2.5   I've been playing around
> with "perf record" on the IBM POWER7 machine, and it's pretty cool.  One
> of the things I don't like is that some of the tools don't produce very
> nice text reports that you can cut and paste into an email - it's kind of
> a text-based GUI where you can zoom around and look at things.  Still,
> it's nice.
You can use --stdio to force a more plain text interface (you need perf 
annotate manually if you want line-by-line view) if you want so. Especially if 
you pipe it through cat or so.

Andres


Re: tracking context switches with perf record

From
Jeff Janes
Date:
On Fri, Mar 30, 2012 at 9:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> The more interesting waits, in my view anyway, are the ones that come
> from LWLockAcquire, which account for nearly all of the semaphore
> sleeps.  As you can see, XLogInsert accounts for over half of those,
> and ProcArrayEndTransaction and SimpleLruReadPage_ReadOnly account for
> most of the rest.  Just out of curiosity, I expanded a couple of those
> another level, and you can see what's going on there, too.  I was
> slightly surprised by the number of calls to log_heap_clean, since I
> wouldn't have expected vacuuming to generate that much work; it turns
> out that 99.9% of those are HOT pruning events happening during query
> execution.

I've generally seen a one to one correspondence between log_heap_clean
and updates to pgbench_accounts (not by sampling waits, but by
wal_debug=1).

I figured this was because the pgbench_accounts blocks are packed to
the gills by default, so the only way to make room for the next HOT
update is the prune the dead tuple left over from the previous HOT
update in that same block.

It might be interesting to see if those go away if run with pgbench -i
-F90 or some other value less than 100.

Which I've now done, and it actually goes the other way.  with -F50,
not only do almost all pgbench_accounts updates still lead to a clean,
but a good chunk of the updates to pgbench_tellers lead to a clean as
well.  If cleans are going to happen so often, would it make sense to
write a combined record for clean+hot_update?

Cheers,

Jeff


Re: tracking context switches with perf record

From
Robert Haas
Date:
On Fri, Mar 30, 2012 at 10:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Which I've now done, and it actually goes the other way.  with -F50,
> not only do almost all pgbench_accounts updates still lead to a clean,
> but a good chunk of the updates to pgbench_tellers lead to a clean as
> well.  If cleans are going to happen so often, would it make sense to
> write a combined record for clean+hot_update?

I think that'd be hard to arrange.  You might thing that the prune
happens when we go to update and find that there isn't room on the
page, but that's actually not how it works at all:

# Ideally we would do defragmenting only when we are about to attempt
# heap_update on a HOT-safe tuple.  The difficulty with this approach
# is that the update query has certainly got a pin on the old tuple, and
# therefore our attempt to acquire a buffer cleanup lock will always fail.
# (This corresponds to the idea that we don't want to move the old tuple
# out from under where the query's HeapTuple pointer points.  It might
# be possible to finesse that, but it seems fragile.)

So what we do instead is:

# The currently planned heuristic is to prune and defrag when first accessing
# a page that potentially has prunable tuples (as flagged by the pd_prune_xid
# page hint field) and that either has free space less than MAX(fillfactor
# target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to
# find enough free space to store an updated tuple version.  (These rules
# are subject to change.)

So, in other words, the prune happens before we have any idea that
we're going to attempt an update.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company