measuring lwlock-related latency spikes - Mailing list pgsql-hackers

From Robert Haas
Subject measuring lwlock-related latency spikes
Date
Msg-id CA+TgmoZRNeD+FJswnagms967Gy-EVrYGVPY1aJr1GyL+1CnSnQ@mail.gmail.com
Whole thread Raw
Responses Re: measuring lwlock-related latency spikes  (Simon Riggs <simon@2ndQuadrant.com>)
Re: measuring lwlock-related latency spikes  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On Fri, Mar 30, 2012 at 1:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Actually, what is really bugging me is that I cannot find any way of
> getting a profile that reflects the *time* spent waiting rather than
> merely the *number* of waits.  That seems like an obvious thing to
> want, and I cannot find a way to get it.

I ended up taking a stab at writing some userspace code for this - see
attached.  It measures the amount of time taken for each contended
lwlock acquisition and prints a crude histogram of the results at
postmaster-shutdown time.  At least on the IBM POWER7 machine, running
pgbench with 32 threads, where pg_test_timing shows that getting the
time takes less than a microsecond 96%+ of the time, this seemed to
have no real impact on the tps numbers - perhaps because the workload
is I/O bound.  Risking the possible ire of people who object to large
attachments, I'm attaching the results this generated on a 30-minute,
32-thread pgbench run at scale factor 300.  To minimize said ire, I've
run the file through bzip2.

What's interesting about this is that, while there is plenty of
waiting for the usual suspects - ProcArrayLock (4), WALInsertLock (7),
and CLogControlLock (11), the waits are all pretty short:

2012-03-31 02:33:25 UTC [50305] LOG:  lock 0: 2:520, 3:2222
2012-03-31 02:33:25 UTC [50305] LOG:  lock 3: 1:9, 2:36838
2012-03-31 02:33:25 UTC [50305] LOG:  lock 4: 1:33, 2:216964
2012-03-31 02:33:25 UTC [50305] LOG:  lock 7: 1:39, 2:406249
2012-03-31 02:33:25 UTC [50305] LOG:  lock 8: 1:4, 2:34
2012-03-31 02:33:25 UTC [50305] LOG:  lock 11: 1:99, 2:374559
2012-03-31 02:33:25 UTC [50305] LOG:  lock 17: 2:24, 3:24

That's saying that there were over 400,000 contended acquisitions of
WALInsertLock, but the longest one had fls(wait_time_in_us) = 2, or in
other words it took less than 4us to get the lock.  So what happens if
we grep the log file for the biggest offenders?

2012-03-31 02:33:25 UTC [50305] LOG:  lock 204610: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 272958: 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 325412: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 325784: 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 360016: 5:1, 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 444886: 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 499890: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 533418: 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 610484: 6:1, 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 897798: 22:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1027128: 7:1, 20:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1074256: 5:1, 21:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1132586: 5:1, 23:1
2012-03-31 02:33:25 UTC [50305] LOG:  lock 1312178: 16:1, 22:1

If I'm reading this right, fls(wait_time_in_us) = 20 means a 1 second
delay, which I think means that a couple of those waits were >= *8
seconds*.  gdb reveals that in the test configuration, all odd
numbered locks between 169 and 2097319 are some buffer's
io_in_progress_lock, and all even numbered locks between 170 and
2097320 are some buffer's content_lock, which means, if I'm not
confused here, that every single lwlock-related stall > 1s happened
while waiting for a buffer content lock.  Moreover, each event
affected a different buffer.  I find this result so surprising that I
have a hard time believing that I haven't screwed something up, so if
anybody can check over the patch and this analysis and suggest what
that thing might be, I would appreciate it.  I would be a lot less
surprised if the waits revolved around the IO-in-progress locks, since
it's not that hard to imagine an I/O taking a really long time on a
busy system.  But I didn't think we were ever supposed to hold content
locks for that long.

The other thing that baffles me about these numbers is that they don't
provide any fodder for explaining the periodic drops in throughput
that happen when the system checkpoints.  I had assumed they would
show up as long lwlock waits, like somebody hanging on to
WALInsertLock while everybody else in the system piles up behind them.
 That's not reflected in these numbers - the few very long waits show
just ONE guy waiting a really long time for the lock.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: tracking context switches with perf record
Next
From: Hitoshi Harada
Date:
Subject: Re: Odd out of memory problem.