pgbench vs. wait events - Mailing list pgsql-hackers

From Robert Haas
Subject pgbench vs. wait events
Date
Msg-id CA+Tgmoav9Q5v5ZGT3+wP_1tQjT6TGYXrwrDcTRrWimC+ZY7RRA@mail.gmail.com
Whole thread Raw
Responses Re: pgbench vs. wait events  (Jeff Janes <jeff.janes@gmail.com>)
Re: pgbench vs. wait events  (Michael Paquier <michael.paquier@gmail.com>)
Re: pgbench vs. wait events  (Alfred Perlstein <alfred@freebsd.org>)
Re: pgbench vs. wait events  (Alfred Perlstein <bright@mu.org>)
Re: pgbench vs. wait events  (Jeff Janes <jeff.janes@gmail.com>)
Re: pgbench vs. wait events  (Andres Freund <andres@anarazel.de>)
Re: pgbench vs. wait events  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
Hi,

I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench.  I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on.  During each run, I ran this psql script in another window and
captured the output:

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output.  First, I tried scale factor
3000 with 32 clients and got these results:
     1  LWLockTranche   | buffer_mapping     9  LWLockNamed     | CLogControlLock    14  LWLockNamed     |
ProcArrayLock   16  Lock            | tuple    25  LWLockNamed     | CheckpointerCommLock    49  LWLockNamed     |
WALBufMappingLock  122  LWLockTranche   | clog   182  Lock            | transactionid   287  LWLockNamed     |
XidGenLock 1300  Client          | ClientRead  1375  LWLockTranche   | buffer_content  3990  Lock            | extend
21014 LWLockNamed     | WALWriteLock 28497                  | 58279  LWLockTranche   | wal_insert
 

tps = 1150.803133 (including connections establishing)

What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload.   Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test.  The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:
     1  BufferPin       | BufferPin     1  LWLockTranche   | lock_manager     2  LWLockTranche   | buffer_mapping     8
LWLockNamed     | ProcArrayLock     9  LWLockNamed     | CheckpointerCommLock     9  LWLockNamed     | CLogControlLock
 11  LWLockTranche   | buffer_content    37  LWLockTranche   | clog   153  Lock            | tuple   388  LWLockNamed
 | XidGenLock   827  Lock            | transactionid  1267  Client          | ClientRead 20631  Lock            |
extend91767                  |
 

tps = 1223.239416 (including connections establishing)

If you don't look at the TPS number, these results look like a vast
improvement.  The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture.  However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O.  This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.

Next, I tried lowering the scale factor to something that fits in
shared buffers.  Here are the results at scale factor 300:
    14  Lock            | tuple    22  LWLockTranche   | lock_manager    39  LWLockNamed     | WALBufMappingLock   331
LWLockNamed    | CLogControlLock   461  LWLockNamed     | ProcArrayLock   536  Lock            | transactionid   552
Lock           | extend   716  LWLockTranche   | buffer_content   763  LWLockNamed     | XidGenLock  2113  LWLockNamed
  | WALWriteLock  6190  LWLockTranche   | wal_insert 25002  Client          | ClientRead 78466                  |
 

tps = 27651.562835 (including connections establishing)

Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work.  ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare.  Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples.  It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.

However, I was curious about what's going on with CLogControlLock,
persuant to the other thread on that topic, so I reran this with
unlogged tables.  At scale factor 300, 32 clients, unlogged tables, I
get these results:
     6  LWLockTranche   | wal_insert    12  Lock            | tuple    26  LWLockTranche   | lock_manager    39
LWLockTranche  | buffer_content   353  LWLockNamed     | CLogControlLock   377  Lock            | extend   491  Lock
       | transactionid   500  LWLockNamed     | ProcArrayLock   696  LWLockNamed     | XidGenLock 27685  Client
| ClientRead 84930                  |
 

WAL contention is eliminated, and CLog contention doesn't increase.
Next, I tried ramping up the client count to see what effect that had.
These next three results are all at scale factor 300 with unlogged
tables.

64 clients:
     1  BufferPin       | BufferPin     2  LWLockTranche   | buffer_mapping    51  LWLockTranche   | wal_insert    52
Lock           | relation   132  LWLockTranche   | lock_manager   154  LWLockTranche   | buffer_content   209  Lock
      | tuple   940  Lock            | extend  1412  LWLockNamed     | XidGenLock  2337  LWLockNamed     |
CLogControlLock 2813  LWLockNamed     | ProcArrayLock  3859  Lock            | transactionid 83760  Client          |
ClientRead133840                 |
 

tps = 36672.484524 (including connections establishing)

96 clients:
     1  LWLockTranche   | proc    53  LWLockTranche   | wal_insert   305  LWLockTranche   | buffer_content   391
LWLockTranche  | lock_manager   797  Lock            | tuple  1074  Lock            | extend  2492  LWLockNamed     |
XidGenLock 4185  LWLockNamed     | ProcArrayLock  6003  LWLockNamed     | CLogControlLock 11348  Lock            |
transactionid142881 Client          | ClientRead174011                  |
 

tps = 37214.257661 (including connections establishing)

192 clients:
     1  LWLockTranche   | buffer_mapping     2  BufferPin       | BufferPin     4  LWLockTranche   | clog     8
LWLockTranche  | proc    87  LWLockTranche   | wal_insert   708  LWLockTranche   | buffer_content  1892  LWLockTranche
| lock_manager  2225  Lock            | extend  4796  LWLockNamed     | ProcArrayLock  5105  LWLockNamed     |
XidGenLock 7982  Lock            | tuple 11882  LWLockNamed     | CLogControlLock 64091  Lock            |
transactionid275481                 |308627  Client          | ClientRead
 

tps = 35018.906863 (including connections establishing)

There are a couple of interesting things here.  First, CLogControlLock
contention becomes more significant as the number of the clients goes
up. At 32 clients, it is the 6th-most frequent wait event; at 192
clients, it's moved up to 4th, and it's clearly growing at a faster
rate than some of the others.  As we go from 32 to 64 to 96 to 192
clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
it goes from being only 2/3 of ProcArrayLock to being more than twice
ProcArrayLock. However, in absolute terms, it's still not very
significant.  Even at 192 clients, where it's the fourth-most-common
wait event, it appears in only 1.7% of samples, which means backends
are only waiting for it about 1.7% of the time.  It appears that, on
this system, this workload is NEVER going to get meaningfully faster
by improving CLogControlLock.

Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.  Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time.  Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Pavan Deolasee
Date:
Subject: FSM corruption leading to errors
Next
From: Robert Haas
Date:
Subject: Re: Switch to unnamed POSIX semaphores as our preferred sema code?