Re: pgbench vs. wait events - Mailing list pgsql-hackers

From Alfred Perlstein
Subject Re: pgbench vs. wait events
Date
Msg-id daf9117e-d08f-6d22-4381-d273b5a788b3@freebsd.org
Whole thread Raw
In response to pgbench vs. wait events  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert,

This contention on WAL reminds me of another scenario I've heard about 
that was similar.

To fix things what happened was that anyone that the first person to 
block would be responsible for writing out all buffers for anyone 
blocked behind "him".

The for example if you have many threads, A, B, C, D

If while A is writing to WAL and hold the lock, then B arrives, B of 
course blocks, then C comes along and blocks as well, then D.

Finally A finishes its write and then ....

Now you have two options for resolving this, either

1) A drops its lock, B picks up the lock... B writes its buffer and then 
drops the lock.  Then C gets the lock, writes its buffer, drops the 
lock, then finally D gets the lock, writes its buffer and then drops the 
lock.

2) A then writes out B's, C's, and D's buffers, then A drops the lock, 
B, C and D wake up, note that their respective buffers are written and 
just return.  This greatly speeds up the system. (just be careful to 
make sure A doesn't do "too much work" otherwise you can get a sort of 
livelock if too many threads are blocked behind it, generally only issue 
one additional flush on behalf of other threads, do not "loop until the 
queue is empty")

I'm not sure if this is actually possible with the way WAL is 
implemented, (or perhaps if this strategy is already implemented) but 
it's definitely worth if not done already as it can speed things up 
enormously.


On 10/6/16 11:38 AM, Robert Haas wrote:
> Hi,
>
> I decided to do some testing on hydra (IBM-provided community
> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
> the newly-enhanced wait event stuff to try to get an idea of what
> we're waiting for during pgbench.  I did 30-minute pgbench runs with
> various configurations, but all had max_connections = 200,
> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
> on.  During each run, I ran this psql script in another window and
> captured the output:
>
> \t
> select wait_event_type, wait_event from pg_stat_activity where pid !=
> pg_backend_pid()
> \watch 0.5
>
> Then, I used a little shell-scripting to count up the number of times
> each wait event occurred in the output.  First, I tried scale factor
> 3000 with 32 clients and got these results:
>
>        1  LWLockTranche   | buffer_mapping
>        9  LWLockNamed     | CLogControlLock
>       14  LWLockNamed     | ProcArrayLock
>       16  Lock            | tuple
>       25  LWLockNamed     | CheckpointerCommLock
>       49  LWLockNamed     | WALBufMappingLock
>      122  LWLockTranche   | clog
>      182  Lock            | transactionid
>      287  LWLockNamed     | XidGenLock
>     1300  Client          | ClientRead
>     1375  LWLockTranche   | buffer_content
>     3990  Lock            | extend
>    21014  LWLockNamed     | WALWriteLock
>    28497                  |
>    58279  LWLockTranche   | wal_insert
>
> tps = 1150.803133 (including connections establishing)
>
> What jumps out here is, at least to me, is that there is furious
> contention on both the wal_insert locks and on WALWriteLock.
> Apparently, the system simply can't get WAL on disk fast enough to
> keep up with this workload.   Relation extension locks and
> buffer_content locks also are also pretty common, both ahead of
> ClientRead, a relatively uncommon wait event on this test.  The load
> average on the system was only about 3 during this test, indicating
> that most processes are in fact spending most of their time off-CPU.
> The first thing I tried was switching to unlogged tables, which
> produces these results:
>
>        1  BufferPin       | BufferPin
>        1  LWLockTranche   | lock_manager
>        2  LWLockTranche   | buffer_mapping
>        8  LWLockNamed     | ProcArrayLock
>        9  LWLockNamed     | CheckpointerCommLock
>        9  LWLockNamed     | CLogControlLock
>       11  LWLockTranche   | buffer_content
>       37  LWLockTranche   | clog
>      153  Lock            | tuple
>      388  LWLockNamed     | XidGenLock
>      827  Lock            | transactionid
>     1267  Client          | ClientRead
>    20631  Lock            | extend
>    91767                  |
>
> tps = 1223.239416 (including connections establishing)
>
> If you don't look at the TPS number, these results look like a vast
> improvement.  The overall amount of time spent not waiting for
> anything is now much higher, and the problematic locks have largely
> disappeared from the picture.  However, the load average now shoots up
> to about 30, because most of the time that the backends are "not
> waiting for anything" they are in fact in kernel wait state D; that
> is, they're stuck doing I/O.  This suggests that we might want to
> consider advertising a wait state when a backend is doing I/O, so we
> can measure this sort of thing.
>
> Next, I tried lowering the scale factor to something that fits in
> shared buffers.  Here are the results at scale factor 300:
>
>       14  Lock            | tuple
>       22  LWLockTranche   | lock_manager
>       39  LWLockNamed     | WALBufMappingLock
>      331  LWLockNamed     | CLogControlLock
>      461  LWLockNamed     | ProcArrayLock
>      536  Lock            | transactionid
>      552  Lock            | extend
>      716  LWLockTranche   | buffer_content
>      763  LWLockNamed     | XidGenLock
>     2113  LWLockNamed     | WALWriteLock
>     6190  LWLockTranche   | wal_insert
>    25002  Client          | ClientRead
>    78466                  |
>
> tps = 27651.562835 (including connections establishing)
>
> Obviously, there's a vast increase in TPS, and the backends seem to
> spend most of their time actually doing work.  ClientRead is now the
> overwhelmingly dominant wait event, although wal_insert and
> WALWriteLock contention is clearly still a significant problem.
> Contention on other locks is apparently quite rare.  Notice that
> client reads are really significant here - more than 20% of the time
> we sample what a backend is doing, it's waiting for the next query.
> It seems unlikely that performance on this workload can be improved
> very much by optimizing anything other than WAL writing, because no
> other wait event appears in more than 1% of the samples.  It's not
> clear how much of the WAL-related stuff is artificial lock contention
> and how much of it is the limited speed at which the disk can rotate.
>
> However, I was curious about what's going on with CLogControlLock,
> persuant to the other thread on that topic, so I reran this with
> unlogged tables.  At scale factor 300, 32 clients, unlogged tables, I
> get these results:
>
>        6  LWLockTranche   | wal_insert
>       12  Lock            | tuple
>       26  LWLockTranche   | lock_manager
>       39  LWLockTranche   | buffer_content
>      353  LWLockNamed     | CLogControlLock
>      377  Lock            | extend
>      491  Lock            | transactionid
>      500  LWLockNamed     | ProcArrayLock
>      696  LWLockNamed     | XidGenLock
>    27685  Client          | ClientRead
>    84930                  |
>
> WAL contention is eliminated, and CLog contention doesn't increase.
> Next, I tried ramping up the client count to see what effect that had.
> These next three results are all at scale factor 300 with unlogged
> tables.
>
> 64 clients:
>
>        1  BufferPin       | BufferPin
>        2  LWLockTranche   | buffer_mapping
>       51  LWLockTranche   | wal_insert
>       52  Lock            | relation
>      132  LWLockTranche   | lock_manager
>      154  LWLockTranche   | buffer_content
>      209  Lock            | tuple
>      940  Lock            | extend
>     1412  LWLockNamed     | XidGenLock
>     2337  LWLockNamed     | CLogControlLock
>     2813  LWLockNamed     | ProcArrayLock
>     3859  Lock            | transactionid
>    83760  Client          | ClientRead
>   133840                  |
>
> tps = 36672.484524 (including connections establishing)
>
> 96 clients:
>
>        1  LWLockTranche   | proc
>       53  LWLockTranche   | wal_insert
>      305  LWLockTranche   | buffer_content
>      391  LWLockTranche   | lock_manager
>      797  Lock            | tuple
>     1074  Lock            | extend
>     2492  LWLockNamed     | XidGenLock
>     4185  LWLockNamed     | ProcArrayLock
>     6003  LWLockNamed     | CLogControlLock
>    11348  Lock            | transactionid
>   142881  Client          | ClientRead
>   174011                  |
>
> tps = 37214.257661 (including connections establishing)
>
> 192 clients:
>
>        1  LWLockTranche   | buffer_mapping
>        2  BufferPin       | BufferPin
>        4  LWLockTranche   | clog
>        8  LWLockTranche   | proc
>       87  LWLockTranche   | wal_insert
>      708  LWLockTranche   | buffer_content
>     1892  LWLockTranche   | lock_manager
>     2225  Lock            | extend
>     4796  LWLockNamed     | ProcArrayLock
>     5105  LWLockNamed     | XidGenLock
>     7982  Lock            | tuple
>    11882  LWLockNamed     | CLogControlLock
>    64091  Lock            | transactionid
>   275481                  |
>   308627  Client          | ClientRead
>
> tps = 35018.906863 (including connections establishing)
>
> There are a couple of interesting things here.  First, CLogControlLock
> contention becomes more significant as the number of the clients goes
> up. At 32 clients, it is the 6th-most frequent wait event; at 192
> clients, it's moved up to 4th, and it's clearly growing at a faster
> rate than some of the others.  As we go from 32 to 64 to 96 to 192
> clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
> but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
> it goes from being only 2/3 of ProcArrayLock to being more than twice
> ProcArrayLock. However, in absolute terms, it's still not very
> significant.  Even at 192 clients, where it's the fourth-most-common
> wait event, it appears in only 1.7% of samples, which means backends
> are only waiting for it about 1.7% of the time.  It appears that, on
> this system, this workload is NEVER going to get meaningfully faster
> by improving CLogControlLock.
>
> Second, ClientRead becomes a bigger and bigger issue as the number of
> clients increases; by 192 clients, it appears in 45% of the samples.
> That basically means that pgbench is increasingly unable to keep up
> with the server; for whatever reason, it suffers more than the server
> does from the increasing lack of CPU resources.  Third,
> Lock/transactionid and Lock/tuple become more and more common as the
> number of clients increases; these happen when two different pgbench
> threads deciding to hit the same row at the same time.  Due to the
> birthday paradox this increases pretty quickly as the number of
> clients ramps up, but it's not really a server issue: there's nothing
> the server can do about the possibility that two or more clients pick
> the same random number at the same time.
>




pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: pg_dump getBlobs query broken for 7.3 servers
Next
From: Amit Langote
Date:
Subject: Re: pg_dump getBlobs query broken for 7.3 servers