Re: pgbench vs. wait events - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: pgbench vs. wait events
Date
Msg-id CAMkU=1yW_SnwxUQVYGroSbtKr058NXYMeQs5cZa2ivYvGyhWiQ@mail.gmail.com
Whole thread Raw
In response to pgbench vs. wait events  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: pgbench vs. wait events  (Robert Haas <robertmhaas@gmail.com>)
Re: pgbench vs. wait events  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Next, I tried lowering the scale factor to something that fits in
shared buffers.  Here are the results at scale factor 300:

     14  Lock            | tuple
     22  LWLockTranche   | lock_manager
     39  LWLockNamed     | WALBufMappingLock
    331  LWLockNamed     | CLogControlLock
    461  LWLockNamed     | ProcArrayLock
    536  Lock            | transactionid
    552  Lock            | extend
    716  LWLockTranche   | buffer_content
    763  LWLockNamed     | XidGenLock
   2113  LWLockNamed     | WALWriteLock
   6190  LWLockTranche   | wal_insert
  25002  Client          | ClientRead
  78466                  |

tps = 27651.562835 (including connections establishing)

Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work.  ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare.  Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples.  It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.

What happens if you turn fsync off?  Once a xlog file is fully written, it is immediately fsynced, even if the backend is holding WALWriteLock or wal_insert (or both) at the time, and even if synchrounous_commit is off.  Assuming this machine has a BBU so that it doesn't have to wait for disk rotation, still fsyncs are expensive because the kernel has to find all the data and get it sent over to the BBU, while holding locks.
 

....

Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources. 

I would be careful about that interpretation.  If you asked pgbench, it would probably have the opposite opinion.

The backend tosses its response at the kernel (which will never block, because the pgbench responses are all small and the kernel will buffer them) and then goes into ClientRead.  After the backend goes into ClientRead, the kernel needs to find and wake up the pgbench, deliver the response, and pgbench has to receive and process the response.  Only then does it create a new query (I've toyed before with having pgbench construct the next query while it is waiting for a response on the previous one, but that didn't seem promising, and much of pgbench has been rewritten since then), pass the query back to the kernel. Then the kernel has to find and wake up the backend and deliver the new query.  So for a reasonable chunk of the time that the server thinks it is waiting for the client, the client also thinks it is waiting for the server.

I think we need to come up with some benchmarking queries which get more work done per round-trip to the database. And build them into the binary, because otherwise people won't use them as much as they should if they have to pass "-f" files around mailing lists and blog postings.   For example, we could enclose 5 statements of the TPC-B-like into a single function which takes aid, bid, tid, and delta as arguments.  And presumably we could drop the other two statements (BEGIN and COMMIT) as well, and rely on autocommit to get that job done.  So we could go from 7 statements to 1.

 
Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time.  Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.


What I have done in the past is chop a zero off from:

#define naccounts   100000

and recompile pgbench.  Then you can increase the scale factor so that you have less contention on pgbench_branches while still fitting the data in shared_buffers, or in RAM.

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Question about pg_control usage
Next
From: Robert Haas
Date:
Subject: Re: Radix tree for character conversion