Home > mailing lists

Re: pgbench vs. wait events - Mailing list pgsql-hackers

From	Jeff Janes
Subject	Re: pgbench vs. wait events
Date	October 7, 2016 15:51:34
Msg-id	CAMkU=1yW_SnwxUQVYGroSbtKr058NXYMeQs5cZa2ivYvGyhWiQ@mail.gmail.com Whole thread Raw
In response to	pgbench vs. wait events (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: pgbench vs. wait events Re: pgbench vs. wait events
List	pgsql-hackers

Tree view

On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Next, I tried lowering the scale factor to something that fits in
shared buffers. Here are the results at scale factor 300:

14 Lock | tuple
22 LWLockTranche | lock_manager
39 LWLockNamed | WALBufMappingLock
331 LWLockNamed | CLogControlLock
461 LWLockNamed | ProcArrayLock
536 Lock | transactionid
552 Lock | extend
716 LWLockTranche | buffer_content
763 LWLockNamed | XidGenLock
2113 LWLockNamed | WALWriteLock
6190 LWLockTranche | wal_insert
25002 Client | ClientRead
78466 |

tps = 27651.562835 (including connections establishing)

Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples. It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.

What happens if you turn fsync off? Once a xlog file is fully written, it is immediately fsynced, even if the backend is holding WALWriteLock or wal_insert (or both) at the time, and even if synchrounous_commit is off. Assuming this machine has a BBU so that it doesn't have to wait for disk rotation, still fsyncs are expensive because the kernel has to find all the data and get it sent over to the BBU, while holding locks.

....

Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.

I would be careful about that interpretation. If you asked pgbench, it would probably have the opposite opinion.

The backend tosses its response at the kernel (which will never block, because the pgbench responses are all small and the kernel will buffer them) and then goes into ClientRead. After the backend goes into ClientRead, the kernel needs to find and wake up the pgbench, deliver the response, and pgbench has to receive and process the response. Only then does it create a new query (I've toyed before with having pgbench construct the next query while it is waiting for a response on the previous one, but that didn't seem promising, and much of pgbench has been rewritten since then), pass the query back to the kernel. Then the kernel has to find and wake up the backend and deliver the new query. So for a reasonable chunk of the time that the server thinks it is waiting for the client, the client also thinks it is waiting for the server.

I think we need to come up with some benchmarking queries which get more work done per round-trip to the database. And build them into the binary, because otherwise people won't use them as much as they should if they have to pass "-f" files around mailing lists and blog postings. For example, we could enclose 5 statements of the TPC-B-like into a single function which takes aid, bid, tid, and delta as arguments. And presumably we could drop the other two statements (BEGIN and COMMIT) as well, and rely on autocommit to get that job done. So we could go from 7 statements to 1.

Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time. Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.

What I have done in the past is chop a zero off from:

#define naccounts 100000

and recompile pgbench. Then you can increase the scale factor so that you have less contention on pgbench_branches while still fitting the data in shared_buffers, or in RAM.

Cheers,

Jeff

pgsql-hackers by date:

From: Tom Lane
Date: 07 October 2016, 15:43:53
Subject: Re: Question about pg_control usage

From: Robert Haas
Date: 07 October 2016, 15:55:15
Subject: Re: Radix tree for character conversion

Re: pgbench vs. wait events - Mailing list pgsql-hackers

Previous

Next