Re: pgbench - implement strict TPC-B benchmark - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: pgbench - implement strict TPC-B benchmark
Date
Msg-id alpine.DEB.2.21.1908012320430.32558@lancre
Whole thread Raw
In response to Re: pgbench - implement strict TPC-B benchmark  (Andres Freund <andres@anarazel.de>)
Responses Re: pgbench - implement strict TPC-B benchmark
List pgsql-hackers
Hello Andres,

Thanks a lot for these feedbacks and comments.

> Using pgbench -Mprepared -n -c 8 -j 8 -S pgbench_100 -T 10 -r -P1
> e.g. shows pgbench to use 189% CPU in my 4/8 core/thread laptop. That's
> a pretty significant share.

Fine, but what is the corresponding server load? 211%? 611%? And what 
actual time is spent in pgbench itself, vs libpq and syscalls?

Figures and discussion below.

> And before you argue that that's just about a read-only workload:

I'm fine with worth case scenarii:-) Let's do the worse for my 2 cores 
running at 2.2 GHz laptop:


(0) we can run a really do nearly nothing script:

   sh> cat nope.sql
   \sleep 0
   # do not sleep, so stay awake…

   sh> time pgbench -f nope.sql -T 10 -r
   latency average = 0.000 ms
   tps = 12569499.226367 (excluding connections establishing) # 12.6M
   statement latencies in milliseconds:
          0.000  \sleep 0
   real 0m10.072s, user 0m10.027s, sys 0m0.012s

Unsurprisingly pgbench is at about 100% cpu load, and the transaction cost 
(transaction loop and stat collection) is 0.080 µs (1/12.6M) per script 
execution (one client on one thread).


(1) a pgbench complex-commands-only script:

   sh> cat set.sql
   \set x random_exponential(1, :scale * 10, 2.5) + 2.1
   \set y random(1, 9) + 17.1 * :x
   \set z case when :x > 7 then 1.0 / ln(:y) else 2.0 / sqrt(:y) end

   sh> time pgbench -f set.sql -T 10 -r
   latency average = 0.001 ms
   tps = 1304989.729560 (excluding connections establishing) # 1.3M
   statement latencies in milliseconds:
     0.000  \set x random_exponential(1, :scale * 10, 2.5) + 2.1
     0.000  \set y random(1, 9) + 17.1 * :x
     0.000  \set z case when :x > 7 then 1.0 / ln(:y) else 2.0 / sqrt(:y) end
   real 0m10.038s, user 0m10.003s, sys 0m0.000s

Again pgbench load is near 100%, with only pgbench stuff (thread loop, 
expression evaluation, variables, stat collection) costing about 0.766 µs 
cpu per script execution. This is about 10 times the previous case, 90% of 
pgbench cpu cost is in expressions and variables, not a surprise.

Probably this under-a-µs could be reduced… but what overall improvements 
would it provide? An answer with the last test:


(2) a ridiculously small SQL query, tested through a local unix socket:

   sh> cat empty.sql
   ;
   # yep, an empty query!

   sh> time pgbench -f empty.sql -T 10 -r
   latency average = 0.016 ms
   tps = 62206.501709 (excluding connections establishing) # 62.2K
   statement latencies in milliseconds:
          0.016  ;
   real 0m10.038s, user 0m1.754s, sys 0m3.867s

We are adding minimal libpq and underlying system calls to pgbench 
internal cpu costs in the most favorable (or worst:-) sql query with the 
most favorable postgres connection.

Apparent load is about (1.754+3.867)/10.038 = 56%, so the cpu cost per 
script is 0.56 / 62206.5 = 9 µs, over 100 times the cost of a do-nothing 
script (0), and over 10 times the cost of a complex expression command 
script (1).

Conclusion: pgbench-specific overheads are typically (much) below 10% of 
the total client-side cpu cost of a transaction, while over 90% of the cpu 
cost is spent in libpq and system, for the worst case do-nothing query.

A perfect bench driver which would have zero overheads would reduce the 
cpu cost by at most 10%, because you still have to talk to the database. 
through the system. If pgbench cost were divided by two, which would be a 
reasonable achievement, the benchmark client cost would be reduced by 5%.

Wow?

I have already given some thought in the past to optimize "pgbench", 
especially to avoid long switches (eg in expression evaluation) and maybe 
to improve variable management, but as show above I would not expect a 
gain worth the effort and assume that a patch would probably be justly 
rejected, because for a realistic benchmark script these costs are already 
much less than other inevitable libpq/syscall costs.

That does not mean that nothing needs to be done, but the situation is 
currently quite good.

In conclusion, ISTM that current pgbench allows to saturate a postgres 
server with a client significantly smaller than the server, which seems 
like a reasonable benchmarking situation. Any other driver in any other 
language would necessarily incur the same kind of costs.


> [...] And the largest part of the overhead is in pgbench's interpreter 
> loop:

Indeed, the figures below are very interesting! Thanks for collecting 
them.

> +   12.35%  pgbench  pgbench                [.] threadRun
> +    3.54%  pgbench  pgbench                [.] dopr.constprop.0
> +    3.30%  pgbench  pgbench                [.] fmtint
> +    1.93%  pgbench  pgbench                [.] getVariable

~ 21%, probably some inlining has been performed, because I would have 
expected to see significant time in "advanceConnectionState".

> +    2.95%  pgbench  libpq.so.5.13          [.] PQsendQueryPrepared
> +    2.15%  pgbench  libpq.so.5.13          [.] pqPutInt
> +    4.47%  pgbench  libpq.so.5.13          [.] pqParseInput3
> +    1.66%  pgbench  libpq.so.5.13          [.] pqPutMsgStart
> +    1.63%  pgbench  libpq.so.5.13          [.] pqGetInt

~ 13%

> +    3.16%  pgbench  libc-2.28.so           [.] __strcmp_avx2
> +    2.95%  pgbench  libc-2.28.so           [.] malloc
> +    1.85%  pgbench  libc-2.28.so           [.] ppoll
> +    1.85%  pgbench  libc-2.28.so           [.] __strlen_avx2
> +    1.85%  pgbench  libpthread-2.28.so     [.] __libc_recv

~ 11%, str is a pain… Not sure who is calling though, pgbench or libpq.

This is basically 47% pgbench, 53% lib*, on the sample provided. I'm 
unclear about where system time is measured.

> And that's the just the standard pgbench read/write case, without
> additional script commands or anything.

> Well, duh, that's because you're completely IO bound. You're doing
> 400tps. That's *nothing*. All you're measuring is how fast the WAL can
> be fdatasync()ed to disk.  Of *course* pgbench isn't a relevant overhead
> in that case.  I really don't understand how this can be an argument.

Sure. My interest in running it was to show that the \set stuff was 
ridiculous compared to processing an actual SQL query, but it does not 
allow to analyze all overheads. I hope that the 3 above examples allow to 
make my point more understandable.

>> Also, pgbench overheads must be compared to an actual client application,
>> which deals with a database through some language (PHP, Python, JS, Java…)
>> the interpreter of which would be written in C/C++ just like pgbench, and
>> some library (ORM, DBI, JDBC…), possibly written in the initial language and
>> relying on libpq under the hood. Ok, there could be some JIT involved, but
>> it will not change that there are costs there too, and it would have to do
>> pretty much the same things that pgbench is doing, plus what the application
>> has to do with the data.
>
> Uh, but those clients aren't all running on a single machine.

Sure.

The cumulated power of the clients is probably much larger than the 
postgres server itself, and ISTM that pgbench allows to simulate such 
things with much smaller client-side requirements, and that any other tool 
could not do much better.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Daniel Migowski
Date:
Subject: Re: [HACKERS] Cached plans and statement generalization
Next
From: Julien Rouhaud
Date:
Subject: Re: The unused_oids script should have a reminder to use the8000-8999 OID range