Re: pgbench - implement strict TPC-B benchmark - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: pgbench - implement strict TPC-B benchmark
Date
Msg-id alpine.DEB.2.21.1908030845550.24235@lancre
Whole thread Raw
In response to Re: pgbench - implement strict TPC-B benchmark  (Andres Freund <andres@anarazel.de>)
Responses Re: pgbench - implement strict TPC-B benchmark
List pgsql-hackers
Hello Andres,

>>> Using pgbench -Mprepared -n -c 8 -j 8 -S pgbench_100 -T 10 -r -P1
>>> e.g. shows pgbench to use 189% CPU in my 4/8 core/thread laptop. That's
>>> a pretty significant share.
>>
>> Fine, but what is the corresponding server load? 211%? 611%? And what actual
>> time is spent in pgbench itself, vs libpq and syscalls?
>
> System wide pgbench, including libpq, is about 22% of the whole system.

Hmmm. I guess that the consistency between 189% CPU on 4 cores/8 threads 
and 22% overall load is that 189/800 = 23.6% ~ 22%.

Given the simplicity of the select-only transaction the stuff is CPU 
bound, so postgres 8 server processes should saturate the 4 core CPU, and 
pgbench & postgres are competing for CPU time. The overall load is 
probably 100%, i.e. 22% pgbench vs 78% postgres (assuming system is 
included), 78/22 = 3.5, i.e. pgbench on one core would saturate postgres 
on 3.5 cores on a CPU bound load.

I'm not chocked by these results for near worst-case conditions (i.e. the 
server side has very little to do).

It seems quite consistent with the really worst-case example I reported 
(empty query, cannot do less). Looking at the same empty-sql-query load 
through "htop", I have 95% postgres and 75% pgbench. This is not fully 
consistent with "time" which reports 55% pgbench overall, over 2/3 of 
which in system, under 1/3 pgbench which must be devided into pgbench 
actual code and external libpq/lib* other stuff.

Yet again, pgbench code is not the issue from my point of view, because 
time is spent mostly elsewhere and any other driver would have to do the 
same.

> As far as I can tell there's a number of things that are wrong:

Sure, I agree that things could be improved.

> - prepared statement names are recomputed for every query execution

I'm not sure it is a bug issue, but it should be precomputed somewhere, 
though.

> - variable name lookup is done for every command, rather than once, when
>  parsing commands

Hmmm. The names of variables are not all known in advance, eg \gset. 
Possibly it does not matter, because the name of actually used variables 
is known. Each used variables could get a number so that using a variable 
would be accessing an array at the corresponding index.

> - a lot of string->int->string type back and forths

Yep, that is a pain, ISTM that strings are exchanged at the protocol 
level, but this is libpq design, not pgbench.

As far as variable values are concerned, AFAICR conversion are performed 
on demand only, and just once.

Overall, my point if that even if all pgbench-specific costs were wiped 
out it would not change the final result (pgbench load) much because most 
of the time is spent in libpq and system. Any other test driver would 
incur the same cost.

>> Conclusion: pgbench-specific overheads are typically (much) below 10% of the
>> total client-side cpu cost of a transaction, while over 90% of the cpu cost
>> is spent in libpq and system, for the worst case do-nothing query.
>
> I don't buy that that's the actual worst case, or even remotely close to 
> it.

Hmmm. I'm not sure I can do much worse than 3 complex expressions against 
one empty sql query. Ok, I could put 27 complex expressions to reach 
50-50, but the 3-to-1 complex-expression-to-empty-sql ratio already seems 
ok for a realistic worst-case test script.

> I e.g. see higher pgbench overhead for the *modify* case than for
> the pgbench's readonly case. And that's because some of the meta
> commands are slow, in particular everything related to variables. And
> the modify case just has more variables.

Hmmm. WRT \set and expressions, the two main cost seems to be the large 
switch and the variable management. Yet again, I still interpret the 
figures I collected as these costs are small compared to libpq/system 
overheads, and the overall result is below postgres own CPU costs (on a 
per client basis).

>>> +   12.35%  pgbench  pgbench                [.] threadRun
>>> +    3.54%  pgbench  pgbench                [.] dopr.constprop.0
>>
>> ~ 21%, probably some inlining has been performed, because I would have
>> expected to see significant time in "advanceConnectionState".
>
> Yea, there's plenty inlining.  Note dopr() is string processing.

Which is a pain, no doubt about that. Some of it as been taken out of 
pgbench already, eg comparing commands vs using an enum.

>>> +    2.95%  pgbench  libpq.so.5.13          [.] PQsendQueryPrepared
>>> +    2.15%  pgbench  libpq.so.5.13          [.] pqPutInt
>>> +    4.47%  pgbench  libpq.so.5.13          [.] pqParseInput3
>>> +    1.66%  pgbench  libpq.so.5.13          [.] pqPutMsgStart
>>> +    1.63%  pgbench  libpq.so.5.13          [.] pqGetInt
>>
>> ~ 13%
>
> A lot of that is really stupid. We need to improve libpq. 
> PqsendQueryGuts (attributed to PQsendQueryPrepared here), builds the 
> command in many separate pqPut* commands, which reside in another 
> translation unit, is pretty sad.

Indeed, I'm definitely convinced that libpq costs are high and should be 
reduced where possible. Now, yet again, they are much smaller than the 
time spent in the system to send and receive the data on a local socket, 
so somehow they could be interpreted as good enough, even if not that 
good.

>>> +    3.16%  pgbench  libc-2.28.so           [.] __strcmp_avx2
>>> +    2.95%  pgbench  libc-2.28.so           [.] malloc
>>> +    1.85%  pgbench  libc-2.28.so           [.] ppoll
>>> +    1.85%  pgbench  libc-2.28.so           [.] __strlen_avx2
>>> +    1.85%  pgbench  libpthread-2.28.so     [.] __libc_recv
>>
>> ~ 11%, str is a pain… Not sure who is calling though, pgbench or
>> libpq.
>
> Both. Most of the strcmp is from getQueryParams()/getVariable(). The
> dopr() is from pg_*printf, which is mostly attributable to
> preparedStatementName() and getVariable().

Hmmm. Franckly I can optimize pgbench code pretty easily, but I'm not sure 
of maintainability, and as I said many times, about the real effect it 
would have, because these cost are a minor part of the client side 
benchmark part.

>> This is basically 47% pgbench, 53% lib*, on the sample provided. I'm unclear
>> about where system time is measured.
>
> It was excluded in this profile, both to reduce profiling costs, and to
> focus on pgbench.

Ok.

If we take my other figures and round up, for a running pgbench we have 
1/6 actual pgbench, 1/6 libpq, 2/3 system.

If I get a factor of 10 speedup in actual pgbench (let us assume I'm that 
good:-), then the overall gain is 1/6 - 1/6/10 = 15%. Although I can do 
it, it would be some fun, but the code would get ugly (not too bad, but 
nevertheless probably less maintainable, with a partial typing phase and 
expression compilation, and my bet is that however good the patch would be 
rejected).

Do you see an error in my evaluation of pgbench actual costs and its 
contribution to the overall performance of running a benchmark?

If yes, which it is?

If not, do you think advisable to spend time improving the evaluator & 
variable stuff and possibly other places for an overall 15% gain?

Also, what would be the likelyhood of such optimization patch to pass?

I could do a limited variable management improvement patch, eventually, I 
have funny ideas to speedup the thing, some of which outlined above, some 
others even more terrible.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Ivan Panchenko
Date:
Subject: Re[2]: jsonb_plperl bug
Next
From: Julien Rouhaud
Date:
Subject: Re: The unused_oids script should have a reminder to use the8000-8999 OID range