Re: Slowness of extended protocol - Mailing list pgsql-hackers

From Vladimir Sitnikov
Subject Re: Slowness of extended protocol
Date
Msg-id CAB=Je-FnD8QLr4++Z+ADCYbsFxBEmxrQsyMq=2M5GpTW3UVciw@mail.gmail.com
Whole thread Raw
In response to Re: Slowness of extended protocol  (Shay Rojansky <roji@roji.org>)
List pgsql-hackers
Shay> What? I really didn't understand your point here. All the doc is saying is
Shay> that if the driver doesn't support prepared statements, then using them

Please read again. PreparedStatement is the only way to execute statements
in JDBC API. There's no API that allows user to specify "use
server-prepared here".
Well, there's non-prepared API in JDBC, however it misses "bind
variables" support,
so if bind variables required, developer would use PreparedStatement.

Java's PreparedStatement does not have an option to distinguish which statements
should be server-prepared and which should not.

Vladimir>> My experience shows, that people are very bad at predicting where the
Vladimir>> performance problem would be.
Vladimir>> For 80% (or even 99%) of the cases, they just do not care
thinking if a
Vladimir>> particular statement should be server-prepared or not.
Vladimir>> They have absolutely no idea how much resources it would
take and so on.
Shay> Maybe you're even right saying these things, I don't know. But
that doesn't
Shay> mean I as a driver should solve their problems for them. And I also get
Shay> that you have an additional argument here besides programmer
Shay> laziness/stupidity - the ORM argument - which makes more sense.

Suppose backend can handle 20 server-prepared statements at most (if using more
it would run out of memory).
Suppose an application has 100 statements with ".prepare()" call.
I think it is reasonable for the DB driver to figure out which
statements are most important
and server-prepare just "20 most important ones", and leave the rest
80 as regular
non-prepared statements.

Do you think the DB driver should just follow developer's advice and
server-prepare
all the 100 statements causing backend crash?
Do you think application developer should have a single list of all
the statements ever
used in the application and make sure there's no more than 20 queries in it?

My main point is not "developers are stupid", but "people often have
wrong guess when
it comes to performance". There are too many moving parts, so it is
hard to predict
performance implications.

Often it is much easier to execute a series of benchmarks that
validate certain hypothesis.
For instance, as Tatsuo says, savepoint overhead for DML is higher
than savepoint
overhead for SELECT, so I plan to have that benchmark as well.


Shay> First, there's nothing stopping an ORM from optimizing multiple inserts
Shay> into a single multivalue insert. I do admit I'm not aware of any who do
Shay> this, but it's a good idea for an optimization - I happen to maintain the
Shay> Entity Framework Core provider for Npgsql, I might take a look at this
Shay> optimization (so again thanks for the idea).

Nothings stops, but M framework times N database drivers results in M*N effort
for each feature.
As you say: application should just use batch API, and it's driver's
job to convert
that into suitable for the database sequence of bytes.

Same for Npgsql: if you implement rewrite at Npgsql level, that would
automatically
improve all the framework/applications running on top of Npgsql.


Shay> I'm going to repeat what I said
Shay> before and it would be good to get some reaction to this. Every software
Shay> component in the stack has a role, and maintaining those separations is
Shay> what keeps things simple and sane

You might be missing my comments on CPU, x86, etc.
My reaction is: almost every existing component is extremely hard to
reason about.

For instance: CPU has certain number of registers, it has certain
amount of L1/L2/...
caches and so on.
Do you mean each and every developer should explicitly specify which
program variable should use register and which one should go into L2 cache?

This is a counter-example to your "sane" "separation". CPU is free to reorder
instruction stream as long as the net result complies to the specification.
In the same way, CPU is free to use L1/L2 caches in whatever way it
thinks is the best.
Note: typical DB driver developer does not try to maintain a set of
"optimal assembly
instructions".
Driver developer relies on the compiler and the CPU so they would optimize
driver's code into the best machine code.

Of course driver might have inline assembly, but that is not how
mainstream drivers are written.

Another example: TCP stack. When DB driver sends some data, kernel is
free to reorder
packets, it is free to interleave, delay them, or even send even use
multiple network cards
to send a single TCP stream.
Windows 10 includes several performance improvements to the TCP stack,
and it is nowhere near
to "kernel is doing exactly what application/driver coded".
Once again: application/driver developer does not optimize for a
specific hardware (e.g. network card).
Developers just use common API and it is kernel's job to use best
optimizations for the particular HW.

The same goes to ORM-DB combo. ORM uses DB driver's API, and it's
drivers job to use
optimal command sequence for the specific database.
Note: of course ORM might use vendor specific features for a good
reason, however
simple cases like "batch insert" are both common, and they are not top
priority from "ORM developer"
perspective.

The same goes for application-driver-DB. Application just uses
driver's API, and it is not application's
business which bytes to send over the wire.
Neither application should specify which particular packets to use.

Note: I do not suggest to rewite SELECTs at the driver level yet. I do
not know a common pattern there.
For "insert multivalues" the pattern is simple: user sends insert
statements via batch API. It happens
quite often as it is the standard way of inserting data in bulks. It
is recommended in all the books and
articles on JDBC API. Not all the databases support multivaleus, so a
cross-database application
would likely just use "insert single values through batch API" and
rely on the DB driver and database to
do the right thing.


Of course having that fix at network stack level would automatically
improve all the postgresql drivers, however
1) It is NOT possible since multivalues have different SEMANTICS. When
sending indivisual inserts, each
individual response would have "number of rows inserted" field. When
using multivalues insert, there will be
just one insert with the total number of rows inserted.

If application logic did expect separate row counts, then kernel stack
is in trouble: it cannot recover
individual counts out of a combined response.

That is why multivalues rewrite at TCP level is impossible.

So, PostgreSQL network protocol v3 prohibits kernel from doing
automatic multivalues optimization.

Let me show you how JDBC API enables PgJDBC to do that optimization:
when application is using JDBC
batch API, it does receive "per statement row count" as a batch
result. However, there's a special result
code that says "statement did succeed, but the row count is not
known". Developers are prepared that
DB driver can return "not known" for all the rows, so PgJDBC is free o
use multivalues or copy or whatever
makes most sense for a particular batch.

2) There are other implications of doing that optimization at kernel
level (e.g. overhead for mysql users), but
since "the optimization is not possible due to semantics", other
reasons are not important.

Shay> Or if the magic is implemented at the
Shay> driver leve, it should be opt-in, or at least easy to disable entirely.

"Optimizations in the CPU should be opt-in or at least easy to
disable". Does that sound right to you?
For debugging purposes, it could make sense. However, for regular
production usage, optimizations
should be enabled by default. Regular Intel CPUs come with
optimizations enabled by default.

Multivalues rewrite does comply with JDBC batch API, so it makes no
surprise even if the optimization is
enabled by default.

Shay> So the
Shay> several-microsecond difference translates to a 6% performance degradation.
Shay> This is probably due to the slowness of processing the extra Bind/Execute
Shay> pair for the savepoint, which maybe can be improved in PostgreSQL.
Shay> Regardless, you may say there's no business case or that SELECT 1 against
Shay> localhost means nothing, or that posting data is somehow wrong,
but I don't
Shay> find this very acceptable.

In this particular case the purpose is to measure the overhead. In
your particular case
savepoint seems to cost 15us. It costs 3us for me, but that does not
matter much.
If each and every query is prefixed with a savepoint, then it will be
like 15ms overhead
for 1000 queries.

I think if application issues 1000 queries, it would anyway spend much more time
on processing the results and figuring out "which query to issue/which
values to pass".

However, that voltmeter is broken as Tatsuo have pointed out. He
chimed in a pgdbc issue tracker
and pointed out that "savepoint-DML-savepoint-DML-..." scenario would
allocate extra xid for each
savepoint. The more the xid allocation rate, the harder for vacuum the case is.

So we need to measure
1) "savepoint-DML-savepoint-DML-..." in order to find "savepoint
overhead in case of DML".
Note: pgjdbc's autosave=conservative would not issue savepoint before
regular DMLs
as DML cannot fail with "cached plan ... result type" error. So
"DML-select-DML-select"
scenario is required.
2) "real life applications" in order to find how extra savepoints
affect xid allocation rate.

Vladimir



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: WIP: Barriers
Next
From: Petr Jelinek
Date:
Subject: Re: Logical Replication WIP