Re: Slowness of extended protocol - Mailing list pgsql-hackers

From Shay Rojansky
Subject Re: Slowness of extended protocol
Date
Msg-id CADT4RqBtP0Yzhdhc2MB5YRSP3k4+rT1KX7yNV4rTV4xq3xp-sA@mail.gmail.com
Whole thread Raw
In response to Re: Slowness of extended protocol  (Vladimir Sitnikov <sitnikov.vladimir@gmail.com>)
Responses Re: Slowness of extended protocol  (Vladimir Sitnikov <sitnikov.vladimir@gmail.com>)
List pgsql-hackers
Apologies, I accidentally replied off-list, here's the response I sent. Vladimir, I suggest you reply to this message with your own response...

On Sat, Aug 13, 2016 at 6:32 PM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
Shay>To be honest, the mere idea of having an SQL parser inside my driver makes me shiver.

Same for me.
However I cannot wait for PostgreSQL 18 that does not need client-side parsing.

I'm glad we agree on something. For me the problem has nothing to do with PostgreSQL or performance - it has to do with database APIs that impose uniform parameter placeholder formats, and therefore force drivers to rewrite user SQL queries. AFAIK rewriting a user's SQL for optimization is totally out of the scope of the driver's work.

Shay>There's nothing your driver is doing that the application developer can't do themselves -
Shay>so your driver isn't faster than other drivers. It's faster only when used by lazy programmers.

I'm afraid you do not get the point.
ORMs like Hibernate, EclipseLink, etc send regular "insert ... values" via batch API.
For the developer the only way to make use of "multivalues" is to implement either "ORM fix" or "the driver fix" or "postgresql fix".
So the feature has very little to do with laziness of the programmers. Application developer just does not have full control of each SQL when working though ORM.
Do you suggest "stop using ORMs"? Do you suggest fixing all the ORMs so it uses optimal for each DB insert statement?
Do you suggest fixing postgresql?

Once again "multivalues rewrite at pgjdbc level" enables the feature transparently for all the users. If PostgreSQL 10/11 would improve bind/exec performance, we could even drop that rewrite at pgjdbc level and revert to the regular flow. That would again be transparent to the application.


I do get the point, and in fact I myself mentioned the ORM case above as an advantage of implicit query preparation.

First, there's nothing stopping an ORM from optimizing multiple inserts into a single multivalue insert. I do admit I'm not aware of any who do this, but it's a good idea for an optimization - I happen to maintain the Entity Framework Core provider for Npgsql, I might take a look at this optimization (so again thanks for the idea).

Second, it's well-known that using an ORM almost always implies a performance sacrifice - it's a tradeoff that's chosen when going with an ORM. It's great that you optimize multiple inserts, but there are a myriad of other cases where an ORM generates less efficient SQL that what would be possible - but I don't think it makes sense for the driver to actually contain an SQL optimizer. Slightly worse performance isn't in itself a reason to drop ORMs: it's frequent practice to drop down to raw SQL for performance-critical operations, etc.

But all that isn't really important - I'm going to repeat what I said before and it would be good to get some reaction to this. Every software component in the stack has a role, and maintaining those separations is what keeps things simple and sane. Just for fun, we could imagine a kernel network-stack feature which analyzes outgoing messages and optimizes them; we could even implement your multiple insert -> multivalue insert optimization there. This would have the advantage of working out-of-the-box for every driver and every language out there (just like your driver does provides it transparently for all ORMs) But nobody in their right mind would think of doing something like this, and for good reason.

The programmer's task is to write SQL, the driver's task is to communicate that SQL via the database-specific protocol, the kernel's networking stack's task is to transmit that protocol via TCP, etc. etc. If an ORM is used, the programmer effectively outsources the task of writing SQL to another component, which is supposed to do a good job about it. Once you go about blurring all the lines here, everything becomes more complicated, brittle and hard to debug.

For what it's worth, I can definitely imagine your kind of optimizations occurring at some additional layer which the user would choose to use - an intermediate SQL optimizer between application code (or ORM) and the driver. This "SQL optimizer" layer would keep the driver itself lean and simple (for users who *don't* want all the magic), while allowing for transparent optimizations for ORMs. Or if the magic is implemented at the driver leve, it should be opt-in, or at least easy to disable entirely.
 
Shay>are you sure there aren't "hidden" costs on the PostgreSQL side for generating so many implicit savepoints?

Technically speaking I use the same savepoint name through bind/exec message.

Out of curiosity, I whipped up a quick benchmark (voltmeter) of the impact of adding a savepoint before every command. Each iteration creates a transaction, sends one Bind/Execute for "SELECT 1" (which was prepared during setup), then sends one Bind/Execute for "SAVEPOINT x; SELECT 1", then commits. The baseline simply sends "SELECT 1" twice (two Bind/Execute roundtrips) in the transaction. Here are the results against localhost (PG 9.5):

  Method |      Median |    StdDev | Scaled |    Op/s |
-------- |------------ |---------- |------- |-------- |
    With | 268.0387 us | 9.4800 us |   1.06 | 3678.32 |
 Without | 252.9685 us | 8.3573 us |   1.00 |  3896.9 |

The benchmark source code is https://gist.github.com/roji/f5bf39bfc759503392e50585ececcfc0. So the several-microsecond difference translates to a 6% performance degradation. This is probably due to the slowness of processing the extra Bind/Execute pair for the savepoint, which maybe can be improved in PostgreSQL. Regardless, you may say there's no business case or that SELECT 1 against localhost means nothing, or that posting data is somehow wrong, but I don't find this very acceptable.

Shay>What you're doing is optimizing developer code, with the assumption that developers can't be trusted to code efficiently - they're going to write bad SQL and forget to prepare their statements

Please, be careful. "you are completely wrong here" he-he. Well, you list the wrong assumption. Why do you think my main assumption is "developers can't be trusted"?

Because you said so above (Wed, Aug 10, 2016 at 8:37 PM):

My experience shows, that people are very bad at predicting where the performance problem would be.
> For 80% (or even 99%) of the cases, they just do not care thinking if a particular statement should be server-prepared or not.
> They have absolutely no idea how much resources it would take and so on.

Maybe you're even right saying these things, I don't know. But that doesn't mean I as a driver should solve their problems for them. And I also get that you have an additional argument here besides programmer laziness/stupidity - the ORM argument - which makes more sense.

The proper assumption is: I follow Java database API specification, and I optimize pgjdbc for the common use case (e.g. ORM or ORM-like).

Do you have any evidence that ORM or ORM-like is the common use case?
 
For instance, if Java application wants to use bind values (e.g. to prevent security issues), then the only way is to go through java.sql.PreparedStatement.


Here's a quote:
Javadoc> Note: This method is optimized for handling parametric SQL statements that benefit from precompilation. If the driver supports precompilation, the methodprepareStatement will send the statement to the database for precompilation. Some drivers may not support precompilation. In this case, the statement may not be sent to the database until the PreparedStatement object is executed. This has no direct effect on users; however, it does affect which methods throw certainSQLException objects.

The most important part is "if the driver supports precompilation..."
There's no API to enable/disable precompilation at all.
So, when using Java, there is no such thing as "statement.enableServerPrepare=true".

It is expected, that "driver" would "optimize" the handling somehow in the best possible way.

What? I really didn't understand your point here. All the doc is saying is that if the driver doesn't support prepared statements, then using them wouldn't provide any benefit etc. The very fact that the Java API *has* a PreparedStatement class means that statement preparation is exposed to the developer - the API provides methods to the developer, allowing them to choose *themselves* what to prepare. You chose to implicitly prepare statements even if PreparedStatement *isn't* used, you really can't claim that this is expected behavior (and indeed, most drivers I know don't do it).
 
It is Java API specification that enables me (as a driver developer) to be flexible, and leverage database features so end user gets best experience.

In my opinion, the API neither enables nor prohibits what you're doing. But again, the very fact that the API includes PreparedStatement shows that the JDBC considers preparation a programmer concern/decision and not an implicit driver decision.

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [parallel query] random server crash while running tpc-h query on power2
Next
From: Thomas Munro
Date:
Subject: Re: WIP: Barriers