Re: Slowness of extended protocol - Mailing list pgsql-hackers

From Shay Rojansky
Subject Re: Slowness of extended protocol
Date
Msg-id CADT4RqAH-X4Nq9jG5qj+frGK5-2T3XKULwG0yZMJu6aAzm2YtQ@mail.gmail.com
Whole thread Raw
In response to Re: Slowness of extended protocol  (Vladimir Sitnikov <sitnikov.vladimir@gmail.com>)
Responses Re: Slowness of extended protocol
List pgsql-hackers
On Sat, Aug 13, 2016 at 11:20 PM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
Tatsuo>Interesting. What would happen if a user changes some of GUC
parameters? Subsequent session accidentally inherits the changed GUC
parameter?

Yes, that is what happens.
The idea is not to mess with gucs.

Wow... That is... insane...
 
Tatsuo>There's nothing wrong with DICARD ALL
Tatsuo>"DISCARD ALL" is perfect for this goal.

It looks like you mean: "let's reset the connection state just in case".
I see where it might help: e.g. when providing database access to random people who might screw a connection in every possible way.

It's not about random people screwing up connections, it's about maintaining isolation between different connections... When using an out-of-process pool (e.g. pgbouncer/pgpool), it's entirely reasonable for more than one app to use use the same pool. It's entirely possible for one app to change a GUC and then the other app goes boom. It's especially hard to debug too, because it depends on who got the physical connection before you.

Even with in-process pools it's standard practice (and a good idea) to reset state. The idea is that pooling, much like a cache (see below!), is supposed to speed up your application without having any other visible effects. In other words, running your app with or without a pool should be identical except for performance aspects. If some part of a big complex application modified state, and then some other part happens to get that physical connection, it's extremely hard to make sense of things.

One note - in Npgsql's implementation of persistent prepared statements, instead of sending DISCARD ALL Npgsql sends the commands listed in https://www.postgresql.org/docs/current/static/sql-discard.html, except for DEALLOCATE ALL. This cleans up state changes but leaves prepared statements in place.

Just in case: do you think one should reset CPU caches, OS filesystem caches, DNS caches, bounce the application, and bounce the OS in addition to "discard all"?
Why reset only "connection properties" when we can reset everything to its pristine state?

Um, because all the caches you mention are, well, caches - by definition they're not supposed to have any visible impact on any application, except for making it faster. This is somewhat similar to the CPU reordering you keep coming back to - it's totally invisible. GUCs are the exact opposite - you use them to modify how PostgreSQL behaves. So it makes perfect sense to reset them.

Just in case: PostgreSQL does not execute "discard all" on its own.

Of course it doesn't - it doesn't know anything about connection pooling, it only knows about physical connections. When would it execute "discard all" on its own?
 
If you mean "pgpool is exactly like reconnect to postgresql but faster since connection is already established", then OK, that might work in some cases (e.g. when response times/throughput are not important), however why forcing "you must always start from scratch" execution model?

To enforce isolation, which is maybe the most important way for programs to be reliable - but this is a general theme which you don't seem to agree with. Regardless, resetting state doesn't have to have a necessary effect on response times/throughput.
 
Developers are not that dumb, and they can understand that "changing gucs at random is bad".

This has nothing to do with random, developers may have a legitimate reason to modify a GUC. In fact, GUCs are there precisely so that developers can change them...

Also, session state is not only about GUCs. DISCARD ALL also releases locks, resets authorization, resets sequence state (currval/nextval), etc. etc. All these things should not leak across connections.

> Please read again. PreparedStatement is the only way to execute statements
> in JDBC API. There's no API that allows user to specify "use
> server-prepared here".
> Well, there's non-prepared API in JDBC, however it misses "bind
> variables" support,
> so if bind variables required, developer would use PreparedStatement.

> Java's PreparedStatement does not have an option to distinguish which statements
> should be server-prepared and which should not.

This is something new - maybe it's part of the misunderstanding here. To me, the term "prepared statements" always means "server-prepared statements"; this seems to be supported by the documentation you quote: "If the driver supports precompilation, the method prepareStatement will send the statement to the database for precompilation". I don't understand the concept of a prepared statements which isn't server-prepared - do you have some sort of driver-only prepared statements, which 

Regardless of any optimizations you may be doing, in every database driver I've ever seen in my life, "prepared" simply means "server-prepared". And in every driver I've ever seen, there's an explicit API for that. Therefore server-prepare is something that's exposed to the developer, rather than some internal detail left to the driver.

> Suppose backend can handle 20 server-prepared statements at most (if using more it would run out of memory).
> Suppose an application has 100 statements with ".prepare()" call.
> I think it is reasonable for the DB driver to figure out which statements are most important and server-prepare just "20 most important ones", > and leave the rest 80 as regular non-prepared statements.

I seriously don't find it reasonable at all. LRU caching sometimes works well, and sometimes works very badly. That depends on the application and on the domain. Here's an example. Imagine two applications running against the same pool (or two large components within the same application). Each one has its set of "hot" statements, but unfortunately not all of them can fit prepared statement quota (20 in your example). Each connection fills the cache, ejecting the other component's prepared statements from the cache. As we switch between the two components, statements have to be reprepared.

Manual work here would allow picking truly hottest statements from each app/component, and preparing those - allowing the hottest statements across both components to always remain live. This is exactly what is meant by "programmer knowledge" which the driver doesn't have - it only has its recently-used logic which sometimes breaks down.

> Shay> I'm going to repeat what I said
> Shay> before and it would be good to get some reaction to this. Every software
> Shay> component in the stack has a role, and maintaining those separations is
> Shay> what keeps things simple and sane
>
> You might be missing my comments on CPU, x86, etc.
> My reaction is: almost every existing component is extremely hard to
> reason about.

> For instance: CPU has certain number of registers, it has certain
> amount of L1/L2/...
> caches and so on.
> Do you mean each and every developer should explicitly specify which
> program variable should use register and which one should go into L2 cache?

You keep making this mistaken analogy... Caches (and CPU reordering) are invisible, prepared statements aren't - they can trigger bugs (because of DDL), and much more importantly, they take up resources on the PostgreSQL side. Period. The L2 cache is already there and takes no additional resources whether you use it or not. It's invisible.

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: WIP: Barriers
Next
From: Andrew Dunstan
Date:
Subject: Re: Why --backup-and-modify-in-place in perltidy config?