Thread: Cached/global query plans, autopreparation

Cached/global query plans, autopreparation

From
Shay Rojansky
Date:
Hi all.

Various versions of having PostgreSQL caching and/or autopreparing statement plans have been discussed (https://www.postgresql.org/message-id/op.t9ggb3wacigqcu%40apollo13.peufeu.comhttps://www.postgresql.org/message-id/8e76d8fc-8b8c-14bd-d4d1-e9cf193a74f5%40postgrespro.ru), without clear conclusions or even an agreement on what might be worthwhile to implement. I wanted to bring this up again from a PostgreSQL driver maintainer's perspective (I'm the owner of Npgsql, the open source .NET driver), apologies in advance if I'm repeating things or I've missed crucial information. Below I'll describe three relevant issues and what I've done to deal with them.

When the same statement is rerun, preparing it has a very significant performance boost. However, in short-lived connection scenarios it's frequently not possible to benefit from this - think of a typical webapp which allocates a connection from a pool, run a query and then return the connection. To make sure prepared statements are used, Npgsql's connection pool doesn't send DISCARD ALL when a connection is returned (to avoid wiping out the connections), and maintains an internal table mapping SQL (and parameter types) to a PostgreSQL statement name. The next time the application attempts to prepare the same SQL, the prepared statement is found in the table and no preparation needs to occur. This means that prepared statements persist across pooled connection open/close, and are never discarded unless the user uses a specific API. While this works, the disadvantages are that:
1. This kind of mechanism needs to be implemented again and again, in each driver:
2. It relies on Npgsql's internal pooling, which can track persistent prepared statements on physical connections. If an external pool is used (e.g. pgpool), this isn't really possible.
1. It complicates resetting the session state (instead of DISCARD ALL, a combination of all other reset commands except DEALLOCATE ALL needs be sent). This is minor.

The second issue is that many applications don't work directly against the database API (ADO.NET in .NET, JDBC in Java). If any sort of O/RM or additional layer is used, there's a good chance that that layer doesn't prepare in any way, and indeed hide your access to the database API's preparation method. Two examples from the .NET world is dapper (a very popular micro-O/RM) and Entity Framework. In order to provide the best possible performance in these scenarios, Npgsql has an opt-in feature whereby it tracks how many times a given statement was executed, and once it passes a certain threshold automatically prepares it. An LRU cache is then used to determine which prepared statements to discard, to avoid explosion. In effect, statement auto-preparation is implemented in the driver. I know that the JDBC driver also implements such a mechanism (it was actually the inspiration for the Npgsql feature). The issues with this are:

1. As above, this has to be implemented by every driver (and is quite complex to do well)
2. There's a possible missed opportunity in having a single plan on the server, as each connection has its own (the "global plan" option). Many apps out there send the same statements across many connections so this seems relevant - but I don't know if the gains outweigh the contention impact in PostgreSQL.

Finally, since quite a few (most?) other databases include autopreparation (SQL Server, Oracle...), users porting their applications - which don't explicitly prepare - experience a big performance drop. It can rightly be said that porting an application across databases isn't a trivial task and that adjustments need to be made, but from experience I can say that PostgreSQL is losing quite a few users to this.

The above issues could be helped by having PostgreSQL cache on its side (especially the second issue, which is the most important). Ideally, any adopted solution would be transparent and not require any modification to applications. It would also not impact explicitly-prepared statements in any way.

Note that I'm not arguing for any specific implementation on the PostgreSQL side (e.g. global or not), but just describing a need and hoping to restart a conversation that will lead somewhere. 

(and thanks for reading this overly long message!)

Shay

Re: Cached/global query plans, autopreparation

From
Shay Rojansky
Date:
Hi all,

Was wondering if anyone has a reaction to my email below about statement preparation, was it too long? :)

(and sorry for top-posting)

On Tue, Feb 6, 2018 at 9:27 PM, Shay Rojansky <roji@roji.org> wrote:
Hi all.

Various versions of having PostgreSQL caching and/or autopreparing statement plans have been discussed (https://www.postgresql.org/message-id/op.t9ggb3wacigqcu%40apollo13.peufeu.comhttps://www.postgresql.org/message-id/8e76d8fc-8b8c-14bd-d4d1-e9cf193a74f5%40postgrespro.ru), without clear conclusions or even an agreement on what might be worthwhile to implement. I wanted to bring this up again from a PostgreSQL driver maintainer's perspective (I'm the owner of Npgsql, the open source .NET driver), apologies in advance if I'm repeating things or I've missed crucial information. Below I'll describe three relevant issues and what I've done to deal with them.

When the same statement is rerun, preparing it has a very significant performance boost. However, in short-lived connection scenarios it's frequently not possible to benefit from this - think of a typical webapp which allocates a connection from a pool, run a query and then return the connection. To make sure prepared statements are used, Npgsql's connection pool doesn't send DISCARD ALL when a connection is returned (to avoid wiping out the connections), and maintains an internal table mapping SQL (and parameter types) to a PostgreSQL statement name. The next time the application attempts to prepare the same SQL, the prepared statement is found in the table and no preparation needs to occur. This means that prepared statements persist across pooled connection open/close, and are never discarded unless the user uses a specific API. While this works, the disadvantages are that:
1. This kind of mechanism needs to be implemented again and again, in each driver:
2. It relies on Npgsql's internal pooling, which can track persistent prepared statements on physical connections. If an external pool is used (e.g. pgpool), this isn't really possible.
1. It complicates resetting the session state (instead of DISCARD ALL, a combination of all other reset commands except DEALLOCATE ALL needs be sent). This is minor.

The second issue is that many applications don't work directly against the database API (ADO.NET in .NET, JDBC in Java). If any sort of O/RM or additional layer is used, there's a good chance that that layer doesn't prepare in any way, and indeed hide your access to the database API's preparation method. Two examples from the .NET world is dapper (a very popular micro-O/RM) and Entity Framework. In order to provide the best possible performance in these scenarios, Npgsql has an opt-in feature whereby it tracks how many times a given statement was executed, and once it passes a certain threshold automatically prepares it. An LRU cache is then used to determine which prepared statements to discard, to avoid explosion. In effect, statement auto-preparation is implemented in the driver. I know that the JDBC driver also implements such a mechanism (it was actually the inspiration for the Npgsql feature). The issues with this are:

1. As above, this has to be implemented by every driver (and is quite complex to do well)
2. There's a possible missed opportunity in having a single plan on the server, as each connection has its own (the "global plan" option). Many apps out there send the same statements across many connections so this seems relevant - but I don't know if the gains outweigh the contention impact in PostgreSQL.

Finally, since quite a few (most?) other databases include autopreparation (SQL Server, Oracle...), users porting their applications - which don't explicitly prepare - experience a big performance drop. It can rightly be said that porting an application across databases isn't a trivial task and that adjustments need to be made, but from experience I can say that PostgreSQL is losing quite a few users to this.

The above issues could be helped by having PostgreSQL cache on its side (especially the second issue, which is the most important). Ideally, any adopted solution would be transparent and not require any modification to applications. It would also not impact explicitly-prepared statements in any way.

Note that I'm not arguing for any specific implementation on the PostgreSQL side (e.g. global or not), but just describing a need and hoping to restart a conversation that will lead somewhere. 

(and thanks for reading this overly long message!)

Shay

Re: Cached/global query plans, autopreparation

From
"henry@visionlink.org"
Date:
​Coming from a PHP application I have several of the same concerns and wishes​.  Given that php can not share any (resources) between requests it would be impossible to accomplish what you have in .NET.  We still prepare statements though for use in result sets and other loops (ORM driven).  I'm wondering if it is possible to solve this as an extension to postgres?  If Citus could refactor their code into an extension surly this is simple in comparison.  I would be willing to help but PHP has made me soft, so it will take a bit longer.

Thanks,

Chet Henry

On Tue, Feb 13, 2018 at 10:13 AM, Shay Rojansky <roji@roji.org> wrote:
Hi all,

Was wondering if anyone has a reaction to my email below about statement preparation, was it too long? :)

(and sorry for top-posting)

On Tue, Feb 6, 2018 at 9:27 PM, Shay Rojansky <roji@roji.org> wrote:
Hi all.

Various versions of having PostgreSQL caching and/or autopreparing statement plans have been discussed (https://www.postgresql.org/message-id/op.t9ggb3wacigqcu%40apollo13.peufeu.comhttps://www.postgresql.org/message-id/8e76d8fc-8b8c-14bd-d4d1-e9cf193a74f5%40postgrespro.ru), without clear conclusions or even an agreement on what might be worthwhile to implement. I wanted to bring this up again from a PostgreSQL driver maintainer's perspective (I'm the owner of Npgsql, the open source .NET driver), apologies in advance if I'm repeating things or I've missed crucial information. Below I'll describe three relevant issues and what I've done to deal with them.

When the same statement is rerun, preparing it has a very significant performance boost. However, in short-lived connection scenarios it's frequently not possible to benefit from this - think of a typical webapp which allocates a connection from a pool, run a query and then return the connection. To make sure prepared statements are used, Npgsql's connection pool doesn't send DISCARD ALL when a connection is returned (to avoid wiping out the connections), and maintains an internal table mapping SQL (and parameter types) to a PostgreSQL statement name. The next time the application attempts to prepare the same SQL, the prepared statement is found in the table and no preparation needs to occur. This means that prepared statements persist across pooled connection open/close, and are never discarded unless the user uses a specific API. While this works, the disadvantages are that:
1. This kind of mechanism needs to be implemented again and again, in each driver:
2. It relies on Npgsql's internal pooling, which can track persistent prepared statements on physical connections. If an external pool is used (e.g. pgpool), this isn't really possible.
1. It complicates resetting the session state (instead of DISCARD ALL, a combination of all other reset commands except DEALLOCATE ALL needs be sent). This is minor.

The second issue is that many applications don't work directly against the database API (ADO.NET in .NET, JDBC in Java). If any sort of O/RM or additional layer is used, there's a good chance that that layer doesn't prepare in any way, and indeed hide your access to the database API's preparation method. Two examples from the .NET world is dapper (a very popular micro-O/RM) and Entity Framework. In order to provide the best possible performance in these scenarios, Npgsql has an opt-in feature whereby it tracks how many times a given statement was executed, and once it passes a certain threshold automatically prepares it. An LRU cache is then used to determine which prepared statements to discard, to avoid explosion. In effect, statement auto-preparation is implemented in the driver. I know that the JDBC driver also implements such a mechanism (it was actually the inspiration for the Npgsql feature). The issues with this are:

1. As above, this has to be implemented by every driver (and is quite complex to do well)
2. There's a possible missed opportunity in having a single plan on the server, as each connection has its own (the "global plan" option). Many apps out there send the same statements across many connections so this seems relevant - but I don't know if the gains outweigh the contention impact in PostgreSQL.

Finally, since quite a few (most?) other databases include autopreparation (SQL Server, Oracle...), users porting their applications - which don't explicitly prepare - experience a big performance drop. It can rightly be said that porting an application across databases isn't a trivial task and that adjustments need to be made, but from experience I can say that PostgreSQL is losing quite a few users to this.

The above issues could be helped by having PostgreSQL cache on its side (especially the second issue, which is the most important). Ideally, any adopted solution would be transparent and not require any modification to applications. It would also not impact explicitly-prepared statements in any way.

Note that I'm not arguing for any specific implementation on the PostgreSQL side (e.g. global or not), but just describing a need and hoping to restart a conversation that will lead somewhere. 

(and thanks for reading this overly long message!)

Shay


Re: Cached/global query plans, autopreparation

From
Andres Freund
Date:
Hi,

On 2018-02-13 09:13:09 -0800, Shay Rojansky wrote:
> Was wondering if anyone has a reaction to my email below about statement
> preparation, was it too long? :)

Well, the issue is that implementing this is a major piece of work. This
post doesn't offer either resources nor a simpler way to do so. There's
no huge debate about the benefit of having a global plan cache, so I'm
not that surprised there's not a huge debate about a post arguing that
we should have one.

- Andres


Re: Cached/global query plans, autopreparation

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-02-13 09:13:09 -0800, Shay Rojansky wrote:
>> Was wondering if anyone has a reaction to my email below about statement
>> preparation, was it too long? :)

> Well, the issue is that implementing this is a major piece of work. This
> post doesn't offer either resources nor a simpler way to do so. There's
> no huge debate about the benefit of having a global plan cache, so I'm
> not that surprised there's not a huge debate about a post arguing that
> we should have one.

Actually, I'm pretty darn skeptical about the value of such a cache for
most use-cases.  But as long as it can be turned off and doesn't leave
residual overhead nor massive added code cruft, I won't stand in the way
of someone else investing their time in it.

In any case, as you say, it's moot until somebody steps up to do it.

            regards, tom lane


Re: Cached/global query plans, autopreparation

From
"henry@visionlink.org"
Date:
Any idea on how feasible it would be as an extention or is the work too central to abstract that way?

Chet Henry
Senior Software Developer - Dev Ops Liaison
VisionLink, Inc.
3101 Iris Ave, Ste 240
Boulder, CO 80301
 henry@visionlink.org

      
Site | Blog | Join Our Team | Try a Demo
Twitter   Pinterest   Facebook   LinkedIn   YouTube

On Wed, Feb 14, 2018 at 2:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
> On 2018-02-13 09:13:09 -0800, Shay Rojansky wrote:
>> Was wondering if anyone has a reaction to my email below about statement
>> preparation, was it too long? :)

> Well, the issue is that implementing this is a major piece of work. This
> post doesn't offer either resources nor a simpler way to do so. There's
> no huge debate about the benefit of having a global plan cache, so I'm
> not that surprised there's not a huge debate about a post arguing that
> we should have one.

Actually, I'm pretty darn skeptical about the value of such a cache for
most use-cases.  But as long as it can be turned off and doesn't leave
residual overhead nor massive added code cruft, I won't stand in the way
of someone else investing their time in it.

In any case, as you say, it's moot until somebody steps up to do it.

                        regards, tom lane


Re: Cached/global query plans, autopreparation

From
Konstantin Knizhnik
Date:


On 13.02.2018 20:13, Shay Rojansky wrote:
Hi all,

Was wondering if anyone has a reaction to my email below about statement preparation, was it too long? :)

(and sorry for top-posting)

On Tue, Feb 6, 2018 at 9:27 PM, Shay Rojansky <roji@roji.org> wrote:
Hi all.

Various versions of having PostgreSQL caching and/or autopreparing statement plans have been discussed (https://www.postgresql.org/message-id/op.t9ggb3wacigqcu%40apollo13.peufeu.comhttps://www.postgresql.org/message-id/8e76d8fc-8b8c-14bd-d4d1-e9cf193a74f5%40postgrespro.ru), without clear conclusions or even an agreement on what might be worthwhile to implement. I wanted to bring this up again from a PostgreSQL driver maintainer's perspective (I'm the owner of Npgsql, the open source .NET driver), apologies in advance if I'm repeating things or I've missed crucial information. Below I'll describe three relevant issues and what I've done to deal with them.

When the same statement is rerun, preparing it has a very significant performance boost. However, in short-lived connection scenarios it's frequently not possible to benefit from this - think of a typical webapp which allocates a connection from a pool, run a query and then return the connection. To make sure prepared statements are used, Npgsql's connection pool doesn't send DISCARD ALL when a connection is returned (to avoid wiping out the connections), and maintains an internal table mapping SQL (and parameter types) to a PostgreSQL statement name. The next time the application attempts to prepare the same SQL, the prepared statement is found in the table and no preparation needs to occur. This means that prepared statements persist across pooled connection open/close, and are never discarded unless the user uses a specific API. While this works, the disadvantages are that:
1. This kind of mechanism needs to be implemented again and again, in each driver:
2. It relies on Npgsql's internal pooling, which can track persistent prepared statements on physical connections. If an external pool is used (e.g. pgpool), this isn't really possible.
1. It complicates resetting the session state (instead of DISCARD ALL, a combination of all other reset commands except DEALLOCATE ALL needs be sent). This is minor.

The second issue is that many applications don't work directly against the database API (ADO.NET in .NET, JDBC in Java). If any sort of O/RM or additional layer is used, there's a good chance that that layer doesn't prepare in any way, and indeed hide your access to the database API's preparation method. Two examples from the .NET world is dapper (a very popular micro-O/RM) and Entity Framework. In order to provide the best possible performance in these scenarios, Npgsql has an opt-in feature whereby it tracks how many times a given statement was executed, and once it passes a certain threshold automatically prepares it. An LRU cache is then used to determine which prepared statements to discard, to avoid explosion. In effect, statement auto-preparation is implemented in the driver. I know that the JDBC driver also implements such a mechanism (it was actually the inspiration for the Npgsql feature). The issues with this are:

1. As above, this has to be implemented by every driver (and is quite complex to do well)
2. There's a possible missed opportunity in having a single plan on the server, as each connection has its own (the "global plan" option). Many apps out there send the same statements across many connections so this seems relevant - but I don't know if the gains outweigh the contention impact in PostgreSQL.

Finally, since quite a few (most?) other databases include autopreparation (SQL Server, Oracle...), users porting their applications - which don't explicitly prepare - experience a big performance drop. It can rightly be said that porting an application across databases isn't a trivial task and that adjustments need to be made, but from experience I can say that PostgreSQL is losing quite a few users to this.

The above issues could be helped by having PostgreSQL cache on its side (especially the second issue, which is the most important). Ideally, any adopted solution would be transparent and not require any modification to applications. It would also not impact explicitly-prepared statements in any way.

Note that I'm not arguing for any specific implementation on the PostgreSQL side (e.g. global or not), but just describing a need and hoping to restart a conversation that will lead somewhere. 

(and thanks for reading this overly long message!)

Shay


I am an author of one of the proposal (autoprepare which is in commit fest now), but I think that sooner or later Postgres has to come to solution with shared DB caches/prepared plans.
Please correct me if I am wrong, but it seems to me that most of all other top DBMSes having something like this.
Such decision can provide a lot of different advantages:
1. Better memory utilization: no need to store the same data N times where N is number of backends and spend time for warming cache.
2. Optimizer can spend more time choosing better plan which then can be used by all clients. Even now time of compilation of some queries several times exceeds time of their execution.
3. It is simpler to add facilities for query plan tuning and maintaining (storing, comparing,...)
4. It make is possible to control size of memory used by caches. Right now catalog cache for DB with hundred thousands and tables and indexes multiplied by hundreds of backends can consume terabytes of memory.
5. Shared caches can simplify invalidation mechanism.
6. Almost all enterprise systems working with Postgres has to use some kind of connection pooling (pgbouncer, pgpool,...). It almost exclude possibility to use prepared statements. Which can slow down performance up to two times.

There is just one (but very important) problem which needs to be solved: access to shared cache should be synchronized.
But there are a lot of other shared resources in Postgres (procarray, shared buffers,...). So  I do not think that it is unsolvable problem and that it can cause degrade of performance.

So it seems to be obvious that shared caches/plans can provide a lot of advantages. But it is still not clear to me the value of this advantages for real customers.
Using -M prepared  protocol in pgbench workload can improve speed up to two times. But I have asked real Postgres users in Avito, Yandex, MyOffice and them told me
that on their workloads advantage of prepared statements is about 10%. 10% performance improvement is definitely not a good compensation for rewriting substantial part of Postgres core...

Another aspect is that Java, .Net and other languages has their own mechanism for preparing statements. I still do not think that the question whether to prepare statement or not can be solved just at API level, without interaction of database engine. Language binding can compare costs of generic and specialized plans, shared prepared statements between all database connections,... But I think that it is more natural and efficient to implement this logic in one place, rather than try to reimplement it several times for different APIs.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Cached/global query plans, autopreparation

From
Shay Rojansky
Date:
> Well, the issue is that implementing this is a major piece of work. This
> post doesn't offer either resources nor a simpler way to do so. There's
> no huge debate about the benefit of having a global plan cache, so I'm
> not that surprised there's not a huge debate about a post arguing that
> we should have one.

Actually, I'm pretty darn skeptical about the value of such a cache for
most use-cases.  But as long as it can be turned off and doesn't leave
residual overhead nor massive added code cruft, I won't stand in the way
of someone else investing their time in it.

In any case, as you say, it's moot until somebody steps up to do it.

Well, looking at previous conversations and also at the comment above it doesn't seem like there's a consensus on whether this feature would even be beneficial... The point of my email above was to have that conversation before looking into implementation. Tom, I'm especially interested in understanding why you think this cache wouldn't help most use-cases: I see many applications which don't prepare (i.e. because they use data access layers/O/RMs which don't do it or expose it), and implementing this in the driver seems like the wrong way (although Npgsql and JDBC do it, at least some other languages don't).

In addition, there are also various options/possibilities here and there seems no consensus about that either:

* How should statement plan caching be done for unprepared statements, what strategy should be used? Should a threshold number of unprepared executions be used before PostgreSQL decides to prepare? Should there be a maximum number of autoprepared statements, ejecting the least-recently used one to make room for a new one? Or something else?
* Should the cached plans be shared across connections ("global" cached statements)? Are the savings from global caching greater than the cost of the contention? The savings include both (a) not having to re-prepare the same statement N times on different connections (typically just a one-time application warm-up cost), and (b) not having the memory duplication of N identical statements across statements (a constant cost, not warm-up - but not sure how significant this is). Note that the global/shared discussion is a bit orthogonal to the general autopreparation conversation - the latter has value with or without the former.

Essentially I think it's a good idea to have a conversation about all this before anyone jumps into implementation.

Re: Cached/global query plans, autopreparation

From
Shay Rojansky
Date:
I am an author of one of the proposal (autoprepare which is in commit fest now), but I think that sooner or later Postgres has to come to solution with shared DB caches/prepared plans.
Please correct me if I am wrong, but it seems to me that most of all other top DBMSes having something like this.
Such decision can provide a lot of different advantages:
1. Better memory utilization: no need to store the same data N times where N is number of backends and spend time for warming cache.
2. Optimizer can spend more time choosing better plan which then can be used by all clients. Even now time of compilation of some queries several times exceeds time of their execution.
3. It is simpler to add facilities for query plan tuning and maintaining (storing, comparing,...)
4. It make is possible to control size of memory used by caches. Right now catalog cache for DB with hundred thousands and tables and indexes multiplied by hundreds of backends can consume terabytes of memory.
5. Shared caches can simplify invalidation mechanism.
6. Almost all enterprise systems working with Postgres has to use some kind of connection pooling (pgbouncer, pgpool,...). It almost exclude possibility to use prepared statements. Which can slow down performance up to two times.

Just wanted to say I didn't see this email before my previous response, but I agree with all of the above. The last point is particularly important, especially for short-lived connection scenarios, the most typical of which is web.
 
There is just one (but very important) problem which needs to be solved: access to shared cache should be synchronized.
But there are a lot of other shared resources in Postgres (procarray, shared buffers,...). So  I do not think that it is unsolvable problem and that it can cause degrade of performance.

So it seems to be obvious that shared caches/plans can provide a lot of advantages. But it is still not clear to me the value of this advantages for real customers.
Using -M prepared  protocol in pgbench workload can improve speed up to two times. But I have asked real Postgres users in Avito, Yandex, MyOffice and them told me
that on their workloads advantage of prepared statements is about 10%. 10% performance improvement is definitely not a good compensation for rewriting substantial part of Postgres core...

Just wanted to say that I've seen more than 10% improvement in some real-world application when preparation was done properly. Also, I'm assuming that implementing this wouldn't involve "rewriting substantial part of Postgres core", and that even 10% is quite a big gain, especially if it's a transparent/free one as far as the user is concerned (no application changes).

Re: Cached/global query plans, autopreparation

From
Jorge Solórzano
Date:

On Thu, Feb 15, 2018 at 8:00 AM, Shay Rojansky <roji@roji.org> wrote:
I am an author of one of the proposal (autoprepare which is in commit fest now), but I think that sooner or later Postgres has to come to solution with shared DB caches/prepared plans.
Please correct me if I am wrong, but it seems to me that most of all other top DBMSes having something like this.
Such decision can provide a lot of different advantages:
1. Better memory utilization: no need to store the same data N times where N is number of backends and spend time for warming cache.
2. Optimizer can spend more time choosing better plan which then can be used by all clients. Even now time of compilation of some queries several times exceeds time of their execution.
3. It is simpler to add facilities for query plan tuning and maintaining (storing, comparing,...)
4. It make is possible to control size of memory used by caches. Right now catalog cache for DB with hundred thousands and tables and indexes multiplied by hundreds of backends can consume terabytes of memory.
5. Shared caches can simplify invalidation mechanism.
6. Almost all enterprise systems working with Postgres has to use some kind of connection pooling (pgbouncer, pgpool,...). It almost exclude possibility to use prepared statements. Which can slow down performance up to two times.

Just wanted to say I didn't see this email before my previous response, but I agree with all of the above. The last point is particularly important, especially for short-lived connection scenarios, the most typical of which is web.
 
There is just one (but very important) problem which needs to be solved: access to shared cache should be synchronized.
But there are a lot of other shared resources in Postgres (procarray, shared buffers,...). So  I do not think that it is unsolvable problem and that it can cause degrade of performance.

So it seems to be obvious that shared caches/plans can provide a lot of advantages. But it is still not clear to me the value of this advantages for real customers.
Using -M prepared  protocol in pgbench workload can improve speed up to two times. But I have asked real Postgres users in Avito, Yandex, MyOffice and them told me
that on their workloads advantage of prepared statements is about 10%. 10% performance improvement is definitely not a good compensation for rewriting substantial part of Postgres core...

Just wanted to say that I've seen more than 10% improvement in some real-world application when preparation was done properly. Also, I'm assuming that implementing this wouldn't involve "rewriting substantial part of Postgres core", and that even 10% is quite a big gain, especially if it's a transparent/free one as far as the user is concerned (no application changes).

 
​10% of improvement in real-world can be pretty significant​, I ignore how complicated can be to implement this in Postgres core, how about add this to the GSoC 2018 ideas[1]?

[1] https://wiki.postgresql.org/wiki/GSoC_2018

Re: Cached/global query plans, autopreparation

From
Bruce Momjian
Date:
On Thu, Feb 15, 2018 at 03:00:17PM +0100, Shay Rojansky wrote:
> Just wanted to say that I've seen more than 10% improvement in some real-world
> application when preparation was done properly. Also, I'm assuming that
> implementing this wouldn't involve "rewriting substantial part of Postgres
> core", and that even 10% is quite a big gain, especially if it's a transparent/
> free one as far as the user is concerned (no application changes).

I would like to step back on this issue.  Ideally, every query would get
re-optimized because we can only be sure the plan is optimal when we use
supplied constants to generate the plan.  But, of course, parsing and
planning take time, so there ideally would be an way to avoid it.  The
question is always how much time will be saved by avoiding
parsing/planning, and what risk is there of suboptimal plans.

Postgres uses a conservative method for reusing plans with previous
constants, as described in the PREPARE manual page:

    https://www.postgresql.org/docs/10/static/sql-prepare.html
    Prepared statements can use generic plans rather than re-planning with
    each set of supplied EXECUTE values. This occurs immediately for prepared
    statements with no parameters; otherwise it occurs only after five or more
    executions produce plans whose estimated cost average (including planning
    overhead) is more expensive than the generic plan cost estimate. Once
    a generic plan is chosen, it is used for the remaining lifetime of the
    prepared statement. Using EXECUTE values which are rare in columns with
    many duplicates can generate custom plans that are so much cheaper than
    the generic plan, even after adding planning overhead, that the generic
    plan might never be used.

While I have heard people complain about how other databases cache
prepare plans, I have heard few complaints about the Postgres approach,
and I haven't even heard of people asking to control the documented "five
or more" behavior.

I also know that other database products have more sophisticated prepare
usage, but they might have higher parse/plan overhead, or they might be
more flexible in handling specialized workloads, which Postgres might
not want to handle, given the costs/complexity/overhead.

So, the directions for improvement are:

1  Improve the existing "five or more" behavior
2  Automatically prepare queries that are not sent as prepared queries
3  Share plans among sessions

While #1 would be nice, #2 increases the number of applications that can
silently benefit from prepared queries, and #3 improves the number of
cases that query plans can be reused.  The issue with #3 is that the
constants used are no longer local to the session (which is the same
issue with connection poolers reusing prepared plans).  When different
sessions with potentially more varied constants reuse plans, the
probability of suboptimal plans increases.

I think the fact that pgbench shows a 2x improvement for prepared
statements, and real world reports are a 10% improvement means we need
to have a better understanding of exactly what workloads can benefit
from this, and a comprehensive analysis of all three areas of
improvement.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: Cached/global query plans, autopreparation

From
Andres Freund
Date:
On 2018-03-02 15:29:09 -0500, Bruce Momjian wrote:
> Postgres uses a conservative method for reusing plans with previous
> constants, as described in the PREPARE manual page:
> 
>     https://www.postgresql.org/docs/10/static/sql-prepare.html
>     Prepared statements can use generic plans rather than re-planning with
>     each set of supplied EXECUTE values. This occurs immediately for prepared
>     statements with no parameters; otherwise it occurs only after five or more
>     executions produce plans whose estimated cost average (including planning
>     overhead) is more expensive than the generic plan cost estimate. Once
>     a generic plan is chosen, it is used for the remaining lifetime of the
>     prepared statement. Using EXECUTE values which are rare in columns with
>     many duplicates can generate custom plans that are so much cheaper than
>     the generic plan, even after adding planning overhead, that the generic
>     plan might never be used.
> 
> While I have heard people complain about how other databases cache
> prepare plans, I have heard few complaints about the Postgres approach,
> and I haven't even heard of people asking to control the documented "five
> or more" behavior.

This *constantly* is a problem.


Greetings,

Andres Freund


Re: Cached/global query plans, autopreparation

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-03-02 15:29:09 -0500, Bruce Momjian wrote:
>> While I have heard people complain about how other databases cache
>> prepare plans, I have heard few complaints about the Postgres approach,
>> and I haven't even heard of people asking to control the documented "five
>> or more" behavior.

> This *constantly* is a problem.

Yeah, I've certainly heard complaints about it.  I do agree with
Bruce's conclusion that we should try to improve that behavior;
but it's not entirely clear how.  (A user-frobbable knob isn't
necessarily the best answer.)

            regards, tom lane


Re: Cached/global query plans, autopreparation

From
Pavel Stehule
Date:


2018-03-02 21:51 GMT+01:00 Tom Lane <tgl@sss.pgh.pa.us>:
Andres Freund <andres@anarazel.de> writes:
> On 2018-03-02 15:29:09 -0500, Bruce Momjian wrote:
>> While I have heard people complain about how other databases cache
>> prepare plans, I have heard few complaints about the Postgres approach,
>> and I haven't even heard of people asking to control the documented "five
>> or more" behavior.

> This *constantly* is a problem.

Yeah, I've certainly heard complaints about it.  I do agree with
Bruce's conclusion that we should try to improve that behavior;
but it's not entirely clear how.  (A user-frobbable knob isn't
necessarily the best answer.)

Can be this problem reduced if we can count number of possible paths?

Maybe it can work for some simple queries, what is majority in pgbench.

When I migrate from Oracle, there was a issue slow planning of very complex views - probably optimization on most common values can work well.

Still I have a idea about some optimization based not on searching the best plan of one parameter vektor, but for searching the best plan for all possible vectors - or best worst case plan.

I don't think so this issue is solvable without changing optimization method.

Or don't lost time with probably useless work and move forward to dynamic execution - for example - dynamic switch from nested loop, to hashjoin to mergejoin ...



                        regards, tom lane


Re: Cached/global query plans, autopreparation

From
konstantin knizhnik
Date:
On Mar 2, 2018, at 11:29 PM, Bruce Momjian wrote:

> On Thu, Feb 15, 2018 at 03:00:17PM +0100, Shay Rojansky wrote:
>> Just wanted to say that I've seen more than 10% improvement in some real-world
>> application when preparation was done properly. Also, I'm assuming that
>> implementing this wouldn't involve "rewriting substantial part of Postgres
>> core", and that even 10% is quite a big gain, especially if it's a transparent/
>> free one as far as the user is concerned (no application changes).
>
> I would like to step back on this issue.  Ideally, every query would get
> re-optimized because we can only be sure the plan is optimal when we use
> supplied constants to generate the plan.  But, of course, parsing and
> planning take time, so there ideally would be an way to avoid it.  The
> question is always how much time will be saved by avoiding
> parsing/planning, and what risk is there of suboptimal plans.
>
> Postgres uses a conservative method for reusing plans with previous
> constants, as described in the PREPARE manual page:
>
>     https://www.postgresql.org/docs/10/static/sql-prepare.html
>     Prepared statements can use generic plans rather than re-planning with
>     each set of supplied EXECUTE values. This occurs immediately for prepared
>     statements with no parameters; otherwise it occurs only after five or more
>     executions produce plans whose estimated cost average (including planning
>     overhead) is more expensive than the generic plan cost estimate. Once
>     a generic plan is chosen, it is used for the remaining lifetime of the
>     prepared statement. Using EXECUTE values which are rare in columns with
>     many duplicates can generate custom plans that are so much cheaper than
>     the generic plan, even after adding planning overhead, that the generic
>     plan might never be used.
>
> While I have heard people complain about how other databases cache
> prepare plans, I have heard few complaints about the Postgres approach,
> and I haven't even heard of people asking to control the documented "five
> or more" behavior.

I am sorry, but I think that lack of complaints in this case just mean that people are not using prepared statements in
Postgres.
Almost any productional system baed on Postgres has to use some kind of connection pooling (pgbouncer...) and it
preventsusage of prepared statements in most cases. 
I personally was involved in development in several such system where it was prohibited to use prepared statements
becauseall connections to the database are expected to be done through pgbouncer. 

I completely agree with you that general plan can be less efficient than specialized plan. It can be caused by data
skewand non-uniform data distribution. 
But what is the percent of such queries? I do not have much experience with investigating behavior of Postgres in real
productionsystems. 
So I can only guess. In the simplest queries like "select * from T where key=?" it is very inlikely that specialized
planwill be more efficient, even in case of non-uniform distribution of "key" values. And for complex plans involving
multiplesjoins Postgres optimizer in any case will not be able to calculate more or less precise estimation after few
joins.So in this case     
specialized plan with literals instead of placeholders once again will not be more efficient.

This rule with five attempts before switching to generic plan works well for partitioning where generic plan cause
queryexecution at all partition and so is several times less efficient than specialized plan which is able to restrict
queryexecution just to one partition. But I think that in this case right solution is  runtime partitioning pruning
whichwill allow to use prepared statements for tables with inheritance. 


>
> I also know that other database products have more sophisticated prepare
> usage, but they might have higher parse/plan overhead, or they might be
> more flexible in handling specialized workloads, which Postgres might
> not want to handle, given the costs/complexity/overhead.
>
> So, the directions for improvement are:
>
> 1  Improve the existing "five or more" behavior
> 2  Automatically prepare queries that are not sent as prepared queries
> 3  Share plans among sessions
>
> While #1 would be nice, #2 increases the number of applications that can
> silently benefit from prepared queries, and #3 improves the number of
> cases that query plans can be reused.  The issue with #3 is that the
> constants used are no longer local to the session (which is the same
> issue with connection poolers reusing prepared plans).  When different
> sessions with potentially more varied constants reuse plans, the
> probability of suboptimal plans increases.
>
> I think the fact that pgbench shows a 2x improvement for prepared
> statements, and real world reports are a 10% improvement means we need
> to have a better understanding of exactly what workloads can benefit
> from this, and a comprehensive analysis of all three areas of
> improvement.

I completely agree with you.
If we have autoprepared statements or shared plans then more sophisticated checks for efficiency of generic query is
definitelyneeded. 
Just want to add few words about estimation of complexity of all this approaches.

Autoprepare patch already exists and can improve performance on OLTP workload up to two times.
It will be nice to perform more experiments on real system to estimate real effect of autoprepare. But until this patch
isnot included in Postgres, it will be hard to perform such experiments.  It is very difficult to try effect of
preparedstatements at real system working through pgbouncer. 
And with autoprepare it will be enough just to toggle one GUC variable to  perform this experiment.

Implementation of shard cache is definitely the most universal and efficient solution. But its implementation will
requirea  lot of efforts and even if we decided to more in this direction, 
I do not think that it will be ready till 11 or even 12 release of Postgres.

It's a pity that we have to loose up to two times of possible Postgres performance just because of lack of any suitable
mechanismfor preparing queries. 


>
> --
>  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>  EnterpriseDB                             http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +                      Ancient Roman grave inscription +



Re: Cached/global query plans, autopreparation

From
Jim Finnerty
Date:
The heuristic for choosing the generic plan by comparing the estimated costs
of the generic plan to the estimated cost of the specific plans is flawed. 
Consider this:  what is the selectivity of a predicate such as 'x > $1'? 
The planner can only make a wild guess, and the default wild guess for range
selectivity is 1/3.

Suppose that the true predicate selectivity is 2/3.  After executing good
specific plans 5 times, we compare the estimated cost of the generic plan to
the average estimated cost of the specific plans.  We conclude that we
should switch to the generic plan because the (badly) estimated cost is
less.  You may get the same plan.  If it's your lucky day, you might even
get a better plan, but in this situation the generic plan should be worse,
on average.  We consider the accuracy of estimates to be the same in both
cases, which is wrong.

So the decision to use the generic plan or not by comparing the estimated
cost of the generic plan to the average estimated cost of k specific plans
uses flawed logic.  If the planner is not called after the plan is cached,
then it becomes more difficult to tune it (e.g. by experimenting with
different plans using pg_hint_plan).  You'd probably have to deallocate and
re-prepare to get another 5 tries.  That sounds annoying.

Is there a way to EXPLAIN the generic plan? 




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html