Thread: Re: psycopg3, prepared statements
On 12/21/20 6:24 AM, Daniele Varrazzo wrote: > The one thing, the most requested thing in psycopg, is support for > prepared statements. > > In psycopg3 for the moment there is: > > - very low level support for prepared statement, i.e. wrapping of > libpq functions such as PQsendPrepare/PQsendQueryPrepared > (https://www.postgresql.org/docs/current/libpq-async.html#LIBPQ-PQSENDPREPARE) > - automatic use of prepared statements in `cursor.executemany()`, > which might eventually stop sucking. > > Gathering some ideas: > > Prepared statements in the server are per session, so any form of > cache is better connected to the connection than the cursor, although > the cursors are the obvious interface to give commands. > > In the past [1] I thought about exposing explicit prepare/deallocate > on the cursor, and it was a single prepared query per cursor. A > `cursor.prepare(query)` with no args doesn't have types information > though: if any it should take an optional array of parameters to get > types from. > > What I'm thinking about is to prepare queries automatically with a schema such: > > - decisions are made after the query is transformed to postgres format > (i.e. it is reduced to bytes, all the client-side manipulations have > been done, placeholders have been transformed to $ format). There is > an object in psycopg3 that takes care of this transformation [2] > - the number of times a query is seen is stored in a LRU cache on the connection > - if a query is seen more than `connection.prepare_threshold` times > (proposed default: 5) then it is prepared with the name > f'pg3_{hash(query)}' and the following executions are prepared. > - if more than `connection.prepared_number` queries are prepared, the > one used least recently is deallocated and evicted from the cache > (proposed default: 100). > - Parameters may be fudged on the connection: prepared_threshold=0 > would prepare all queries, prepared_threshold=None would disable > preparing. > - For the control freak, cursor.execute(query, params, prepare=True) > would prepare the query immediately, if it isn't already, > prepare=False would avoid preparation. The default None would enable > the automatic choice. So your plan from [1] was like that in plpythonu where the process is broken down into two parts. What I'm not following is whether that is still the plan or whether prepare/execute is going to happen without the cursor.prepare() and just be automatic on cursor.execute()? With the provision to override per cursor. > > [1] https://gist.github.com/dvarrazzo/3797445 > [2] https://github.com/psycopg/psycopg3/blob/c790a832/psycopg3/psycopg3/_queries.py#L27 > > What do you think? > > Cheers > > -- Daniele > > -- Adrian Klaver adrian.klaver@aklaver.com
On Mon, 21 Dec 2020 at 16:02, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > So your plan from [1] was like that in plpythonu where the process is > broken down into two parts. Is there anything useful to learn from the plpythonu experience? I'm not very familiar with it. > What I'm not following is whether that is > still the plan or whether prepare/execute is going to happen without the > cursor.prepare() and just be automatic on cursor.execute()? With the > provision to override per cursor. I would do without `cursor.prepare(query)`. As a starter it should be `cursor.prepare(query, args)` in order to sniff the types, so you would also need to pass it a sample of data (or the types). So now 1) as interface it's very similar to `cursor.execute(query, args)`, but it would prepare and not run the query, and 2) it's a moment in your program's lifetime where you know the query to run and the arguments: most likely now you want to run the query too, not only to prepare it. So the lack of a function "prepare the query, with these types, but don't run it" doesn't seem a very likely one. I am also assuming that, if the assumption is that "preparing queries is good", it would be better to let it happen automatically instead of asking the people to do it on their own. Does it make sense? ---- I mistakenly replied Adrian privately. Following, his reply. On Mon, 21 Dec 2020 at 16:45, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > Is there anything useful to learn from the plpythonu experience? I'm > not very familiar with it. The user visible portion can be found here: https://www.postgresql.org/docs/current/plpython-database.html The most notable part is that it requires type definition(s) for parameters in the prepare portion. That and a prepared query can be passed around to different execute and different function calls. > I would do without `cursor.prepare(query)`. [...] If I read the above correctly this boils down to cursor.execute(query, args) will always be prepared once the connection.prepare_threshold(default 5)is passed and evicted after connection.prepared_number(default 100) other queries are run. > Does it make sense? I think so. Personally, I prefer the two step approach as I am becoming less and less enthusiastic about hidden 'magic'. To that end a global(maybe connection) setting that would disable prepare would be nice. ---- To which, 1) thank you very much, Adrian, for the plpython pointer, I'll take a look at it. 2) About disabling the automatic prepare: the mechanism I have in mind is to set prepare_threshold to None on the connection instance; however we could make sure to have the default attribute defined on the class: this way who really hates the idea of prepared statements can be cheeky and set `psycopg3.Connection.prepare_threshold = None` instead of `myconn.prepare_threshold`... More seriously, if there is a large base of people who think that something can go wrong with prepared statement we can either provide a better interface to control it globally or to have the feature opt-in. -- Daniele
On 12/21/20 12:26 PM, Daniele Varrazzo wrote: > On Mon, 21 Dec 2020 at 16:02, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > > > I mistakenly replied Adrian privately. Following, his reply. > > I think so. Personally, I prefer the two step approach as I am becoming > less and less enthusiastic about hidden 'magic'. To that end a > global(maybe connection) setting that would disable prepare would be nice. > > ---- > > To which, 1) thank you very much, Adrian, for the plpython pointer, > I'll take a look at it. > > 2) About disabling the automatic prepare: the mechanism I have in mind > is to set prepare_threshold to None on the connection instance; So the above is something the user would have to do on each connection? > however we could make sure to have the default attribute defined on > the class: this way who really hates the idea of prepared statements > can be cheeky and set `psycopg3.Connection.prepare_threshold = None` I could get behind that. This all may be premature optimization on my part. As long as there is some way to turn it off at some level I could live with it. > instead of `myconn.prepare_threshold`... More seriously, if there is a > large base of people who think that something can go wrong with > prepared statement we can either provide a better interface to control > it globally or to have the feature opt-in. > > > -- Daniele > -- Adrian Klaver adrian.klaver@aklaver.com
> it's a moment in your
> program's lifetime where you know the query to run and the arguments:
> most likely now you want to run the query too, not only to prepare it.
> (proposed default: 100).
> program's lifetime where you know the query to run and the arguments:
> most likely now you want to run the query too, not only to prepare it.
I am just thinking about type issues in some edge cases...
What if the data is such that in the first row it has some small integers,
but on subsequent rows it has larger numbers that do not fit into "integer"?
There are probably other cases similar to this, including "None" values
on the moment you want to capture types... Maybe it is more reliable to
oblige the user, who knows their data better, to explicitly supply the types
for preparation... Will any type inference work in case of types mismatch?
If yes, to what extent?
Generally I think that automatic preparation is a great idea, but if it is achieved
by sacrificing reliability to any extent, I would prefer to have such a feature
switched off by default.
> - if more than `connection.prepared_number` queries are prepared, the
> one used least recently is deallocated and evicted from the cache> (proposed default: 100).
Why do you need such logic? Why not just keep some limited number of
prepared statements? Is it a problem if a PS is in cache but rarely used?
--
Vlad
On Mon, 21 Dec 2020 at 13:55, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 12/21/20 12:26 PM, Daniele Varrazzo wrote:
> On Mon, 21 Dec 2020 at 16:02, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>
>
> I mistakenly replied Adrian privately. Following, his reply.
>
> I think so. Personally, I prefer the two step approach as I am becoming
> less and less enthusiastic about hidden 'magic'. To that end a
> global(maybe connection) setting that would disable prepare would be nice.
>
> ----
>
> To which, 1) thank you very much, Adrian, for the plpython pointer,
> I'll take a look at it.
>
> 2) About disabling the automatic prepare: the mechanism I have in mind
> is to set prepare_threshold to None on the connection instance;
So the above is something the user would have to do on each connection?
> however we could make sure to have the default attribute defined on
> the class: this way who really hates the idea of prepared statements
> can be cheeky and set `psycopg3.Connection.prepare_threshold = None`
I could get behind that. This all may be premature optimization on my
part. As long as there is some way to turn it off at some level I could
live with it.
> instead of `myconn.prepare_threshold`... More seriously, if there is a
> large base of people who think that something can go wrong with
> prepared statement we can either provide a better interface to control
> it globally or to have the feature opt-in.
>
>
> -- Daniele
>
--
Adrian Klaver
adrian.klaver@aklaver.com
>> - if more than `connection.prepared_number` queries are prepared, the> one >> used least recently is deallocated and evicted from the cache >> (proposed default: 100). >Why do you need such logic? Why not just keep some limited number of >prepared statements? Is it a problem if a PS is in cache but rarely used? On resource constrained environments one wants to be able to balance use vs disuse. Here I agree with Adrian: automagic, yes but only if one can opt out by default. Karsten
On Tue, 22 Dec 2020 at 05:39, Vladimir Ryabtsev <greatvovan@gmail.com> wrote: > > > it's a moment in your > > program's lifetime where you know the query to run and the arguments: > > most likely now you want to run the query too, not only to prepare it. > > I am just thinking about type issues in some edge cases... > What if the data is such that in the first row it has some small integers, > but on subsequent rows it has larger numbers that do not fit into "integer"? > There are probably other cases similar to this, including "None" values > on the moment you want to capture types... Maybe it is more reliable to > oblige the user, who knows their data better, to explicitly supply the types > for preparation... Will any type inference work in case of types mismatch? > If yes, to what extent? The only case I have thought where this can happen is with None vs. not None, in which case the oid info would be missing. Oid selection is based only on the Python type, not on the specific values (for instance Python datetime is always passed as timestamptz, never timestamp, even if the tezinfo is missing, and it relies on Postgres cast): I have avoided so far to make a choice based on the values exactly to avoid to find ourselves in this type of situations. I don't have in mind a case where a missing value would make a difference but that doesn't mean that they don't exist. What I think is that if in a certain context a NULL is acceptable and Python can pass either a None or e.g. a date, preparing witn unknown or date oid shouldn't make a difference. Of course it is possible to trigger ambiguities: trivially the query `SELECT %s` could be prepared with a date and then you can throw it a string that will make it fail. But if a placeholder is part of an expression or target for an insert I believe (but will test too) that preparing with unknown types will be equivalent. I am interested in real cases in which a prepared statement is definitely undesirable. My current idea is to make the feature automatic but with the possibility to disable it at global, connection, statement level. If there is evidence that the feature is dangerous we can leave it disabled by default and leave the possibility to enable it in the same contexts. -- Daniele > > Generally I think that automatic preparation is a great idea, but if it is achieved > by sacrificing reliability to any extent, I would prefer to have such a feature > switched off by default. > > > - if more than `connection.prepared_number` queries are prepared, the > > one used least recently is deallocated and evicted from the cache > > (proposed default: 100). > > Why do you need such logic? Why not just keep some limited number of > prepared statements? Is it a problem if a PS is in cache but rarely used? > > -- > Vlad > > > On Mon, 21 Dec 2020 at 13:55, Adrian Klaver <adrian.klaver@aklaver.com> wrote: >> >> On 12/21/20 12:26 PM, Daniele Varrazzo wrote: >> > On Mon, 21 Dec 2020 at 16:02, Adrian Klaver <adrian.klaver@aklaver.com> wrote: >> > >> >> > >> > I mistakenly replied Adrian privately. Following, his reply. >> > >> >> > I think so. Personally, I prefer the two step approach as I am becoming >> > less and less enthusiastic about hidden 'magic'. To that end a >> > global(maybe connection) setting that would disable prepare would be nice. >> > >> > ---- >> > >> > To which, 1) thank you very much, Adrian, for the plpython pointer, >> > I'll take a look at it. >> > >> > 2) About disabling the automatic prepare: the mechanism I have in mind >> > is to set prepare_threshold to None on the connection instance; >> >> So the above is something the user would have to do on each connection? >> >> > however we could make sure to have the default attribute defined on >> > the class: this way who really hates the idea of prepared statements >> > can be cheeky and set `psycopg3.Connection.prepare_threshold = None` >> >> I could get behind that. This all may be premature optimization on my >> part. As long as there is some way to turn it off at some level I could >> live with it. >> >> > instead of `myconn.prepare_threshold`... More seriously, if there is a >> > large base of people who think that something can go wrong with >> > prepared statement we can either provide a better interface to control >> > it globally or to have the feature opt-in. >> > >> > >> > -- Daniele >> > >> >> >> -- >> Adrian Klaver >> adrian.klaver@aklaver.com >> >>
On Tue, 22 Dec 2020 at 22:36, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote: > > On Tue, 22 Dec 2020 at 05:39, Vladimir Ryabtsev <greatvovan@gmail.com> wrote: > > I am just thinking about type issues in some edge cases... > > What if the data is such that in the first row it has some small integers, > > but on subsequent rows it has larger numbers that do not fit into "integer"? > > There are probably other cases similar to this, including "None" values > > on the moment you want to capture types... Maybe it is more reliable to > > oblige the user, who knows their data better, to explicitly supply the types > > for preparation... Will any type inference work in case of types mismatch? > > If yes, to what extent? > > The only case I have thought where this can happen is with None vs. > not None, in which case the oid info would be missing. Oid selection > is based only on the Python type, not on the specific values (for > instance Python datetime is always passed as timestamptz, never > timestamp, even if the tezinfo is missing, and it relies on Postgres > cast): I have avoided so far to make a choice based on the values > exactly to avoid to find ourselves in this type of situations. > > I don't have in mind a case where a missing value would make a > difference but that doesn't mean that they don't exist. What I think > is that if in a certain context a NULL is acceptable and Python can > pass either a None or e.g. a date, preparing witn unknown or date oid > shouldn't make a difference. Of course it is possible to trigger > ambiguities: trivially the query `SELECT %s` could be prepared with a > date and then you can throw it a string that will make it fail. But if > a placeholder is part of an expression or target for an insert I > believe (but will test too) that preparing with unknown types will be > equivalent. Heads up about this: it's better than I thought! I wrote a first implementation of the prepared statements cache using the query as a key, but it's actually enough to use the (query, types) tuple in order to tell apart statements that are executed with different types. This way even the "SELECT %s" case won't be a problem. Of course a statement executed with a mix of types will be prepared later than `prepare_threshold`, but I think it's perfectly acceptable: the case doesn't happen often and having the query prepared after 10 times instead of 5 doesn't change much if it will be executed hundreds of times or more. What seems a feature-complete branch is available in [1]. The tests [2] illustrate the main behaviour of the prepared statements system. [1]: https://github.com/psycopg/psycopg3/tree/prepared-statements>. [2]: https://github.com/psycopg/psycopg3/blob/prepared-statements/tests/test_prepared.py Off to do some benchmarks now... -- Daniele
On 12/23/20 2:53 PM, Daniele Varrazzo wrote: > On Tue, 22 Dec 2020 at 22:36, Daniele Varrazzo > <daniele.varrazzo@gmail.com> wrote: >> >> On Tue, 22 Dec 2020 at 05:39, Vladimir Ryabtsev <greatvovan@gmail.com> wrote: > > > Heads up about this: it's better than I thought! > > I wrote a first implementation of the prepared statements cache using > the query as a key, but it's actually enough to use the (query, types) > tuple in order to tell apart statements that are executed with > different types. This way even the "SELECT %s" case won't be a > problem. Of course a statement executed with a mix of types will be > prepared later than `prepare_threshold`, but I think it's perfectly Alright I was following you until you got to above. I'm not following why it would overshoot prepare_threshold? > acceptable: the case doesn't happen often and having the query > prepared after 10 times instead of 5 doesn't change much if it will be > executed hundreds of times or more. > > What seems a feature-complete branch is available in [1]. The tests > [2] illustrate the main behaviour of the prepared statements system. > > [1]: https://github.com/psycopg/psycopg3/tree/prepared-statements>. > [2]: https://github.com/psycopg/psycopg3/blob/prepared-statements/tests/test_prepared.py > > Off to do some benchmarks now... > > -- Daniele > -- Adrian Klaver adrian.klaver@aklaver.com
Cause (query, types) can give more combinations than (query,)?
On Wed, 23 Dec 2020 at 15:12, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 12/23/20 2:53 PM, Daniele Varrazzo wrote:
> On Tue, 22 Dec 2020 at 22:36, Daniele Varrazzo
> <daniele.varrazzo@gmail.com> wrote:
>>
>> On Tue, 22 Dec 2020 at 05:39, Vladimir Ryabtsev <greatvovan@gmail.com> wrote:
>
>
> Heads up about this: it's better than I thought!
>
> I wrote a first implementation of the prepared statements cache using
> the query as a key, but it's actually enough to use the (query, types)
> tuple in order to tell apart statements that are executed with
> different types. This way even the "SELECT %s" case won't be a
> problem. Of course a statement executed with a mix of types will be
> prepared later than `prepare_threshold`, but I think it's perfectly
Alright I was following you until you got to above. I'm not following
why it would overshoot prepare_threshold?
> acceptable: the case doesn't happen often and having the query
> prepared after 10 times instead of 5 doesn't change much if it will be
> executed hundreds of times or more.
>
> What seems a feature-complete branch is available in [1]. The tests
> [2] illustrate the main behaviour of the prepared statements system.
>
> [1]: https://github.com/psycopg/psycopg3/tree/prepared-statements>.
> [2]: https://github.com/psycopg/psycopg3/blob/prepared-statements/tests/test_prepared.py
>
> Off to do some benchmarks now...
>
> -- Daniele
>
--
Adrian Klaver
adrian.klaver@aklaver.com
On Wed, 23 Dec 2020 at 23:23, Vladimir Ryabtsev <greatvovan@gmail.com> wrote: > > Cause (query, types) can give more combinations than (query,)? Yes, that's the reason In [1]: import psycopg3 In [2]: cnn = psycopg3.connect() In [3]: cnn.prepare_threshold = 2 In [4]: cnn.execute("select 1 + %s", [1]).fetchone() Out[4]: (2,) In [5]: cnn.execute("select 1 + %s", [None]).fetchone() Out[5]: (None,) In [7]: cnn.execute("select 1 + %s", [2]).fetchone() Out[7]: (3,) After 3 times the expression should have been prepared, but the tally has been spread in two values (0 is unknown oid, 20 is int oid). In [8]: cnn._prepared_statements Out[8]: OrderedDict([((b'select 1 + $1', (0,)), 1), ((b'select 1 + $1', (20,)), 2)]) In [9]: cnn.execute("select 1 + %s", [3]).fetchone() Out[9]: (4,) However, when either key passes the threshold, eventually preparation happens. In [10]: cnn._prepared_statements Out[10]: OrderedDict([((b'select 1 + $1', (0,)), 1), ((b'select 1 + $1', (20,)), b'_pg3_0')]) _pg3_0 is the name under which that combination of query and types is now prepared (it is local per session). -- Daniele
On 12/23/20 4:14 PM, Daniele Varrazzo wrote: > On Wed, 23 Dec 2020 at 23:23, Vladimir Ryabtsev <greatvovan@gmail.com> wrote: >> >> Cause (query, types) can give more combinations than (query,)? > > Yes, that's the reason > > In [1]: import psycopg3 > In [2]: cnn = psycopg3.connect() > In [3]: cnn.prepare_threshold = 2 > > In [4]: cnn.execute("select 1 + %s", [1]).fetchone() > Out[4]: (2,) > In [5]: cnn.execute("select 1 + %s", [None]).fetchone() > Out[5]: (None,) > In [7]: cnn.execute("select 1 + %s", [2]).fetchone() > Out[7]: (3,) Alright, I was misunderstanding. I thought you where referring to something like: cur.execute("select * from some table where id > %s and user_name = %s and current = %s", [2, 'aklaver', True]) > > After 3 times the expression should have been prepared, but the tally > has been spread in two values (0 is unknown oid, 20 is int oid). > > In [8]: cnn._prepared_statements > Out[8]: OrderedDict([((b'select 1 + $1', (0,)), 1), ((b'select 1 + > $1', (20,)), 2)]) > > In [9]: cnn.execute("select 1 + %s", [3]).fetchone() > Out[9]: (4,) > > However, when either key passes the threshold, eventually preparation happens. > > In [10]: cnn._prepared_statements > Out[10]: > OrderedDict([((b'select 1 + $1', (0,)), 1), > ((b'select 1 + $1', (20,)), b'_pg3_0')]) > > _pg3_0 is the name under which that combination of query and types is > now prepared (it is local per session). > > -- Daniele > -- Adrian Klaver adrian.klaver@aklaver.com