Thread: Latest developments in psycopg3

Latest developments in psycopg3

From
Daniele Varrazzo
Date:
Hello,

(message is getting so long as to require titles, so here they are)


# Latest adaptation woes

I have finally broken the adaptation model where a Python class gets
adapted to a Postgres type. The model is what psycopg2 is based on but
for psycopg3 it was proving unsatisfactory.

The problems were some of the ones discussed in a previous thread,
revolving around the different behaviour of server-side binding
(psycopg3) vs. client-side binding (psycopg2). With client side
binding psycopg2 could get away with a lot that is no longer allowed
by the server; trying to use binary parameters makes it even more
tricky.

The area of the numbers is the most prominent: "select current_date +
%s" only works if the parameter is specified as either int2 or int4.
If it's int8 or numeric, Postgres will complain that there is no cast.
If it's unknown, Postgres will complain that it cannot choose if %s is
a number or an interval.

Other thorny problems are with arrays: you know that in Python there
are lists of any types (we hope they are uniform here) and in Postgres
there are arrays of a specific type. Just the type list doesn't say
enough, and empty lists are an even more special case: you cannot have
an array of unknowns, but without an item in it you cannot really know
its oid. You can specify the whole array to be unknown, with the '{}'
syntax and unknown oid, but only in text, not in binary.

Another funny area is with ranges, as postgres can cast between
timestamp and timestamptz transparently but there's no cast between
tsrange and tstzrange, and Python has a single datetime object which
you have to inspect to know if it's tz aware or naive...


# Introducing two-steps adaptation

For some time, in order to accommodate some of these quirks, there
were special cases in the adaptation functions, but now there is a
generic mechanism which can be used for these Python types whose
Postgres type cannot be determined only by the class but the value
must be observed. Choosing what adapter class to use for the object
'obj' goes this way [1]:

1) start from a key = type(object)

2) look for such key in the adapters map. If you don't find it it's an
error (cannot adapt object 'obj'). If found, you have an object dumper
of a subclass of Dumper.

3) ask the dumper if it needs an upgrade by calling
dumper.get_key(obj). If it returns the same key as in 1) (which is the
base implementation) then there's nothing to do

4) if it returns a different key, call dumper.upgrade(obj) which
returns a new dumper instance.

Details omitted from this implementation are caching (the reason for
the twins get_key/upgrade) and managing the binary/text format: they
can be seen in the implementation in [1].

[1]:
https://github.com/psycopg/psycopg3/blob/68547b8c16d3b3b093fba8d6df1f56cf52b37f08/psycopg3/psycopg3/_transform.py#L131

This schema allows all the pretty things: a Python int can be dumped
as int2, int4, int8, numeric according to its value; empty lists can
be adapted as unknown '{}' if the query is willing to accept text
params, there is a single Range object, instead of DateRange, IntRange
etc...

I have to go back on the project of implementing the Django db backend
for psycopg3, but a lot of what I had done so far in order to
accommodate its differences (often consisting in finding the right
place to add a ::integer cast...) can be probably simplified. I am
absolutely delighted because it goes in the direction of making a
migration from psycopg2 to psycopg3 smoother for many users: the
amount of things I had to do for the Django backend was telling me
that the thing wasn't trivial yet but I think it's improving now.


# Automatic use of binary parameters

Another change happened, related, to the above, is the automatic
selection of text/binary format for the query arguments. Previously
parameters passed to %s placeholders would have been adapted in text,
and people would have opted in to binary parameters by specifying a %b
placeholder. This rigidity didn't allow for the required adaptation
fluidity of the schema above, so now %s allows for automatic
selection: if a type has a binary dumper it will be used, otherwise a
text dumper will be looked up, and adaptation will fail if neither is
available. If people have reasons to use a specific format they can
specify %b or %t as placeholders for the binary or the text format.


# Connecting oid catalog and adaptation contexts

The next changes in the adaptation area will be around unifying the
oids map (mapping type names to oids) with the adaptation context
(mapping types to adaptation objects). At the moment the former is
global and supposed for builtins only, the latter is attached to a
connection or a cursor. This causes a few problems: one is around
looking up custom types in the catalog and registering adapters for
them, because the thing happens in two different places and the global
one is not supposed to be modified. Another is in certain areas where
the types are not dictated by the Python types of the arguments, like
in normal queries, but by what Postgres is willing to accept: it
happens with casting composite types in binary and in binary COPY
FROM, where Postgres doesn't perform any cast. In these places the
array of dumpers to use should be configured in other ways, for
example specifying the list of type names to pass, which may contain
custom data types too.


# jsonb binary loading

The thing I've worked on in the last few days is something entirely
new and exciting instead: finding a way to transfer jsonb in binary
(at the moment jsonb is only passed as text to the client).
Preliminary results seem to suggest a lot of performance to be gained.
(https://github.com/dvarrazzo/jsonb_parser) and it's something I will
explore further with the help of some of the PostgresPro people, who
are keeping on developing jsonb in the backend.


# single-row mode

The libpq allows a single-row mode, but given the batching nature of
the fetch*() methods it's not something of obvious utility. However a
few alternative databases are emerging (Materialise, CockroachDB)
which use the query protocol as a stream: pretty much a never-ending
query. I have added a `stream()` method [2] to the cursor, allowing to
consume these results too in single-row mode.

https://www.psycopg.org/psycopg3/docs/cursor.html#psycopg3.Cursor.stream


# Thank you for the support

Thank you very much to the companies and the individuals sponsoring
this project: https://www.psycopg.org/sponsors/

If anyone would like to contribute, in ways other than funding, there
is also fun code to write (strictly typed, async, Cython if you wish)
and you are welcome to get in touch: I am sure there are practical
ways to contribute!

Cheers

-- Daniele



Re: Latest developments in psycopg3

From
Denis Laxalde
Date:
Hello Daniele,

Daniele Varrazzo a écrit :
> If anyone would like to contribute, in ways other than funding, there
> is also fun code to write (strictly typed, async, Cython if you wish)
> and you are welcome to get in touch: I am sure there are practical
> ways to contribute!

I spent a bit of time looking around this last week.

The first thing I did was to try out psycopg3 on a (small) project I 
maintain (pg_activity). Porting from psycopg2 wasn't too hard:

- first, I needed the conn-info developments (in progress at the moment) 
along with recent works on the adaptation system so I did a local merge 
of master in conn-info branch.

- second, I was missing the DictCursor cursor factory that's not (yet?) 
available in psycopg3, so I had to rewrite part of the code.

- after that, tests pass and the application works fine without any 
further changes (the DictCursor ones are non-negligible though).

So, based on that:

- It seems to me that the conn-info developments work fine (also looked 
at the commits, which look fine as well to me): what needs to be done 
here? Any help needed?

- There appears to be no card about dictcursor on the project board. 
What's the plan (and priority) for that (if any)?


Then, following your idea to implement a django backend, I was wondering 
if doing the same for sqlalchemy would be a good thing to work on now?


I'm also curious about the "strictly typed" thing you wrote above. What 
do you have in mind more specifically?


Denis



Re: Latest developments in psycopg3

From
Daniele Varrazzo
Date:
Hello Denis,

On Mon, 8 Feb 2021 at 12:14, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
> I spent a bit of time looking around this last week.

Thank you very much, that's very appreciated :)


> The first thing I did was to try out psycopg3 on a (small) project I
> maintain (pg_activity). Porting from psycopg2 wasn't too hard:

Nice to know!

> - It seems to me that the conn-info developments work fine (also looked
> at the commits, which look fine as well to me): what needs to be done
> here? Any help needed?

I can't remember what is missing: probably the interface is complete,
it is missing test coverage and documentation. I will review if there
are methods missing: its purpose is to expose pretty much all the info
about the connection that the libpq can give, correctly decoded from
bytes. Complete tests and docs are definitely to be added.


> - There appears to be no card about dictcursor on the project board.
> What's the plan (and priority) for that (if any)?

I forgot to add a point about them, sorry: I added one now. There is
no firm plan for me, but probably a conversation to have, because I
have my own usage patterns for psycopg2, when I am its user, and I
want to be mindful of other use cases too.

In short: yes, I would like to provide alternative record factories.
We can review if the best way is to create cursor subclasses
out-of-the-box or mixins. In psycopg2 there are:

- DictCursor (returns a hybrid object between a sequence and a dictionary)
- RealDictCursor (returns a straight subclass of a dictionary, so it's
not a legit dbapi cursor as it doesn't expose a sequence)
- NamedTupleCursor (what it says on the tin).

Personally I use the NamedTupeCursor most often, as I prefer the dot
notation to access attribute. Named tuples also have a _dict() method
when a dict is needed. If psycopg3 was only my pet project, I would
personally have NamedTuples as the default and only thing supported...
But that's hardly the case.

1) We can provide a feature to select the type of cursor that is not
much dissimilar from psycopg2
2) Do we need DictCursor/RealDictCursor? ISTM that one of the two
would be sufficient? Possibly an object inheriting from a Python dict
instead of emulating it so that e.g. json.dumps()
3) Or, conversely, we could have a Row object providing both attribute
and getattr access, both as index and name:

   row.myattr
   row["myattr"]
   row[1]

4) There are reports of the *DictCursor being slow: the objects in
psycopg2 have hardly been profiled. I would look well at the
performance of the objects created.

I am curious about why people use DictCursor: what are its advantages?
Should we provide such objects only to guarantee a smoother upgrade
path or there are use cases where they are genuinely better than
NamedTuples or a would-be Row object (JSON does come to mind).


> Then, following your idea to implement a django backend, I was wondering
> if doing the same for sqlalchemy would be a good thing to work on now?

Yes, why not. I don't know SQLAlchemy but I assume it would be way
less work to create a SA backend than a Django one.

While on this topic, in case anyone takes a look at the Django backend
and thinks about using or hacking on it: the last commits were done
before I implemented the two-steps dumers; as such there are *way*
more changes than there should be now that Python int are dumped as
int2/int4 if they are smaller than their respective MAX. So the
current state is not representative of how it can look and I expect it
will have many less differences than the psycopg2 backend.


> I'm also curious about the "strictly typed" thing you wrote above. What
> do you have in mind more specifically?

The codebase uses mypy --strict, so it is extensively type-annotated
and of course all the public interfaces are type-annotated too, to
collaborate with projects using static typing. I am also curious about
whether it would be possible to provide types out of the query
results, but I doubt it is.

Which makes me think... 5) PydanticCursor? Or, as a more general case,
implement all the cursor flavours not as subclasses but using a
rowfactory function, and allow people to use their own (e.g. asking a
cursor to create a Pydantic model provided by the user, so that they
can have psycopg3 collaborate with e.g. FastAPI strictly typed
models). Or a Python dataclass...


Thank you for the feedback so far!

-- Daniele



Re: Latest developments in psycopg3

From
Denis Laxalde
Date:
Daniele Varrazzo a écrit :
> On Mon, 8 Feb 2021 at 12:14, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
>> - It seems to me that the conn-info developments work fine (also looked
>> at the commits, which look fine as well to me): what needs to be done
>> here? Any help needed?
> 
> I can't remember what is missing: probably the interface is complete,
> it is missing test coverage and documentation. I will review if there
> are methods missing: its purpose is to expose pretty much all the info
> about the connection that the libpq can give, correctly decoded from
> bytes. Complete tests and docs are definitely to be added.

Ok, good to know. (Let me know if help is needed.)

>> - There appears to be no card about dictcursor on the project board.
>> What's the plan (and priority) for that (if any)?
> 
> I forgot to add a point about them, sorry: I added one now. There is
> no firm plan for me, but probably a conversation to have, because I
> have my own usage patterns for psycopg2, when I am its user, and I
> want to be mindful of other use cases too.
> 
> In short: yes, I would like to provide alternative record factories.
> We can review if the best way is to create cursor subclasses
> out-of-the-box or mixins. In psycopg2 there are:
> 
> - DictCursor (returns a hybrid object between a sequence and a dictionary)
> - RealDictCursor (returns a straight subclass of a dictionary, so it's
> not a legit dbapi cursor as it doesn't expose a sequence)
> - NamedTupleCursor (what it says on the tin).
> 
> Personally I use the NamedTupeCursor most often, as I prefer the dot
> notation to access attribute. Named tuples also have a _dict() method
> when a dict is needed. If psycopg3 was only my pet project, I would
> personally have NamedTuples as the default and only thing supported...
> But that's hardly the case.

Personally, I use DictCursor because it provides access to rows without 
depending on SELECT order and then pass items to a custom dataclass as 
**kwargs.


> 1) We can provide a feature to select the type of cursor that is not
> much dissimilar from psycopg2
> 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two
> would be sufficient? Possibly an object inheriting from a Python dict
> instead of emulating it so that e.g. json.dumps()

Inheriting from dict is strange because we only need read access. So 
rather collections.abc.Mapping, I think.

> 3) Or, conversely, we could have a Row object providing both attribute
> and getattr access, both as index and name:
> 
>     row.myattr
>     row["myattr"]
>     row[1]

+1 (this would also fit with collections.abc.Mapping)

> 
> 4) There are reports of the *DictCursor being slow: the objects in
> psycopg2 have hardly been profiled. I would look well at the
> performance of the objects created.
> 
> I am curious about why people use DictCursor: what are its advantages?
> Should we provide such objects only to guarantee a smoother upgrade
> path or there are use cases where they are genuinely better than
> NamedTuples or a would-be Row object (JSON does come to mind).

I seem to recall that namedtuple was not good for performances as either 
(see this article https://lwn.net/Articles/731423/ for instance). I tend 
to avoid using them and often write custom classes instead (which is 
very handy with py3.7's dataclass).

So, for me, no use case that wouldn't be covered by "would-be Row object".


>> I'm also curious about the "strictly typed" thing you wrote above. What
>> do you have in mind more specifically?
> 
> The codebase uses mypy --strict, so it is extensively type-annotated
> and of course all the public interfaces are type-annotated too, to
> collaborate with projects using static typing. I am also curious about
> whether it would be possible to provide types out of the query
> results, but I doubt it is.

That does not sounds feasible/easy. It's the price to pay where 
interfacing with external data sources (e.g. the same is true when one 
retrieves data from a JSON API) and type checking is probably only 
possible when "decoding" results.


> Which makes me think... 5) PydanticCursor? Or, as a more general case,
> implement all the cursor flavours not as subclasses but using a
> rowfactory function, and allow people to use their own (e.g. asking a
> cursor to create a Pydantic model provided by the user, so that they
> can have psycopg3 collaborate with e.g. FastAPI strictly typed
> models). Or a Python dataclass...

Providing a plugable API for that would be really nice, indeed.



Re: Latest developments in psycopg3

From
Karsten Hilbert
Date:
Am Mon, Feb 08, 2021 at 01:16:43PM +0100 schrieb Daniele Varrazzo:

> 2) Do we need DictCursor/RealDictCursor?

For what it is worth, GNUmed does use DictConnection for all
its database needs.

Karsten
--
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B



Re: Latest developments in psycopg3

From
Rory Campbell-Lange
Date:
On 08/02/21, Daniele Varrazzo (daniele.varrazzo@gmail.com) wrote:
> > - There appears to be no card about dictcursor on the project board.
> > What's the plan (and priority) for that (if any)?

...

> In psycopg2 there are:
> 
> - DictCursor (returns a hybrid object between a sequence and a dictionary)
> - RealDictCursor (returns a straight subclass of a dictionary, so it's
> not a legit dbapi cursor as it doesn't expose a sequence)
> - NamedTupleCursor (what it says on the tin).
> 
> Personally I use the NamedTupeCursor most often, as I prefer the dot
> notation to access attribute. 

...

> 1) We can provide a feature to select the type of cursor that is not
> much dissimilar from psycopg2
> 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two
> would be sufficient? Possibly an object inheriting from a Python dict
> instead of emulating it so that e.g. json.dumps()
> 3) Or, conversely, we could have a Row object providing both attribute
> and getattr access, both as index and name:
> 
>    row.myattr
>    row["myattr"]
>    row[1]

A comment on this point.

We use NamedTupleCursor a great deal as our system domain is modelled
largely in the database in plpgsql. The only issue we've had with using
NamedTupleCursor is when someone uses a space in column heading (the
space translates to a "_") or trips up and uses the same column name for
two columns. We class both cases as a mistake.

Instead of making the NamedTupleCursor the default, I imagine many
people accessing "raw" sql from python would benefit from the third
option, of permitting list, dict or getattr type access to values. This
would, presumably, be backwards compatible with current usage which I
imagine is typically list oriented.

A major benefit of getattr type access is that interactive debugging in
iPython allows dot expansion, so introspection of the the row's
attributes is wonderfully easy.

Regards
Rory



Re: Latest developments in psycopg3

From
Daniele Varrazzo
Date:
On Mon, 8 Feb 2021 at 15:16, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
>
> Daniele Varrazzo a écrit :
> > On Mon, 8 Feb 2021 at 12:14, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
> >> - It seems to me that the conn-info developments work fine (also looked
> >> at the commits, which look fine as well to me): what needs to be done
> >> here? Any help needed?
> >
> > I can't remember what is missing: probably the interface is complete,
> > it is missing test coverage and documentation. I will review if there
> > are methods missing: its purpose is to expose pretty much all the info
> > about the connection that the libpq can give, correctly decoded from
> > bytes. Complete tests and docs are definitely to be added.
>
> Ok, good to know. (Let me know if help is needed.)

That object should have more or less the same attributes of psycopg2
ConnectionInfo. [1] I looked again at the state of the branch and I
remember spending some time on get_parameters() (involving finding a
bug in the libpq). I haven't checked one by one the attributes but I
think a few are still missing. I would like to have at least one test
per attribute, just to catch regressions, and documentation, which
would largely be based on one line of docstrings and Sphinx
automethod.

So, not a difficult task, if a bit tedious. If you would like to help
on it I don't have work on that branch in mind in the next few days.


[1] https://www.psycopg.org/docs/extensions.html#psycopg2.extensions.ConnectionInfo


> > Which makes me think... 5) PydanticCursor? Or, as a more general case,
> > implement all the cursor flavours not as subclasses but using a
> > rowfactory function, and allow people to use their own (e.g. asking a
> > cursor to create a Pydantic model provided by the user, so that they
> > can have psycopg3 collaborate with e.g. FastAPI strictly typed
> > models). Or a Python dataclass...
>
> Providing a plugable API for that would be really nice, indeed.

For something more creative you could maybe take a look at this: maybe
we could give both the connection and the cursor a row_factory
attribute. I am not sure if just a callable would be enough: named
tuples, and in lesser measure dictionaries, would benefit from being
warned "a new result is coming: this is its shape". A clean interface
might be:

    class RowFactory:
        def set_result(self, cursor: BaseCursor):
            ...
        def make_row(self, values: Tuple[...]):
            ...

The library could provide some basic ones, for instance the equivalent
of DictCursor (real dicts) might be:

    class DictRowFactory:
        def set_result(self, cursor):
            self.titles = [d.name for d in cursor.description]
        def make_row(self, values):
            return dict(zip(self.titles, values))

This feature is definitely to be done before the first library release
but I can totally find something else to do and leave space to others.

Cheers

-- Daniele



Re: Latest developments in psycopg3

From
Adrian Klaver
Date:
On 2/8/21 4:16 AM, Daniele Varrazzo wrote:
> Hello Denis,
> 

> In short: yes, I would like to provide alternative record factories.
> We can review if the best way is to create cursor subclasses
> out-of-the-box or mixins. In psycopg2 there are:
> 
> - DictCursor (returns a hybrid object between a sequence and a dictionary)
> - RealDictCursor (returns a straight subclass of a dictionary, so it's
> not a legit dbapi cursor as it doesn't expose a sequence)
> - NamedTupleCursor (what it says on the tin).
> 
> Personally I use the NamedTupeCursor most often, as I prefer the dot
> notation to access attribute. Named tuples also have a _dict() method
> when a dict is needed. If psycopg3 was only my pet project, I would
> personally have NamedTuples as the default and only thing supported...
> But that's hardly the case.
> 
> 1) We can provide a feature to select the type of cursor that is not
> much dissimilar from psycopg2
> 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two
> would be sufficient? Possibly an object inheriting from a Python dict
> instead of emulating it so that e.g. json.dumps()

RealDictCursor is what I use primarily as it plays well with json.*. I 
have used NamedTupleCursor for those things that play better with dot 
notation. I very rarely use DictCursor anymore as it's dual nature 
seemed to get in the way more then it helped.

> 3) Or, conversely, we could have a Row object providing both attribute
> and getattr access, both as index and name:
> 
>     row.myattr
>     row["myattr"]
>     row[1]
> 
> 4) There are reports of the *DictCursor being slow: the objects in
> psycopg2 have hardly been profiled. I would look well at the
> performance of the objects created.
> 


> -- Daniele
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Latest developments in psycopg3

From
Rory Campbell-Lange
Date:
On 08/02/21, Adrian Klaver (adrian.klaver@aklaver.com) wrote:
> On 2/8/21 4:16 AM, Daniele Varrazzo wrote:

> > 1) We can provide a feature to select the type of cursor that is not
> > much dissimilar from psycopg2
> > 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two
> > would be sufficient? Possibly an object inheriting from a Python dict
> > instead of emulating it so that e.g. json.dumps()
> 
> RealDictCursor is what I use primarily as it plays well with json.*. I have
> used NamedTupleCursor for those things that play better with dot notation. I
> very rarely use DictCursor anymore as it's dual nature seemed to get in the
> way more then it helped.

Your comments also suggests the possibility of having a convenience
method, something like ._asjson() perhaps to allow rows or entire result
sets json encoded.





Re: Latest developments in psycopg3

From
Daniel Fortunov
Date:
> 1) We can provide a feature to select the type of cursor that is not
> much dissimilar from psycopg2
> 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two
> would be sufficient? Possibly an object inheriting from a Python dict
> instead of emulating it so that e.g. json.dumps()

Inheriting from dict is strange because we only need read access. So
rather collections.abc.Mapping, I think.
 
One advantage of using a real dict is that it leaves the door open for the caller to mutate the results as they consume them. A little impure, I know, but one practical example I can think of would be where you want to augment your results with additional data (coming from a different source, outside the database).

RealDictCursor gives you the flexibility to do this, whereas collections.abc.Mapping would force you to construct a new dict instance for every row in order to get the same result.