Thread: Latest developments in psycopg3
Hello, (message is getting so long as to require titles, so here they are) # Latest adaptation woes I have finally broken the adaptation model where a Python class gets adapted to a Postgres type. The model is what psycopg2 is based on but for psycopg3 it was proving unsatisfactory. The problems were some of the ones discussed in a previous thread, revolving around the different behaviour of server-side binding (psycopg3) vs. client-side binding (psycopg2). With client side binding psycopg2 could get away with a lot that is no longer allowed by the server; trying to use binary parameters makes it even more tricky. The area of the numbers is the most prominent: "select current_date + %s" only works if the parameter is specified as either int2 or int4. If it's int8 or numeric, Postgres will complain that there is no cast. If it's unknown, Postgres will complain that it cannot choose if %s is a number or an interval. Other thorny problems are with arrays: you know that in Python there are lists of any types (we hope they are uniform here) and in Postgres there are arrays of a specific type. Just the type list doesn't say enough, and empty lists are an even more special case: you cannot have an array of unknowns, but without an item in it you cannot really know its oid. You can specify the whole array to be unknown, with the '{}' syntax and unknown oid, but only in text, not in binary. Another funny area is with ranges, as postgres can cast between timestamp and timestamptz transparently but there's no cast between tsrange and tstzrange, and Python has a single datetime object which you have to inspect to know if it's tz aware or naive... # Introducing two-steps adaptation For some time, in order to accommodate some of these quirks, there were special cases in the adaptation functions, but now there is a generic mechanism which can be used for these Python types whose Postgres type cannot be determined only by the class but the value must be observed. Choosing what adapter class to use for the object 'obj' goes this way [1]: 1) start from a key = type(object) 2) look for such key in the adapters map. If you don't find it it's an error (cannot adapt object 'obj'). If found, you have an object dumper of a subclass of Dumper. 3) ask the dumper if it needs an upgrade by calling dumper.get_key(obj). If it returns the same key as in 1) (which is the base implementation) then there's nothing to do 4) if it returns a different key, call dumper.upgrade(obj) which returns a new dumper instance. Details omitted from this implementation are caching (the reason for the twins get_key/upgrade) and managing the binary/text format: they can be seen in the implementation in [1]. [1]: https://github.com/psycopg/psycopg3/blob/68547b8c16d3b3b093fba8d6df1f56cf52b37f08/psycopg3/psycopg3/_transform.py#L131 This schema allows all the pretty things: a Python int can be dumped as int2, int4, int8, numeric according to its value; empty lists can be adapted as unknown '{}' if the query is willing to accept text params, there is a single Range object, instead of DateRange, IntRange etc... I have to go back on the project of implementing the Django db backend for psycopg3, but a lot of what I had done so far in order to accommodate its differences (often consisting in finding the right place to add a ::integer cast...) can be probably simplified. I am absolutely delighted because it goes in the direction of making a migration from psycopg2 to psycopg3 smoother for many users: the amount of things I had to do for the Django backend was telling me that the thing wasn't trivial yet but I think it's improving now. # Automatic use of binary parameters Another change happened, related, to the above, is the automatic selection of text/binary format for the query arguments. Previously parameters passed to %s placeholders would have been adapted in text, and people would have opted in to binary parameters by specifying a %b placeholder. This rigidity didn't allow for the required adaptation fluidity of the schema above, so now %s allows for automatic selection: if a type has a binary dumper it will be used, otherwise a text dumper will be looked up, and adaptation will fail if neither is available. If people have reasons to use a specific format they can specify %b or %t as placeholders for the binary or the text format. # Connecting oid catalog and adaptation contexts The next changes in the adaptation area will be around unifying the oids map (mapping type names to oids) with the adaptation context (mapping types to adaptation objects). At the moment the former is global and supposed for builtins only, the latter is attached to a connection or a cursor. This causes a few problems: one is around looking up custom types in the catalog and registering adapters for them, because the thing happens in two different places and the global one is not supposed to be modified. Another is in certain areas where the types are not dictated by the Python types of the arguments, like in normal queries, but by what Postgres is willing to accept: it happens with casting composite types in binary and in binary COPY FROM, where Postgres doesn't perform any cast. In these places the array of dumpers to use should be configured in other ways, for example specifying the list of type names to pass, which may contain custom data types too. # jsonb binary loading The thing I've worked on in the last few days is something entirely new and exciting instead: finding a way to transfer jsonb in binary (at the moment jsonb is only passed as text to the client). Preliminary results seem to suggest a lot of performance to be gained. (https://github.com/dvarrazzo/jsonb_parser) and it's something I will explore further with the help of some of the PostgresPro people, who are keeping on developing jsonb in the backend. # single-row mode The libpq allows a single-row mode, but given the batching nature of the fetch*() methods it's not something of obvious utility. However a few alternative databases are emerging (Materialise, CockroachDB) which use the query protocol as a stream: pretty much a never-ending query. I have added a `stream()` method [2] to the cursor, allowing to consume these results too in single-row mode. https://www.psycopg.org/psycopg3/docs/cursor.html#psycopg3.Cursor.stream # Thank you for the support Thank you very much to the companies and the individuals sponsoring this project: https://www.psycopg.org/sponsors/ If anyone would like to contribute, in ways other than funding, there is also fun code to write (strictly typed, async, Cython if you wish) and you are welcome to get in touch: I am sure there are practical ways to contribute! Cheers -- Daniele
Hello Daniele, Daniele Varrazzo a écrit : > If anyone would like to contribute, in ways other than funding, there > is also fun code to write (strictly typed, async, Cython if you wish) > and you are welcome to get in touch: I am sure there are practical > ways to contribute! I spent a bit of time looking around this last week. The first thing I did was to try out psycopg3 on a (small) project I maintain (pg_activity). Porting from psycopg2 wasn't too hard: - first, I needed the conn-info developments (in progress at the moment) along with recent works on the adaptation system so I did a local merge of master in conn-info branch. - second, I was missing the DictCursor cursor factory that's not (yet?) available in psycopg3, so I had to rewrite part of the code. - after that, tests pass and the application works fine without any further changes (the DictCursor ones are non-negligible though). So, based on that: - It seems to me that the conn-info developments work fine (also looked at the commits, which look fine as well to me): what needs to be done here? Any help needed? - There appears to be no card about dictcursor on the project board. What's the plan (and priority) for that (if any)? Then, following your idea to implement a django backend, I was wondering if doing the same for sqlalchemy would be a good thing to work on now? I'm also curious about the "strictly typed" thing you wrote above. What do you have in mind more specifically? Denis
Hello Denis, On Mon, 8 Feb 2021 at 12:14, Denis Laxalde <denis.laxalde@dalibo.com> wrote: > I spent a bit of time looking around this last week. Thank you very much, that's very appreciated :) > The first thing I did was to try out psycopg3 on a (small) project I > maintain (pg_activity). Porting from psycopg2 wasn't too hard: Nice to know! > - It seems to me that the conn-info developments work fine (also looked > at the commits, which look fine as well to me): what needs to be done > here? Any help needed? I can't remember what is missing: probably the interface is complete, it is missing test coverage and documentation. I will review if there are methods missing: its purpose is to expose pretty much all the info about the connection that the libpq can give, correctly decoded from bytes. Complete tests and docs are definitely to be added. > - There appears to be no card about dictcursor on the project board. > What's the plan (and priority) for that (if any)? I forgot to add a point about them, sorry: I added one now. There is no firm plan for me, but probably a conversation to have, because I have my own usage patterns for psycopg2, when I am its user, and I want to be mindful of other use cases too. In short: yes, I would like to provide alternative record factories. We can review if the best way is to create cursor subclasses out-of-the-box or mixins. In psycopg2 there are: - DictCursor (returns a hybrid object between a sequence and a dictionary) - RealDictCursor (returns a straight subclass of a dictionary, so it's not a legit dbapi cursor as it doesn't expose a sequence) - NamedTupleCursor (what it says on the tin). Personally I use the NamedTupeCursor most often, as I prefer the dot notation to access attribute. Named tuples also have a _dict() method when a dict is needed. If psycopg3 was only my pet project, I would personally have NamedTuples as the default and only thing supported... But that's hardly the case. 1) We can provide a feature to select the type of cursor that is not much dissimilar from psycopg2 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two would be sufficient? Possibly an object inheriting from a Python dict instead of emulating it so that e.g. json.dumps() 3) Or, conversely, we could have a Row object providing both attribute and getattr access, both as index and name: row.myattr row["myattr"] row[1] 4) There are reports of the *DictCursor being slow: the objects in psycopg2 have hardly been profiled. I would look well at the performance of the objects created. I am curious about why people use DictCursor: what are its advantages? Should we provide such objects only to guarantee a smoother upgrade path or there are use cases where they are genuinely better than NamedTuples or a would-be Row object (JSON does come to mind). > Then, following your idea to implement a django backend, I was wondering > if doing the same for sqlalchemy would be a good thing to work on now? Yes, why not. I don't know SQLAlchemy but I assume it would be way less work to create a SA backend than a Django one. While on this topic, in case anyone takes a look at the Django backend and thinks about using or hacking on it: the last commits were done before I implemented the two-steps dumers; as such there are *way* more changes than there should be now that Python int are dumped as int2/int4 if they are smaller than their respective MAX. So the current state is not representative of how it can look and I expect it will have many less differences than the psycopg2 backend. > I'm also curious about the "strictly typed" thing you wrote above. What > do you have in mind more specifically? The codebase uses mypy --strict, so it is extensively type-annotated and of course all the public interfaces are type-annotated too, to collaborate with projects using static typing. I am also curious about whether it would be possible to provide types out of the query results, but I doubt it is. Which makes me think... 5) PydanticCursor? Or, as a more general case, implement all the cursor flavours not as subclasses but using a rowfactory function, and allow people to use their own (e.g. asking a cursor to create a Pydantic model provided by the user, so that they can have psycopg3 collaborate with e.g. FastAPI strictly typed models). Or a Python dataclass... Thank you for the feedback so far! -- Daniele
Daniele Varrazzo a écrit : > On Mon, 8 Feb 2021 at 12:14, Denis Laxalde <denis.laxalde@dalibo.com> wrote: >> - It seems to me that the conn-info developments work fine (also looked >> at the commits, which look fine as well to me): what needs to be done >> here? Any help needed? > > I can't remember what is missing: probably the interface is complete, > it is missing test coverage and documentation. I will review if there > are methods missing: its purpose is to expose pretty much all the info > about the connection that the libpq can give, correctly decoded from > bytes. Complete tests and docs are definitely to be added. Ok, good to know. (Let me know if help is needed.) >> - There appears to be no card about dictcursor on the project board. >> What's the plan (and priority) for that (if any)? > > I forgot to add a point about them, sorry: I added one now. There is > no firm plan for me, but probably a conversation to have, because I > have my own usage patterns for psycopg2, when I am its user, and I > want to be mindful of other use cases too. > > In short: yes, I would like to provide alternative record factories. > We can review if the best way is to create cursor subclasses > out-of-the-box or mixins. In psycopg2 there are: > > - DictCursor (returns a hybrid object between a sequence and a dictionary) > - RealDictCursor (returns a straight subclass of a dictionary, so it's > not a legit dbapi cursor as it doesn't expose a sequence) > - NamedTupleCursor (what it says on the tin). > > Personally I use the NamedTupeCursor most often, as I prefer the dot > notation to access attribute. Named tuples also have a _dict() method > when a dict is needed. If psycopg3 was only my pet project, I would > personally have NamedTuples as the default and only thing supported... > But that's hardly the case. Personally, I use DictCursor because it provides access to rows without depending on SELECT order and then pass items to a custom dataclass as **kwargs. > 1) We can provide a feature to select the type of cursor that is not > much dissimilar from psycopg2 > 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two > would be sufficient? Possibly an object inheriting from a Python dict > instead of emulating it so that e.g. json.dumps() Inheriting from dict is strange because we only need read access. So rather collections.abc.Mapping, I think. > 3) Or, conversely, we could have a Row object providing both attribute > and getattr access, both as index and name: > > row.myattr > row["myattr"] > row[1] +1 (this would also fit with collections.abc.Mapping) > > 4) There are reports of the *DictCursor being slow: the objects in > psycopg2 have hardly been profiled. I would look well at the > performance of the objects created. > > I am curious about why people use DictCursor: what are its advantages? > Should we provide such objects only to guarantee a smoother upgrade > path or there are use cases where they are genuinely better than > NamedTuples or a would-be Row object (JSON does come to mind). I seem to recall that namedtuple was not good for performances as either (see this article https://lwn.net/Articles/731423/ for instance). I tend to avoid using them and often write custom classes instead (which is very handy with py3.7's dataclass). So, for me, no use case that wouldn't be covered by "would-be Row object". >> I'm also curious about the "strictly typed" thing you wrote above. What >> do you have in mind more specifically? > > The codebase uses mypy --strict, so it is extensively type-annotated > and of course all the public interfaces are type-annotated too, to > collaborate with projects using static typing. I am also curious about > whether it would be possible to provide types out of the query > results, but I doubt it is. That does not sounds feasible/easy. It's the price to pay where interfacing with external data sources (e.g. the same is true when one retrieves data from a JSON API) and type checking is probably only possible when "decoding" results. > Which makes me think... 5) PydanticCursor? Or, as a more general case, > implement all the cursor flavours not as subclasses but using a > rowfactory function, and allow people to use their own (e.g. asking a > cursor to create a Pydantic model provided by the user, so that they > can have psycopg3 collaborate with e.g. FastAPI strictly typed > models). Or a Python dataclass... Providing a plugable API for that would be really nice, indeed.
Am Mon, Feb 08, 2021 at 01:16:43PM +0100 schrieb Daniele Varrazzo: > 2) Do we need DictCursor/RealDictCursor? For what it is worth, GNUmed does use DictConnection for all its database needs. Karsten -- GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B
On 08/02/21, Daniele Varrazzo (daniele.varrazzo@gmail.com) wrote: > > - There appears to be no card about dictcursor on the project board. > > What's the plan (and priority) for that (if any)? ... > In psycopg2 there are: > > - DictCursor (returns a hybrid object between a sequence and a dictionary) > - RealDictCursor (returns a straight subclass of a dictionary, so it's > not a legit dbapi cursor as it doesn't expose a sequence) > - NamedTupleCursor (what it says on the tin). > > Personally I use the NamedTupeCursor most often, as I prefer the dot > notation to access attribute. ... > 1) We can provide a feature to select the type of cursor that is not > much dissimilar from psycopg2 > 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two > would be sufficient? Possibly an object inheriting from a Python dict > instead of emulating it so that e.g. json.dumps() > 3) Or, conversely, we could have a Row object providing both attribute > and getattr access, both as index and name: > > row.myattr > row["myattr"] > row[1] A comment on this point. We use NamedTupleCursor a great deal as our system domain is modelled largely in the database in plpgsql. The only issue we've had with using NamedTupleCursor is when someone uses a space in column heading (the space translates to a "_") or trips up and uses the same column name for two columns. We class both cases as a mistake. Instead of making the NamedTupleCursor the default, I imagine many people accessing "raw" sql from python would benefit from the third option, of permitting list, dict or getattr type access to values. This would, presumably, be backwards compatible with current usage which I imagine is typically list oriented. A major benefit of getattr type access is that interactive debugging in iPython allows dot expansion, so introspection of the the row's attributes is wonderfully easy. Regards Rory
On Mon, 8 Feb 2021 at 15:16, Denis Laxalde <denis.laxalde@dalibo.com> wrote: > > Daniele Varrazzo a écrit : > > On Mon, 8 Feb 2021 at 12:14, Denis Laxalde <denis.laxalde@dalibo.com> wrote: > >> - It seems to me that the conn-info developments work fine (also looked > >> at the commits, which look fine as well to me): what needs to be done > >> here? Any help needed? > > > > I can't remember what is missing: probably the interface is complete, > > it is missing test coverage and documentation. I will review if there > > are methods missing: its purpose is to expose pretty much all the info > > about the connection that the libpq can give, correctly decoded from > > bytes. Complete tests and docs are definitely to be added. > > Ok, good to know. (Let me know if help is needed.) That object should have more or less the same attributes of psycopg2 ConnectionInfo. [1] I looked again at the state of the branch and I remember spending some time on get_parameters() (involving finding a bug in the libpq). I haven't checked one by one the attributes but I think a few are still missing. I would like to have at least one test per attribute, just to catch regressions, and documentation, which would largely be based on one line of docstrings and Sphinx automethod. So, not a difficult task, if a bit tedious. If you would like to help on it I don't have work on that branch in mind in the next few days. [1] https://www.psycopg.org/docs/extensions.html#psycopg2.extensions.ConnectionInfo > > Which makes me think... 5) PydanticCursor? Or, as a more general case, > > implement all the cursor flavours not as subclasses but using a > > rowfactory function, and allow people to use their own (e.g. asking a > > cursor to create a Pydantic model provided by the user, so that they > > can have psycopg3 collaborate with e.g. FastAPI strictly typed > > models). Or a Python dataclass... > > Providing a plugable API for that would be really nice, indeed. For something more creative you could maybe take a look at this: maybe we could give both the connection and the cursor a row_factory attribute. I am not sure if just a callable would be enough: named tuples, and in lesser measure dictionaries, would benefit from being warned "a new result is coming: this is its shape". A clean interface might be: class RowFactory: def set_result(self, cursor: BaseCursor): ... def make_row(self, values: Tuple[...]): ... The library could provide some basic ones, for instance the equivalent of DictCursor (real dicts) might be: class DictRowFactory: def set_result(self, cursor): self.titles = [d.name for d in cursor.description] def make_row(self, values): return dict(zip(self.titles, values)) This feature is definitely to be done before the first library release but I can totally find something else to do and leave space to others. Cheers -- Daniele
On 2/8/21 4:16 AM, Daniele Varrazzo wrote: > Hello Denis, > > In short: yes, I would like to provide alternative record factories. > We can review if the best way is to create cursor subclasses > out-of-the-box or mixins. In psycopg2 there are: > > - DictCursor (returns a hybrid object between a sequence and a dictionary) > - RealDictCursor (returns a straight subclass of a dictionary, so it's > not a legit dbapi cursor as it doesn't expose a sequence) > - NamedTupleCursor (what it says on the tin). > > Personally I use the NamedTupeCursor most often, as I prefer the dot > notation to access attribute. Named tuples also have a _dict() method > when a dict is needed. If psycopg3 was only my pet project, I would > personally have NamedTuples as the default and only thing supported... > But that's hardly the case. > > 1) We can provide a feature to select the type of cursor that is not > much dissimilar from psycopg2 > 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two > would be sufficient? Possibly an object inheriting from a Python dict > instead of emulating it so that e.g. json.dumps() RealDictCursor is what I use primarily as it plays well with json.*. I have used NamedTupleCursor for those things that play better with dot notation. I very rarely use DictCursor anymore as it's dual nature seemed to get in the way more then it helped. > 3) Or, conversely, we could have a Row object providing both attribute > and getattr access, both as index and name: > > row.myattr > row["myattr"] > row[1] > > 4) There are reports of the *DictCursor being slow: the objects in > psycopg2 have hardly been profiled. I would look well at the > performance of the objects created. > > -- Daniele > > -- Adrian Klaver adrian.klaver@aklaver.com
On 08/02/21, Adrian Klaver (adrian.klaver@aklaver.com) wrote: > On 2/8/21 4:16 AM, Daniele Varrazzo wrote: > > 1) We can provide a feature to select the type of cursor that is not > > much dissimilar from psycopg2 > > 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two > > would be sufficient? Possibly an object inheriting from a Python dict > > instead of emulating it so that e.g. json.dumps() > > RealDictCursor is what I use primarily as it plays well with json.*. I have > used NamedTupleCursor for those things that play better with dot notation. I > very rarely use DictCursor anymore as it's dual nature seemed to get in the > way more then it helped. Your comments also suggests the possibility of having a convenience method, something like ._asjson() perhaps to allow rows or entire result sets json encoded.
> 1) We can provide a feature to select the type of cursor that is not
> much dissimilar from psycopg2
> 2) Do we need DictCursor/RealDictCursor? ISTM that one of the two
> would be sufficient? Possibly an object inheriting from a Python dict
> instead of emulating it so that e.g. json.dumps()
Inheriting from dict is strange because we only need read access. So
rather collections.abc.Mapping, I think.
One advantage of using a real dict is that it leaves the door open for the caller to mutate the results as they consume them. A little impure, I know, but one practical example I can think of would be where you want to augment your results with additional data (coming from a different source, outside the database).
RealDictCursor gives you the flexibility to do this, whereas collections.abc.Mapping would force you to construct a new dict instance for every row in order to get the same result.