Re: about client-side cursors - Mailing list psycopg

From Denis Laxalde
Subject Re: about client-side cursors
Date
Msg-id 20210204111647.4s3qie5cnassksa5@dalibo.com
Whole thread Raw
In response to Re: about client-side cursors  (Daniele Varrazzo <daniele.varrazzo@gmail.com>)
Responses Re: about client-side cursors  (Christophe Pettus <xof@thebuild.com>)
Re: about client-side cursors  (Daniele Varrazzo <daniele.varrazzo@gmail.com>)
List psycopg
Hi,

Daniele Varrazzo a écrit :
> On Wed, 3 Feb 2021 at 17:16, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
> 
> > I'd like to discuss about the concept of client-side cursor as defined
> > in psycopg.
> 
> Welcome Denis!
> 
> 
> > This is a concept I've always been quite uncomfortable with;
> > I'll try to explain why.
> >
> > First, from a practical point of view:
> >
> >    cur = connection.cursor()
> >    cur.execute(...)
> >    result = cur.fetchxxx()
> >
> > This typical pattern seems awkward to me. Where do IO operations
> > actually take place? (on execute()? on fetch()? on both?)
> 
> For a client-side cursor, it happens on execute, but for a server-side
> cursor it happens in both (execute creates the cursor, I have no idea
> as to when the server decides to hit the disks, but for the client
> every fetch is a roundtrip with the server).

Then, having fetch*() methods on client-side cursors is also misleading
because nothing gets fetched actually, it's just result unpacking.

> > Then, and besides the terminology, the fact that named cursors are very
> > different from client-side cursors while being constructed by the same
> > API is confusing. I understand that this may come from the DB-API:
> >
> >   https://www.python.org/dev/peps/pep-0249/#cursor-objects
> >   > [Cursor] objects represent a database cursor. [...]
> >
> > But this does not seem quite right to me since a client-side cursor is
> > actually not a PostgreSQL cursor. (The documention on psycopg2 is clear
> > on the difference, though.)
> 
> Yes, that's correct: the name comes from the dbapi. The dbapi cursor
> does two different things indeed: it runs queries and manages results.
> I understand that the overlap is sort of weird and everyone who has
> said "psycopg sucks: let me write my own driver" have used the mental
> model - much more familiar - of "I ask a query to a connection which
> gives me a result". So usually what is created is something which can
> be used as:
> 
>     result = conn.execute(query, vars)
>     for record in result:
>         ...
> 
> This is understandable, so much that in psycopg3 I've added
> conn.execute, which is implemented as:
> 
>     def execute(self, query, vars)
>         cur = self.cursor()
>         cur.execute(query, vars)
>         return cur
> 
> and allow you to use the cursor only to iterate on the data returned,
> ignoring the execute*() methods.

I am aware of that and that's indeed a nice addition. As I mentioned in
the thread introducing this feature, I still think that returning a
(client-side) cursor here is confusing.

It seems to me that there something wrong in either (or both) the DB-API
or the way it got implemented in psycopg:

* PEP-249 states that "cursor objects represent a database cursor",

* it also states that "if the database does not provide a direct cursor
  concept, the module will have to emulate cursors using other means to
  the extent needed by this specification".

But, unless I missed it, the PEP does not state how to implement query
execution and result fetch operations if the database does not need a
cursor.

It seems to me that connection.execute() provides an implementation to
that case, except that it'd better if it did not return a misnamed
cursor but some kind of "result set" or "rows" object. As you mentioned,
some other Python drivers already do this.

Alternatively (or in complement), how about renaming connection.cursor()
as connection.client() for client-side cursors (while keep cursor() for
server-side ones)? That would fulfill the second statement above, I
think. (Or is it something closer to a "portal", as defined in the
Extended Query protocol? Maybe not...)

Looking around in other programming languages, the situation is similar:

* Golang's libpq https://godocs.io/github.com/lib/pq follows
  https://github.com/golang/go/wiki/SQLInterface : queries are executed
  on the connection object and Rows object are returned.

* In Rust, there is https://docs.rs/crate/postgres/0.19.0 (sync) which
  is quite similar and https://docs.rs/tokio-postgres/0.7.0/tokio_postgres/
  (async), which exposes a Client type to execute queries.

With psycopg3 coming up, I believe it'd be a good opportunity to improve
that.


> > In asyncio context, while playing around with psycopg3, the situation is
> > even worse because of the additional async/await keywords; for instance:
> >
> >    conn = await psycopg3.AsyncConnection.connect()
> >    async with conn:
> >        cur = await conn.execute("SELECT * FROM test")
> >        r = await cur.fetchall()
> >
> > Why do we need two 'await' (i.e. two IO operations) on conn.execute()
> > and cur.fetchall() if 'cur' is a client-side cursor? (Back to my first
> > point about where IO happen, this appears to depend on whether the
> > connection has 'autocommit' or not...)
> 
> I am going a bit back and forth on this mentally, although not in the
> code. The forces pulling in different directions are:
> 
> 1) if a method may possibly do I/O activity, then it should be async.
> I don't know where I have read it and how strong is this guideline,
> however it makes sense: it's better to have a function to call with
> await where the await is actually not needed than a function which
> should have been await but now you are stuck with a blocking
> interface.

I gently disagree here. This is essentially the same issue that having
cursor.fetch*() not fetching anything, but it's worse than "just" a
naming problem because it's on the control flow.

For instance, consider the following examples with and without
autocommit on the connection to demonstrate where IO takes place:

    conn = await psycopg3.AsyncConnection.connect(autocommit=False)
    async with conn:
        cur = await conn.execute("SELECT 1")
        rows = await cur.fetchall()  # IO happens here

    conn = await psycopg3.AsyncConnection.connect(autocommit=True)
    async with conn:
        cur = await conn.execute("SELECT 1")  # IO happens here
        rows = await cur.fetchall()

This is quite unexpected, and behaves differently on sync cursors (IO
always happens on execute()).

> 2) if we want to have client and server cursor, it would be better if
> they had the same interface

(Unless we drop the client-side cursor, or rename it.)

> The point 2 is pretty true in psycopg2 because there is a single
> cursor class, and every method is if'ed like:
> 
>     def fetchmany(self, nrecs):
>         if not self.name:
>             # client-side cursor
>             res = self.result[self.current, self.current + nrecs]
>             self.current += nrecs
>             return res
>         else:
>             # server-side cursor
>             result = self.connection.execute(f"FETCH {nrecs} FROM {self.name}")
>             return result
> 
> I don't think this is a good implementation, so in psycopg3 I would
> like to have a Cursor interface with a ClientCursor and a ServerCursor
> implementation, without the if for each method. In async mode,
> AsyncClientCursor.fetchmany() doesn't actually need to be async, but
> AsyncServerCursor.fetchmany() does I/O activity. However, would two
> different implementation classes demand two different interfaces?

These are not just two different implementations, these are two
different things. So, yes, different interfaces (and names) would be
better, IMHO.

> > All in all, my main point is: (why) do we need client-side cursors?
> 
> I think client-side cursors are the bread and butter of querying the
> database. They are a lightweight and stateless way to receive a
> result: you ask a question, you receive an answer, and all is over.

Ok. But then, they are just a proxy to the actual results already
fetched upon execute() with no added value, aren't they?

> You can put caches between client server, connection pools...

How is this related to client-side cursors?

> Server-side cursors impy a state persisted across calls and shared
> between clients and server, which is much more heavyweight, scales
> less for the server, and makes the identity of the connection
> precious.
> 
> If any I would do without the server-side cursor, but it would be
> abdicating a responsibility that the driver should handle IMO, so
> although I'm not a fan I will definitely implement them from psycopg3.
> 
> 
> There are choices to be made, and it's one of those points for which
> I'd be happy to hear other experiences; I haven't asked around yet
> because I first wanted to implement a sketch of server-side cursors
> and then look at the 4 class side to side (to side to side) and decide
> what to do. Let's call them:
> 
> - SyncClientCursor
> - AsyncClientCursor
> - SyncServerCursor
> - AsyncServerCursor
> 
> The Server versions have I/O done on fetch*() and on close(). So the
> first three objects might have non-async fetch(). I was preferring to
> give all them an uniform interface so the documentation would be "on
> async communication, always use async when fetching", as opposite as
> "on async use "await execute()", no await on fetch() for client-side
> cursor, but "await fetch()` for server-side ones". In practice, I am
> envisioning a SyncCursor object interface, with SyncClient and
> SyncServer being two different implementations, and an AsyncCursor
> interface, with AsyncClient and AsyncServer being two different
> implementations:
> 
>     @abstract
>     class SyncCursor:
>         def execute()
>         def fetch()
> 
>     class SyncClientCursor(SyncCursor):
>         def execute()
>         def fetch()
> 
>     class SyncClientCursor(SyncCursor):
>         def execute()
>         def fetch()
> 
>     @abstract
>     class AsyncCursor:
>         async def execute()
>         async def fetch()
> 
>     class AsyncClientCursor(AsyncCursor):
>         await def execute()
>         await def fetch()
> 
>     class AsyncClientCursor(AsyncCursor):
>         await def execute()
>         await def fetch()
> 
> Doing without await on AsyncClientCursor.fetch() creates an asymmetry:
> 
>     class AsyncClientCursor:
>         await def execute()
>         def fetch()
> 
>     class AsyncClientCursor:
>         await def execute()
>         await def fetch()
> 
> so you couldn't write a function "processResults(acur)" independent of
> whether the cursor is client or server, whereas you can do it no
> problem for sync cursors.
> 
> That's my brain dump! @Denis: it's just more to explain current
> implementation and mental processes, however in pg3 there is still
> space to make changes and I'm happy to hear yours and others input. I
> think that using await uniformly makes things easier to think about
> and code to compose but I'm happy to hear if there are different
> opinions.

So my humble opinion is that naming and control flow should be explicit.
For instance, we could drop the fetch*() method from client-side cursors
(to be renamed) and add result access methods:

    class AsyncClient:
        await def execute()
        def first()  # return first row
        def all()  # return all rows

while keeping the cursor term for real cursors:

    class AsyncCursor:
        await def execute()
        await def fetch()


Thanks for you detailed and kind reply, Daniele. Also, besides
discussing here (kind of a self-introduction), I'd be happy to help out
code-wise on this particular aspect or another.

-- 
Denis



psycopg by date:

Previous
From: Daniele Varrazzo
Date:
Subject: Re: about client-side cursors
Next
From: Christophe Pettus
Date:
Subject: Re: about client-side cursors