Thread: Resdhift's lack of cursors and PQsetSingleRowMode

Resdhift's lack of cursors and PQsetSingleRowMode

From

Ryan Kelly

Date:

24 December 2013, 16:45:49

Hi:

We're using psycopg2 to connect to Amazon's Redshift, and one problem
Redshift has is that it lacks server-side cursors. What I'd like is to
be able to use PQsetSingleRowMode to avoid consuming all the data at
once (in specific circumstances, of course). I'm not entirely sure what
in psycopg2 would need to be modified, but it seems like the changes
would mostly be restricted to the `async == 0` portion of `pq_execute`,
but I'm not familiar enough with the source to be certain.

-Ryan Kelly

Re: Resdhift's lack of cursors and PQsetSingleRowMode

From

Daniele Varrazzo

Date:

24 December 2013, 19:24:30

On Tue, Dec 24, 2013 at 4:45 PM, Ryan Kelly <rpkelly22@gmail.com> wrote:
> Hi:
>
> We're using psycopg2 to connect to Amazon's Redshift, and one problem
> Redshift has is that it lacks server-side cursors. What I'd like is to
> be able to use PQsetSingleRowMode to avoid consuming all the data at
> once (in specific circumstances, of course). I'm not entirely sure what
> in psycopg2 would need to be modified, but it seems like the changes
> would mostly be restricted to the `async == 0` portion of `pq_execute`,
> but I'm not familiar enough with the source to be certain.

Hello Ryan,

supporting SingleRowMode doesn't look like a trivial change. I have
personally no plan to work on it but of course I'm not against
contributions. Here's a bunch of random thoughts.

General implementation of psycopg with SingleRowMode seems relatively
hard: for instance implementing fetchmany(10) would require 10 calls
and handling the result from 10 PQresult structures into a temporary
storage we currently don't need and don't have: currently we have a
single result both with client-side and server-side cursor. It would
be easier to implement a subset of the interface, e.g. forward only
iteration with no bookkeeping (no scroll(), no rownumber...).

ISTM that in single row mode you can only get a single row per network
operation. If you have a large dataset (likely if you want this
feature) you may end up in a large overhead, the sort of problem we
have solved with cursor.itersize but whose solution doesn't seem to
apply for SingleRowMode.

The async codepath is already lacking several features: I wouldn't
bother for SingleRowMode with it, at least in a first iteration (I
haven't analyzed if it's doable or not).

If you want to work on the feature you may also start hacking on the
pure python implementation of the driver
(https://github.com/chtd/psycopg2cffi is the most up to date): it is
quite a straightforward porting so reimplementing the changeset in C
wouldn't be too hard.

I may be wrong, but in my opinion the one-roundtrip-per-row would be a
performance killer and you may end up better paginating manually via
LIMIT/OFFSET, although the server wouldn't love you for that (the
server still computes the rows before the offset, only to discard
them, so querying a large dataset in pages is a quadratic operation).

-- Daniele

Re: Resdhift's lack of cursors and PQsetSingleRowMode

From

Marko Kreen

Date:

26 December 2013, 13:37:59

On Tue, Dec 24, 2013 at 07:23:58PM +0000, Daniele Varrazzo wrote:
> On Tue, Dec 24, 2013 at 4:45 PM, Ryan Kelly <rpkelly22@gmail.com> wrote:
> > We're using psycopg2 to connect to Amazon's Redshift, and one problem
> > Redshift has is that it lacks server-side cursors. What I'd like is to
> > be able to use PQsetSingleRowMode to avoid consuming all the data at
> > once (in specific circumstances, of course). I'm not entirely sure what
> > in psycopg2 would need to be modified, but it seems like the changes
> > would mostly be restricted to the `async == 0` portion of `pq_execute`,
> > but I'm not familiar enough with the source to be certain.
>
> Hello Ryan,
>
> supporting SingleRowMode doesn't look like a trivial change. I have
> personally no plan to work on it but of course I'm not against
> contributions. Here's a bunch of random thoughts.
>
> General implementation of psycopg with SingleRowMode seems relatively
> hard: for instance implementing fetchmany(10) would require 10 calls
> and handling the result from 10 PQresult structures into a temporary
> storage we currently don't need and don't have: currently we have a
> single result both with client-side and server-side cursor. It would
> be easier to implement a subset of the interface, e.g. forward only
> iteration with no bookkeeping (no scroll(), no rownumber...).

The single row mode is designed for following high-level API:

    curs = db.single-row-mode-cursor()
    curs.execute(sql)
    for row in curs.fetchall():
    process(row)

IOW, it is meant to implement efficient iterators.

In case of psycopg, it might be best to implement as alternative
cursor object?  Integraring with existing cursor implementation
might create too much complexity perhaps?

> ISTM that in single row mode you can only get a single row per network
> operation. If you have a large dataset (likely if you want this
> feature) you may end up in a large overhead, the sort of problem we
> have solved with cursor.itersize but whose solution doesn't seem to
> apply for SingleRowMode.

No, libpq will still works with 8KB buffers over network.  Single
row mode will return rows one-by-one from that buffer.  So it's only
slightly inefficient as it creates PGresult object for each row,
instead one for full resultset.  This inefficiency is offset by
much better cache usage for larger resultsets.

In any case, there is no difference on how it operates over TCP
compared to plain PQexec().

--
marko

Re: Resdhift's lack of cursors and PQsetSingleRowMode

From

Daniele Varrazzo

Date:

27 December 2013, 13:00:20

On Thu, Dec 26, 2013 at 1:37 PM, Marko Kreen <markokr@gmail.com> wrote:

> The single row mode is designed for following high-level API:
>
>     curs = db.single-row-mode-cursor()
>     curs.execute(sql)
>     for row in curs.fetchall():
>         process(row)

Because it's neither easy nor necessary to have a full-fledged cursor
object we could just have a new method on the cursor, returning an
iterable object responsible of all the state during the iteration:
something like:

    for r in cur.execute_iter(query [, args]):    # better name?
        process(row)

Because in the DBAPI querying and retrieving is done with different
set of methods, I would have preferred something like:

    cur.execute(query)
    for row in cur.iter_single():
        process(row)

but usually psycopg calls both PQexec and PQgetResult during execute()
so this interface wouldn't be straightforward to implement.

>> ISTM that in single row mode you can only get a single row per network

> No, libpq will still works with 8KB buffers over network.

Cool, no performance objection then, thank you for the clarification.

-- Daniele

Re: Resdhift's lack of cursors and PQsetSingleRowMode

From

Marko Kreen

Date:

27 December 2013, 14:39:00

On Fri, Dec 27, 2013 at 12:59:52PM +0000, Daniele Varrazzo wrote:
> On Thu, Dec 26, 2013 at 1:37 PM, Marko Kreen <markokr@gmail.com> wrote:
> > The single row mode is designed for following high-level API:
> >
> >     curs = db.single-row-mode-cursor()
> >     curs.execute(sql)
> >     for row in curs.fetchall():
> >         process(row)
>
> Because it's neither easy nor necessary to have a full-fledged cursor
> object we could just have a new method on the cursor, returning an
> iterable object responsible of all the state during the iteration:
> something like:
>
>     for r in cur.execute_iter(query [, args]):    # better name?
>         process(row)

Because DB-API 2.0 has already standardized iterator usage, I would
prefer DB-API compatible iterator API.  But I also understand that
adding it has hack to current C code will be messy.

In the long-term psycopg would need C code that handles iterating as
main case, but perhaps short-term we can implement _execute_iter()
in C and then have IterConnection/IterCursor classes that emulate
proper cursor on top of that?

> Because in the DBAPI querying and retrieving is done with different
> set of methods, I would have preferred something like:
>
>     cur.execute(query)
>     for row in cur.iter_single():
>         process(row)
>
> but usually psycopg calls both PQexec and PQgetResult during execute()
> so this interface wouldn't be straightforward to implement.

In any case the .execute() needs to know about iterating, so it can
do PQsend() + PQsetSingleRowMode() + PQgetResult().  PQgetResult to
get the column headers from first row.  So it must be either cursor
default or flag to execute().

--
marko