Thread: libpq custom row processing

libpq custom row processing

From
Marko Kreen
Date:
There is an item in current commitfest that might be interesting
to Psycopg - a way to provide custom callback to result
row processing.  By default, libpq takes row data from
network buffer and puts it into PGresult.  This can now
be overrided - column data can be converted immediately
to final format and even processed in-flight:

  https://commitfest.postgresql.org/action/patch_view?id=769

Attached is current state of the doc.

Any feedback is appreciated, either about API or
documentation clarity/typos or whether all interesting
scenarios are handled.

--
marko

Attachment

Re: libpq custom row processing

From
Federico Di Gregorio
Date:
On 14/02/12 13:11, Marko Kreen wrote:
> There is an item in current commitfest that might be interesting
> to Psycopg - a way to provide custom callback to result
> row processing.  By default, libpq takes row data from
> network buffer and puts it into PGresult.  This can now
> be overrided - column data can be converted immediately
> to final format and even processed in-flight:
>
>    https://commitfest.postgresql.org/action/patch_view?id=769
>
> Attached is current state of the doc.
>
> Any feedback is appreciated, either about API or
> documentation clarity/typos or whether all interesting
> scenarios are handled.

Extremely interesting. Thank you very much for this link.

federico


Mac OS 10.7x ??

From
Brian Hamlin
Date:
Hi All -

   I am configuring a Mac laptop to use for coding, and of course I
wanted psycopg2 right away..
But the more I look into this, the more questions are raised..

   I do not want to use the hack to turn off 64-bit python.. why ??
If I am not using Python 3 than at least I should be able to keep 64
bit, no?

   The system Postgres is well hidden.. I installed PgAdmin 3
via .dmg and I have no
problems there.. there is a libpq private to PgAdmin

   but the pg_config that is found in /usr/bin appears to be for the
hidden Postgres
I have stopped my installs at this point, other than a few KyngChaos
Frameworks
for geo processing.. like QGis..

   what is the story ?  Is Mac OS X an orphan now?
   (I believe I am similarly stuck with no ipython, also)


--
Brian Hamlin
GeoCal
OSGeo California Chapter
415-717-4462 cell



Re: libpq custom row processing

From
Marko Kreen
Date:
On Tue, Feb 14, 2012 at 3:28 PM, Federico Di Gregorio <fog@dndg.it> wrote:
> On 14/02/12 13:11, Marko Kreen wrote:
>> There is an item in current commitfest that might be interesting
>> to Psycopg - a way to provide custom callback to result
>> row processing.  By default, libpq takes row data from
>> network buffer and puts it into PGresult.  This can now
>> be overrided - column data can be converted immediately
>> to final format and even processed in-flight:

The callback API has been rolled back and final 9.2 will have
single-row-mode instead:

  http://www.postgresql.org/docs/devel/static/libpq-single-row-mode.html

Callback API was hard to use and did not support iterative
result processing.  This API should fix both problems.

--
marko

Re: libpq custom row processing

From
Federico Di Gregorio
Date:
On 07/08/12 12:51, Marko Kreen wrote:
> The callback API has been rolled back and final 9.2 will have
> single-row-mode instead:
>
>   http://www.postgresql.org/docs/devel/static/libpq-single-row-mode.html
>
> Callback API was hard to use and did not support iterative
> result processing.  This API should fix both problems.

Interesting, but this is quite intimidating:

While processing a query, the server may return some rows and then
encounter an error, causing the query to be aborted. Ordinarily, libpq
discards any such rows and reports only the error. But in single-row
mode, those rows will have already been returned to the application.
Hence, the application will see some PGRES_SINGLE_TUPLE PGresult objects
followed by a PGRES_FATAL_ERROR object. For proper transactional
behavior, the application must be designed to discard or undo whatever
has been done with the previously-processed rows, if the query
ultimately fails.

Making sure old code continues to work and doesn't process the initial
tuples before an exception is throw will be challenging.

federico

--
Federico Di Gregorio                         federico.digregorio@dndg.it
Studio Associato Di Nunzio e Di Gregorio                  http://dndg.it
 Everything will be OK at the end. If it's not OK, it's not the end.
                                                              -- Unknown

Re: libpq custom row processing

From
Marko Kreen
Date:
On Tue, Aug 7, 2012 at 2:15 PM, Federico Di Gregorio <fog@dndg.it> wrote:
> On 07/08/12 12:51, Marko Kreen wrote:
>> The callback API has been rolled back and final 9.2 will have
>> single-row-mode instead:
>>
>>   http://www.postgresql.org/docs/devel/static/libpq-single-row-mode.html
>>
>> Callback API was hard to use and did not support iterative
>> result processing.  This API should fix both problems.
>
> Interesting, but this is quite intimidating:
>
> While processing a query, the server may return some rows and then
> encounter an error, causing the query to be aborted. Ordinarily, libpq
> discards any such rows and reports only the error. But in single-row
> mode, those rows will have already been returned to the application.
> Hence, the application will see some PGRES_SINGLE_TUPLE PGresult objects
> followed by a PGRES_FATAL_ERROR object. For proper transactional
> behavior, the application must be designed to discard or undo whatever
> has been done with the previously-processed rows, if the query
> ultimately fails.
>
> Making sure old code continues to work and doesn't process the initial
> tuples before an exception is throw will be challenging.

Same thing happens when you fetch the result in transaction
and later query fails with error thus invalidating earlier
processing.  So nothing new.

Or how about FETCH 100 from cursor in transaction,
and first few succeed and later one fails.

It's up to user code to handle such cases correctly
and "correct" here depends on actual business logic
of the transaction.

The warning is there because there is now new
failure scenario, but not because the failure
needs any kind of special handling.

--
marko

Re: libpq custom row processing

From
Federico Di Gregorio
Date:
On 07/08/12 13:41, Marko Kreen wrote:
> Same thing happens when you fetch the result in transaction
> and later query fails with error thus invalidating earlier
> processing.  So nothing new.
>
> Or how about FETCH 100 from cursor in transaction,
> and first few succeed and later one fails.
>
> It's up to user code to handle such cases correctly
> and "correct" here depends on actual business logic
> of the transaction.
>
> The warning is there because there is now new
> failure scenario, but not because the failure
> needs any kind of special handling.

I don't agree. Simple code like:

curs.execute("SELECT * FROM xs")
for x in curs.fetchall():
    # do something like writing to the file system with x

will have a different effect if row-by-row processing is enabled. Before
nothing would be changed on the file system in case of error: the
fetchall() is "atomic"; while now you write to the file system until the
row that causes the error is processed.

federico

--
Federico Di Gregorio                         federico.digregorio@dndg.it
Studio Associato Di Nunzio e Di Gregorio                  http://dndg.it
 There's no greys, only white that's got grubby. I'm surprised you
  don't know that. And sin, young man, is when you treat people as
  things. Including yourself.                       -- Granny Weatherwax

Re: libpq custom row processing

From
Marko Kreen
Date:
On Tue, Aug 7, 2012 at 2:46 PM, Federico Di Gregorio <fog@dndg.it> wrote:
> On 07/08/12 13:41, Marko Kreen wrote:
>> Same thing happens when you fetch the result in transaction
>> and later query fails with error thus invalidating earlier
>> processing.  So nothing new.
>>
>> Or how about FETCH 100 from cursor in transaction,
>> and first few succeed and later one fails.
>>
>> It's up to user code to handle such cases correctly
>> and "correct" here depends on actual business logic
>> of the transaction.
>>
>> The warning is there because there is now new
>> failure scenario, but not because the failure
>> needs any kind of special handling.
>
> I don't agree. Simple code like:
>
> curs.execute("SELECT * FROM xs")
> for x in curs.fetchall():
>     # do something like writing to the file system with x
>
> will have a different effect if row-by-row processing is enabled. Before
> nothing would be changed on the file system in case of error: the
> fetchall() is "atomic"; while now you write to the file system until the
> row that causes the error is processed.

When in transaction, then best analogy is reading
from cursor with FETCH.

When outside, in autocommit mode, then COPY (SELECT ..)
which you then process line-by-line.

My point is that the behavior is not something completely new,
that no-one has seen before.

But it's different indeed from libpq default, so it's not something
psycopg can convert to using unconditionally.  But as optional feature
it should be quite useful.


Note - we are talking about libpq world here, Npgsql uses
such mode by default, maybe pgjdbc does too.

--
marko

Re: libpq custom row processing

From
Federico Di Gregorio
Date:
On 07/08/12 15:14, Marko Kreen wrote:
> My point is that the behavior is not something completely new,
> that no-one has seen before.
>
> But it's different indeed from libpq default, so it's not something
> psycopg can convert to using unconditionally.  But as optional feature
> it should be quite useful.

I agree. As an opt-in feature would be quite useful for large datasets
but then, named cursors already cover that ground. Not that I am against
it, just I'd like to see why:

curs = conn.cursor(row_by_row=True)

would be better than:

curs = conn.cursor("row_by_row")

Is row by row faster than fetching from a named cursor? Does it add less
overhead. If that's the case then would be nice to have it as a feature
for optimizing queries returning large datasets.

federico

--
Federico Di Gregorio                         federico.digregorio@dndg.it
Studio Associato Di Nunzio e Di Gregorio                  http://dndg.it
 And anyone who yells "fork" deserves to get one stuck in them.
                                                          -- Dan Winship

Re: libpq custom row processing

From
Magnus Hagander
Date:
On Tue, Aug 7, 2012 at 3:25 PM, Federico Di Gregorio <fog@dndg.it> wrote:
> On 07/08/12 15:14, Marko Kreen wrote:
>> My point is that the behavior is not something completely new,
>> that no-one has seen before.
>>
>> But it's different indeed from libpq default, so it's not something
>> psycopg can convert to using unconditionally.  But as optional feature
>> it should be quite useful.
>
> I agree. As an opt-in feature would be quite useful for large datasets
> but then, named cursors already cover that ground. Not that I am against
> it, just I'd like to see why:
>
> curs = conn.cursor(row_by_row=True)
>
> would be better than:
>
> curs = conn.cursor("row_by_row")
>
> Is row by row faster than fetching from a named cursor? Does it add less
> overhead. If that's the case then would be nice to have it as a feature
> for optimizing queries returning large datasets.

A big win would be that you don't need to keep the whole dataset in
memory, wouldn't it? As you're looping through it, you can throw away
the old results...

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: libpq custom row processing

From
Marko Kreen
Date:
On Tue, Aug 7, 2012 at 4:25 PM, Federico Di Gregorio <fog@dndg.it> wrote:
> On 07/08/12 15:14, Marko Kreen wrote:
>> My point is that the behavior is not something completely new,
>> that no-one has seen before.
>>
>> But it's different indeed from libpq default, so it's not something
>> psycopg can convert to using unconditionally.  But as optional feature
>> it should be quite useful.
>
> I agree. As an opt-in feature would be quite useful for large datasets
> but then, named cursors already cover that ground. Not that I am against
> it, just I'd like to see why:
>
> curs = conn.cursor(row_by_row=True)
>
> would be better than:
>
> curs = conn.cursor("row_by_row")
>
> Is row by row faster than fetching from a named cursor? Does it add less
> overhead. If that's the case then would be nice to have it as a feature
> for optimizing queries returning large datasets.

It avoids network rondtrips while buffering minimal amount
of data.  Roundtrips may not be noticeable in single colo,
but are definitely troublesome when working between different colos.

It also saves CPU and memory on both server and client,
(less cache usage, less context switches)
but that gets into micro-optimization world so it is harder
to measure.

--
marko