Re: [PERFORM] Correct use of cursors for very large result sets in Postgres - Mailing list pgsql-performance

From Tom Lane
Subject Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
Date
Msg-id 21063.1487349580@sss.pgh.pa.us
Whole thread Raw
In response to Re: [PERFORM] Correct use of cursors for very large result sets in Postgres  (Mike Beaton <mjsbeaton@gmail.com>)
Responses Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
List pgsql-performance
Mike Beaton <mjsbeaton@gmail.com> writes:
> [ generally accurate information ]

> **WARNINGS**

> It would *never* make sense to do `FETCH ALL FROM cursor` for
> astronomically large data, if your client side code (including your data
> access layer) has any bottleneck at all at which means that all the data
> from a command is fetched before any processing can be done. Many data
> access layers (and especially data access wrappers) are like this. So
> beware. But it is also true that not all client side code is made this way.

It would probably be good to point out that most client-side libraries
will do it that way, including libpq, because then they can make success
or failure of the query look atomic to the application.  If you use an
API that lets you see rows as they come off the wire, it's up to you
to recover properly from a query failure that occurs after some/many rows
have already been returned.

> Returning huge data using a `TABLE` or `SETOF` return type from within a
> PostgeSQL function will *always* be broken (i.e. will create a huge buffer
> and take a very long time to start). This will be so whether the function
> is called from SQL to SQL or called over the wire.

I believe this is false in general.  I think it's probably true for all
the standard PL languages, because they don't want to bother with
suspending/resuming execution, so they make "RETURN NEXT" add the row to
a tuplestore not return it immediately.  But it's definitely possible to
write a C function that returns a row at a time, and depending on what the
calling SQL statement looks like, that could get streamed back to the
client live rather than being buffered first.

As a trivial example, if you do
    select generate_series(1,100000000);
in psql and watch what's happening with "top", you'll see psql's memory
usage going through the roof (because libpq tries to buffer the result)
but the connected backend's memory usage is steady as a rock --- nor
does it dump the data into a temporary file.  On the other hand,
    select * from generate_series(1,100000000);
does dump the data into a temp file, something we ought to work on
improving.

            regards, tom lane


pgsql-performance by date:

Previous
From: Mike Beaton
Date:
Subject: Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
Next
From: Vucomir Ianculov
Date:
Subject: Re: [PERFORM] pgsql connection timeone