Thread: [PERFORM] Correct use of cursors for very large result sets in Postgres
Mike Beaton <mjsbeaton@gmail.com> writes: > [ generally accurate information ] > **WARNINGS** > It would *never* make sense to do `FETCH ALL FROM cursor` for > astronomically large data, if your client side code (including your data > access layer) has any bottleneck at all at which means that all the data > from a command is fetched before any processing can be done. Many data > access layers (and especially data access wrappers) are like this. So > beware. But it is also true that not all client side code is made this way. It would probably be good to point out that most client-side libraries will do it that way, including libpq, because then they can make success or failure of the query look atomic to the application. If you use an API that lets you see rows as they come off the wire, it's up to you to recover properly from a query failure that occurs after some/many rows have already been returned. > Returning huge data using a `TABLE` or `SETOF` return type from within a > PostgeSQL function will *always* be broken (i.e. will create a huge buffer > and take a very long time to start). This will be so whether the function > is called from SQL to SQL or called over the wire. I believe this is false in general. I think it's probably true for all the standard PL languages, because they don't want to bother with suspending/resuming execution, so they make "RETURN NEXT" add the row to a tuplestore not return it immediately. But it's definitely possible to write a C function that returns a row at a time, and depending on what the calling SQL statement looks like, that could get streamed back to the client live rather than being buffered first. As a trivial example, if you do select generate_series(1,100000000); in psql and watch what's happening with "top", you'll see psql's memory usage going through the roof (because libpq tries to buffer the result) but the connected backend's memory usage is steady as a rock --- nor does it dump the data into a temporary file. On the other hand, select * from generate_series(1,100000000); does dump the data into a temp file, something we ought to work on improving. regards, tom lane
Mike Beaton <mjsbeaton@gmail.com> writes: > One outstanding question I have. Based on a lot of helpful responses given > to the SO question I can now test and see what disk buffers are generated > (by setting `log_temp_files` to `0` and then `tail -f log`), as well as how > long it takes for results to start arriving. > With a large (10,000,000 row) test table, if I do `SELECT * FROM table` on > psql it starts to return results immediately with no disk buffer. If I do > `FETCH ALL FROM cursortotable` on psql it takes about 7.5 seconds to start > returning results, and generates a 14MB buffer. If I do `SELECT * FROM > table` on a correctly coded streaming client, it also starts to return > results immediately with no disk buffer. But if I do `FETCH ALL FROM > cursortotable` from my streaming client, it takes about 1.5 seconds for > results to start coming... but again with no disk buffer, as hoped Seems odd. Is your cursor just on "SELECT * FROM table", or is there some processing in there you're not mentioning? Maybe it's a cursor WITH HOLD and you're exiting the source transaction? regards, tom lane
This is regardless of how many rows I actually stream from thE connection before closing the cursor.
Mike Beaton <mjsbeaton@gmail.com> writes: > New TL;DR (I'm afraid): PostgreSQL is always generating a huge buffer file > on `FETCH ALL FROM CursorToHuge`. I poked into this and determined that it's happening because pquery.c executes FETCH statements the same as it does with any other tuple-returning utility statement, ie "run it to completion and put the results in a tuplestore, then send the tuplestore contents to the client". I think the main reason nobody worried about that being non-optimal was that we weren't expecting people to FETCH very large amounts of data in one go --- if you want the whole query result at once, why are you bothering with a cursor? This could probably be improved, but it would (I think) require inventing an additional PortalStrategy specifically for FETCH, and writing associated code paths in pquery.c. Don't know when/if someone might get excited enough about it to do that. regards, tom lane
Wouldn't this mean that cursors are noticeably non-optimal even for normal data sizes, since the entire data to be streamed from the table is always duplicated into another buffer and then streamed?
My experience with cursors in PostgreSQL with Java has been to stay away from them. We support 2 databases with our product, PostgreSQL (default) and SQL Server. While re-encrypting data in a database the application used cursors with a fetch size of 1000.
Worked perfectly on SQL Server and on PostgreSQL until we got to a PostgreSQL table with more than 11 million rows. After spending weeks trying to figure out what was happening, I realized that when it gets to a table with more than 10 million rows for some reason, the cursor functionality just silently stopped working and it was reading the entire table. I asked another very senior architect to look at it and he came to the same conclusion. Because of limited time, I ended up working around it using limit/offset.
Again we are using Java, so the problem could just be in the PostgreSQL JDBC driver. Also we were on 9.1 at the time.
Regards
John
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Mike Beaton
Sent: Tuesday, February 21, 2017 6:49 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
Thanks, Tom.
Wouldn't this mean that cursors are noticeably non-optimal even for normal data sizes, since the entire data to be streamed from the table is always duplicated into another buffer and then streamed?
> if you want the whole query result at once, why are you bothering with a cursor?
The PostgreSQL docs (https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html#AEN66551) clearly recommend cursors as a way to return a reference to a large result set from a function (as I understood, this is recommended precisely as a way to avoid tuple-based buffering of the data).
So following that advice, it's not unreasonable that I would actually have a cursor to a large dataset.
Then, I would ideally want to be able to fetch the data from that cursor without the entire data getting duplicated (even if only a bit at a time instead of all at once, which seems to be the best case behaviour) as I go.
Additionally, I thought that if I had a streaming use-case (which I do), and a streaming data-access layer (which I do), then since `SELECT * FROM large` is absolutely fine, end-to-end, in that situation, then by symmetry and the principle of least astonishment `FETCH ALL FROM cursor` might be fine too.
Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
My experience with cursors in PostgreSQL with Java has been to stay away from them. We support 2 databases with our product, PostgreSQL (default) and SQL Server. While re-encrypting data in a database the application used cursors with a fetch size of 1000.
Worked perfectly on SQL Server and on PostgreSQL until we got to a PostgreSQL table with more than 11 million rows. After spending weeks trying to figure out what was happening, I realized that when it gets to a table with more than 10 million rows for some reason, the cursor functionality just silently stopped working and it was reading the entire table. I asked another very senior architect to look at it and he came to the same conclusion. Because of limited time, I ended up working around it using limit/offset.
Again we are using Java, so the problem could just be in the PostgreSQL JDBC driver. Also we were on 9.1 at the time.
Regards
John
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Mike Beaton
Sent: Tuesday, February 21, 2017 6:49 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
Thanks, Tom.
Wouldn't this mean that cursors are noticeably non-optimal even for normal data sizes, since the entire data to be streamed from the table is always duplicated into another buffer and then streamed?
> if you want the whole query result at once, why are you bothering with a cursor?
The PostgreSQL docs (https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html#AEN66551) clearly recommend cursors as a way to return a reference to a large result set from a function (as I understood, this is recommended precisely as a way to avoid tuple-based buffering of the data).
So following that advice, it's not unreasonable that I would actually have a cursor to a large dataset.
Then, I would ideally want to be able to fetch the data from that cursor without the entire data getting duplicated (even if only a bit at a time instead of all at once, which seems to be the best case behaviour) as I go.
Additionally, I thought that if I had a streaming use-case (which I do), and a streaming data-access layer (which I do), then since `SELECT * FROM large` is absolutely fine, end-to-end, in that situation, then by symmetry and the principle of least astonishment `FETCH ALL FROM cursor` might be fine too.
Yes of course that’s all verified and taken into account during code initialization
From: Vitalii Tymchyshyn [mailto:vit@tym.im]
Sent: Wednesday, February 22, 2017 8:14 PM
To: John Gorman; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
For JDBC there are certain prerequisites for setFetchSize to work, e.g. using forward only result sets and transactions.
вт, 21 лют. 2017 о 09:06 John Gorman <jgorman@eldocomp.com> пише:
My experience with cursors in PostgreSQL with Java has been to stay away from them. We support 2 databases with our product, PostgreSQL (default) and SQL Server. While re-encrypting data in a database the application used cursors with a fetch size of 1000.
Worked perfectly on SQL Server and on PostgreSQL until we got to a PostgreSQL table with more than 11 million rows. After spending weeks trying to figure out what was happening, I realized that when it gets to a table with more than 10 million rows for some reason, the cursor functionality just silently stopped working and it was reading the entire table. I asked another very senior architect to look at it and he came to the same conclusion. Because of limited time, I ended up working around it using limit/offset.
Again we are using Java, so the problem could just be in the PostgreSQL JDBC driver. Also we were on 9.1 at the time.
Regards
John
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Mike Beaton
Sent: Tuesday, February 21, 2017 6:49 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
Thanks, Tom.
Wouldn't this mean that cursors are noticeably non-optimal even for normal data sizes, since the entire data to be streamed from the table is always duplicated into another buffer and then streamed?
> if you want the whole query result at once, why are you bothering with a cursor?
The PostgreSQL docs (https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html#AEN66551) clearly recommend cursors as a way to return a reference to a large result set from a function (as I understood, this is recommended precisely as a way to avoid tuple-based buffering of the data).
So following that advice, it's not unreasonable that I would actually have a cursor to a large dataset.
Then, I would ideally want to be able to fetch the data from that cursor without the entire data getting duplicated (even if only a bit at a time instead of all at once, which seems to be the best case behaviour) as I go.
Additionally, I thought that if I had a streaming use-case (which I do), and a streaming data-access layer (which I do), then since `SELECT * FROM large` is absolutely fine, end-to-end, in that situation, then by symmetry and the principle of least astonishment `FETCH ALL FROM cursor` might be fine too.