Re: [PERFORM] Correct use of cursors for very large result sets in Postgres - Mailing list pgsql-performance

From Mike Beaton
Subject Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
Date
Msg-id CAHzAAWQhQKsE2isFr6bGL-Q06CTQSg6Sjd_+Nt-4h9Gqu0aVHg@mail.gmail.com
Whole thread Raw
In response to Re: [PERFORM] Correct use of cursors for very large result sets in Postgres  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
Re: [PERFORM] Correct use of cursors for very large result sets in Postgres
List pgsql-performance
Dear Tom,

This is very helpful, thank you.

You make a very useful point that the limitation is basically on PL/pgSQL and other PL languages. And someone on SO already pointed out that an inline SQL function with a enormous sized TABLE return value also doesn't have any buffering problems. So that's a very convenient option, whenever SQL alone is powerful enough.

You make the further very helpful point that any library which is written using `libpq` won't work as desired on `FETCH ALL FROM HugeCursor`. But I don't know whether that's 'most' libraries. I think that depends on your programming milieu! I'm working in the world of ADO.NET (but the same seems to apply to JDBC) where most low level drivers are not written using `libpq` but rather directly with sockets against the database - which makes sense because a streaming data reader is part of the contract which those drivers have to implement.

It's definitely worth noting that the `FETCH 100000 FROM cursor` until exhausted pattern will *always* be safe. But most fundamentally I did, very specifically, want to know if the `FETCH ALL FROM CursorToAstronomicallyLargeData` pattern can *ever* work sensibly. It seems it clearly can and does if certain assumptions are met. Assumptions which I actually know *are* met, in the case in which I potentially wanted to use it!

One outstanding question I have. Based on a lot of helpful responses given to the SO question I can now test and see what disk buffers are generated (by setting `log_temp_files` to `0` and then `tail -f log`), as well as how long it takes for results to start arriving.

With a large (10,000,000 row) test table, if I do `SELECT * FROM table` on psql it starts to return results immediately with no disk buffer. If I do `FETCH ALL FROM cursortotable` on psql it takes about 7.5 seconds to start returning results, and generates a 14MB buffer. If I do `SELECT * FROM table` on a correctly coded streaming client, it also starts to return results immediately with no disk buffer. But if I do `FETCH ALL FROM cursortotable` from my streaming client, it takes about 1.5 seconds for results to start coming... but again with no disk buffer, as hoped

I was kind of hoping that the 'it creates a buffer' and the 'it takes a while to start' issues would be pretty much directly aligned, but it's clearly not as simple as that! I don't know if you can offer any more helpful insight on this last aspect?

Many thanks,

Mike

pgsql-performance by date:

Previous
From: Hustler DBA
Date:
Subject: Re: [PERFORM] Number of characters in column preventing index usage
Next
From: Mike Beaton
Date:
Subject: Re: [PERFORM] Correct use of cursors for very large result sets in Postgres