Thread: libpq - lack of support to set the fetch size
On 03/09/2014 06:43 AM, matshyeq wrote: > Hello, > > I've found an issue when tried to implement fetching rows from big table > (2mln rows) in my app. > Basically I don't find an elegant and easy way (other than always use > cursors) to limit the number of rows returned. > This causes my application to break due to the excessive memory consumption. LIMIT does not work? > > I'm using Perl and DBD::Pg library but contacted maintainer who actually > pointed out this is an issue that goes much deeper (libpq): > > "Unfortunately, this is a limitation in the underlying driver (libpq) > rather than DBD::Pg itself. There have been talks over the years of > supporting this, but nothing concrete yet. Your best bet would be to ask > about this on the Postgres lists" > > Would you consider putting this on the roadmap, so one day it gets improved? > > Re, the details of the issue, I believe this has been well described at: > http://stackoverflow.com/questions/21960121/perl-dbdpg-script-fails-when-selecting-data-from-big-table > > Kind Regards > ~Msciwoj -- Adrian Klaver adrian.klaver@aklaver.com
matshyeq wrote: > "Unfortunately, this is a limitation in the underlying driver (libpq) rather > than DBD::Pg itself. There have been talks over the years of supporting > this, but nothing concrete yet. Your best bet would be to ask about this on > the Postgres lists" > > Would you consider putting this on the roadmap, so one day it gets improved? This improvement seems to have actually been made since 9.2 with the PQsetSingleRowMode() function: http://postgresql.org/docs/current/static/libpq-single-row-mode.html Best regards, -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
Daniel Verite wrote: > matshyeq wrote: [ runs out of memory on the client because all results from a large query are retrieved at once ] >> "Unfortunately, this is a limitation in the underlying driver (libpq) rather >> than DBD::Pg itself. There have been talks over the years of supporting >> this, but nothing concrete yet. Your best bet would be to ask about this on >> the Postgres lists" >> >> Would you consider putting this on the roadmap, so one day it gets improved? > > This improvement seems to have actually been made since 9.2 with > the PQsetSingleRowMode() function: > > http://postgresql.org/docs/current/static/libpq-single-row-mode.html Yes, DBD::Pg could be improved to make use of that; the problem is probably that the code would have to differentiate between PostgreSQL versions. Your solution with using SELECT ... OFFSET ? LIMIT 1 in a loop is bound to suck. First of all, there is no guarantee that the rows will be returned in the same order each time, see for example http://www.postgresql.org/docs/current/static/runtime-config-compatible.html#GUC-SYNCHRONIZE-SEQSCANS Also, unless you operate with an isolation level higher than READ COMMITTED, the various SELECTs could operate on different data sets. So you are likely to end up with incorrect results sooner or later if you use OFFSET and LIMIT without an ORDER BY clause. Then you will have really bad performance, especially with a large table, because each SELECT statement will have to start scanning the table again. The complexity will rise from O(n) to O(n^2). You can improve on this by using ORDER BY with an index and remembering the last returned row (get and read http://sql-performance-explained.com/). Finally, you will have a client-server round trip for each row returned. This is a problem you would also have when using PQsetSingleRowMode(). Yours, Laurenz Albe
matshyeq wrote: > Postgresql is there for a good while perceived as one of the best (or just simply the best!?) > available open source DB solution, so I'm really surprised this functionality is not yet supported... You can retrieve the full result set, you can retrieve it row by row, you can use a LIMIT clause to retrieve it in batches. Can you explain how exactly the functionality would look that you are missing? Yours, Laurenz Albe
Hello,I've found an issue when tried to implement fetching rows from big table (2mln rows) in my app.Basically I don't find an elegant and easy way (other than always use cursors) to limit the number of rows returned.This causes my application to break due to the excessive memory consumption.I'm using Perl and DBD::Pg library but contacted maintainer who actually pointed out this is an issue that goes much deeper (libpq):"Unfortunately, this is a limitation in the underlying driver (libpq) rather than DBD::Pg itself. There have been talks over the years of supporting this, but nothing concrete yet. Your best bet would be to ask about this on the Postgres lists"
in addition to what the others suggested, you can use a CURSOR to read through results in arbitrary sized blocks.
-- john r pierce 37N 122W somewhere on the middle of the left coast
On Mon, Mar 10, 2014 at 06:58:26AM +0000, Albe Laurenz wrote: > Daniel Verite wrote: > > matshyeq wrote: > > [ runs out of memory on the client because all results from a large query are retrieved at once ] > > >> "Unfortunately, this is a limitation in the underlying driver (libpq) rather > >> than DBD::Pg itself. There have been talks over the years of supporting > >> this, but nothing concrete yet. Your best bet would be to ask about this on > >> the Postgres lists" > >> > >> Would you consider putting this on the roadmap, so one day it gets improved? > > > > This improvement seems to have actually been made since 9.2 with > > the PQsetSingleRowMode() function: > > > > http://postgresql.org/docs/current/static/libpq-single-row-mode.html > Finally, you will have a client-server round trip for each row returned. > This is a problem you would also have when using PQsetSingleRowMode(). PQsetSingleRowMode() does not do additional roudtrips, it loads rows from libpq internal buffer. -- marko
Daniel Verite wrote:
> matshyeq wrote:
[ runs out of memory on the client because all results from a large query are retrieved at once ]Yes, DBD::Pg could be improved to make use of that; the problem is probably
>> "Unfortunately, this is a limitation in the underlying driver (libpq) rather
>> than DBD::Pg itself. There have been talks over the years of supporting
>> this, but nothing concrete yet. Your best bet would be to ask about this on
>> the Postgres lists"
>>
>> Would you consider putting this on the roadmap, so one day it gets improved?
>
> This improvement seems to have actually been made since 9.2 with
> the PQsetSingleRowMode() function:
>
> http://postgresql.org/docs/current/static/libpq-single-row-mode.html
that the code would have to differentiate between PostgreSQL versions.
Your solution with using
SELECT ... OFFSET ? LIMIT 1
in a loop is bound to suck.
First of all, there is no guarantee that the rows will be returned in
the same order each time, see for example
http://www.postgresql.org/docs/current/static/runtime-config-compatible.html#GUC-SYNCHRONIZE-SEQSCANS
Also, unless you operate with an isolation level higher than READ COMMITTED,
the various SELECTs could operate on different data sets.
So you are likely to end up with incorrect results sooner or later
if you use OFFSET and LIMIT without an ORDER BY clause.
Then you will have really bad performance, especially with a large table,
because each SELECT statement will have to start scanning the table again.
The complexity will rise from O(n) to O(n^2).
You can improve on this by using ORDER BY with an index and remembering
the last returned row (get and read http://sql-performance-explained.com/).
Finally, you will have a client-server round trip for each row returned.
This is a problem you would also have when using PQsetSingleRowMode().
Yours,
Laurenz Albe
> you can retrieve it row by row,
> you can use a LIMIT clause to retrieve it in batches.
Statement st = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
// Set the fetch size to 1000.
st.setFetchSize(1000);
// Execute the given sql query
String sql = "select * from bigtable";
ResultSet rs = statement.executeQuery(sql);
while (rs.next()) {
⋮
}
Kind Regards,
matshyeq wrote:You can retrieve the full result set,
> Postgresql is there for a good while perceived as one of the best (or just simply the best!?)
> available open source DB solution, so I'm really surprised this functionality is not yet supported...
you can retrieve it row by row,
you can use a LIMIT clause to retrieve it in batches.
Can you explain how exactly the functionality would look that
you are missing?
Yours,
Laurenz Albe
On 03/10/2014 04:51 AM, matshyeq wrote: >>Albe Laurenz wrote: > > I would believe the stackoverflow > (http://stackoverflow.com/questions/21960121/perl-script-fails-when-selecting-data-from-big-postgresql-table) > question referred to explains the issue well. > >> You can retrieve the full result set, > not an option because of client memory limitations (in this case it's > poor client spec but there always are some, especially when you want to > pull 1e7 rows) > >> you can retrieve it row by row, > not an option because of performance (db calls/network roundtrips) > >> you can use a LIMIT clause to retrieve it in batches. > you pointed the best why it's not a feasible option (complexity, > isolation levels, not always possible ie. when custom query and last but > not least: far from being elegant) > >> CURSOR option > As already explained at stackoverflow - I'm using it as a workaround. My > general point is it forces developers to use lower level communication > with DB (cursors) therefore not as elegant as just setting RowCacheSize > parameter as specified by DBI. According to DBD::Pg maintainer this > hasn't and can't be implemented for PostgreSQL due to the lack of > support in its own libpq library. > So again.., I'm really surprised this functionality is not yet supported > in PostgreSQL. Does that mean everybody have been implementing this > through cursors? > > To recap what's on stackoverflow - The functionality I'm talking about > would be an equivalent of JDBC setFetchSize() > <http://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#setFetchSize(int)> function > to optimize the load from (any) database in batches, like in the example > below: > > Statement st = > conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, > java.sql.ResultSet.CONCUR_READ_ONLY); > > // Set the fetch size to 1000. > > st.setFetchSize(1000); > > // Execute the given sql query > > String sql = "select * from bigtable"; > > ResultSet rs = statement.executeQuery(sql); > > while (rs.next()) { > > ⋮ > > } > > > where underneath ResultSet.next() doesn't actually fetch one row at a > time from the RESULT-SET. It returns that from the (local) ROW-SET and > fetches ROW-SET (transparently) whenever it becomes exhausted on the > local client. > > Actually, curious now if this functionality has been implemented in > PostgreSQL JDBC drivers...? Yes, using a cursor. http://jdbc.postgresql.org/documentation/92/query.html By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows. ..... > > Anyway, according to one of the DBD::Pg developers it's impossible to > bring this functionality as the problem lies deeper, within libpq library: > > "Unfortunately, this is a limitation in the underlying driver (libpq) > rather than DBD::Pg itself. There have been talks over the years of > supporting this, but nothing concrete yet." > > So probably the best is to ask Greg to speak to details if still unclear. > > Kind Regards, > Maciek > > > > On Mon, Mar 10, 2014 at 9:42 AM, Albe Laurenz <laurenz.albe@wien.gv.at > <mailto:laurenz.albe@wien.gv.at>> wrote: > > matshyeq wrote: > > Postgresql is there for a good while perceived as one of the best > (or just simply the best!?) > > available open source DB solution, so I'm really surprised this > functionality is not yet supported... > > You can retrieve the full result set, > you can retrieve it row by row, > you can use a LIMIT clause to retrieve it in batches. > > Can you explain how exactly the functionality would look that > you are missing? > > Yours, > Laurenz Albe > > > > > -- > Thank you, > Kind Regards > ~Maciek -- Adrian Klaver adrian.klaver@aklaver.com
matshyeq <matshyeq@gmail.com> writes: > If 'SingleRowMode' goes row-by-row then again it's not a solution, > especially given that this particular issue applies to rather large row > sets. Perhaps you should actually experiment with that solution instead of rejecting it out of hand. Or at least RTFM about it. It does have limitations: you can't interleave fetching of different large query results. But I don't have a problem telling people they ought to use cursors for such cases. regards, tom lane
Hello,I've found an issue when tried to implement fetching rows from big table (2mln rows) in my app.Basically I don't find an elegant and easy way (other than always use cursors) to limit the number of rows returned.This causes my application to break due to the excessive memory consumption.I'm using Perl and DBD::Pg library but contacted maintainer who actually pointed out this is an issue that goes much deeper (libpq):"Unfortunately, this is a limitation in the underlying driver (libpq) rather than DBD::Pg itself. There have been talks over the years of supporting this, but nothing concrete yet. Your best bet would be to ask about this on the Postgres lists"
matshyeq wrote: > The only solution is CURSOR based which I find an awkward low level hack > comparing to elegant option supported by native library. That's not the only solution, even with the current DBD::Pg you could do: $dbh->do("COPY (sql-squery) TO STDOUT"); my $data; while ($dbh->pg_getcopydata($data) >= 0) { # process $data } The results would be streamed as opposed to being accumulated in memory. Also COPY is optimized for high performance. The drawback is you'd have to parse $data according to the specific rules of the COPY format, which may be easy or not-so-easy depending on the actual data, numeric or text or other, whether it has NULLs, backslashes and so on. Best regards, -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
Marko, Tom, Adrian, Jeff, Daniel - thank you all for valuable feedback!
Two general questions:
- when using PQsetSingleRowMode() function - does it give an option to define how many rows to cache on client's side (like JDBC setFetchSize() does) or leaves it at pqlib's discretion?
- is it/would it be possible to add corresponding option to pgAdmin to limit initially (and each subsequently) returned rows in Query Tool by custom defined max value?
@Tom
>Perhaps you should actually experiment with that solution instead of rejecting it out of hand. Or at least RTFM about it.
As I'm not using pqlib directly I'm unable to leverage PQsetSingleRowMode() call (or at least I'm not aware how to do this from DBD::Pg)
I simply passed you feedback given by them.
@Adrian
The example in the documentation you refer to actually demonstrates this has been properly implemented in JDBC.
By properly I mean call to:
setFetchSize()
works, whatever it actually does behind the scenes (cursors?) it doesn't actually require a developer to declare and utilize cursors explicitly, like
st.execute("DECLARE csr CURSOR FOR SELECT * FROM myBigTable;");
conn.prepareStatement("fetch 1000 from csr");
⋮
@Jeff
I'll make a suggestion to DBD::Pg development
@Daniel
Very interesting alternative. You're welcome to contribute to this stackoverflow question!
Does it mean $data is a reference to single returned row of data (`COPY formatted`)?
Kind Regards
Msciwoj
On Tue, Mar 11, 2014 at 12:39:12PM +0000, matshyeq wrote: > - when using PQsetSingleRowMode() function - does it give an option to > define how many rows to cache on client's side (like JDBC setFetchSize() > does) or leaves it at pqlib's discretion? This option would not make sense as you are not "fetching" anything, full resultset is being streamed from server over TCP connection. > - is it/would it be possible to add corresponding option to pgAdmin to > limit initially (and each subsequently) returned rows in Query Tool by > custom defined max value? It could close connection in the middle of resultset but that seems like bad idea. LIMIT N or FETCH N are better for such task. -- marko
On Wed, Mar 12, 2014 at 10:57:03AM +0000, matshyeq wrote: > On Wed, Mar 12, 2014 at 9:30 AM, Marko Kreen <markokr@gmail.com> wrote: > > This option would not make sense as you are not "fetching" anything, > > full resultset is being streamed from server over TCP connection. > > Well, I don't know what "streamed" exactly means here. > If server pushes sequentially all the data not asking client if ready to > receive then that's what the issue is about. This problem is handled in kernel's TCP stack - it will slow down the connection if userspace does not read fast enough. IOW, don't worry about it. > If client asks server for another chunk each time it has received previous > one then to me it's implicit 'fetching' scenario where user/developer > doesn't have an option to define fetch size. This is how it is usually implemented and configurable fetch size is indeed useful in such situation. But it requires separate round-trip for each chunk so single-row-mode is superior method for processing large queries without huge buffers. > > > - is it/would it be possible to add corresponding option to pgAdmin to > > > > limit initially (and each subsequently) returned rows in Query Tool by > > > > custom defined max value? > > > > > It could close connection in the middle of resultset but that seems like > > bad idea. LIMIT N or FETCH N are better for such task. > > I don't see why? I can't think of any single SQL tool I've been working > with that didn't have this functionality, really. Yeah, I see no reason pgAdmin cannot implement it. Implementing such feature with transparently adding LIMIT or FETCH is better. PQsetSingleRowMode() is not meant for partial resultsets, but it can be used if you accept the downsides. -- marko
On Tue, Mar 11, 2014 at 12:39:12PM +0000, matshyeq wrote:
> - when using PQsetSingleRowMode() function - does it give an option to> define how many rows to cache on client's side (like JDBC setFetchSize()> does) or leaves it at pqlib's discretion?This option would not make sense as you are not "fetching" anything,full resultset is being streamed from server over TCP connection.
I don't see why? I can't think of any single SQL tool I've been working with that didn't have this functionality, really.> - is it/would it be possible to add corresponding option to pgAdmin to> limit initially (and each subsequently) returned rows in Query Tool by> custom defined max value?It could close connection in the middle of resultset but that seems likebad idea. LIMIT N or FETCH N are better for such task.
> Well, I don't know what "streamed" exactly means here.> If server pushes sequentially all the data not asking client if ready to> receive then that's what the issue is about.This problem is handled in kernel's TCP stack - it will slow downthe connection if userspace does not read fast enough.IOW, don't worry about it.
Again, this is something 'at discretion' of the pqlib library therefore not sure how 'slowing down' really works in different scenarios, like ie. server and app client on the same machine?
Still see quite a value in specifying this explicitly (like JDBC does), which BTW I did by using CURSOR as a workaround.
> If client asks server for another chunk each time it has received previous> one then to me it's implicit 'fetching' scenario where user/developer> doesn't have an option to define fetch size.This is how it is usually implemented and configurable fetch sizeis indeed useful in such situation. But it requires separate round-tripfor each chunk so single-row-mode is superior method for processinglarge queries without huge buffers.
I wouldn't worry about this (round trips) myself - at the end of the day client decides when and how much to pull.
Also, in the datawarehouse area, we consciously tune this parameter for each DB source individually which is considered very (and good!) common practice.
Don't fully agree with last statement. FETCH is WAY better and more functional than LIMIT.> > > - is it/would it be possible to add corresponding option to pgAdmin to> >> > limit initially (and each subsequently) returned rows in Query Tool by> >> > custom defined max value?> >>> > It could close connection in the middle of resultset but that seems like> > bad idea. LIMIT N or FETCH N are better for such task.>> I don't see why? I can't think of any single SQL tool I've been working> with that didn't have this functionality, really.Yeah, I see no reason pgAdmin cannot implement it.Implementingsuch feature with transparently adding LIMIT or FETCH is better.
Given the 'discretion mode' I'm still not sure if that's a proper and elegant way to meet the requirement.PQsetSingleRowMode() is not meant for partial resultsets, but it canbe used if you accept the downsides.
Maybe it is? Anyway, This is just me, an end user giving you, the PostgreSQL maintaining developers a chance to hear 'the voice of the customer'
Regards,
On 03/12/2014 06:05 AM, matshyeq wrote: > Don't fully agree with last statement. FETCH is WAY better and more > functional than LIMIT. > > PQsetSingleRowMode() is not meant for partial resultsets, but it can > be used if you accept the downsides. > > Given the 'discretion mode' I'm still not sure if that's a proper and > elegant way to meet the requirement. Following this discussion it would seem that Postgres meets the requirement, but that what you want is that the interfaces you use make use of the capabilities. As mentioned previously the JDBC driver and the Python driver(psycopg2) already do what you want. The next step, to me at least, is take what you have learned and contact the projects (DBD::Pg, pgAdmin) in question. > Maybe it is? Anyway, This is just me, an end user giving you, the > PostgreSQL maintaining developers a chance to hear 'the voice of the > customer' > > Regards, > ~Msciwoj -- Adrian Klaver adrian.klaver@aklaver.com
On Mar 12, 2014, at 5:57 AM, matshyeq <matshyeq@gmail.com> wrote: > I don't see why? I can't think of any single SQL tool I've been working with that didn't have this functionality, really. > The principle I find very simple and useful. > There is defined "fetch row size" parameter (each tool calls give its own name), > after submitting ANY query, client fetches result set rows but not more than that. > Some programs even automatically define this value based on result grid size displayed on the screen. > User then usually has two buttons, fetch another batch/screen or fetch all - he decides. > If he decides way too late (break for coffee) then he simply resubmits the query (and potentially change the parameterfirst)... > > I don't find value in auto-fetching millions of rows for user to present on the screen. > Also I don't think it's particularly useful when you need to know and apply database specific SQL syntax to limit the rows. > If you join multiple tables that may be even more tricky (which table to apply limit? or use subquerying instead?). Using the extend query protocol, Postgres has a built-in way to limit the number of rows returned from any select withoutany textual manipulation of the query. I'm not sure if libpq exposes this capability in the API, but it should not be too difficult to implement. See: http://www.postgresql.org/docs/current/static/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY > Once a portal exists, it can be executed using an Execute message. The Execute message specifies the portal name (emptystring denotes the unnamed portal) and a maximum result-row count (zero meaning "fetch all rows"). The result-row countis only meaningful for portals containing commands that return row sets; in other cases the command is always executedto completion, and the row count is ignored. The possible responses to Execute are the same as those described abovefor queries issued via simple query protocol, except that Execute doesn't cause ReadyForQuery or RowDescription to beissued. John DeSoi, Ph.D.
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > Second, the feature needed to do this without even > using a cursor was added 1.5 years ago (PQsetSingleRowMode). The DBD::Pg > was just not taught how to use it yet. True. And we were hoping for something better, so we can make one request for 10,000 rows to libpq rather than call PQgetResult 10,000 times, but we'll move ahead with implementing RowCacheSize via PQsetSingleRowMode. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 201404021428 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAlM8ie8ACgkQvJuQZxSWSsit0gCgn0qMRRnep2sVUeM+BLyQoIkS dtMAoPvM71oL+YdQg+84/xT2TxLj3wek =GVH+ -----END PGP SIGNATURE-----