Thread: Proposal: efficient iter on named cursors
Hello, you may know psycopg offers server-side cursors, known in postgres/psycopg as named cursors. If you don't, well, this is good news :) Named cursors are useful with datasets too big to be handled by the client, as regular cursors transfer all the data to the client during the execute() method. Named cursors on the other hand only transfer the required amount of records to the client (one record with cur.fetchone(), n with cursor.fetchmany(n)). There is a shortcoming though: iter(cursor) will fetch the records one at a time, with a noticeable time overhead in case of large recordsets (exactly the ones you may want to retrieve with a named cursors...) Currently the most efficient way to iterate on a named cursor is something like: nrecs = 100 # or some other reasonable number while 1: recs = cur.fetchmany(nrecs) if not recs: break for rec in recs: # do something This would use only the memory used by nrecs record on the client and require just 1/nrecs of the roundtrips required by a naive operation. But it make the named cursors harder to use and not a drop-in replacement for regular cursors that can be idiomatically used with: for rec in cur: # do something So, I'd like to modify the cursor so that in case of __iter__, a certain number of record is fetched and iteration is performed on them. The cursor already has the state to keep the dataset so probably only the code would require change, not so much the data structures. How do we make the users choose their nrecs? I think the cursor should have an attribute with a sensible default: 100? 1000? 1024? What attribute name? It shouldn't be hard to implement. Does anybody want to try it? I've open the ticket #33 <http://psycopg.lighthouseapp.com/projects/62710/tickets/33> for any update. Comments? Cheers, -- Daniele
On 13/01/2011 17:37, Daniele Varrazzo wrote: [snip] > There is a shortcoming though: iter(cursor) will fetch the records one > at a time, with a noticeable time overhead in case of large recordsets > (exactly the ones you may want to retrieve with a named cursors...) > Currently the most efficient way to iterate on a named cursor is > something like: > > nrecs = 100 # or some other reasonable number > while 1: > recs = cur.fetchmany(nrecs) > if not recs: > break > for rec in recs: > # do something > > This would use only the memory used by nrecs record on the client and > require just 1/nrecs of the roundtrips required by a naive operation. > But it make the named cursors harder to use and not a drop-in > replacement for regular cursors that can be idiomatically used with: > > for rec in cur: > # do something > > So, I'd like to modify the cursor so that in case of __iter__, a > certain number of record is fetched and iteration is performed on > them. The cursor already has the state to keep the dataset so probably > only the code would require change, not so much the data structures. > > How do we make the users choose their nrecs? I think the cursor should > have an attribute with a sensible default: 100? 1000? 1024? What > attribute name? [snip] > Comments? I mostly agree. I'd like to see a .fetchsize parameter both on the module, on the connection and on the cursor (usual psycopg cascade initialization). A good default value, supposing rows of 10-20 columns with a mix of textual and numeric data, is ~2000 because it will probably fetch less than 100KB of data and that seems a good compromise between the number of rows transferred and data you'll throw away if you decide to leave the loop early. federico -- Federico Di Gregorio federico.digregorio@dndg.it Studio Associato Di Nunzio e Di Gregorio http://dndg.it If nobody understand you, that doesn't mean you're an artist. -- anonymous
On Thu, Jan 13, 2011 at 4:46 PM, Federico Di Gregorio <federico.digregorio@dndg.it> wrote: > On 13/01/2011 17:37, Daniele Varrazzo wrote: >> So, I'd like to modify the cursor so that in case of __iter__, a >> certain number of record is fetched and iteration is performed on >> them. The cursor already has the state to keep the dataset so probably >> only the code would require change, not so much the data structures. > I mostly agree. I'd like to see a .fetchsize parameter both on the > module, on the connection and on the cursor (usual psycopg cascade > initialization). A good default value, supposing rows of 10-20 columns > with a mix of textual and numeric data, is ~2000 because it will > probably fetch less than 100KB of data and that seems a good compromise > between the number of rows transferred and data you'll throw away if you > decide to leave the loop early. The feature discussed here is included in the current betas. In particular, in the released implementation, we are using the already existing 'arraysize' attribute (used as default for 'fetchmany()') as the number of records to fetch per roundtrip. But because the default for arraysize is 1 (per DB-API), and this value basically disables the feature, 'iter()' currently uses the proposed value of 2000 as default. However the ticket #41 <http://psycopg.lighthouseapp.com/projects/62710/tickets/41>, while invalid for me as I don't see "for record in cur" as a synonym for "fetch one record at time", has made me wonder if we are too aggressive with the default: maybe discarding arraysize=1 is not the best option. Albeit somebody can still use the "while True + / fetchone() + break" pattern to force record-per-record fetching, I think if she is careful enough to use named cursors for its task she may also care to set an appropriate value > 1 for arraysize. Named cursors are still easier to use, but I don't want to make people think they can be a replacement for *all* cursors - they still require more resources on the server, so are better used only when required. On this reasoning, I've committed this patch <https://github.com/dvarrazzo/psycopg/commit/1dd71947bff415f1c79a5e6b81b6ba89d717ecb9> in a separate branch: it makes iteration respect arraysize in its default value 1 too, and improves the documentation explaining the complete picture. If ok, the patch will be merged in devel. Comments? -- Daniele
On 15/02/11 13:22, Daniele Varrazzo wrote: > However the ticket #41 > <http://psycopg.lighthouseapp.com/projects/62710/tickets/41>, while > invalid for me as I don't see "for record in cur" as a synonym for > "fetch one record at time", has made me wonder if we are too > aggressive with the default: maybe discarding arraysize=1 is not the > best option. Albeit somebody can still use the "while True + / > fetchone() + break" pattern to force record-per-record fetching, I > think if she is careful enough to use named cursors for its task she > may also care to set an appropriate value > 1 for arraysize. Named > cursors are still easier to use, but I don't want to make people think > they can be a replacement for *all* cursors - they still require more > resources on the server, so are better used only when required. > > On this reasoning, I've committed this patch > <https://github.com/dvarrazzo/psycopg/commit/1dd71947bff415f1c79a5e6b81b6ba89d717ecb9> > in a separate branch: it makes iteration respect arraysize in its > default value 1 too, and improves the documentation explaining the > complete picture. If ok, the patch will be merged in devel. > > Comments? I think the original implementation was right because "foreach ..." doesn't mean fetch one record at a time. IMHO, 1) .fetchone() should _always_ fetch one record 2) iter(cursor) should fetch as many records as we feel right But we can do a little trick here and make iter(cursor) respect .arraysize if arraysize was explicitly set so that if one really wants to fetch one record at a time can just set .arraysize to 1. Good or bad? federico -- Federico Di Gregorio federico.digregorio@dndg.it Studio Associato Di Nunzio e Di Gregorio http://dndg.it I did appreciate the irony that I was whining about encoding issues on a mailing list that was unable to show those chars, too. -- Antti S. Lankila to mono-devel-list@
On Thu, Feb 17, 2011 at 10:44 AM, Federico Di Gregorio <federico.digregorio@dndg.it> wrote: > On 15/02/11 13:22, Daniele Varrazzo wrote: >> On this reasoning, I've committed this patch >> <https://github.com/dvarrazzo/psycopg/commit/1dd71947bff415f1c79a5e6b81b6ba89d717ecb9> >> in a separate branch: it makes iteration respect arraysize in its >> default value 1 too, and improves the documentation explaining the >> complete picture. If ok, the patch will be merged in devel. >> >> Comments? > > I think the original implementation was right because "foreach ..." > doesn't mean fetch one record at a time. IMHO, > > 1) .fetchone() should _always_ fetch one record > 2) iter(cursor) should fetch as many records as we feel right Yes, this is what I think too. It is consistent with what happens with iter(file) vs. file.readline(). The only hitch is that the DBAPI asks for a default of 1 for arraysize. > But we can do a little trick here and make iter(cursor) respect > .arraysize if arraysize was explicitly set so that if one really wants > to fetch one record at a time can just set .arraysize to 1. > > Good or bad? Quite tricky as arraysize is currently a simple property. Even if we could do it with some property trickery, it would be surprising if "print cur.arraysize" would return 1 and iter(cur) was efficient; then, after "cur.arraysize = 1", iter(cur) would switch to fetch one record at time, while "print cur.arraysize" would still report 1. I feel it violates the principle of least astonishment, as much as being difficult for the user to predict what the library would do. -- Daniele
On 17/02/11 11:57, Daniele Varrazzo wrote: >> I think the original implementation was right because "foreach ..." >> > doesn't mean fetch one record at a time. IMHO, >> > >> > 1) .fetchone() should _always_ fetch one record >> > 2) iter(cursor) should fetch as many records as we feel right > Yes, this is what I think too. It is consistent with what happens with > iter(file) vs. file.readline(). The only hitch is that the DBAPI asks > for a default of 1 for arraysize. > > >> > But we can do a little trick here and make iter(cursor) respect >> > .arraysize if arraysize was explicitly set so that if one really wants >> > to fetch one record at a time can just set .arraysize to 1. >> > >> > Good or bad? > Quite tricky as arraysize is currently a simple property. Even if we > could do it with some property trickery, it would be surprising if > "print cur.arraysize" would return 1 and iter(cur) was efficient; > then, after "cur.arraysize = 1", iter(cur) would switch to fetch one > record at time, while "print cur.arraysize" would still report 1. I > feel it violates the principle of least astonishment, as much as being > difficult for the user to predict what the library would do. Then we need a different property: itersize? federico -- Federico Di Gregorio federico.digregorio@dndg.it Studio Associato Di Nunzio e Di Gregorio http://dndg.it Io non sono romantica. La candelina sul tavolo mi vede e si spegne. -- sisterconfusion
On Thu, Feb 17, 2011 at 11:09 AM, Federico Di Gregorio <federico.digregorio@dndg.it> wrote: > On 17/02/11 11:57, Daniele Varrazzo wrote: >>> I think the original implementation was right because "foreach ..." >>> > doesn't mean fetch one record at a time. IMHO, >>> > >>> > 1) .fetchone() should _always_ fetch one record >>> > 2) iter(cursor) should fetch as many records as we feel right >> Yes, this is what I think too. It is consistent with what happens with >> iter(file) vs. file.readline(). The only hitch is that the DBAPI asks >> for a default of 1 for arraysize. >> >> >>> > But we can do a little trick here and make iter(cursor) respect >>> > .arraysize if arraysize was explicitly set so that if one really wants >>> > to fetch one record at a time can just set .arraysize to 1. >>> > >>> > Good or bad? >> Quite tricky as arraysize is currently a simple property. Even if we >> could do it with some property trickery, it would be surprising if >> "print cur.arraysize" would return 1 and iter(cur) was efficient; >> then, after "cur.arraysize = 1", iter(cur) would switch to fetch one >> record at time, while "print cur.arraysize" would still report 1. I >> feel it violates the principle of least astonishment, as much as being >> difficult for the user to predict what the library would do. > > Then we need a different property: itersize? While I don't like the multiplication of attributes and extensions, this sounds like the cleaner option. -- Daniele
On Thu, Feb 17, 2011 at 11:12 AM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote: > On Thu, Feb 17, 2011 at 11:09 AM, Federico Di Gregorio >> Then we need a different property: itersize? > > While I don't like the multiplication of attributes and extensions, > this sounds like the cleaner option. It's in my devel. -- Daniele
On 17/02/11 13:38, Daniele Varrazzo wrote: > On Thu, Feb 17, 2011 at 11:12 AM, Daniele Varrazzo > <daniele.varrazzo@gmail.com> wrote: >> > On Thu, Feb 17, 2011 at 11:09 AM, Federico Di Gregorio >>> >> Then we need a different property: itersize? >> > >> > While I don't like the multiplication of attributes and extensions, >> > this sounds like the cleaner option. > It's in my devel. I just reviewed and merged your devel. federico -- Federico Di Gregorio federico.digregorio@dndg.it Studio Associato Di Nunzio e Di Gregorio http://dndg.it Qu'est ce que la folie? Juste un sentiment de liberté si fort qu'on en oublie ce qui nous rattache au monde... -- J. de Loctra