Thread: Protocol 3, Execute, maxrows to return, impact?
I was wondering, if there is any real advantage to actually specify say 64 for the maxrows parameter to the Execute message in the PostgreSQL network protocol? I.e.: - Is it easier on the server, because it somehow uses less total memory when batching the Executes with 64 rows at a time? - Is it better for latency when expecting notices/notifies in between? - Is it simply slowing down the protocol because every 64 rows I introduce a round-trip delay for the next PortalSuspended/Executecombo? Please note that we're only running the query once, I'm just batching the results with the Execute maxrows parameter. -- Sincerely, Stephen R. van den Berg. Limited offer: Pacemakers, with a unique life-time warranty
"Stephen R. van den Berg" <srb@cuci.nl> writes: > I was wondering, if there is any real advantage to actually specify say > 64 for the maxrows parameter to the Execute message in the PostgreSQL > network protocol? There's no benefit in it from the server's perspective, if that's what you meant. The point of the parameter is to allow the client to avoid running out of memory to store all of a huge query result --- it can pull it in sections, instead. (Think of it as a built-in cursor FETCH facility.) > - Is it simply slowing down the protocol because every 64 rows I > introduce a round-trip delay for the next PortalSuspended/Execute combo? Yup; there's no free lunch ... regards, tom lane
On Thu, Jul 10, 2008 at 05:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Stephen R. van den Berg" <srb@cuci.nl> writes: >> I was wondering, if there is any real advantage to actually specify say >> 64 for the maxrows parameter to the Execute message in the PostgreSQL >> network protocol? > There's no benefit in it from the server's perspective, if that's what > you meant. The point of the parameter is to allow the client to avoid > running out of memory to store all of a huge query result --- it can > pull it in sections, instead. (Think of it as a built-in cursor > FETCH facility.) Then, from a client perspective, there is no use at all, because the client can actually pause reading the results at any time it wants, when it wants to avoid storing all of the result rows. The network will perform the cursor/fetch facility for it. -- Sincerely,Stephen R. van den Berg.
"Stephen R. van den Berg" <srb@cuci.nl> writes: > Then, from a client perspective, there is no use at all, because the > client can actually pause reading the results at any time it wants, > when it wants to avoid storing all of the result rows. The network > will perform the cursor/fetch facility for it. [ shrug... ] In principle you could write a client library that would act that way, but I think you'll find that none of the extant ones will hand back an incomplete query result to the application. A possibly more convincing argument is that with that approach, the connection is completely tied up --- you cannot issue additional database commands based on what you just read, nor pull rows from multiple portals in an interleaved fashion. regards, tom lane
Tom Lane wrote: > "Stephen R. van den Berg" <srb@cuci.nl> writes: > >> Then, from a client perspective, there is no use at all, because the >> client can actually pause reading the results at any time it wants, >> when it wants to avoid storing all of the result rows. The network >> will perform the cursor/fetch facility for it. >> > > [ shrug... ] In principle you could write a client library that would > act that way, but I think you'll find that none of the extant ones > will hand back an incomplete query result to the application. > > A possibly more convincing argument is that with that approach, the > connection is completely tied up --- you cannot issue additional > database commands based on what you just read, nor pull rows from > multiple portals in an interleaved fashion. > > I really think we need to get something like this into libpq. It's on my TODO list after notification payloads and libpq support for arrays and composites. We'll need to come up with an API before we do much else. cheers andrew
Tom Lane wrote: >"Stephen R. van den Berg" <srb@cuci.nl> writes: >> Then, from a client perspective, there is no use at all, because the >> client can actually pause reading the results at any time it wants, >> when it wants to avoid storing all of the result rows. The network >> will perform the cursor/fetch facility for it. >[ shrug... ] In principle you could write a client library that would >act that way, but I think you'll find that none of the extant ones >will hand back an incomplete query result to the application. True. But I have written one just now. The language is called Pike, it's a C/C++/Java lookalike. And I start returning rows as they arrive, and pause reading from the network when the application wants to pause. >A possibly more convincing argument is that with that approach, the >connection is completely tied up --- you cannot issue additional >database commands based on what you just read, nor pull rows from >multiple portals in an interleaved fashion. Interleaved retrieval using multiple portals is not what most libraries support, I'd guess. It can be supported at the application layer using multiple cursors, but that works with my approach as well. In practice, most applications that need that, open multiple connections to the same database (I'd think). The only thing I could imagine is that *if* at the server end, the notifications that arrive during the retrieval of one long running Execute, are queued *after* all the data, instead of inserted into the datastream, then it might be worth doing it differently. Incidentally, the nice thing about my library is that it automatically does arguments in binary which are easily processed in binary (TEXT/BYTEA/ and all those others I mentioned earlier). It automatically transmits those arguments in binary for *both* arguments and rowresults; i.e. in one row I can have both text and binary columns, without the application needing to specify which is which. -- Sincerely, Stephen R. van den Berg. "If you can't explain it to an 8-year-old, you don't understand it."
(I don't really have much to add to the discussion here; I'm just posting for the record on the question of client behaviour, since I also wrote and maintain a client library in C++.) At 2008-07-10 18:40:03 +0200, srb@cuci.nl wrote: > > I start returning rows as they arrive, and pause reading from the > network when the application wants to pause. My library also starts returning rows as they arrive, and in fact my application makes heavy use of that feature. The data rows are read from a non-blocking socket and the caller either does something for each one, or waits until they've all arrived before proceeding. > Interleaved retrieval using multiple portals is not what most > libraries support, I'd guess. My code did support that mode of operation in theory, but in practice in the few situations where I have needed to use something like it, I found it more convenient to open explicit cursors and FETCH from them (but I usually needed this inside a transaction, and so did not open multiple connections). Thus my code always sets maxrows to 0 at the moment, and so... > The only thing I could imagine is that *if* at the server end, the > notifications that arrive during the retrieval of one long running > Execute, are queued *after* all the data, instead of inserted into > the datastream, then it might be worth doing it differently. ...I can't comment on this interesting observation. > i.e. in one row I can have both text and binary columns, without the > application needing to specify which is which. Yes, that's nice. My first attempt to define an API for bind variables set the data format to text by default and allowed it to be overriden, but that was much too troublesome. Now the code decides by itself what format is best to use for a given query. (Again, though my library certainly supports mixing text and binary format columns, my application has not needed to use this feature.) -- ams
"Stephen R. van den Berg" <srb@cuci.nl> writes: >>A possibly more convincing argument is that with that approach, the >>connection is completely tied up --- you cannot issue additional >>database commands based on what you just read, nor pull rows from >>multiple portals in an interleaved fashion. > > Interleaved retrieval using multiple portals is not what most libraries > support, I'd guess. It can be supported at the application layer using > multiple cursors, but that works with my approach as well. > > In practice, most applications that need that, open multiple > connections to the same database (I'd think). Er? There's nothing particularly unusual about application logic like: $sth->execute('huge select'); while ($sth->fetch('foreign_key')) { ... do some processing which is hard to do in server-side language ... $sth->execute('insertresulting data'); } Most drivers do support this kind of interface but they may be reading the entire result set for "huge select" in advance. However if ti's large enough then this is only going to really work if you can start a new portal while the outer portal is actually running on the backend. If the driver tries to cache the whole result set the programmer will be sad. Back when I was doing PHP programming and I discovered that PHP's Postgres driver didn't support this I thought it was an outrageous bug. (It didn't help that the behaviour was to misbehave randomly rather than throw a meaningful error.) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark <stark@enterprisedb.com> writes: > "Stephen R. van den Berg" <srb@cuci.nl> writes: >> In practice, most applications that need that, open multiple >> connections to the same database (I'd think). > Er? There's nothing particularly unusual about application logic like: > $sth->execute('huge select'); > while ($sth->fetch('foreign_key')) { > ... do some processing which is hard to do in server-side language ... > $sth->execute('insert resulting data'); > } Moreover, there's often good reasons to do it all within one transaction, which is impossible if you rely on a separate connection to issue the inserts on. regards, tom lane
"Abhijit Menon-Sen" <ams@oryx.com> writes: >> Interleaved retrieval using multiple portals is not what most >> libraries support, I'd guess. > > My code did support that mode of operation in theory, but in practice > in the few situations where I have needed to use something like it, I > found it more convenient to open explicit cursors and FETCH from them Note that using FETCH for each record means a round trip to the server for each record. If you're dealing with a lot of records that could be a lot slower than streaming them to the client as quickly as it can consume them. Now I'm not sure anyone's actually done any experiments to optimize libpq or other drivers to stream data efficiently, so I'm not sure how much you would really lose in practice today. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Gregory Stark wrote: >"Abhijit Menon-Sen" <ams@oryx.com> writes: >>> Interleaved retrieval using multiple portals is not what most >>> libraries support, I'd guess. >> My code did support that mode of operation in theory, but in practice >> in the few situations where I have needed to use something like it, I >> found it more convenient to open explicit cursors and FETCH from them >Note that using FETCH for each record means a round trip to the server for >each record. If you're dealing with a lot of records that could be a lot >slower than streaming them to the client as quickly as it can consume them. >Now I'm not sure anyone's actually done any experiments to optimize libpq or >other drivers to stream data efficiently, so I'm not sure how much you would >really lose in practice today. My Pike drivers now support multiple simultaneous portals and automatic streaming by presending overlapping Execute statements with a dynamically adapted fetchlimit calculated per select as the query progresses. The only support still lacking is COPY. -- Sincerely, Stephen R. van den Berg. In this signature, the concluding three words `were left out'.
Stephen R. van den Berg wrote: >My Pike drivers now support multiple simultaneous portals and >automatic streaming by presending overlapping Execute statements with >a dynamically adapted fetchlimit calculated per select as the query >progresses. They also support COPY now. The driver beats libpq in speed by about 62%. The memory consumption is on demand, by row, and not the whole result set. Transport to and from the query is in binary and dynamically determined per datatype, no quoting necessary. Anyone interested in taking a peek at the (GPL copyright) driver, I temporarily put up a small package which contains the working driver in Pike at: http://admin.cuci.nl/psgsql.pike.tar.gz Pike is a C/C++/Java like interpreted language. The production driver uses a PGsql assist class which is written in C to accelerate (amazingly) few core functions (not included, but the driver works fully without the PGsql assist class). -- Sincerely, Stephen R. van den Berg. "There are 10 types of people in the world.Those who understand binary and those who do not."
I replied to this post, yesterday. Yet I don't see my reply appear, could it have been caught in a spamfilter or something? -- Sincerely, Stephen R. van den Berg. "Even if man could understand women, he still wouldn't believe it."
On 27-Jul-08, at 3:00 PM, Stephen R. van den Berg wrote: > Stephen R. van den Berg wrote: >> My Pike drivers now support multiple simultaneous portals and >> automatic streaming by presending overlapping Execute statements with >> a dynamically adapted fetchlimit calculated per select as the query >> progresses. > > They also support COPY now. > > The driver beats libpq in speed by about 62%. > The memory consumption is on demand, by row, and not the whole > result set. > Transport to and from the query is in binary and dynamically > determined > per datatype, no quoting necessary. > > Anyone interested in taking a peek at the (GPL copyright) driver, I > temporarily put up a small package which contains the working driver > in Pike at: > > http://admin.cuci.nl/psgsql.pike.tar.gz > This is very exciting news, I'd love to look at it, is there any way it could be re-licensed so that it can be incorporated into say the jdbc driver ? Dave
Dave Cramer wrote: >On 27-Jul-08, at 3:00 PM, Stephen R. van den Berg wrote: >>Stephen R. van den Berg wrote: >>The driver beats libpq in speed by about 62%. >>Anyone interested in taking a peek at the (GPL copyright) driver, I >>temporarily put up a small package which contains the working driver >>in Pike at: >> http://admin.cuci.nl/psgsql.pike.tar.gz >This is very exciting news, I'd love to look at it, is there any way >it could be re-licensed so that it can be incorporated into say the >jdbc driver ? Since I wrote it, I can relicense it any which way I want. What kind of license would you like to have? -- Sincerely, Stephen R. van den Berg. "Even if man could understand women, he still wouldn't believe it."
On Mon, 2008-07-28 at 18:45 +0200, Stephen R. van den Berg wrote: > Dave Cramer wrote: > >On 27-Jul-08, at 3:00 PM, Stephen R. van den Berg wrote: > >>Stephen R. van den Berg wrote: > >>The driver beats libpq in speed by about 62%. > > >>Anyone interested in taking a peek at the (GPL copyright) driver, I > >>temporarily put up a small package which contains the working driver > >>in Pike at: > > >> http://admin.cuci.nl/psgsql.pike.tar.gz > > >This is very exciting news, I'd love to look at it, is there any way > >it could be re-licensed so that it can be incorporated into say the > >jdbc driver ? > > Since I wrote it, I can relicense it any which way I want. > What kind of license would you like to have? The JDBC driver and PostgreSQL itself are BSD licensed: http://jdbc.postgresql.org/license.html http://www.postgresql.org/about/licence Sincerely, Joshua D. Drake > -- > Sincerely, > Stephen R. van den Berg. > > "Even if man could understand women, he still wouldn't believe it." > -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
On 28-Jul-08, at 12:45 PM, Stephen R. van den Berg wrote: > Dave Cramer wrote: >> On 27-Jul-08, at 3:00 PM, Stephen R. van den Berg wrote: >>> Stephen R. van den Berg wrote: >>> The driver beats libpq in speed by about 62%. > >>> Anyone interested in taking a peek at the (GPL copyright) driver, I >>> temporarily put up a small package which contains the working driver >>> in Pike at: > >>> http://admin.cuci.nl/psgsql.pike.tar.gz > >> This is very exciting news, I'd love to look at it, is there any way >> it could be re-licensed so that it can be incorporated into say the >> jdbc driver ? > > Since I wrote it, I can relicense it any which way I want. > What kind of license would you like to have? As Joshua mentioned BSD is the preferred postgresql license. As I understand it I can't even look at your code and subsequently use anything in the JDBC driver Dave > > -- > Sincerely, > Stephen R. van den Berg. > > "Even if man could understand women, he still wouldn't believe it." > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
On Jul 28, 2008, at 1:54 PM, Dave Cramer wrote: > > On 28-Jul-08, at 12:45 PM, Stephen R. van den Berg wrote: > >> Dave Cramer wrote: >>> On 27-Jul-08, at 3:00 PM, Stephen R. van den Berg wrote: >>>> Stephen R. van den Berg wrote: >>>> The driver beats libpq in speed by about 62%. >> >>>> Anyone interested in taking a peek at the (GPL copyright) driver, I >>>> temporarily put up a small package which contains the working >>>> driver >>>> in Pike at: >> >>>> http://admin.cuci.nl/psgsql.pike.tar.gz >> >>> This is very exciting news, I'd love to look at it, is there any way >>> it could be re-licensed so that it can be incorporated into say the >>> jdbc driver ? >> >> Since I wrote it, I can relicense it any which way I want. >> What kind of license would you like to have? > > As Joshua mentioned BSD is the preferred postgresql license. As I > understand it I can't even look at your code and subsequently use > anything in the JDBC driver The GPL does not cover implementation ideas, not to mention that the author just described the implementation. Furthermore, one could not "take" anything from the Pike driver for the JDBC driver because it is a completely different language. It seems like you confused the GPL with an NDA. Anyway, what does "The driver beats libpq in speed by about 62%" mean? Cheers, M
Dave Cramer wrote: >>Since I wrote it, I can relicense it any which way I want. >>What kind of license would you like to have? >As Joshua mentioned BSD is the preferred postgresql license. As I >understand it I can't even look at your code and subsequently use >anything in the JDBC driver I'll relicense it under a BSD license, so you can use it. Watch this space. -- Sincerely, Stephen R. van den Berg. "Even if man could understand women, he still wouldn't believe it."
On 7/27/08, Stephen R. van den Berg <srb@cuci.nl> wrote:
Stephen R. van den Berg wrote:
>My Pike drivers now support multiple simultaneous portals and
>automatic streaming by presending overlapping Execute statements with
>a dynamically adapted fetchlimit calculated per select as the query
>progresses.
They also support COPY now.
The driver beats libpq in speed by about 62%.
The memory consumption is on demand, by row, and not the whole result set.
Transport to and from the query is in binary and dynamically determined
per datatype, no quoting necessary.
Anyone interested in taking a peek at the (GPL copyright) driver, I
temporarily put up a small package which contains the working driver
in Pike at:
http://admin.cuci.nl/psgsql.pike.tar.gz
I'd love to take a look at the code, but the link seems to be 404'd at the moment.
On Mon, Jul 28, 2008 at 3:36 PM, Andrew Garner <andrew.b.garner@gmail.com> wrote: > > On 7/27/08, Stephen R. van den Berg <srb@cuci.nl> wrote: >> >> Stephen R. van den Berg wrote: >> >My Pike drivers now support multiple simultaneous portals and >> >automatic streaming by presending overlapping Execute statements with >> >a dynamically adapted fetchlimit calculated per select as the query >> >progresses. >> >> >> They also support COPY now. >> >> The driver beats libpq in speed by about 62%. >> The memory consumption is on demand, by row, and not the whole result set. >> Transport to and from the query is in binary and dynamically determined >> per datatype, no quoting necessary. >> >> Anyone interested in taking a peek at the (GPL copyright) driver, I >> temporarily put up a small package which contains the working driver >> in Pike at: >> >> http://admin.cuci.nl/psgsql.pike.tar.gz > > I'd love to take a look at the code, but the link seems to be 404'd at the > moment. > Hi, I'd love to take a look at the code too. I'm the lead developer of Npgsql a .Net data provider for postgresql which implements the protocol 2 and 3. And have been following this thread so I could get more info about protocol implementation internals so we could use in our .net implementation. Currently we don't support multiple resultsets yet. Also, I'd like to get some ideas about how to optimize throughput. You said you got a huge improvement compared to libpq. I think you may implement some nice ideas which we could use to optimize Npgsql. I'm also getting 404 error when trying to download the code. -- Regards, Francisco Figueiredo Jr. http://fxjr.blogspot.com http://www.npgsql.org
A.M. wrote: >Anyway, what does "The driver beats libpq in speed by about 62%" mean? It means that if I open up a connection, my lib only takes half the time libpq uses to open up the connection, and it also means that when I run the following query 10 times in a row: SELECT * FROM pg_catalog.pg_type as a,pg_catalog.pg_type as b LIMIT 1000 and the resultset is subsequently splitted, parsed and copied *per* row and column into memory, and then discarded; that using my driver it takes 483 units of time, and using the libpq driver it uses 762 units of time. I.e. speed comparison is 762/483 = 1.63 -> 63% faster. The test environment is: unloaded Linux 2.6.26 client using libpq5 8.3.3, and an unloaded Linux 2.6.26 server running PostgreSQL 8.3.3, 100Mb/s ethernet in between. The test is I/O bound, and therefore tests the client-lib efficiency. -- Sincerely, Stephen R. van den Berg. "Even if man could understand women, he still wouldn't believe it."
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Jul 28, 2008 at 02:24:22PM -0400, A.M. wrote: > > On Jul 28, 2008, at 1:54 PM, Dave Cramer wrote: >> >> As Joshua mentioned BSD is the preferred postgresql license. As I >> understand it I can't even look at your code and subsequently use anything >> in the JDBC driver > > The GPL does not cover implementation ideas [...] Exactly. What you can't do is copy code verbatim or link to the code without being bound by the GPL. Taking ideas is free. (Disclaimer: I'm clearly biased towards the GPL, but not so much as to not understand that BSD would make more sense in the context of a BSD project). Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIjrMpBcgs9XrR2kYRAtK4AJ99e4hf74WvBR9qz+mMcZDjKAzWwgCcCJBo BwgZ3g71VPqmFztKlUheKfg= =gPPR -----END PGP SIGNATURE-----
Relicensed and downloadable (Re: Protocol 3, Execute, maxrows to return, impact?)
From
"Stephen R. van den Berg"
Date:
Limited offer, get it while it's hot: http://admin.cuci.nl/pgsql.pike.driver.tar.gz.bin I relicensed it under BSD now, put up proper copyright notices, and included the accelleratorclass for amusement (the module works without the accellerator class, but it's about 10 times slower than libpq then). For anyone trying to understand Pike, the important points probably are that: - ({ 1, 2, 3, 4 }) is an array. - m = (["key1":13, "key2":14, "key3":15]) is a mapping, i.e. m->key1 or m["key1"] both evaluate to 13. The rest is rather similar to C++/Java. See http://pike.roxen.com/ for more documentation and references, try git://git.cuci.nl/pike as the source repository, http://pike.ida.liu.se/docs/ for documentation and tutorials. The fetch_row() function actually returns an array for each row, where the individual columns are already parsed and converted into the native Pike formats for handling. -- Sincerely, Stephen R. van den Berg. "People who think they know everything are annoying to those of us who do."
Re: Relicensed and downloadable (Re: Protocol 3, Execute, maxrows to return, impact?)
From
"Stephen R. van den Berg"
Date:
Stephen R. van den Berg wrote: >The fetch_row() function actually returns an array for each row, where >the individual columns are already parsed and converted into the native >Pike formats for handling. A typical (low level) usage sequence in Pike looks like this: object db; object q; array row; db = Sql.pgsql("host","database","user","password"); q = db->query("SELECT * FROM a"); while(row = q->fetch_row()) write("Processing row %O\n",row); q = db->query("SELECT * FROM b"); etc. -- Sincerely, Stephen R. van den Berg. "People who think they know everything are annoying to those of us who do."
Stephen R. van den Berg wrote: >A.M. wrote: >>Anyway, what does "The driver beats libpq in speed by about 62%" mean? >I.e. speed comparison is 762/483 = 1.63 -> 63% faster. After some tuning on the libpq side, I can say that: If libpq is called using a cursor and fetches in 64 row chunks the speed difference with my driver above is correct. If libpq is called *not* using a cursor and fetches all (1000) rows in one go, the new comparison becomes: 583/483 = 1.21, which still makes my driver 21% faster than libpq native. Please note that my driver automatically chunks up the query using multiple chained Execute(dynamically calculated limit) and allows fetching the rows on demand and interleaving portals. My driver uses a 32KByte buffer per active portal. -- Sincerely, Stephen R. van den Berg. How many weeks are there in a lightyear?