Thread: general PG network slowness (possible cure) (repost)
I set up pg to replace a plain gdbm database for my application. But even running to the same machine, via a unix socket * the pg database ran 100 times slower Across the net it was * about 500 to 1000 times slower than local gdbm with no cpu use to speak of. I'd heard that networked databases are slow. I might have left it at that if curiosity hadn't led me to write a network server for gdbm databases, and talk to _that_ just to get a comparison. Lo and behold and smack me with a corncob if it wasn't _slower_ than pg. On a whim I mapped the network bandwidth per packet size with the NPtcp suite, and got surprising answers .. at 1500B, naturally, the bandwidth was the full 10Mb/s (minus overheads, say 8.5Mb/s) of my pathetic little local net. At 100B the bandwidth available was only 25Kb/s. At 10B, you might as well use tin cans and taut string instead. I also mapped the network flows using ntop, and yes, the average packet size for both gdbm and pg in one direction was only about 100B or so. That's it! Clearly there are a lot of short queries going out and the answers were none too big either ( I had a LIMIT 1 in all my PG queries). About 75% of traffic was in the 64-128B range while my application was running, with the peak bandwidth in that range being about 75-125Kb/s (and I do mean bits, not bytes). Soooo ... I took a look at my implementation of remote gdbm, and did a very little work to aggregate outgoing transmissions together into lumps. Three lines added in two places. At the level of the protocol where I could tell how long the immediate conversation segment would be, I "corked" the tcp socket before starting the segment and "uncorked" it after the segment (for "cork", see tcp(7), setsockopt(2) and TCP_CORK in linux). Surprise, ... I got a speed up of hundreds of times. The same application that crawled under my original rgdbm implementation and under PG now maxed out the network bandwidth at close to a full 10Mb/s and 1200 pkts/s, at 10% CPU on my 700MHz client, and a bit less on the 1GHz server. So * Is that what is holding up postgres over the net too? Lots of tiny packets? And if so * can one fix it the way I fixed it for remote gdbm? The speedup was hundreds of times. Can someone point me at the relevant bits of pg code? A quick look seems to say that fe-*.c is interesting. I need to find where the actual read and write on the conn->sock is done. Very illuminating gnuplot outputs available on request. Peter
Peter T. Breuer wrote: > I set up pg to replace a plain gdbm database for my application. But > even running to the same machine, via a unix socket > > * the pg database ran 100 times slower For what operations? Bulk reads? 19-way joins? > Across the net it was > > * about 500 to 1000 times slower than local gdbm > > with no cpu use to speak of. Disk-intensive or memory intensive? > I'd heard that networked databases are slow. I might have left it at > that if curiosity hadn't led me to write a network server for gdbm > databases, and talk to _that_ just to get a comparison. > > Lo and behold and smack me with a corncob if it wasn't _slower_ than pg. > > On a whim I mapped the network bandwidth per packet size with the NPtcp > suite, and got surprising answers .. at 1500B, naturally, the bandwidth > was the full 10Mb/s (minus overheads, say 8.5Mb/s) of my pathetic little > local net. At 100B the bandwidth available was only 25Kb/s. At 10B, > you might as well use tin cans and taut string instead. This sounds like you're testing a single connection. You would expect "dead time" to dominate in that scenario. What happens when you have 50 simultaneous connections? Or do you think it's just packet overhead? > I also mapped the network flows using ntop, and yes, the average packet > size for both gdbm and pg in one direction was only about 100B or > so. That's it! Clearly there are a lot of short queries going out and > the answers were none too big either ( I had a LIMIT 1 in all my PG > queries). I'm not sure that 100B query-results are usually the bottleneck. Why would you have LIMIT 1 on all your queries? > About 75% of traffic was in the 64-128B range while my application was > running, with the peak bandwidth in that range being about 75-125Kb/s > (and I do mean bits, not bytes). None of this sounds like typical database traffic to me. Yes, there are lots of small result-sets, but there are also typically larger (several kilobytes) to much larger (10s-100s KB). > Soooo ... I took a look at my implementation of remote gdbm, and did > a very little work to aggregate outgoing transmissions together into > lumps. Three lines added in two places. At the level of the protocol > where I could tell how long the immediate conversation segment would be, > I "corked" the tcp socket before starting the segment and "uncorked" it > after the segment (for "cork", see tcp(7), setsockopt(2) and TCP_CORK in > linux). I'm a bit puzzled, because I'd have thought the standard Nagle algorithm would manage this gracefully enough for short-query cases. There's no way (that I know of) for a backend to handle more than one query at a time. > Surprise, ... I got a speed up of hundreds of times. The same application > that crawled under my original rgdbm implementation and under PG now > maxed out the network bandwidth at close to a full 10Mb/s and 1200 > pkts/s, at 10% CPU on my 700MHz client, and a bit less on the 1GHz > server. > > So > > * Is that what is holding up postgres over the net too? Lots of tiny > packets? I'm not sure your setup is typical, interesting though the figures are. Google a bit for pg_bench perhaps and see if you can reproduce the effect with a more typical load. I'd be interested in being proved wrong. > And if so > > * can one fix it the way I fixed it for remote gdbm? > > The speedup was hundreds of times. Can someone point me at the relevant > bits of pg code? A quick look seems to say that fe-*.c is > interesting. I need to find where the actual read and write on the > conn->sock is done. You'll want to look in backend/libpq and interfaces/libpq I think (although I'm not a developer). -- Richard Huxton Archonet Ltd
On Fri, May 25, 2007 at 10:50:58AM +0200, Peter T. Breuer wrote: > I set up pg to replace a plain gdbm database for my application. Postgres and gdbm are completely different. You want to rethink your queries so each does more work, instead of running a zillion of them over the network. /* Steinar */ -- Homepage: http://www.sesse.net/
"Also sprach Kenneth Marshall:" > > Surprise, ... I got a speed up of hundreds of times. The same application > > that crawled under my original rgdbm implementation and under PG now > > maxed out the network bandwidth at close to a full 10Mb/s and 1200 > > pkts/s, at 10% CPU on my 700MHz client, and a bit less on the 1GHz > > server. > > > > So > > > > * Is that what is holding up postgres over the net too? Lots of tiny > > packets? > > > This effect is very common, but you are in effect altering the query/ I imagined so, but no, I am not changing the behaviour - I believe you are imagining something different here. Let me explain. It is usually the case that drivers and the network layer conspire to emit packets when they are otherwise idle, since they have nothing better to do. That is, if the transmission unit is the normal 1500B and there is 200B in the transmission buffer and nothing else is frisking them about the chops, something along the line will shrug and say, OK, I'll just send out a 200B fragment now, apologize, and send out another fragment later if anything else comes along for me to chunter out. It is also the case that drivers do the opposite .. that is, they do NOT send out packets when the transmission buffer is full, even if they have 1500B worth. Why? Well, on Ge for sure, and on 100BT most of the time, it doesn't pay to send out individual packets because the space required between packets is relatively too great to permit the network to work at that speed given the speed of light as it is, and the spacing it implies between packets (I remember when I advised the networking protocol people that Ge was a coming thing about 6 years ago, they all protested and said it was _physically_ impossible. It is. If you send packets one by one!). An ethernet line is fundamentally only electrical and only signals up or down (relative) and needs time to quiesce. And then there's the busmastering .. a PCI bus is only about 33MHz, and 32 bits wide (well, or 16 on portables, or even 64, but you're getting into heavy server equipment then). That's 128MB/s in one direction, and any time one releases the bus there's a re-setup time that costs the earth and will easily lower bandwidth by 75%. So drivers like to take the bus for a good few packets at a time. Even a single packet (1500B) will take 400 multi-step bus cycles to get to the card, and then it's a question of how much onboard memory it has or whether one has to drive it synchronously. Most cards have something like a 32-unit ring buffer, and I think each unit is considerable. Now, if a driver KNOWS what's coming then it can alter its behavior in order to mesh properly with the higher level layers. What I did was _tell_ the driver and the protocol not to send any data until I well and truly tell it to, and then told it to, when I was ready. The result is that a full communication unit (start, header, following data, and stop codon) was sent in one blast. That meant that there were NO tiny fragments blocking up the net, being sent wily-nily. And it also meant that the driver was NOT waiting for more info to come in before getting bored and sending out what it had. It did as I told it to. The evidence from monitoring the PG network thruput is that 75% of its packets are in the 64-128B range, including tcp header. That's hitting the 100Kb/s (10KB/s) bandwidth regime on my network at the lower end. It will be even _worse_ on a faster net, I think (feel free to send me a faster net to compare with :). I also graphed latency, but I haven't taken into account the results as the bandwidth measurements were so striking. > response behavior of the database. Most applications expect an answer > from the database after every query. Well of course. Nothing else would work! (I imagine you have some kind of async scheme, but I haven't investigated). I ask, the db replies. I ask, the db replies. What I did was 1) made the ASK go out as one lump. 2) made the REPLY go out as one lump 3) STOPPED the card waiting for several replies or asks to accumulate before sending out anything at all. > If it could manage retrying failed > queries later, you could use the typical sliding window/delayed ack > that is so useful in improving the bandwidth utilization of many network That is not what is going on (though that's not a bad idea). See above for the explanation. One has to take into account the physical hardware involved and its limitations, and arrange the communications accordingly. All I did was send EACH query and EACH response as a single unit, at the hardware level. One could do better still by managing _several_ threads communications at once. > programs. Maybe an option in libpq to tell it to use delayed "acks". I > do not know what would be involved. Nothing spectacular is required to see a considerable improvement, I think,. apart from a little direction from the high level protocol down to the driver about where the communication boundaries are. 1000% speedup in my case. Now, where is the actual socket send done in the pg code? I'd like to check what's happening in there. Peter
"Also sprach Richard Huxton:" [Charset ISO-8859-1 unsupported, filtering to ASCII...] > Peter T. Breuer wrote: > > I set up pg to replace a plain gdbm database for my application. But > > even running to the same machine, via a unix socket > > > > * the pg database ran 100 times slower > > For what operations? Bulk reads? 19-way joins? The only operations being done are simple "find the row with this key", or "update the row with this key". That's all. The queries are not an issue (though why the PG thread choose to max out cpu when it gets the chance to do so through a unix socket, I don't know). > > Across the net it was > > > > * about 500 to 1000 times slower than local gdbm > > > > with no cpu use to speak of. > > Disk-intensive or memory intensive? There is no disk as such... it's running on a ramdisk at the server end. But assuming you mean i/o, i/o was completely stalled. Everything was idle, all waiting on the net. > > On a whim I mapped the network bandwidth per packet size with the NPtcp > > suite, and got surprising answers .. at 1500B, naturally, the bandwidth > > was the full 10Mb/s (minus overheads, say 8.5Mb/s) of my pathetic little > > local net. At 100B the bandwidth available was only 25Kb/s. At 10B, > > you might as well use tin cans and taut string instead. > > This sounds like you're testing a single connection. You would expect > "dead time" to dominate in that scenario. What happens when you have 50 Indeed, it is single, because that's my application. I don't have 50 simultaneous connections. The use of the database is as a permanent storage area for the results of previous analyses (static analysis of the linux kernel codes) from a single client. Multiple threads accessing at the same time might help keep the network drivers busier, which would help. They would always see their buffers filling at an even rate and be able to send out groups of packets at once. > simultaneous connections? Or do you think it's just packet overhead? It's not quite overhead in the sense of the logical layer. It's a physical layer thing. I replied in another mail on this thread, but in summary, tcp behaves badly with small packets on ethernet, even on a dedicated line (as this was). One needs to keep it on a tight rein. > > I also mapped the network flows using ntop, and yes, the average packet > > size for both gdbm and pg in one direction was only about 100B or > > so. That's it! Clearly there are a lot of short queries going out and > > the answers were none too big either ( I had a LIMIT 1 in all my PG > > queries). > > I'm not sure that 100B query-results are usually the bottleneck. > Why would you have LIMIT 1 on all your queries? Because there is always only one answer to the query, according to the logic. So I can always tell the database manager to stop looking after one, which will always help it. > > About 75% of traffic was in the 64-128B range while my application was > > running, with the peak bandwidth in that range being about 75-125Kb/s > > (and I do mean bits, not bytes). > > None of this sounds like typical database traffic to me. Yes, there are > lots of small result-sets, but there are also typically larger (several > kilobytes) to much larger (10s-100s KB). There's none here. > > Soooo ... I took a look at my implementation of remote gdbm, and did > > a very little work to aggregate outgoing transmissions together into > > lumps. Three lines added in two places. At the level of the protocol > > where I could tell how long the immediate conversation segment would be, > > I "corked" the tcp socket before starting the segment and "uncorked" it > > after the segment (for "cork", see tcp(7), setsockopt(2) and TCP_CORK in > > linux). > > I'm a bit puzzled, because I'd have thought the standard Nagle algorithm > would manage this gracefully enough for short-query cases. There's no On the contrary, Nagle is also often wrong here because it will delay sending in order to accumulate more data into buffers when only a little has arrived, then give up when no more data arrives to be sent out, then send out the (short) packet anyway, late. There's no other traffic apart from my (single thread) application. What we want is to direct the sending exactly,n this situation saying when to not send, and when to send. Disable Nagle for a start, use async read (noblock), and sync write, with sends from the socket blocked from initiation of a message until the whole message is ready to be sent out. Sending the message piecemeal just hurts too. > way (that I know of) for a backend to handle more than one query at a time. That's not the scenario. > > Surprise, ... I got a speed up of hundreds of times. The same application > > that crawled under my original rgdbm implementation and under PG now > > maxed out the network bandwidth at close to a full 10Mb/s and 1200 > > pkts/s, at 10% CPU on my 700MHz client, and a bit less on the 1GHz > > server. > > > > So > > > > * Is that what is holding up postgres over the net too? Lots of tiny > > packets? > > I'm not sure your setup is typical, interesting though the figures are. > Google a bit for pg_bench perhaps and see if you can reproduce the > effect with a more typical load. I'd be interested in being proved wrong. But the load is typical HERE. The application works well against gdbm and I was hoping to see speedup from using a _real_ full-fledged DB instead. Well, at least it's very helpful for debugging. > > And if so > > > > * can one fix it the way I fixed it for remote gdbm? > > > > The speedup was hundreds of times. Can someone point me at the relevant > > bits of pg code? A quick look seems to say that fe-*.c is > > interesting. I need to find where the actual read and write on the > > conn->sock is done. > > You'll want to look in backend/libpq and interfaces/libpq I think > (although I'm not a developer). I'll look around there. Specific directions are greatly appreciated. Thanks. Peter
Peter T. Breuer wrote: > > The only operations being done are simple "find the row with this key", > or "update the row with this key". That's all. The queries are not an > issue (though why the PG thread choose to max out cpu when it gets the > chance to do so through a unix socket, I don't know). > There is no disk as such... it's running on a ramdisk at the server > end. But assuming you mean i/o, i/o was completely stalled. Everything > was idle, all waiting on the net. > Indeed, it is single, because that's my application. I don't have > 50 simultaneous connections. The use of the database is as a permanent > storage area for the results of previous analyses (static analysis of > the linux kernel codes) from a single client. >> I'm not sure your setup is typical, interesting though the figures are. >> Google a bit for pg_bench perhaps and see if you can reproduce the >> effect with a more typical load. I'd be interested in being proved wrong. > > But the load is typical HERE. The application works well against gdbm > and I was hoping to see speedup from using a _real_ full-fledged DB > instead. I'm not sure you really want a full RDBMS. If you only have a single connection and are making basic key-lookup queries then 90% of PostgreSQL's code is just getting in your way. Sounds to me like gdbm (or one of its alternatives) is a good match for you. Failing that, sqlite is probably the next lowest-overhead solution. Of course, if you want to have multiple clients interacting and performing complex 19-way joins on gigabyte-sized tables with full-text indexing and full transaction control then you *do* want a RDBMS. -- Richard Huxton Archonet Ltd
"Also sprach Richard Huxton:" > I'm not sure you really want a full RDBMS. If you only have a single > connection and are making basic key-lookup queries then 90% of > PostgreSQL's code is just getting in your way. Sounds to me like gdbm Yep - I could happily tell it not to try and compile a special lookup scheme each time, for example! (how that?). I could presumably also help it by preloading the commands I will run and sending over the params only with a "do a no. 17 now!". > (or one of its alternatives) is a good match for you. Failing that, > sqlite is probably the next lowest-overhead solution. Not a bad idea. but PG _will_ be useful when folk come to analyse the result of the analyses being done. What is slow is getting the data into the database now via simple store, fetch and update. > Of course, if you want to have multiple clients interacting and > performing complex 19-way joins on gigabyte-sized tables with full-text Well, the dbs are in the tens of MB from a single run over a single file (i.e analysis of a single 30KLOC source). The complete analysis space is something like 4000 times that, for 4300 C files in the linux kernel source. And then there is all the linux kernel versions. Then there is godzilla and apache source .. > indexing and full transaction control then you *do* want a RDBMS. We want one anyway. The problem is filling the data and the simple fetch and update queries on it. I really think it would be worthwhile getting some developer to tell me where the network send is done in PG. Peter
"Peter T. Breuer" <ptb@inv.it.uc3m.es> writes: > Soooo ... I took a look at my implementation of remote gdbm, and did > a very little work to aggregate outgoing transmissions together into > lumps. We do that already --- for a simple query/response such as you are describing, each query cycle will involve one physical client->server message followed by one physical server->client message. The only way to aggregate more is for the application code to merge queries together. Migrating a dbm-style application to a SQL database is often a real pain, precisely because the application is designed to a mindset of "fetch one record, manipulate it, update it", where "fetch" and "update" are assumed to be too stupid to do any of the work for you. The way to get high performance with a SQL engine is to push as much of the work as you can to the database side, and let the engine process multiple records per query; and that can easily mean rewriting the app from the ground up :-( regards, tom lane
Peter T. Breuer wrote: > "Also sprach Richard Huxton:" >> I'm not sure you really want a full RDBMS. If you only have a single >> connection and are making basic key-lookup queries then 90% of >> PostgreSQL's code is just getting in your way. Sounds to me like gdbm > > Yep - I could happily tell it not to try and compile a special lookup > scheme each time, for example! (how that?). I could presumably also > help it by preloading the commands I will run and sending over the > params only with a "do a no. 17 now!". PREPARE/EXECUTE (or the equivalent libpq functions). Also - if you can have multiple connections to the DB you should be able to have several queries running at once. >> (or one of its alternatives) is a good match for you. Failing that, >> sqlite is probably the next lowest-overhead solution. > > Not a bad idea. but PG _will_ be useful when folk come to analyse the > result of the analyses being done. What is slow is getting the data > into the database now via simple store, fetch and update. I'd have an hourly/daily bulk-load running from the simple system into PG. If you have to search all the data from your app that's not practical of course. >> Of course, if you want to have multiple clients interacting and >> performing complex 19-way joins on gigabyte-sized tables with full-text > > Well, the dbs are in the tens of MB from a single run over a single > file (i.e analysis of a single 30KLOC source). The complete analysis > space is something like 4000 times that, for 4300 C files in the linux > kernel source. And then there is all the linux kernel versions. Then > there is godzilla and apache source .. If you're doing some sort of token analysis on source-code you probably want to look into how tsearch2 / trigram / Gist+GIN indexes work. It might be that you're doing work in your app that the DB can handle for you. >> indexing and full transaction control then you *do* want a RDBMS. > > We want one anyway. The problem is filling the data and the simple > fetch and update queries on it. OK > I really think it would be worthwhile getting some developer to tell me > where the network send is done in PG. -- Richard Huxton Archonet Ltd
Peter T. Breuer escribió: > I really think it would be worthwhile getting some developer to tell me > where the network send is done in PG. See src/backend/libpq/pqcomm.c (particularly internal_flush()). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
"Also sprach Alvaro Herrera:" > > I really think it would be worthwhile getting some developer to tell me > > where the network send is done in PG. > > See src/backend/libpq/pqcomm.c (particularly internal_flush()). Yes. Thanks. That looks like it. It calls secure_write continually until the buffer is empty. Secure_write is located ibe-secure.c, but I'm not using ssl, so the call reduces to just n = send(port->sock, ptr, len, 0); And definitely all those could be grouped if there are several to do. But under normal circumstances the send will be pushing against a lttle resistance (the copy to the driver/protocol stack buffer is faster than the physical network send, by a ratio of GB/s to MB/s, or 1000 to 1), and thus all these sends will probably complete as a single unit once they have been started. It's worth a try. I thought first this may be too low level, but it looks as though internal_flush is only triggered when some other buffer is full, or deliberately, so it may be useful to block until it fires. I'll try it. Peter
"Peter T. Breuer" <ptb@inv.it.uc3m.es> writes: > And definitely all those could be grouped if there are several to do. Except that in the situation you're describing, there's only a hundred or two bytes of response to each query, which means that only one send() will occur anyway. (The flush call comes only when we are done responding to the current client query.) It's possible that for bulk data transmission situations we could optimize things a bit better --- in particular I've wondered whether we can reliably find out the MTU of the connection and use that as the output buffer size, instead of trusting the kernel to choose the best message boundaries --- but for the situation you're worried about there will be only one send. regards, tom lane
"Also sprach Tom Lane:" > "Peter T. Breuer" <ptb@inv.it.uc3m.es> writes: > > And definitely all those could be grouped if there are several to do. > > Except that in the situation you're describing, there's only a hundred > or two bytes of response to each query, which means that only one send() > will occur anyway. (The flush call comes only when we are done > responding to the current client query.) It may still be useful. The kernel won't necessarily send data as you push it down to the network protocols and driver. The driver may decide to wait for more data to accumulate, particularly if it only has a couple of hundred bytes to send so far and the medium is high speed and medium latency (fast ethernet). It'll get fed up with waiting for more data eventually, and send it out, but it is essentially waiting on _itself_ in that case, since the outgoing data is required at the other side of the net as a response to be processed before another query can be sent out, only then prompting the postmaster to start stuffing the output buffer with more bytes. Waiting on oneself is bad for us procrastinators. We need some whips. I'll try and really force a send, and try some more tricks. Unfortunately this isn't really quite the right level, so I have to use some heuristics. Can you guarantee that internal_flush is not called until (a) the internal buffer is full, OR (b) we have finished composing a reply, AND (c) there is no other way to send out data? I also need to find where we begin to compose a reply. That's somewhere well before internal flush ever gets called. I want to block output at that point. As it is, I can either unblock just before internal_flush and block after, or block just before internal_flush and unblock after (:-) that's not quite as daft as it sounds, but needs care). Really I want to do query received *block output process query create response *unblock output send Instead, I have here to do query received process query create response *unblock output send *block output Which is not quite the same. It may work though, because the driver will know nothing is going to go out while it is listening for the next query, and it will not have sent anything prematurely or kept it back inopportunely. > It's possible that for bulk data transmission situations we could > optimize things a bit better --- in particular I've wondered whether we > can reliably find out the MTU of the connection and use that as the > output buffer size, instead of trusting the kernel to choose the best > message boundaries --- but for the situation you're worried about Don't bother, I think. MTU is often effectively only notional these days at the hardware level in many media. OTOH, on my little net, MTU really does mean something because it's 10BT. > there will be only one send. True. Peter
"Peter T. Breuer" <ptb@inv.it.uc3m.es> writes: > "Also sprach Tom Lane:" >> Except that in the situation you're describing, there's only a hundred >> or two bytes of response to each query, which means that only one send() >> will occur anyway. (The flush call comes only when we are done >> responding to the current client query.) > It may still be useful. The kernel won't necessarily send data as you > push it down to the network protocols and driver. The driver may decide > to wait for more data to accumulate, No, because we set TCP_NODELAY. Once we've flushed a message to the kernel, we don't want the kernel sitting on it --- any delay there adds directly to the elapsed query time. At least this is the case for the final response to a query. I'm not too clear on whether this means we need to be careful about intermediate message boundaries when there's a lot of data being sent. regards, tom lane
"Also sprach Tom Lane:" > > It may still be useful. The kernel won't necessarily send data as you > > push it down to the network protocols and driver. The driver may decide > > to wait for more data to accumulate, > > No, because we set TCP_NODELAY. Once we've flushed a message to the That just means "disable Nagle", which is indeed more or less the correct thing to do .. you don't want to sit around waiting for more data when we're sure there will be none, as you say. Yet you also don't want to send short data out prematurely, which disabling Nagle can cause. And disabling Nagle doesn't actually force data out immediately you want it to be sent ... it just disables extra waits imposed by the Nagle algorithm/protocol. It doesn't stop the driver from waiting around because it feels taking the bus might be a bit premature right now, for example. > kernel, we don't want the kernel sitting on it --- any delay there adds > directly to the elapsed query time. At least this is the case for the > final response to a query. I'm not too clear on whether this means we > need to be careful about intermediate message boundaries when there's a > lot of data being sent. It's unclear. But not my situation. If I clear TCP_CORK all data is sent at that point. If I set TCP_CORK data is held until I clear TCP_CORK, or 200ms have passed with no send. Peter
"Also sprach Kenneth Marshall:" > improvement from coalescing the packets. Good luck in your investigations. While I am recompiling stuff, just some stats. Typical network traffic analysis during the PG runs: Total Packets Processed 493,499 Unicast 100.0% 493,417 Broadcast 0.0% 82 Multicast 0.0% 0 pktCast distribution chart Shortest 42 bytes Average Size 192 bytes Longest 1,514 bytes <= 64 bytes 0.0% 158 64 to 128 bytes 77.3% 381,532 129 to 256 bytes 6.8% 33,362 257 to 512 bytes 8.6% 42,535 513 to 1024 bytes 4.0% 19,577 1025 to 1518 bytes 3.3% 16,335 Typical application rusage stats: time ./c -timeout 12000 -database postgresql://pebbles/d /tmp/tty_io..c user system elapsed cpu 7.866u 6.038s 5:49.13 3.9% 0+0k 0+0io 0pf+0w Those stats show the system lost in i/o. It's neither in kernel nor in userspace. Presumably the other side plus networking was the holdup. For comparison, against localhost via loopback ("fake" networking): time ./c -timeout 12000 -database postgresql://localhost/d /tmp/tty_io..c user system elapsed cpu 9.483u 5.321s 2:41.78 9.1% 0+0k 0+0io 0pf+0w but in that case postmaster was doing about 54% cpu, so the overall cpu for server + client is 63%. I moved to a unix domain socket and postmaster alone went to 68%. time ./c -timeout 12000 -database postgresql://unix/var/run/postgresql/d /tmp/tty_io..c user system elapsed cpu 9.569u 3.698s 2:52.41 7.6% 0+0k 0+0io 0pf+0w The elapsed time is not much different between unix and localhost. One can see that there is some i/o holdup because the two threads ought to do 100% between them if handover of info were costless. The difference (the system was queiscent o/w apart from the monitoring software, which shows only a fraction of a percent loading). There were no memory shortages and swap was disabled for the test (both sides) For comparison, running against gdbm straignt to disk time ./c -timeout 12000 /tmp/tty_io..c user system elapsed cpu 2.637u 0.735s 0:05.34 62.9% 0+0k 0+0io 0pf+0w Through localhost: time ./c -timeout 12000 -database gdbm://localhost/ptb/c /tmp/tty_io..c user system elapsed cpu 2.746u 3.699s 0:16.00 40.1% 0+0k 0+0io 0pf+0w (the server process was at 35% cpu, for 75% total). Across the net: time ./c -timeout 12000 -database gdbm://pebbles/ptb/c /tmp/tty_io..c user system elapsed cpu 2.982u 4.430s 1:03.44 7.9% 0+0k 0+0io 0pf+0w (the server was at 7% cpu) Have to go shopping .... Peter
"Also sprach Richard Huxton:" > > scheme each time, for example! (how that?). I could presumably also > > help it by preloading the commands I will run and sending over the > > params only with a "do a no. 17 now!". > > PREPARE/EXECUTE (or the equivalent libpq functions). Yes, thank you. It seems to speed things up by a factor of 2. But can I prepare a DECLARE x BINARY CURSOR FOR SELECT ... statement? The manual seems to say no. Peter
"Peter T. Breuer" <ptb@inv.it.uc3m.es> writes: > But can I prepare a DECLARE x BINARY CURSOR FOR SELECT ... statement? > The manual seems to say no. No, you just prepare the SELECT. At the protocol level, DECLARE CURSOR is a tad useless. You can still fetch the data in binary if you want... regards, tom lane
"Also sprach Tom Lane:" > "Peter T. Breuer" <ptb@inv.it.uc3m.es> writes: > > But can I prepare a DECLARE x BINARY CURSOR FOR SELECT ... statement? > > The manual seems to say no. > > No, you just prepare the SELECT. At the protocol level, DECLARE CURSOR > is a tad useless. You can still fetch the data in binary if you want... How? It's a 7.4 server (or may be, more generally) and declare binary cursor is the only way I know to get binary data off it. AFAIR the only other way works only for an 8.* server and consists of sending the query with an annotation that a binary reply is expected. Peter
"Peter T. Breuer" <ptb@inv.it.uc3m.es> writes: > "Also sprach Tom Lane:" >> No, you just prepare the SELECT. At the protocol level, DECLARE CURSOR >> is a tad useless. You can still fetch the data in binary if you want... > How? It's a 7.4 server (or may be, more generally) and declare binary > cursor is the only way I know to get binary data off it. AFAIR the only > other way works only for an 8.* server and consists of sending the query > with an annotation that a binary reply is expected. No, that works for a 7.4 server too; we haven't changed the protocol since then. (I forget though to what extent 7.4 libpq exposes the capability.) regards, tom lane
Please let us know if there is something we should change in the PostgreSQL source code. --------------------------------------------------------------------------- Peter T. Breuer wrote: > "Also sprach Tom Lane:" > > > It may still be useful. The kernel won't necessarily send data as you > > > push it down to the network protocols and driver. The driver may decide > > > to wait for more data to accumulate, > > > > No, because we set TCP_NODELAY. Once we've flushed a message to the > > That just means "disable Nagle", which is indeed more or less the > correct thing to do .. you don't want to sit around waiting for more > data when we're sure there will be none, as you say. Yet you also don't > want to send short data out prematurely, which disabling Nagle can > cause. > > And disabling Nagle doesn't actually force data out immediately you want > it to be sent ... it just disables extra waits imposed by the Nagle > algorithm/protocol. It doesn't stop the driver from waiting around > because it feels taking the bus might be a bit premature right now, > for example. > > > kernel, we don't want the kernel sitting on it --- any delay there adds > > directly to the elapsed query time. At least this is the case for the > > final response to a query. I'm not too clear on whether this means we > > need to be careful about intermediate message boundaries when there's a > > lot of data being sent. > > It's unclear. But not my situation. > > > If I clear TCP_CORK all data is sent at that point. If I set TCP_CORK > data is held until I clear TCP_CORK, or 200ms have passed with no send. > > Peter > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 5/26/07, Peter T. Breuer <ptb@inv.it.uc3m.es> wrote: > "Also sprach Tom Lane:" > > "Peter T. Breuer" <ptb@inv.it.uc3m.es> writes: > > > But can I prepare a DECLARE x BINARY CURSOR FOR SELECT ... statement? > > > The manual seems to say no. > > > > No, you just prepare the SELECT. At the protocol level, DECLARE CURSOR > > is a tad useless. You can still fetch the data in binary if you want... > > How? It's a 7.4 server (or may be, more generally) and declare binary > cursor is the only way I know to get binary data off it. AFAIR the only > other way works only for an 8.* server and consists of sending the query > with an annotation that a binary reply is expected. You want to be calling PQexecPrepared and flip the resultFormat. http://www.postgresql.org/docs/7.4/interactive/libpq-exec.html#LIBPQ-EXEC-MAIN IMO, it's usually not worth bothering with binary unless you are dealing with bytea objects. merlin