Thread: Why does backend send buffer size hardcoded at 8KB?
Why backend send buffer use exactly 8KB? (https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134) I had this question when I try to measure the speed of reading data. The bottleneck was a read syscall. With strace I found that in most cases read returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we can confirm, that network packets have size 8192 (https://pastebin.com/FD8abbiA) So, with well-tuned networking stack, the limit is 8KB. The reason is the hardcoded size of Postgres write buffer. I found discussion, where Tom Lane says that the reason of this limit is the size of pipe buffers in Unix machines: https://www.postgresql.org/message-id/9426.1388761242%40sss.pgh.pa.us > Traditionally, at least, that was the size of pipe buffers in Unix machines, so in principle this is the most optimal chunk size for sending data across a Unix socket. I have no idea though if that's still true in kernels in common use today. For TCP communication it might be marginally better to find out the MTU size and use that; but it's unclear that it's worth the trouble, or indeed that we can know the end-to-end MTU size with any reliability. Does it make sense to make this parameter configurable? -- Artemiy Ryabinkov getlag(at)ya(dot)ru
Artemiy Ryabinkov <getlag@ya.ru> writes: > Does it make sense to make this parameter configurable? Not without some proof that it makes a performance difference on common setups (which you've not provided). Even with some proof, I'm not sure I'd bother with exposing a user-tunable knob, as opposed to just making the buffer bigger. We have far too many GUCs already. regards, tom lane
On 2019-07-27 14:43:54 +0300, Artemiy Ryabinkov wrote: > Why backend send buffer use exactly 8KB? (https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134) > > > I had this question when I try to measure the speed of reading data. The > bottleneck was a read syscall. With strace I found that in most cases read > returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we can > confirm, that network packets have size 8192 (https://pastebin.com/FD8abbiA) Well, in most setups, you can't have that large frames. The most common limit is 1500 +- some overheads. Using jumbo frames isn't that uncommon, but it has enough problems that I don't think it's that widely used with postgres. > So, with well-tuned networking stack, the limit is 8KB. The reason is the > hardcoded size of Postgres write buffer. Well, jumbo frames are limited to 9000 bytes. But the reason you're seeing 8192 sized packages isn't just that we have an 8kb buffer, I think it's also that that we unconditionally set TCP_NODELAY: #ifdef TCP_NODELAY on = 1; if (setsockopt(port->sock, IPPROTO_TCP, TCP_NODELAY, (char *) &on, sizeof(on)) < 0) { elog(LOG, "setsockopt(%s) failed: %m", "TCP_NODELAY"); return STATUS_ERROR; } #endif With 8KB send size, we'll often unnecessarily send some smaller packets (both for 1500 and 9000 MTUs), because 8kB doesn't neatly divide into the MTU. Here's e.g. the ip packet sizes for a query returning maybe 18kB: 1500 1500 1500 1500 1500 1004 1500 1500 1500 1500 1500 1004 1500 414 the dips are because that's where our 8KB buffer + disabling nagle implies a packet boundary. I wonder if we ought to pass MSG_MORE (which overrides TCP_NODELAY by basically having TCP_CORK behaviour for that call) in cases we know there's more data to send. Which we pretty much know, although we'd need to pass that knowledge from pqcomm.c to be-secure.c It might be better to just use larger send sizes however. I think most kernels are going to be better than us knowing how to chop up the send size. We're using much larger limits when sending data from the client (no limit for !win32, 65k for windows), and I don't recall seeing any problem reports about that. OTOH, I'm not quite convinced that you're going to see much of a performance difference in most scenarios. As soon as the connection is actually congested, the kernel will coalesce packages regardless of the send() size. > Does it make sense to make this parameter configurable? I'd much rather not. It's goign to be too hard to tune, and I don't see any tradeoffs actually requiring that. Greetings, Andres Freund
Hi, On 2019-07-27 11:09:06 -0400, Tom Lane wrote: > Artemiy Ryabinkov <getlag@ya.ru> writes: > > Does it make sense to make this parameter configurable? > > Not without some proof that it makes a performance difference on > common setups (which you've not provided). I think us unnecessarily fragmenting into some smaller packets everytime we send a full 8kB buffer, unless there's already network congestion, is kind of evidence enough? The combination of a relatively small send buffer + TCP_NODELAY isn't great. I'm not quite sure what the smaller buffer is supposed to achieve, at least these days. In blocking mode (emulated in PG code, using latches, so we can accept interrupts) we'll always just loop back to another send() in internal_flush(). In non-blocking mode, we'll fall out of the loop as soon as the kernel didn't send any data. Isn't the outcome of using such a small send buffer that we end up performing a) more syscalls, which has gotten a lot worse in last two years due to all the cpu vulnerability mitigations making syscalls a *lot* more epensive b) unnecessary fragmentation? The situation for receiving data is a bit different. For one, we don't cause unnecessary fragmentation by using a buffer of a relatively limited size. But more importantly, copying data into the buffer takes time, and we could actually be responding to queries earlier in the data. In contrast to the send case we don't loop around recv() until all the data has been received. I suspect we could still do with a bigger buffer, just to reduce the number of syscalls in bulk loading cases, however. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > It might be better to just use larger send sizes however. I think most > kernels are going to be better than us knowing how to chop up the send > size. Yeah. The existing commentary about that is basically justifying 8K as being large enough to avoid performance issues; if somebody can show that that's not true, I wouldn't have any hesitation about kicking it up. (Might be worth malloc'ing it rather than having it as part of the static process image if we do so, but that's a trivial change.) regards, tom lane
Hi, On 2019-07-27 18:34:50 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > It might be better to just use larger send sizes however. I think most > > kernels are going to be better than us knowing how to chop up the send > > size. > Yeah. The existing commentary about that is basically justifying 8K > as being large enough to avoid performance issues; if somebody can > show that that's not true, I wouldn't have any hesitation about > kicking it up. You think that unnecessary fragmentation, which I did show, isn't good enough? That does have cost on the network level, even if it possibly doesn't show up that much in timing. I wonder if we ought to just query SO_SNDBUF/SO_RCVBUF or such, and use those (although that's not quite perfect, because there's some added overhead before data ends up in SNDBUF). Probably with some clamping, to defend against a crazy sysadmin setting it extremely high. Additionally we perhaps ought to just not use the send buffer when internal_putbytes() is called with more data than can fit in the buffer. We should fill it with as much data as fits in it (so the pending data like the message header, or smaller previous messages, are flushed out in the largest size), and then just call secure_write() directly on the rest. It's not free to memcpy all that data around, when we already have a buffer. > (Might be worth malloc'ing it rather than having it as part of the > static process image if we do so, but that's a trivial change.) We already do for the send buffer, because we repalloc it in socket_putmessage_noblock(). Olddly enough we never reduce it's size after that... While the receive side is statically allocated, I don't think it ends up in the process image as-is - as the contents aren't initialized, it ends up in .bss. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-07-27 18:34:50 -0400, Tom Lane wrote: >> Yeah. The existing commentary about that is basically justifying 8K >> as being large enough to avoid performance issues; if somebody can >> show that that's not true, I wouldn't have any hesitation about >> kicking it up. > You think that unnecessary fragmentation, which I did show, isn't good > enough? That does have cost on the network level, even if it possibly > doesn't show up that much in timing. I think it is worth doing some testing, rather than just blindly changing buffer size, because we don't know how much we'd have to change it to have any useful effect. > Additionally we perhaps ought to just not use the send buffer when > internal_putbytes() is called with more data than can fit in the > buffer. We should fill it with as much data as fits in it (so the > pending data like the message header, or smaller previous messages, are > flushed out in the largest size), and then just call secure_write() > directly on the rest. It's not free to memcpy all that data around, when > we already have a buffer. Maybe, but how often does a single putbytes call transfer more than 16K? (If you fill the existing buffer, but don't have a full bufferload left to transfer, I doubt you want to shove the fractional bufferload directly to the kernel.) Perhaps this added complexity will pay for itself, but I don't think we should just assume that. > While the receive side is statically allocated, I don't think it ends up > in the process image as-is - as the contents aren't initialized, it ends > up in .bss. Right, but then we pay for COW when a child process first touches it, no? Maybe the kernel is smart about pages that started as BSS, but I wouldn't bet on it. regards, tom lane
Hi, On 2019-07-27 19:10:22 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Additionally we perhaps ought to just not use the send buffer when > > internal_putbytes() is called with more data than can fit in the > > buffer. We should fill it with as much data as fits in it (so the > > pending data like the message header, or smaller previous messages, are > > flushed out in the largest size), and then just call secure_write() > > directly on the rest. It's not free to memcpy all that data around, when > > we already have a buffer. > > Maybe, but how often does a single putbytes call transfer more than > 16K? I don't think it's that rare. COPY produces entire rows and sends them at once, printtup also does, walsender can send pretty large chunks? I think with several columns after text conversion it's pretty easy to exceed 16k, not even taking large toasted columns into account. > (If you fill the existing buffer, but don't have a full bufferload > left to transfer, I doubt you want to shove the fractional bufferload > directly to the kernel.) Perhaps this added complexity will pay for > itself, but I don't think we should just assume that. Yea, I'm not certain either. One way to deal with the partially filled buffer issue would be to use sendmsg() - and have two iovs (one pointing to the filled buffer, one to the actual data). Wonder if it'd be worthwhile to do in more scenarios, to avoid unnecessarily copying memory around. > > While the receive side is statically allocated, I don't think it ends up > > in the process image as-is - as the contents aren't initialized, it ends > > up in .bss. > > Right, but then we pay for COW when a child process first touches it, > no? Maybe the kernel is smart about pages that started as BSS, but > I wouldn't bet on it. Well, they'll not exist as pages at that point, because postmaster won't have used the send buffer to a meaningful degree? And I think that's the same for >4k/pagesize blocks with malloc. I think there could be a benefit if we started the buffer pretty small with malloc, and only went up as needed. Greetings, Andres Freund
On 2019-07-27 19:10:22 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2019-07-27 18:34:50 -0400, Tom Lane wrote: > >> Yeah. The existing commentary about that is basically justifying 8K > >> as being large enough to avoid performance issues; if somebody can > >> show that that's not true, I wouldn't have any hesitation about > >> kicking it up. > > > You think that unnecessary fragmentation, which I did show, isn't good > > enough? That does have cost on the network level, even if it possibly > > doesn't show up that much in timing. > > I think it is worth doing some testing, rather than just blindly changing > buffer size, because we don't know how much we'd have to change it to > have any useful effect. I did a little test with nttcp between two of our servers (1 Gbit to different switches, switches connected by 10 Gbit). The difference between a 1024 byte buffer and a 1460 byte buffer is small but measurable. Anything larger doesn't make a difference. So increasing the buffer beyond 8 kB probably doesn't improve performance on a 1 Gbit LAN. I didn't test 10 Gbit LAN or WAN - those might be different. hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>