Thread: Why does backend send buffer size hardcoded at 8KB?

Why does backend send buffer size hardcoded at 8KB?

From
Artemiy Ryabinkov
Date:
Why backend send buffer use exactly 8KB? 
(https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134) 


I had this question when I try to measure the speed of reading data. The 
bottleneck was a read syscall. With strace I found that in most cases 
read returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we 
can confirm, that network packets have size 8192 
(https://pastebin.com/FD8abbiA)

So, with well-tuned networking stack, the limit is 8KB. The reason is 
the hardcoded size of Postgres write buffer.

I found discussion, where Tom Lane says that the reason of this limit is 
the size of pipe buffers in Unix machines: 
https://www.postgresql.org/message-id/9426.1388761242%40sss.pgh.pa.us

 > Traditionally, at least, that was the size of pipe buffers in Unix 
machines, so in principle this is the most optimal chunk size for 
sending data across a Unix socket.  I have no idea though if that's 
still true in kernels in common use today. For TCP communication it 
might be marginally better to find out the MTU size and use that; but 
it's unclear that it's worth the trouble, or indeed that we can
know the end-to-end MTU size with any reliability.

Does it make sense to make this parameter configurable?

-- 
Artemiy Ryabinkov
getlag(at)ya(dot)ru



Re: Why does backend send buffer size hardcoded at 8KB?

From
Tom Lane
Date:
Artemiy Ryabinkov <getlag@ya.ru> writes:
> Does it make sense to make this parameter configurable?

Not without some proof that it makes a performance difference on
common setups (which you've not provided).

Even with some proof, I'm not sure I'd bother with exposing a
user-tunable knob, as opposed to just making the buffer bigger.
We have far too many GUCs already.

            regards, tom lane



Re: Why does backend send buffer size hardcoded at 8KB?

From
Andres Freund
Date:
On 2019-07-27 14:43:54 +0300, Artemiy Ryabinkov wrote:
> Why backend send buffer use exactly 8KB?
(https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134)
> 
> 
> I had this question when I try to measure the speed of reading data. The
> bottleneck was a read syscall. With strace I found that in most cases read
> returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we can
> confirm, that network packets have size 8192 (https://pastebin.com/FD8abbiA)

Well, in most setups, you can't have that large frames. The most common
limit is 1500 +- some overheads. Using jumbo frames isn't that uncommon,
but it has enough problems that I don't think it's that widely used with
postgres.


> So, with well-tuned networking stack, the limit is 8KB. The reason is the
> hardcoded size of Postgres write buffer.

Well, jumbo frames are limited to 9000 bytes.



But the reason you're seeing 8192 sized packages isn't just that we have
an 8kb buffer, I think it's also that that we unconditionally set
TCP_NODELAY:

#ifdef    TCP_NODELAY
        on = 1;
        if (setsockopt(port->sock, IPPROTO_TCP, TCP_NODELAY,
                       (char *) &on, sizeof(on)) < 0)
        {
            elog(LOG, "setsockopt(%s) failed: %m", "TCP_NODELAY");
            return STATUS_ERROR;
        }
#endif

With 8KB send size, we'll often unnecessarily send some smaller packets
(both for 1500 and 9000 MTUs), because 8kB doesn't neatly divide into
the MTU. Here's e.g. the ip packet sizes for a query returning maybe
18kB:

1500
1500
1500
1500
1500
1004
1500
1500
1500
1500
1500
1004
1500
414

the dips are because that's where our 8KB buffer + disabling nagle
implies a packet boundary.


I wonder if we ought to pass MSG_MORE (which overrides TCP_NODELAY by
basically having TCP_CORK behaviour for that call) in cases we know
there's more data to send. Which we pretty much know, although we'd need
to pass that knowledge from pqcomm.c to be-secure.c


It might be better to just use larger send sizes however. I think most
kernels are going to be better than us knowing how to chop up the send
size. We're using much larger limits when sending data from the client
(no limit for !win32, 65k for windows), and I don't recall seeing any
problem reports about that.


OTOH, I'm not quite convinced that you're going to see much of a
performance difference in most scenarios. As soon as the connection is
actually congested, the kernel will coalesce packages regardless of the
send() size.


> Does it make sense to make this parameter configurable?

I'd much rather not. It's goign to be too hard to tune, and I don't see
any tradeoffs actually requiring that.

Greetings,

Andres Freund



Re: Why does backend send buffer size hardcoded at 8KB?

From
Andres Freund
Date:
Hi,

On 2019-07-27 11:09:06 -0400, Tom Lane wrote:
> Artemiy Ryabinkov <getlag@ya.ru> writes:
> > Does it make sense to make this parameter configurable?
>
> Not without some proof that it makes a performance difference on
> common setups (which you've not provided).

I think us unnecessarily fragmenting into some smaller packets everytime
we send a full 8kB buffer, unless there's already network congestion, is
kind of evidence enough? The combination of a relatively small send
buffer + TCP_NODELAY isn't great.

I'm not quite sure what the smaller buffer is supposed to achieve, at
least these days. In blocking mode (emulated in PG code, using latches,
so we can accept interrupts) we'll always just loop back to another
send() in internal_flush(). In non-blocking mode, we'll fall out of the
loop as soon as the kernel didn't send any data. Isn't the outcome of
using such a small send buffer that we end up performing a) more
syscalls, which has gotten a lot worse in last two years due to all the
cpu vulnerability mitigations making syscalls a *lot* more epensive b)
unnecessary fragmentation?

The situation for receiving data is a bit different. For one, we don't
cause unnecessary fragmentation by using a buffer of a relatively
limited size. But more importantly, copying data into the buffer takes
time, and we could actually be responding to queries earlier in the
data. In contrast to the send case we don't loop around recv() until all
the data has been received.

I suspect we could still do with a bigger buffer, just to reduce the
number of syscalls in bulk loading cases, however.

Greetings,

Andres Freund



Re: Why does backend send buffer size hardcoded at 8KB?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> It might be better to just use larger send sizes however. I think most
> kernels are going to be better than us knowing how to chop up the send
> size.

Yeah.  The existing commentary about that is basically justifying 8K
as being large enough to avoid performance issues; if somebody can
show that that's not true, I wouldn't have any hesitation about
kicking it up.

(Might be worth malloc'ing it rather than having it as part of the
static process image if we do so, but that's a trivial change.)

            regards, tom lane



Re: Why does backend send buffer size hardcoded at 8KB?

From
Andres Freund
Date:
Hi,

On 2019-07-27 18:34:50 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > It might be better to just use larger send sizes however. I think most
> > kernels are going to be better than us knowing how to chop up the send
> > size.

> Yeah.  The existing commentary about that is basically justifying 8K
> as being large enough to avoid performance issues; if somebody can
> show that that's not true, I wouldn't have any hesitation about
> kicking it up.

You think that unnecessary fragmentation, which I did show, isn't good
enough? That does have cost on the network level, even if it possibly
doesn't show up that much in timing.


I wonder if we ought to just query SO_SNDBUF/SO_RCVBUF or such, and use
those (although that's not quite perfect, because there's some added
overhead before data ends up in SNDBUF). Probably with some clamping, to
defend against a crazy sysadmin setting it extremely high.


Additionally we perhaps ought to just not use the send buffer when
internal_putbytes() is called with more data than can fit in the
buffer. We should fill it with as much data as fits in it (so the
pending data like the message header, or smaller previous messages, are
flushed out in the largest size), and then just call secure_write()
directly on the rest. It's not free to memcpy all that data around, when
we already have a buffer.


> (Might be worth malloc'ing it rather than having it as part of the
> static process image if we do so, but that's a trivial change.)

We already do for the send buffer, because we repalloc it in
socket_putmessage_noblock(). Olddly enough we never reduce it's size
after that...

While the receive side is statically allocated, I don't think it ends up
in the process image as-is - as the contents aren't initialized, it ends
up in .bss.

Greetings,

Andres Freund



Re: Why does backend send buffer size hardcoded at 8KB?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2019-07-27 18:34:50 -0400, Tom Lane wrote:
>> Yeah.  The existing commentary about that is basically justifying 8K
>> as being large enough to avoid performance issues; if somebody can
>> show that that's not true, I wouldn't have any hesitation about
>> kicking it up.

> You think that unnecessary fragmentation, which I did show, isn't good
> enough? That does have cost on the network level, even if it possibly
> doesn't show up that much in timing.

I think it is worth doing some testing, rather than just blindly changing
buffer size, because we don't know how much we'd have to change it to
have any useful effect.

> Additionally we perhaps ought to just not use the send buffer when
> internal_putbytes() is called with more data than can fit in the
> buffer. We should fill it with as much data as fits in it (so the
> pending data like the message header, or smaller previous messages, are
> flushed out in the largest size), and then just call secure_write()
> directly on the rest. It's not free to memcpy all that data around, when
> we already have a buffer.

Maybe, but how often does a single putbytes call transfer more than 16K?
(If you fill the existing buffer, but don't have a full bufferload
left to transfer, I doubt you want to shove the fractional bufferload
directly to the kernel.)  Perhaps this added complexity will pay for
itself, but I don't think we should just assume that.

> While the receive side is statically allocated, I don't think it ends up
> in the process image as-is - as the contents aren't initialized, it ends
> up in .bss.

Right, but then we pay for COW when a child process first touches it,
no?  Maybe the kernel is smart about pages that started as BSS, but
I wouldn't bet on it.

            regards, tom lane



Re: Why does backend send buffer size hardcoded at 8KB?

From
Andres Freund
Date:
Hi,

On 2019-07-27 19:10:22 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Additionally we perhaps ought to just not use the send buffer when
> > internal_putbytes() is called with more data than can fit in the
> > buffer. We should fill it with as much data as fits in it (so the
> > pending data like the message header, or smaller previous messages, are
> > flushed out in the largest size), and then just call secure_write()
> > directly on the rest. It's not free to memcpy all that data around, when
> > we already have a buffer.
> 
> Maybe, but how often does a single putbytes call transfer more than
> 16K?

I don't think it's that rare. COPY produces entire rows and sends them
at once, printtup also does, walsender can send pretty large chunks? I
think with several columns after text conversion it's pretty easy to
exceed 16k, not even taking large toasted columns into account.


> (If you fill the existing buffer, but don't have a full bufferload
> left to transfer, I doubt you want to shove the fractional bufferload
> directly to the kernel.)  Perhaps this added complexity will pay for
> itself, but I don't think we should just assume that.

Yea, I'm not certain either. One way to deal with the partially filled
buffer issue would be to use sendmsg() - and have two iovs (one pointing
to the filled buffer, one to the actual data). Wonder if it'd be
worthwhile to do in more scenarios, to avoid unnecessarily copying
memory around.


> > While the receive side is statically allocated, I don't think it ends up
> > in the process image as-is - as the contents aren't initialized, it ends
> > up in .bss.
> 
> Right, but then we pay for COW when a child process first touches it,
> no?  Maybe the kernel is smart about pages that started as BSS, but
> I wouldn't bet on it.

Well, they'll not exist as pages at that point, because postmaster won't
have used the send buffer to a meaningful degree? And I think that's the
same for >4k/pagesize blocks with malloc.  I think there could be a
benefit if we started the buffer pretty small with malloc, and only went
up as needed.

Greetings,

Andres Freund



Re: Why does backend send buffer size hardcoded at 8KB?

From
"Peter J. Holzer"
Date:
On 2019-07-27 19:10:22 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-07-27 18:34:50 -0400, Tom Lane wrote:
> >> Yeah.  The existing commentary about that is basically justifying 8K
> >> as being large enough to avoid performance issues; if somebody can
> >> show that that's not true, I wouldn't have any hesitation about
> >> kicking it up.
>
> > You think that unnecessary fragmentation, which I did show, isn't good
> > enough? That does have cost on the network level, even if it possibly
> > doesn't show up that much in timing.
>
> I think it is worth doing some testing, rather than just blindly changing
> buffer size, because we don't know how much we'd have to change it to
> have any useful effect.

I did a little test with nttcp between two of our servers (1 Gbit to
different switches, switches connected by 10 Gbit). The difference
between a 1024 byte buffer and a 1460 byte buffer is small but
measurable. Anything larger doesn't make a difference. So increasing the
buffer beyond 8 kB probably doesn't improve performance on a 1 Gbit LAN.

I didn't test 10 Gbit LAN or WAN - those might be different.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment