Re: Flushing large data immediately in pqcomm - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Flushing large data immediately in pqcomm
Date
Msg-id 20240201032442.v3vd52kzu3hynamf@awork3.anarazel.de
Whole thread Raw
In response to Re: Flushing large data immediately in pqcomm  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Flushing large data immediately in pqcomm
List pgsql-hackers
Hi,

On 2024-01-31 14:57:35 -0500, Robert Haas wrote:
> > You're right and I'm open to doing more legwork. I'd also appreciate any
> > suggestion about how to test this properly and/or useful scenarios to
> > test. That would be really helpful.
>
> I think experimenting to see whether the long-short-long-short
> behavior that Heikki postulated emerges in practice would be a really
> good start.
>
> Another experiment that I think would be interesting is: suppose you
> create a patch that sends EVERY message without buffering and compare
> that to master. My naive expectation would be that this will lose if
> you pump short messages through that connection and win if you pump
> long messages through that connection. Is that true? If yes, at what
> point do we break even on performance? Does it depend on whether the
> connection is local or over a network? Does it depend on whether it's
> with or without SSL? Does it depend on Linux vs. Windows vs.
> whateverBSD? What happens if you twiddle the 8kB buffer size up or,
> say, down to just below the Ethernet frame size?

I feel like you're putting up a too high bar for something that can be a
pretty clear improvement on its own, without a downside. The current behaviour
is pretty absurd, doing all this research across all platforms isn't going to
disprove that - and it's a lot of work.  ISTM we can analyze this without
taking concrete hardware into account easily enough.


One thing that I haven't seen mentioned here that's relevant around using
small buffers: Postgres uses TCP_NODELAY and has to do so. That means doing
tiny sends can hurt substantially


> I think that what we really want to understand here is under what
> circumstances the extra layer of buffering is a win vs. being a loss.

It's quite easy to see that doing no buffering isn't viable - we end up with
tiny tiny TCP packets, one for each send(). And then there's the syscall
overhead.


Here's a quickly thrown together benchmark using netperf. First with -D, which
instructs it to use TCP_NODELAY, as we do.

10gbit network, remote host:

$ (fields="request_size,throughput"; echo "$fields";for i in $(seq 0 16); do s=$((2**$i));netperf -P0 -t TCP_STREAM -l1
-Halap5-10gbe  -- -r $s,$s -D 1 -o "$fields";done)|column -t -s,
 

request_size  throughput
1             22.73
2             45.77
4             108.64
8             225.78
16            560.32
32            1035.61
64            2177.91
128           3604.71
256           5878.93
512           9334.70
1024          9031.13
2048          9405.35
4096          9334.60
8192          9275.33
16384         9406.29
32768         9385.52
65536         9399.40


localhost:
request_size  throughput
1             2.76
2             5.10
4             9.89
8             20.51
16            43.42
32            87.13
64            173.72
128           343.70
256           647.89
512           1328.79
1024          2550.14
2048          4998.06
4096          9482.06
8192          17130.76
16384         29048.02
32768         42106.33
65536         48579.95

I'm slightly baffled by the poor performance of localhost with tiny packet
sizes. Ah, I see - it's the NODELA, without that:

localhost:
1             32.02
2             60.58
4             114.32
8             262.71
16            558.42
32            1053.66
64            2099.39
128           3815.60
256           6566.19
512           11751.79
1024          18976.11
2048          27222.99
4096          33838.07
8192          38219.60
16384         39146.37
32768         44784.98
65536         44214.70


NODELAY triggers many more context switches, because there's immediately data
available for the receiving side. Whereas with real network the interrupts get
coalesced.


I think that's pretty clear evidence that we need buffering.  But I think we
can probably be smarter than we are right now, and then what's been proposed
in the patch. Because of TCP_NODELAY we shouldn't send a tiny buffer on its
own, it may trigger sending a small TCP packet, which is quite inefficient.


While not perfect - e.g. because networks might use jumbo packets / large MTUs
and we don't know how many outstanding bytes there are locally, I think a
decent heuristic could be to always try to send at least one packet worth of
data at once (something like ~1400 bytes), even if that requires copying some
of the input data. It might not be sent on its own, but it should make it
reasonably unlikely to end up with tiny tiny packets.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "Euler Taveira"
Date:
Subject: Re: speed up a logical replica setup
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: speed up a logical replica setup