Re: Flushing large data immediately in pqcomm - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Flushing large data immediately in pqcomm |
Date | |
Msg-id | 20240201032442.v3vd52kzu3hynamf@awork3.anarazel.de Whole thread Raw |
In response to | Re: Flushing large data immediately in pqcomm (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Flushing large data immediately in pqcomm
|
List | pgsql-hackers |
Hi, On 2024-01-31 14:57:35 -0500, Robert Haas wrote: > > You're right and I'm open to doing more legwork. I'd also appreciate any > > suggestion about how to test this properly and/or useful scenarios to > > test. That would be really helpful. > > I think experimenting to see whether the long-short-long-short > behavior that Heikki postulated emerges in practice would be a really > good start. > > Another experiment that I think would be interesting is: suppose you > create a patch that sends EVERY message without buffering and compare > that to master. My naive expectation would be that this will lose if > you pump short messages through that connection and win if you pump > long messages through that connection. Is that true? If yes, at what > point do we break even on performance? Does it depend on whether the > connection is local or over a network? Does it depend on whether it's > with or without SSL? Does it depend on Linux vs. Windows vs. > whateverBSD? What happens if you twiddle the 8kB buffer size up or, > say, down to just below the Ethernet frame size? I feel like you're putting up a too high bar for something that can be a pretty clear improvement on its own, without a downside. The current behaviour is pretty absurd, doing all this research across all platforms isn't going to disprove that - and it's a lot of work. ISTM we can analyze this without taking concrete hardware into account easily enough. One thing that I haven't seen mentioned here that's relevant around using small buffers: Postgres uses TCP_NODELAY and has to do so. That means doing tiny sends can hurt substantially > I think that what we really want to understand here is under what > circumstances the extra layer of buffering is a win vs. being a loss. It's quite easy to see that doing no buffering isn't viable - we end up with tiny tiny TCP packets, one for each send(). And then there's the syscall overhead. Here's a quickly thrown together benchmark using netperf. First with -D, which instructs it to use TCP_NODELAY, as we do. 10gbit network, remote host: $ (fields="request_size,throughput"; echo "$fields";for i in $(seq 0 16); do s=$((2**$i));netperf -P0 -t TCP_STREAM -l1 -Halap5-10gbe -- -r $s,$s -D 1 -o "$fields";done)|column -t -s, request_size throughput 1 22.73 2 45.77 4 108.64 8 225.78 16 560.32 32 1035.61 64 2177.91 128 3604.71 256 5878.93 512 9334.70 1024 9031.13 2048 9405.35 4096 9334.60 8192 9275.33 16384 9406.29 32768 9385.52 65536 9399.40 localhost: request_size throughput 1 2.76 2 5.10 4 9.89 8 20.51 16 43.42 32 87.13 64 173.72 128 343.70 256 647.89 512 1328.79 1024 2550.14 2048 4998.06 4096 9482.06 8192 17130.76 16384 29048.02 32768 42106.33 65536 48579.95 I'm slightly baffled by the poor performance of localhost with tiny packet sizes. Ah, I see - it's the NODELA, without that: localhost: 1 32.02 2 60.58 4 114.32 8 262.71 16 558.42 32 1053.66 64 2099.39 128 3815.60 256 6566.19 512 11751.79 1024 18976.11 2048 27222.99 4096 33838.07 8192 38219.60 16384 39146.37 32768 44784.98 65536 44214.70 NODELAY triggers many more context switches, because there's immediately data available for the receiving side. Whereas with real network the interrupts get coalesced. I think that's pretty clear evidence that we need buffering. But I think we can probably be smarter than we are right now, and then what's been proposed in the patch. Because of TCP_NODELAY we shouldn't send a tiny buffer on its own, it may trigger sending a small TCP packet, which is quite inefficient. While not perfect - e.g. because networks might use jumbo packets / large MTUs and we don't know how many outstanding bytes there are locally, I think a decent heuristic could be to always try to send at least one packet worth of data at once (something like ~1400 bytes), even if that requires copying some of the input data. It might not be sent on its own, but it should make it reasonably unlikely to end up with tiny tiny packets. Greetings, Andres Freund
pgsql-hackers by date: