Re: Flushing large data immediately in pqcomm - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Flushing large data immediately in pqcomm |
Date | |
Msg-id | 20240202223827.5odxkvnh4x6coe7y@awork3.anarazel.de Whole thread Raw |
In response to | Re: Flushing large data immediately in pqcomm (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Flushing large data immediately in pqcomm
|
List | pgsql-hackers |
Hi, On 2024-02-01 15:02:57 -0500, Robert Haas wrote: > On Thu, Feb 1, 2024 at 10:52 AM Robert Haas <robertmhaas@gmail.com> wrote: > There was probably a better way to phrase this email ... the sentiment > is sincere, but there was almost certainly a way of writing it that > didn't sound like I'm super-annoyed. NP - I could have phrased mine better as well... > > On Wed, Jan 31, 2024 at 10:24 PM Andres Freund <andres@anarazel.de> wrote: > > > While not perfect - e.g. because networks might use jumbo packets / large MTUs > > > and we don't know how many outstanding bytes there are locally, I think a > > > decent heuristic could be to always try to send at least one packet worth of > > > data at once (something like ~1400 bytes), even if that requires copying some > > > of the input data. It might not be sent on its own, but it should make it > > > reasonably unlikely to end up with tiny tiny packets. > > > > I think that COULD be a decent heuristic but I think it should be > > TESTED, including against the ~3 or so other heuristics proposed on > > this thread, before we make a decision. > > > > I literally mentioned the Ethernet frame size as one of the things > > that we should test whether it's relevant in the exact email to which > > you're replying, and you replied by proposing that as a heuristic, but > > also criticizing me for wanting more research before we settle on > > something. I mentioned the frame size thing because afaict nobody in the thread had mentioned our use of TCP_NODELAY (which basically forces the kernel to send out data immediately instead of waiting for further data to be sent). Without that it'd be a lot less problematic to occasionally send data in small increments inbetween larger sends. Nor would packet sizes be as relevant. > > Are we just supposed to assume that your heuristic is better than the > > others proposed here without testing anything, or, like, what? I don't > > think this needs to be a completely exhaustive or exhausting process, but > > I think trying a few different things out and seeing what happens is > > smart. I wasn't trying to say that my heuristic necessarily is better. What I was trying to get at - and expressed badly - was that I doubt that testing can get us all that far here. It's not too hard to test the effects of our buffering with regards to syscall overhead, but once you actually take network effects into account it gets quite hard. Bandwidth, latency, the specific network hardware and operating systems involved all play a significant role. Given how, uh, naive our current approach is, I think analyzing the situation from first principles and then doing some basic validation of the results makes more sense. Separately, I think we shouldn't aim for perfect here. It's obviously extremely inefficient to send a larger amount of data by memcpy()ing and send()ing it in 8kB chunks. As mentioned by several folks upthread, we can improve upon that without having worse behaviour than today. Medium-long term I suspect we're going to want to use asynchronous network interfaces, in combination with zero-copy sending, which requires larger changes. Not that relevant for things like query results, quite relevant for base backups etc. It's perhaps also worth mentioning that the small send buffer isn't great for SSL performance, the encryption overhead increases when sending in small chunks. I hacked up Melih's patch to send the pending data together with the first bit of the large "to be sent" data and also added a patch to increased SINK_BUFFER_LENGTH by 16x. With a 12GB database I tested the time for pg_basebackup -c fast -Ft --compress=none -Xnone -D - -d "$conn" > /dev/null time via test unix tcp tcp+ssl master 6.305s 9.436s 15.596s master-larger-buffer 6.535s 9.453s 15.208s patch 5.900s 7.465s 13.634s patch-larger-buffer 5.233s 5.439s 11.730s The increase when using tcp is pretty darn impressive. If I had remembered in time to disable manifests checksums, the win would have been even bigger. The bottleneck for SSL is that it still ends up with ~16kB sends, not sure why. Greetings, Andres Freund
pgsql-hackers by date: