Re: Syncrep and improving latency due to WAL throttling - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Syncrep and improving latency due to WAL throttling |
Date | |
Msg-id | 20231108171131.uwcerw5vpxsbj4ea@awork3.anarazel.de Whole thread Raw |
In response to | Re: Syncrep and improving latency due to WAL throttling (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Syncrep and improving latency due to WAL throttling
|
List | pgsql-hackers |
Hi, On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote: > > I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit > > network (albeit with a crappy external card for my laptop), to put some > > numbers to this. I used -r $s,100 to test sending a variable sized data to the > > other size, with the other side always responding with 100 bytes (assuming > > that'd more than fit a feedback response). > > > > Command: > > fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields;for s in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done > > > > 10gbe: > > > > request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate > > 10 100 43 64.30 390 96 15526.084 > > 100 100 57 75.12 428 122 13286.602 > > 1000 100 47 74.41 270 108 13412.125 > > 10000 100 89 114.63 712 152 8700.643 > > 100000 100 167 255.90 584 312 3903.516 > > 1000000 100 891 1015.99 2470 1143 983.708 > > > > > > Same hosts, but with my workstation forced to use a 1gbit connection: > > > > request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate > > 10 100 78 131.18 2425 257 7613.416 > > 100 100 81 129.25 425 255 7727.473 > > 1000 100 100 162.12 1444 266 6161.388 > > 10000 100 310 686.19 1797 927 1456.204 > > 100000 100 1006 1114.20 1472 1199 896.770 > > 1000000 100 8338 8420.96 8827 8498 118.410 Looks like the 1gbit numbers were somewhat bogus-ified due having configured jumbo frames and some network component doing something odd with that (handling them in software maybe?). 10gbe: request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate 10 100 56 68.56 483 87 14562.476 100 100 57 75.68 353 123 13185.485 1000 100 60 71.97 391 94 13870.659 10000 100 58 92.42 489 140 10798.444 100000 100 184 260.48 1141 338 3834.504 1000000 100 926 1071.46 2012 1466 933.009 1gbe request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate 10 100 77 132.19 1097 257 7555.420 100 100 79 127.85 534 249 7810.862 1000 100 98 155.91 966 265 6406.818 10000 100 176 235.37 1451 314 4245.304 100000 100 944 1022.00 1380 1148 977.930 1000000 100 8649 8768.42 9018 8895 113.703 > > I haven't checked, but I'd assume that 100bytes back and forth should easily > > fit a new message to update LSNs and the existing feedback response. Even just > > the difference between sending 100 bytes and sending 10k (a bit more than a > > single WAL page) is pretty significant on a 1gbit network. > > > > I'm on decaf so I may be a bit slow, but it's not very clear to me what > conclusion to draw from these numbers. What is the takeaway? > > My understanding is that in both cases the latency is initially fairly > stable, independent of the request size. This applies to request up to > ~1000B. And then the latency starts increasing fairly quickly, even > though it shouldn't hit the bandwidth (except maybe the 1MB requests). Except for the smallest end, these are bandwidth related, I think. Converting 1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on 1gbit. If you subtract the minmum latency of about 130us, that's nearly all of the latency. The reason this matters is that the numbers show that the latency of having to send a small message with updated positions is far smaller than having to send all the outstanding data. Even having to send a single WAL page over the network ~doubles the latency of the response on 1gbit! Of course the impact is smaller on 10gbit, but even there latency substantially increases around 100kB of outstanding data. In a local pgbench with 32 clients I see WAL write sizes between 8kB and ~220kB. Being able to stream those out before the local flush completed therefore seems likely to reduce synchronous_commit overhead substantially. > I don't think it says we should be replicating WAL in tiny chunks, > because if you need to send a chunk of data it's always more efficient > to send it at once (compared to sending multiple smaller pieces). I don't think that's a very large factor for network data, once your minimal data sizes is ~8kB (or ~4kB if we lower wal_block_size). TCP messsages will get chunked into something smaller anyway and small messages don't need to get acknowledged individually. Sending more data at once is good for CPU efficiency (reducing syscall and network device overhead), but doesn't do much for throughput. Sending 4kB of data in each send() in a bandwidth oriented test already gets to ~9.3gbit/s in my network. That's close to the maximum atainable with normal framing. If I change the mtu back to 9000 I get 9.89 gbit/s, again very close to the theoretical max. Greetings, Andres Freund
pgsql-hackers by date: