Re: Syncrep and improving latency due to WAL throttling - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Syncrep and improving latency due to WAL throttling
Date
Msg-id b1bacfeb-cfec-22f9-aae3-47c82ffde5e8@enterprisedb.com
Whole thread Raw
In response to Re: Syncrep and improving latency due to WAL throttling  (Andres Freund <andres@anarazel.de>)
Responses Re: Syncrep and improving latency due to WAL throttling
List pgsql-hackers

On 11/8/23 18:11, Andres Freund wrote:
> Hi,
> 
> On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote:
>>> I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit
>>> network (albeit with a crappy external card for my laptop), to put some
>>> numbers to this. I used -r $s,100 to test sending a variable sized data to the
>>> other size, with the other side always responding with 100 bytes (assuming
>>> that'd more than fit a feedback response).
>>>
>>> Command:
>>> fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo
$fields;for s in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o
"$fields";done
>>>
>>> 10gbe:
>>>
>>> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
>>> 10              100             43              64.30           390             96              15526.084
>>> 100             100             57              75.12           428             122             13286.602
>>> 1000            100             47              74.41           270             108             13412.125
>>> 10000           100             89              114.63          712             152             8700.643
>>> 100000          100             167             255.90          584             312             3903.516
>>> 1000000         100             891             1015.99         2470            1143            983.708
>>>
>>>
>>> Same hosts, but with my workstation forced to use a 1gbit connection:
>>>
>>> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
>>> 10              100             78              131.18          2425            257             7613.416
>>> 100             100             81              129.25          425             255             7727.473
>>> 1000            100             100             162.12          1444            266             6161.388
>>> 10000           100             310             686.19          1797            927             1456.204
>>> 100000          100             1006            1114.20         1472            1199            896.770
>>> 1000000         100             8338            8420.96         8827            8498            118.410
> 
> Looks like the 1gbit numbers were somewhat bogus-ified due having configured
> jumbo frames and some network component doing something odd with that
> (handling them in software maybe?).
> 
> 10gbe:
> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> 10        100        56        68.56        483        87        14562.476
> 100        100        57        75.68        353        123        13185.485
> 1000        100        60        71.97        391        94        13870.659
> 10000        100        58        92.42        489        140        10798.444
> 100000        100        184        260.48        1141        338        3834.504
> 1000000        100        926        1071.46        2012        1466        933.009
> 
> 1gbe
> request_size    response_size   min_latency     mean_latency    max_latency     p99_latency     transaction_rate
> 10        100        77        132.19        1097        257        7555.420
> 100        100        79        127.85        534        249        7810.862
> 1000        100        98        155.91        966        265        6406.818
> 10000        100        176        235.37        1451        314        4245.304
> 100000        100        944        1022.00        1380        1148        977.930
> 1000000        100        8649        8768.42        9018        8895        113.703
> 
> 
>>> I haven't checked, but I'd assume that 100bytes back and forth should easily
>>> fit a new message to update LSNs and the existing feedback response. Even just
>>> the difference between sending 100 bytes and sending 10k (a bit more than a
>>> single WAL page) is pretty significant on a 1gbit network.
>>>
>>
>> I'm on decaf so I may be a bit slow, but it's not very clear to me what
>> conclusion to draw from these numbers. What is the takeaway?
>>
>> My understanding is that in both cases the latency is initially fairly
>> stable, independent of the request size. This applies to request up to
>> ~1000B. And then the latency starts increasing fairly quickly, even
>> though it shouldn't hit the bandwidth (except maybe the 1MB requests).
> 
> Except for the smallest end, these are bandwidth related, I think. Converting
> 1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving
> the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on
> 1gbit. If you subtract the minmum latency of about 130us, that's nearly all of
> the latency.
> 

Maybe I don't understand what you mean "bandwidth related" but surely
the smaller requests are not limited by bandwidth. I mean, 100B and 1kB
(and even 10kB) requests have almost the same transaction rate, yet
there's an order of magnitude difference in bandwidth (sure, there's
overhead, but this much magnitude?).

On the higher end, sure, that seems bandwidth related. But for 100kB,
it's still just ~50% of the 1Gbps.

> The reason this matters is that the numbers show that the latency of having to
> send a small message with updated positions is far smaller than having to send
> all the outstanding data. Even having to send a single WAL page over the
> network ~doubles the latency of the response on 1gbit!  Of course the impact
> is smaller on 10gbit, but even there latency substantially increases around
> 100kB of outstanding data.

Understood. I wonder if this is one of the things we'd need to measure
to adjust the write size (i.e. how eagerly to write WAL to disk / over
network). Essentially, we'd get an the size where the latency starts
increasing much faster, and try to write WAL faster than that.

I wonder if storage (not network) has a similar pattern.

> In a local pgbench with 32 clients I see WAL write sizes between 8kB and
> ~220kB. Being able to stream those out before the local flush completed
> therefore seems likely to reduce synchronous_commit overhead substantially.
> 

Yeah, those writes are certainly too large. If we can write them
earlier, and then only do smaller messages to write the remaining bit
and the positions, that'd help a lot.

> 
>> I don't think it says we should be replicating WAL in tiny chunks,
>> because if you need to send a chunk of data it's always more efficient
>> to send it at once (compared to sending multiple smaller pieces).
> 
> I don't think that's a very large factor for network data, once your minimal
> data sizes is ~8kB (or ~4kB if we lower wal_block_size). TCP messsages will
> get chunked into something smaller anyway and small messages don't need to get
> acknowledged individually. Sending more data at once is good for CPU
> efficiency (reducing syscall and network device overhead), but doesn't do much
> for throughput.
> 
> Sending 4kB of data in each send() in a bandwidth oriented test already gets
> to ~9.3gbit/s in my network. That's close to the maximum atainable with normal
> framing. If I change the mtu back to 9000 I get 9.89 gbit/s, again very close
> to the theoretical max.
> 

Got it.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Next
From: Bruce Momjian
Date:
Subject: Re: XID-wraparound hazards in LISTEN/NOTIFY