Home > mailing lists
Re: Syncrep and improving latency due to WAL throttling - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Syncrep and improving latency due to WAL throttling
Date	November 4, 2023 19:00:46
Msg-id	523faff5-f265-b704-85df-4835a02373b1@enterprisedb.com Whole thread Raw
In response to	Re: Syncrep and improving latency due to WAL throttling (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses	Re: Syncrep and improving latency due to WAL throttling
List	pgsql-hackers
Tree view
Hi,

I keep getting occasional complaints about the impact of large/bulk
transactions on latency of small OLTP transactions, so I'd like to
revive this thread a bit and move it forward.

Attached is a rebased v3, followed by 0002 patch with some review
comments, missing comments and minor tweaks. More about that later ...

It's been a couple months, and there's been a fair amount of discussion
and changes earlier, so I guess it makes sense to post a summary,
stating the purpose (and scope), and then go through the various open
questions etc.


goals
-----
The goal is to limit the impact of large transactions (producing a lot
of WAL) on small OLTP transactions, in a cluster with a sync replica.
Imagine a backend executing single-row inserts, or something like that.
The commit will wait for the replica to confirm the WAL, which may be
expensive, but it's well determined by the network roundtrip.

But then a large transaction comes, and inserts a lot of WAL (imagine a
COPY which inserts 100MB of data, VACUUM, CREATE INDEX and so on). A
small transaction may insert a COMMIT record right after this WAL chunk,
and locally that's (mostly) fine. But with the sync replica it's much
worse - we don't send WAL until it's flushed locally, and then we need
to wait for the WAL to be sent, applied and confirmed by the replica.
This takes time (depending on the bandwidth), and it may not happen
until the small transaction does COMMIT (because we may not flush WAL
from in-progress transaction very often).

Jakub Wartak presented some examples of the impact when he started this
thread, and it can be rather bad. Particularly for latency-sensitive
applications. I plan to do more experiments with the current patch, but
I don't have the results yet.


scope
-----
Now, let's talk about scope - what the patch does not aim to do. The
patch is explicitly intended for syncrep clusters, not async. There have
been proposals to also support throttling for async replicas, logical
replication etc. I suppose all of that could be implemented, and I do
see the benefit of defining some sort of maximum lag even for async
replicas. But the agreement was to focus on the syncrep case, where it's
particularly painful, and perhaps extend it in the future.

I believe adding throttling for physical async replication should not be
difficult - in principle we need to determine how far the replica got,
and compare it to the local LSN. But there's likely complexity with
defining which async replicas to look at, inventing a sensible way to
configure this, etc. It'd be helpful if people interested in that
feature took a look at this patch and tried extending etc.

It's not clear to me what to do about disconnected replicas, though. We
may not even know about them, if there's no slot (and how would we know
what the slot is for). So maybe this would need a new GUC listing the
interesting replicas, and all would need to be connected. But that's an
availability issue, because then all replicas need to be connected.

I'm not sure about logical replication, but I guess we could treat it
similarly to async.

But what I think would need to be different is handling of small
transactions. For syncrep we automatically wait for those at commit,
which means automatic throttling. But for async (and logical), it's
trivial to cause ever-increasing lag with only tiny transactions, thanks
to the single-process replay, so maybe we'd need to throttle those too.
(The recovery prefetching improved this for async quite a bit, ofc.)


implementation
--------------
The implementation is fairly straightforward, and happens in two places.
XLogInsertRecord() decides if a throttling might be needed for this
backend, and then HandleXLogDelayPending() does the wait.

XLogInsertRecord() checks if the backend produced certain amount of WAL
(might be 1MB, for example). We do this because we don't want to do the
expensive stuff in HandleXLogDelayPending() too often (e.g. after every
XLOG record).

HandleXLogDelayPending() picks a suitable LSN, flushes it and then also
waits for the sync replica, as if it was a commit. This limits the lag,
i.e. the amount of WAL that the small transaction will need to wait for
to be replicated and confirmed by the replica.

There was a fair amount of discussion about how to pick the LSN. I think
the agreement is we certainly can't pick the current LSN (because that
would lead to write amplification for the partially filled page), and we
probably even want to backoff a bit more, to make it more likely the LSN
is already flushed. So for example with the threshold set to 8MB we
might go back 1MB, or something like that. That'd still limit the lag.


problems
--------
Now let's talk about some problems - both conceptual and technical
(essentially review comments for the patch).

1) The goal of the patch is to limit the impact on latency, but the
relationship between WAL amounts and latency may not be linear. But we
don't have a good way to predict latency, and WAL lag is the only thing
we have, so there's that. Ultimately, it's a best effort.

2) The throttling is per backend. That makes it simple, but it means
that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
and with a single backend that works fine - the lag should not exceed
the 8MB value. But if there are N backends, the lag could be up to
N-times 8MB, I believe. That's a bit annoying, but I guess the only
solution would be to have some autovacuum-like cost balancing, with all
backends (or at least those running large stuff) doing the checks more
often. I'm not sure we want to do that.

3) The actual throttling (flush and wait for syncrep) happens in
ProcessInterrupts(), which mostly works but it has two drawbacks:

 * It may not happen "early enough" if the backends inserts a lot of
XLOG records without processing interrupts in between.

 * It may happen "too early" if the backend inserts enough WAL to need
throttling (i.e. sets XLogDelayPending), but then after processing
interrupts it would be busy with other stuff, not inserting more WAL.

I think ideally we'd do the throttling right before inserting the next
XLOG record, but there's no convenient place, I think. We'd need to
annotate a lot of places, etc. So maybe ProcessInterrupts() is a
reasonable approximation.

We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but
that seems reasonable.

4) I'm not sure I understand why we need XactLastThrottledRecEnd. Why
can't we just use XLogRecEnd?

5) I think the way XLogFlush() resets backendWalInserted is a bit wrong.
Imagine a backend generates a fair amount of WAL, and then calls
XLogFlush(lsn). Why is it OK to set backendWalInserted=0 when we don't
know if the generated WAL was before the "lsn"? I suppose we don't use
very old lsn values for flushing, but I don't know if this drift could
accumulate over time, or cause some other issues.

6) Why XLogInsertRecord() did skip SYNCHRONOUS_COMMIT_REMOTE_FLUSH?

7) I find the "synchronous_commit_wal_throttle_threshold" name
annoyingly long, so I renamed it to just "wal_throttle_threshold". I've
also renamed the GUC to "wal_throttle_after" and I wonder if maybe it
should be configured in , maybe it should be in GUC_UNIT_BLOCKS just
like the other _after options? But those changes are more a matter of
taste, feel free to ignore this.


missing pieces
--------------
The thing that's missing is that some processes (like aggressive
anti-wraparound autovacuum) should not be throttled. If people set the
GUC in the postgresql.conf, I guess that'll affect those processes too,
so I guess we should explicitly reset the GUC for those processes. I
wonder if there are other cases that should not be throttled.


tangents
--------
While discussing this with Andres a while ago, he mentioned a somewhat
orthogonal idea - sending unflushed data to the replica.

We currently never send unflushed data to the replica, which makes sense
because this data is not durable and if the primary crashes/restarts,
this data will disappear. But it also means there may be a fairly large
chunk of WAL data that we may need to send at COMMIT and wait for the
confirmation.

He suggested we might actually send the data to the replica, but the
replica would know this data is not flushed yet and so would not do the
recovery etc. And at commit we could just send a request to flush,
without having to transfer the data at that moment.

I don't have a very good intuition about how large the effect would be,
i.e. how much unflushed WAL data could accumulate on the primary
(kilobytes/megabytes?), and how big is the difference between sending a
couple kilobytes or just a request to flush.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:
From: Soumyadeep Chakraborty
Date: 04 November 2023, 18:58:31
Subject: Re: brininsert optimization opportunity
From: "Andrey M. Borodin"
Date: 04 November 2023, 20:07:52
Subject: Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Re: Syncrep and improving latency due to WAL throttling - Mailing list pgsql-hackers

Attachment

Previous

Next