Re: Syncrep and improving latency due to WAL throttling - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Syncrep and improving latency due to WAL throttling |
Date | |
Msg-id | 523faff5-f265-b704-85df-4835a02373b1@enterprisedb.com Whole thread Raw |
In response to | Re: Syncrep and improving latency due to WAL throttling (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: Syncrep and improving latency due to WAL throttling
|
List | pgsql-hackers |
Hi, I keep getting occasional complaints about the impact of large/bulk transactions on latency of small OLTP transactions, so I'd like to revive this thread a bit and move it forward. Attached is a rebased v3, followed by 0002 patch with some review comments, missing comments and minor tweaks. More about that later ... It's been a couple months, and there's been a fair amount of discussion and changes earlier, so I guess it makes sense to post a summary, stating the purpose (and scope), and then go through the various open questions etc. goals ----- The goal is to limit the impact of large transactions (producing a lot of WAL) on small OLTP transactions, in a cluster with a sync replica. Imagine a backend executing single-row inserts, or something like that. The commit will wait for the replica to confirm the WAL, which may be expensive, but it's well determined by the network roundtrip. But then a large transaction comes, and inserts a lot of WAL (imagine a COPY which inserts 100MB of data, VACUUM, CREATE INDEX and so on). A small transaction may insert a COMMIT record right after this WAL chunk, and locally that's (mostly) fine. But with the sync replica it's much worse - we don't send WAL until it's flushed locally, and then we need to wait for the WAL to be sent, applied and confirmed by the replica. This takes time (depending on the bandwidth), and it may not happen until the small transaction does COMMIT (because we may not flush WAL from in-progress transaction very often). Jakub Wartak presented some examples of the impact when he started this thread, and it can be rather bad. Particularly for latency-sensitive applications. I plan to do more experiments with the current patch, but I don't have the results yet. scope ----- Now, let's talk about scope - what the patch does not aim to do. The patch is explicitly intended for syncrep clusters, not async. There have been proposals to also support throttling for async replicas, logical replication etc. I suppose all of that could be implemented, and I do see the benefit of defining some sort of maximum lag even for async replicas. But the agreement was to focus on the syncrep case, where it's particularly painful, and perhaps extend it in the future. I believe adding throttling for physical async replication should not be difficult - in principle we need to determine how far the replica got, and compare it to the local LSN. But there's likely complexity with defining which async replicas to look at, inventing a sensible way to configure this, etc. It'd be helpful if people interested in that feature took a look at this patch and tried extending etc. It's not clear to me what to do about disconnected replicas, though. We may not even know about them, if there's no slot (and how would we know what the slot is for). So maybe this would need a new GUC listing the interesting replicas, and all would need to be connected. But that's an availability issue, because then all replicas need to be connected. I'm not sure about logical replication, but I guess we could treat it similarly to async. But what I think would need to be different is handling of small transactions. For syncrep we automatically wait for those at commit, which means automatic throttling. But for async (and logical), it's trivial to cause ever-increasing lag with only tiny transactions, thanks to the single-process replay, so maybe we'd need to throttle those too. (The recovery prefetching improved this for async quite a bit, ofc.) implementation -------------- The implementation is fairly straightforward, and happens in two places. XLogInsertRecord() decides if a throttling might be needed for this backend, and then HandleXLogDelayPending() does the wait. XLogInsertRecord() checks if the backend produced certain amount of WAL (might be 1MB, for example). We do this because we don't want to do the expensive stuff in HandleXLogDelayPending() too often (e.g. after every XLOG record). HandleXLogDelayPending() picks a suitable LSN, flushes it and then also waits for the sync replica, as if it was a commit. This limits the lag, i.e. the amount of WAL that the small transaction will need to wait for to be replicated and confirmed by the replica. There was a fair amount of discussion about how to pick the LSN. I think the agreement is we certainly can't pick the current LSN (because that would lead to write amplification for the partially filled page), and we probably even want to backoff a bit more, to make it more likely the LSN is already flushed. So for example with the threshold set to 8MB we might go back 1MB, or something like that. That'd still limit the lag. problems -------- Now let's talk about some problems - both conceptual and technical (essentially review comments for the patch). 1) The goal of the patch is to limit the impact on latency, but the relationship between WAL amounts and latency may not be linear. But we don't have a good way to predict latency, and WAL lag is the only thing we have, so there's that. Ultimately, it's a best effort. 2) The throttling is per backend. That makes it simple, but it means that it's hard to enforce a global lag limit. Imagine the limit is 8MB, and with a single backend that works fine - the lag should not exceed the 8MB value. But if there are N backends, the lag could be up to N-times 8MB, I believe. That's a bit annoying, but I guess the only solution would be to have some autovacuum-like cost balancing, with all backends (or at least those running large stuff) doing the checks more often. I'm not sure we want to do that. 3) The actual throttling (flush and wait for syncrep) happens in ProcessInterrupts(), which mostly works but it has two drawbacks: * It may not happen "early enough" if the backends inserts a lot of XLOG records without processing interrupts in between. * It may happen "too early" if the backend inserts enough WAL to need throttling (i.e. sets XLogDelayPending), but then after processing interrupts it would be busy with other stuff, not inserting more WAL. I think ideally we'd do the throttling right before inserting the next XLOG record, but there's no convenient place, I think. We'd need to annotate a lot of places, etc. So maybe ProcessInterrupts() is a reasonable approximation. We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but that seems reasonable. 4) I'm not sure I understand why we need XactLastThrottledRecEnd. Why can't we just use XLogRecEnd? 5) I think the way XLogFlush() resets backendWalInserted is a bit wrong. Imagine a backend generates a fair amount of WAL, and then calls XLogFlush(lsn). Why is it OK to set backendWalInserted=0 when we don't know if the generated WAL was before the "lsn"? I suppose we don't use very old lsn values for flushing, but I don't know if this drift could accumulate over time, or cause some other issues. 6) Why XLogInsertRecord() did skip SYNCHRONOUS_COMMIT_REMOTE_FLUSH? 7) I find the "synchronous_commit_wal_throttle_threshold" name annoyingly long, so I renamed it to just "wal_throttle_threshold". I've also renamed the GUC to "wal_throttle_after" and I wonder if maybe it should be configured in , maybe it should be in GUC_UNIT_BLOCKS just like the other _after options? But those changes are more a matter of taste, feel free to ignore this. missing pieces -------------- The thing that's missing is that some processes (like aggressive anti-wraparound autovacuum) should not be throttled. If people set the GUC in the postgresql.conf, I guess that'll affect those processes too, so I guess we should explicitly reset the GUC for those processes. I wonder if there are other cases that should not be throttled. tangents -------- While discussing this with Andres a while ago, he mentioned a somewhat orthogonal idea - sending unflushed data to the replica. We currently never send unflushed data to the replica, which makes sense because this data is not durable and if the primary crashes/restarts, this data will disappear. But it also means there may be a fairly large chunk of WAL data that we may need to send at COMMIT and wait for the confirmation. He suggested we might actually send the data to the replica, but the replica would know this data is not flushed yet and so would not do the recovery etc. And at commit we could just send a request to flush, without having to transfer the data at that moment. I don't have a very good intuition about how large the effect would be, i.e. how much unflushed WAL data could accumulate on the primary (kilobytes/megabytes?), and how big is the difference between sending a couple kilobytes or just a request to flush. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: