Re: Proposal: "Causal reads" mode for load balancing reads without stale data - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Proposal: "Causal reads" mode for load balancing reads without stale data |
Date | |
Msg-id | CAEepm=0eMpTy1a8b=FVtGZBFxnQkFoO1dU0ib=UrbKMwkUMKqg@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal: "Causal reads" mode for load balancing reads without stale data (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: Proposal: "Causal reads" mode for load balancing reads
without stale data
|
List | pgsql-hackers |
On Thu, Mar 3, 2016 at 7:34 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Finally I got my eyes on this patch. Thanks, I really appreciate your time and interest. Your feedback gave me plenty to chew on. See below for replies to both your recent emails. > To put it short, this patch introduces two different concepts: > - addition of a new value, remote_apply, for synchronous_commit, which > is actually where things overlap a bit with the N-sync patch, because > by combining the addition of remote_apply with the N-sync patch, it is > possible to ensure that an LSN is applied to multiple targets instead > of one now. In this case, as the master will wait for the LSN to be > applied on the sync standby, there is no need for the application to > have error handling in case a read transaction is happening on the > standby as the change is already visible on the standby when > committing it on master. However the cost here is that if the standby > node lags behind, it puts some extra wait as well on the master side. > - casual reads, which makes the master able to drop standbys that are > lagging too much behind and let callers of standbys know that it is > lagging to much behind and cannot satisfy causal reads. In this case > error handling is needed by the application in case a standby is > lagging to much behind. > > That's one of my concerns about this patch now: it is trying to do too > much. I think that there is definitely a need for both things: > applications may be fine to pay the lagging price when remote_apply is > used by not having an extra error handling layer, or they cannot > accept a perhaps large of lag and are ready to have an extra error > handling layer to manage read failures on standbys. So I would > recommend a split to begin with: > 1) Addition of remote_apply in synchronous_commit, which would be > quite a simple patch, and I am convinced that there are benefits in > having that. Compared to the previous patch sent, a standby message is > sent back to the master once COMMIT has been applied, accelerating > things a bit. > 2) Patch for causal reads, with all its extra parametrization logic > and stuff to select standby candidates. Agreed. I have split this work up into four patches which apply on top of each other, and provide something (hopefully) useful at each stage. Namely: 0001-remote-apply.patch: This adds synchronous_commit = remote_apply. It works by setting a new bit in commit records to say that the primary requests feedback when the record is applied. When the recovery process sees this bit, it wakes the walreceiver process and asks it to send a reply to the primary ASAP. 0002-replay-lag.patch: This adds a time-based estimate of the current apply lag on each standby. The value is visible in pg_stat_replication. It uses a combination of commit timestamps and periodic LSN->time samples from keepalive messages, so that replay lag is computed even when there are no commits, for example during bulk data loads. This builds on the previous patch, because it makes use of the mechanism whereby the recovery process can ask the walreceiver to send replies whenever it applies WAL records for which it has timestamps. 0003-refactor-syncrep-exit.patch: This moves the code to handle syncrep wait cancellation into a function, as you suggested. Conceptually independent of the previous two patches, but presented as a patch on top of the previous patches and separately from causal-reads.patch for ease of review. 0004-causal-reads.patch: This adds the causal reads feature, building on top of the other three patches. From the remote-apply patch it uses the mechanism for asking for apply feedback. From the replay-lag patch it uses the mechanism for tracking apply lag. It also uses syncrep wait cancellation code. > Here is as well a more detailed review... > > Regression tests are failing, rules.sql is generating diffs because > pg_stat_replication is changed. Fixed, thanks. > CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE, > COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also > call XLogRequestWalReceiverReply() when needed. Ah yes, agreed for COMMIT PREPARED. Fixed. But why for PREPARE TRANSACTION or ROLLBACK PREPARED? Those don't generate visible effects that others might expect to see on a standby, so I don't think that is necessary. > The new parameters are missing from postgresql.conf.sample. Fixed, thanks. > +static bool > +SyncRepCheckEarlyExit(void) > +{ > Doing this refactoring would actually make sense as a separate patch I > think, and that would simplify the core patch for causal reads. Agreed -- see 0003-refactor-syncrep-exit.patch. > +For this reason we say that the causal reads guarantee only holds as > +long as the absolute difference between the system clocks of the > +machines is no more than max_clock_skew. The theory is that NTP makes > +it possible to reason about the maximum possible clock difference > +between machines and choose a value that allows for a much larger > +difference. However, we do make a best effort attempt to detect > +misconfigured systems as described above, to catch the case of servers > +not running ntp a correctly configured ntp daemon, or with a clock so > +far out of whack that ntp refuses to fix it. > > Just wondering how this reacts when standby and master are on > different timezones. I know of two ways to measure WAL lag: one is > what you are doing, by using a timestamp and rely on the two servers > to be in sync at clock level. The second is in bytes with a WAL > quantity, though it is less user-friendly to set up, say max_wal_lag > or similar, symbolized by the number of WAL segments the standby is > lagging behind, the concerns regarding clock sync across nodes go > away. To put it short, I find the timestamp approach easy to set up > and understand for the user, but not really solid as it depends much > on the state dependency between different hosts, while a difference > between flush and apply LSN positions is a quite independent concept. > So basically what would be used as a base comparison is not the > timestamp of the transaction commit but the flush LSN at the moment > commit has been replayed. I'm not actually using either of those approaches. Postgres already periodically captures (time, WAL end) pairs on the primary, and feeds them to the standby. With this patch set, the standby records these samples in a circular buffer, and when the recovery process eventually applies sampled WAL positions, it feeds the associated timestamps back to the primary in unsolicited keepalive reply messages. Then the primary can judge how long it took the standby to apply the sampled WAL position. It doesn't suffer from clock skew problems because it's comparing two timestamps obtained from the system clock on the same box. On the other hand, the time reported includes an extra network latency term as the reply message has to reach the primary. That might be a good thing, depending on your purpose, as it reflects the time you'd be waiting for syncrep/causal reads on that standby, which includes such latency. I had to implement this because, in an early prototype of the causal reads feature, the system would choke on large data load transactions even though the standbys were able to keep up. Lag could only be measured when commit records came along. That meant that a very big data load was indistinguishable from a system not applying fast enough, and the standby would drop to causal reads unavailable state spuriously. I considered interpolating WAL positions in between commit records, but couldn't see how to make it work. I considered injecting extra timestamps into the WAL periodically, but I didn't need anything persistent. I just needed a way to collect occasional (time, LSN) samples in memory and feed the times back at the appropriate moments as the WAL is replayed. > + /* > + * If a causal_reads_timeout is configured, it is used instead of > + * wal_sender_timeout. Ideally we'd use causal_reads_timeout / 2 + > + * allowance for network latency, but since walreceiver can become quite > + * bogged down fsyncing WAL we allow more tolerance. (This could be > + * tightened up once standbys hand writing off to the WAL writer). > + */ > + if (causal_reads_timeout != 0) > + allowed_time = causal_reads_timeout; > + else > + allowed_time = wal_sender_timeout; > I find that surprising, for two reasons: > 1) it seems to me that causal_read_timeout has in concept no relation > with WAL sender process control. > 2) A standby should still be able to receive WAL even if it cannot > satisfy causal reads to give it a chance to catch up faster the amount > it is late. This is only used to detect dead/disconnected/unreachable standbys, not to handle standbys that are merely not keeping up with replication traffic. To get disconnected this way, a standby has to be not responding to keepalive pings. When dealing with a causal reads standby, we want to cap the time we're prepared to hang around waiting for a zombie standby before nuking it. If we didn't have that change, then we would stop waiting for slow standbys (not applying fast enough) in a timely fashion, but crashed/disconnected standbys would make our commits hang for the potentially much longer wal_sender_timeout, which seems silly. Zombie servers shouldn't be able to interfere with your commits for longer periods of time than lagging servers. > - SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */ > + SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */ > + SYNCHRONOUS_COMMIT_REMOTE_APPLY, /* wait for local flush and remote > + * apply */ > + SYNCHRONOUS_COMMIT_CONSISTENT_APPLY /* wait for local flusha and remote > + apply with causal consistency */ > SYNCHRONOUS_COMMIT_CONSISTENT_APPLY is used nowhere, and there is a > typo s/flusha/flush a/. > > I am still digging into the patch, the available/joining/unavailable > logic being quite interesting. I hope the new patch set makes that job easier... On Thu, Mar 3, 2016 at 8:00 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Hm. Looking now at > http://www.postgresql.org/message-id/CANP8+j+jCpNoOjc-KQLtt4PDyOX2Sq6wYWqCSy6aaHWkvNa0hw@mail.gmail.com > it would be nice to get a clear solution for it first, though the use > of signals to wake up the WAL receiver and enforce it to send a new > LSN apply position back to the master to unlock it asap does not look > very appealing. Seeing that no patch has been sent for 9.6 regarding > that, it would be better to simply drop this code from the causal-read > patch perhaps... The signalling mechanism is still present. Since that discussion I had the idea of using a different (non-overloaded) signal and unblocking it only while actually waiting for the network, so there should be no signal delivery interrupting blocking disk IO operations. But I certainly agree that the disk IO should be moved out of the walreceiver process and in to the WAL writer process, for several reasons. This patch series doesn't need that to work though. Here's a recap of how the signal is used, in the context of the communication introduced by each patch: 0001-remote-apply.patch * the primary sets a new bit XACT_COMPLETION_SYNC_APPLY_FEEDBACK in commit records' xinfo when synchronous_commit = remote_apply * the WAL record is transported walsender->walreceiver as usual * the recovery process eventually applies the commit record, and, when it sees that bit, it wakes the walreceiver with SIGUSR2 * when walreceiver receives the signal it generates a keepalive reply to report the latest apply position * the walsender uses that information to release syncrep waiters (just as it always did) 0002-replay-lag.patch * the walreceiver records (time, lsn) pairs into a circular buffer as they arrive * the recovery process, when it discovers that it has replayed an LSN associated with a time in the circular buffer, uses the same signal as used in 0001-remote-apply.patch to ask walreceiver to report this progress * the walreceiver includes the last replay timestamp in a new field added to the reply message 0004-causal-reads.patch * reuses the above to implement causal_reads mode: XACT_COMPLETION_SYNC_APPLY_FEEDBACK is now also set if causal_reads is on, because it too needs to know when commit records are applied * the rest of the patch implements the new state machine, lease control and waiting machinery Some alternatives to that signal could include: a pipe between recovery and walreceiver that could be multiplexed with network IO (which may be tricky to implement cross-platform, I don't know, but I don't see anything similar in Postgres), or a Postgres latch that could be multiplexed with network IO using some future primitive as suggested elsewhere (but would ultimately still involve a signal). There would be plenty of scope for evolution, for example when integrating with the standby WAL writer work, but the signal approach seemed simple and effective for now and certainly isn't the first case of backends signalling each other. Thanks, -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-hackers by date: