Re: Proposal: "Causal reads" mode for load balancing reads without stale data - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Proposal: "Causal reads" mode for load balancing reads without stale data |
Date | |
Msg-id | CAB7nPqSbCHF-fZjMxFZ3uO22V4Vp8NqAQ37YxmTRuExDDwTjkg@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal: "Causal reads" mode for load balancing reads without stale data (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: Proposal: "Causal reads" mode for load balancing reads
without stale data
Re: Proposal: "Causal reads" mode for load balancing reads without stale data Re: Proposal: "Causal reads" mode for load balancing reads without stale data Re: Proposal: "Causal reads" mode for load balancing reads without stale data |
List | pgsql-hackers |
On Tue, Mar 1, 2016 at 11:53 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Mar 1, 2016 at 2:46 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> Hi, >> >> On 2016/02/29 18:05, Thomas Munro wrote: >>> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote: >>>> + servers. A transaction that is run with >>>> <varname>causal_reads</> set >>>> + to <literal>on</> is guaranteed either to see the effects of all >>>> + completed transactions run on the primary with the setting on, or to >>>> + receive an error "standby is not available for causal reads". >>>> >>>> "A transaction that is run" means "A transaction that is run on a >>>> standby", right? >>> >>> Well, it could be any server, standby or primary. Of course standbys >>> are the interesting case since it it was already true that if you run >>> two sequential transactions run on the primary, the second can see the >>> effect of the first, but I like the idea of a general rule that >>> applies anywhere, allowing you not to care which server it is. >> >> I meant actually in context of that sentence only. > > Ok, here's a new version that includes that change, fixes a conflict > with recent commit 10b48522 and removes an accidental duplicate copy > of the README file. Finally I got my eyes on this patch. To put it short, this patch introduces two different concepts: - addition of a new value, remote_apply, for synchronous_commit, which is actually where things overlap a bit with the N-sync patch, because by combining the addition of remote_apply with the N-sync patch, it is possible to ensure that an LSN is applied to multiple targets instead of one now. In this case, as the master will wait for the LSN to be applied on the sync standby, there is no need for the application to have error handling in case a read transaction is happening on the standby as the change is already visible on the standby when committing it on master. However the cost here is that if the standby node lags behind, it puts some extra wait as well on the master side. - casual reads, which makes the master able to drop standbys that are lagging too much behind and let callers of standbys know that it is lagging to much behind and cannot satisfy causal reads. In this case error handling is needed by the application in case a standby is lagging to much behind. That's one of my concerns about this patch now: it is trying to do too much. I think that there is definitely a need for both things: applications may be fine to pay the lagging price when remote_apply is used by not having an extra error handling layer, or they cannot accept a perhaps large of lag and are ready to have an extra error handling layer to manage read failures on standbys. So I would recommend a split to begin with: 1) Addition of remote_apply in synchronous_commit, which would be quite a simple patch, and I am convinced that there are benefits in having that. Compared to the previous patch sent, a standby message is sent back to the master once COMMIT has been applied, accelerating things a bit. 2) Patch for causal reads, with all its extra parametrization logic and stuff to select standby candidates. Here is as well a more detailed review... Regression tests are failing, rules.sql is generating diffs because pg_stat_replication is changed. CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE, COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also call XLogRequestWalReceiverReply() when needed. The new parameters are missing from postgresql.conf.sample. +static bool +SyncRepCheckEarlyExit(void) +{ Doing this refactoring would actually make sense as a separate patch I think, and that would simplify the core patch for causal reads. +For this reason we say that the causal reads guarantee only holds as +long as the absolute difference between the system clocks of the +machines is no more than max_clock_skew. The theory is that NTP makes +it possible to reason about the maximum possible clock difference +between machines and choose a value that allows for a much larger +difference. However, we do make a best effort attempt to detect +misconfigured systems as described above, to catch the case of servers +not running ntp a correctly configured ntp daemon, or with a clock so +far out of whack that ntp refuses to fix it. Just wondering how this reacts when standby and master are on different timezones. I know of two ways to measure WAL lag: one is what you are doing, by using a timestamp and rely on the two servers to be in sync at clock level. The second is in bytes with a WAL quantity, though it is less user-friendly to set up, say max_wal_lag or similar, symbolized by the number of WAL segments the standby is lagging behind, the concerns regarding clock sync across nodes go away. To put it short, I find the timestamp approach easy to set up and understand for the user, but not really solid as it depends much on the state dependency between different hosts, while a difference between flush and apply LSN positions is a quite independent concept. So basically what would be used as a base comparison is not the timestamp of the transaction commit but the flush LSN at the moment commit has been replayed. + /* + * If a causal_reads_timeout is configured, it is used instead of + * wal_sender_timeout. Ideally we'd use causal_reads_timeout / 2 + + * allowance for network latency, but since walreceiver can become quite + * bogged down fsyncing WAL we allow more tolerance. (This could be + * tightened up once standbys hand writing off to the WAL writer). + */ + if (causal_reads_timeout != 0) + allowed_time = causal_reads_timeout; + else + allowed_time = wal_sender_timeout; I find that surprising, for two reasons: 1) it seems to me that causal_read_timeout has in concept no relation with WAL sender process control. 2) A standby should still be able to receive WAL even if it cannot satisfy causal reads to give it a chance to catch up faster the amount it is late. - SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */ + SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */ + SYNCHRONOUS_COMMIT_REMOTE_APPLY, /* wait for local flush and remote + * apply */ + SYNCHRONOUS_COMMIT_CONSISTENT_APPLY /* wait for local flusha and remote + apply with causal consistency */ SYNCHRONOUS_COMMIT_CONSISTENT_APPLY is used nowhere, and there is a typo s/flusha/flush a/. I am still digging into the patch, the available/joining/unavailable logic being quite interesting. -- Michael
pgsql-hackers by date: