Re: [HACKERS] Causal reads take II - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] Causal reads take II |
Date | |
Msg-id | CAEepm=3z_4CbhdWG8v-jC4nrfiPjXz5_Z--9RQEjPSykNmEyeA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Causal reads take II (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: [HACKERS] Causal reads take II
|
List | pgsql-hackers |
On Wed, May 24, 2017 at 3:58 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >>> I'm wondering about status of this patch and how can I try it out? > > I ran into a problem while doing this, and it may take a couple more > days to fix it since I am at pgcon this week. More soon. Apologies for the extended delay. Here is the rebased patch, now with a couple of improvements (see below). To recap, this is the third part of the original patch series[1], which had these components: 1. synchronous_commit = remote_apply, committed in PostgreSQL 9.6 2. replication lag tracking, committed in PostgreSQL 10 3. causal_reads, the remaining part, hereby proposed for PostgreSQL 11 The goal is to allow applications to move arbitrary read-only transactions to physical replica databases and still know that they can see all preceding write transactions or get an error. It's something like regular synchronous replication with synchronous_commit = remote_apply, except that it limits the impact on the primary and handles failure transitions with defined semantics. The inspiration for this kind of distributed read-follows-write consistency using read leases was a system called Comdb2[2][3], whose designer encouraged me to try to extend Postgres's streaming replication to do something similar. Read leases can also be found in some consensus systems like Google Megastore, albeit in more ambitious form IIUC. The name is inspired by a MySQL Galera feature (approximately the same feature but the approach is completely different; Galera adds read latency, whereas this patch does not). Maybe it needs a better name. Is this is a feature that people want to see in PostgreSQL? IMPROVEMENTS IN V17 The GUC to enable the feature is now called "causal_reads_max_replay_lag". Standbys listed in causal_reads_standby_names whose pg_stat_replication.replay_lag doesn't exceed that time are "available" for causal reads and will be waited for by the primary when committing. When they exceed that threshold they are briefly in "revoking" state and then "unavailable", and when the go return to an acceptable level they are briefly in "joining" state before reaching "available". CR states appear in pg_stat_replication and transitions are logged at LOG level. A new GUC called "causal_reads_lease_time" controls the lifetime of read leases sent from the primary to the standby. This affects the frequency of lease replacement messages, and more importantly affects the worst case of commit stall that can be introduced if connectivity to a standby is lost and we have to wait for the last sent lease to expire. In the previous version, one single GUC controlled both maximum tolerated replay lag and lease lifetime, which was good from the point of view that fewer GUCs are better, but bad because it had to be set fairly high when doing both jobs to be conservative about clock skew. The lease lifetime must be at least 4 x maximum tolerable clock skew. After the recent botching of a leap-second transition on a popular public NTP network (TL;DR OpenNTP is not a good choice of implementation to add to a public time server pool) I came to the conclusion that I wouldn't want to recommend a default max clock skew under 1.25s, to allow for some servers to be confused about leap seconds for a while or to be running different smearing algorithms. A reasonable causal_reads_lease_time recommendation for people who don't know much about the quality of their time source might therefore be 5s. I think it's reasonable to want to set the maximum tolerable replay lag to lower time than that, or in fact as low as you like, depending on your workload and hardware. Therefore I decided to split the old "causal_reads_timeout" GUC into "causal_reads_max_replay_lag" and "causal_reads_lease_time". This new version introduces fast lease revocation. Whenever the primary decides that a standby is not keeping up, it kicks it out of the set of CR-available standbys and revokes its lease, so that anyone trying to run causal reads transactions there will start receiving a new error. In the previous version, it always did that by blocking commits while waiting for the most recently sent lease to expire, which I now call "slow revocation" because it could take several seconds. Now it blocks commits only until the standby acknowledges that it is no longer available for causal reads OR the lease expires: ideally that takes the time of a network a round trip. Slow revocation is still needed in various failure cases such as lost connectivity. TESTING Apply the patch after first applying a small bug fix for replication lag tracking[4]. Then: 1. Set up some streaming replicas. 2. Stick causal_reads_max_replay_lag = 2s (or any time you like) in the primary's postgresql.conf. 3. Set causal_reads = on in some transactions on various nodes. 4. Try to break it! As long as your system clocks don't disagree by more than 1.25s (causal_reads_lease_time / 4), the causal reads guarantee will be upheld: standbys will either see transactions that have completed on the primary or raise an error to indicate that they are not available for causal reads transactions. You should not be able to break this guarantee, no matter what you do: unplug the network, kill arbitrary processes, etc. If you mess with your system clocks so they differ by more than causal_reads_lease_time / 4, you should see that a reasonable effort is made to detect that so it's still very unlikely you can break it (you'd need clocks to differ by more than causal_reads_lease_time / 4 but less than causal_reads_lease_time / 4 + network latency so that the excessive skew is not detected, and then you'd need a very well timed pair of transactions and loss of connectivity). [1] https://www.postgresql.org/message-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com [2] https://github.com/bloomberg/comdb2 [3] http://www.vldb.org/pvldb/vol9/p1377-scotti.pdf [4] https://www.postgresql.org/message-id/CAEepm%3D3tJX_0kSeDi8OYTMp8NogrqPxgP1%2B2uzsdePz9i0-V0Q%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: