Thread: [HACKERS] Causal reads take II
Hi hackers, Here is a new version of my "causal reads" patch (see the earlier thread from the 9.6 development cycle[1]), which provides a way to avoid stale reads when load balancing with streaming replication. To try it out: Set up a primary and some standbys, and put "causal_reads_timeout = 4s" in the primary's postgresql.conf. Then SET causal_reads = on and experiment with various workloads, watching your master's log and looking at pg_stat_replication. For example you could try out test-causal-reads.c with --causal-reads --check (from the earlier thread) or write something similar, and verify the behaviour while killing, pausing, overwhelming servers etc. Here's a brief restatement of the problem I'm trying to address and how this patch works: In 9.6 we got a new synchronous_commit level "remote_apply", which causes committing transactions to block until the commit record has been applied on the current synchronous standby server. In 10devel can now be servers plural. That's useful because it means that a client can run tx1 on the primary and then run tx2 on an appropriate standby, or cause some other client to do so, and be sure that tx2 can see tx1. Tx2 can be said to be "causally dependent" on tx1 because clients expect tx2 to see tx1, because they know that tx1 happened before tx2. In practice there are complications relating to failure and transitions. How should you find an appropriate standby? Suppose you have a primary and N standbys, you set synchronous_standby_names to wait for all N standbys, and you set synchronous_commit to remote_apply. Then the above guarantee of visibility of tx1 by tx2 works, no matter which server you run tx2 on. Unfortunately, if one of your standby servers fails or there is a network partition, all commits will block until you fix that. So you probably want to set synchronous_standby_names to wait for a subset of your set of standbys. Now you can lose some number of standby servers without holding up commits on the primary, but the visibility guarantee for causal dependencies is lost! How can a client know for certain whether tx2 run on any given standby can see a transaction tx1 that it has heard about? If you're using the new "ANY n" mode then the subset of standbys that have definitely applied tx1 is not known to any client; if you're using the traditional FIRST mode it's complicated during transitions (you might be talking to a standby that has recently lost its link to the primary and the primary could have decided to wait for the next highest priority standby instead and then returned from COMMIT successfully). This patch provides the following guarantee: if causal_reads is on for both tx1 and tx2, then after tx1 returns, tx2 will either see tx1 or fail with an error indicating that the server is currently unavailable for causal reads. This guarantee is upheld even if there is a network partition and the standby running tx2 is unable to communicate with the primary server, but requires the system clocks of all standbys to differ from the primary's by less than a certain amount of allowable skew that is accounted for in the algorithm (causal_reads_timeout / 4, see README.causal_reads for gory details). It works by sending a stream of "leases" to standbys that are applying fast enough. These leases grant the standby the right to assume that all transactions that were run with causal_reads = on and have returned control have already been applied locally, without doing any communication or waiting, for a limited time. Leases are promises made by the primary that it will wait for all such transactions to be applied on each 'available' standby or for available standbys' leases to be revoked because they're lagging too much, and for any revoked leases to expire. As discussed in the earlier thread, there are other ways that tx2 run on a standby could get a useful guarantee about the visibility of an early transaction tx1 that the client knows about. (1) User-managed "causality tokens": Clients could somehow obtain the LSN of commit tx1 (or later), and then tx2 could explicitly wait for that LSN to be applied, as proposed by Ivan Kartyshov[2] and others; if you aren't also using sync rep for data loss avoidance, then tx1 will return from committing without waiting for standbys, and by the time tx2 starts on a standby it may find that the LSN has already been applied and not have to wait at all. That is definitely good. Unfortunately it also transfers the problem of tracking causal dependencies between transactions to client code, which is a burden on the application developer and difficult to retrofit. (2) Middleware-managed causality tokens: Something like pgpool or pgbouncer or some other proxy could sit in front of all of your PostgreSQL servers and watch all transactions and do the LSN tracking for you, inserting waits where appropriate so that no standby query ever sees a snapshot that doesn't include any commit that any client has heard about; that requires tx2 to wait for transactions that may be later than tx1 to be applied potentially slowing down every read query, and requires pushing all transactions through a single central process thereby introducing its own failover problem with associated transition failure mode that could break our guarantee if somehow two of these proxies are ever active at once. Don't get me wrong, I think those are good ideas: let's do those too. I guess that people working on logical multi-master replication might eventually want a general concept of causality tokens which could include some kind of vector clock. But I don't see this proposal as conflicting with any of that. It's a set of trade-offs that provides a simple solution for users who want to be able to talk directly to any PostgreSQL standby server out of the box without pushing everything through a central observer, and who want to be able to enable this for existing applications without having to rewrite them to insert complicated code to track and communicate LSNs. Some assorted thoughts and things I'd love to hear your ideas on: I admit that it has a potentially confusing relationship with synchronous replication. It is presented as a separate feature, and you can use both features together or use them independently: synchronous_standby_names and synchronous_commit are for controlling your data loss risk, and causal_reads_standby_names and causal_reads are for controlling distributed read consistency. Perhaps the causal_reads GUC should support different levels rather that using on/off; the mode described above could be enabled with something = 'causal_read_lease', leaving room for other modes. Maybe the whole feature needs a better name: I borrowed "causal reads" from Galera's wsrep_causal_reads/wsrep_sync_wait. That system makes readers (think standbys) wait for the global end of WAL to be applied locally at the start of every transaction, which could also be a potential future mode for us, but I thought it was much more interesting to have wait-free reads on standbys, especially if you already happen to be waiting on the primary because you want to avoid data loss with syncrep. To achieve that I added system-clock-based leases. I suspect some people will dislike that part: the guarantee includes the caveat about the maximum difference between system clocks, and the patch doesn't do anything as clever as Google's Spanner/Truetime system or come with a free built-in atomic clock, so it relies on setting the max clock skew conservatively and making sure you have NTP set up correctly (for example, things reportedly got a bit messy for a short time after the recent leap second if you happened to have only one server from pool.ntp.org in your ntpd.conf and were unlucky). I considered ways to make causal reads an extension, but it'd need fairly invasive hooks including the ability to change replication wire protocol messages. Long term, I think it would be pretty cool if we could develop a set of features that give you distributed sequential consistency on top of streaming replication. Something like (this | causality-tokens) + SERIALIZABLE-DEFERRABLE-on-standbys[3] + distributed-dirty-read-prevention[4]. The patch: The replay lag tracking patch this depends on is in the current commitfest[1] and is presented as an independent useful feature. Please find two patches to implement causal reads for the open CF attached. First apply replay-lag-v16.patch, then refactor-syncrep-exit-v16.patch, then causal-reads-v16.patch. Thanks for reading! [1] https://www.postgresql.org/message-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com [2] https://www.postgresql.org/message-id/flat/0240c26c-9f84-30ea-fca9-93ab2df5f305@postgrespro.ru#0240c26c-9f84-30ea-fca9-93ab2df5f305@postgrespro.ru [3] https://www.postgresql.org/message-id/flat/CAEepm%3D2b9TV%2BvJ4UeSBixDrW7VUiTjxPwWq8K3QwFSWx0pTXHQ%40mail.gmail.com [4] https://www.postgresql.org/message-id/flat/CAEepm%3D1GNCriNvWhPkWCqrsbXWGtWEEpvA-KnovMbht5ryzbmg%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here is a new version of my "causal reads" patch (see the earlier > thread from the 9.6 development cycle[1]), which provides a way to > avoid stale reads when load balancing with streaming replication. Thanks for working on this. It will let us do something a lot of people have been asking for. > Long term, I think it would be pretty cool if we could develop a set > of features that give you distributed sequential consistency on top of > streaming replication. Something like (this | causality-tokens) + > SERIALIZABLE-DEFERRABLE-on-standbys[3] + > distributed-dirty-read-prevention[4]. Is it necessary that causal writes wait for replication before making the transaction visible on the master? I'm asking because the per tx variable wait time between logging commit record and making transaction visible makes it really hard to provide matching visibility order on master and standby. In CSN based snapshot discussions we came to the conclusion that to make standby visibility order match master while still allowing for async transactions to become visible before they are durable we need to make the commit sequence a vector clock and transmit extra visibility ordering information to standby's. Having one more level of delay between wal logging of commit and making it visible would make the problem even worse. One other thing that might be an issue for some users is that this patch only ensures that clients observe forwards progress of database state after a writing transaction. With two consecutive read only transactions that go to different servers a client could still observe database state going backwards. It seems that fixing that would require either keeping some per client state or a global agreement on what snapshots are safe to provide, both of which you tried to avoid for this feature. Regards, Ants Aasma
On Thu, Jan 19, 2017 at 8:11 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Long term, I think it would be pretty cool if we could develop a set >> of features that give you distributed sequential consistency on top of >> streaming replication. Something like (this | causality-tokens) + >> SERIALIZABLE-DEFERRABLE-on-standbys[3] + >> distributed-dirty-read-prevention[4]. > > Is it necessary that causal writes wait for replication before making > the transaction visible on the master? I'm asking because the per tx > variable wait time between logging commit record and making > transaction visible makes it really hard to provide matching > visibility order on master and standby. Yeah, that does seem problematic. Even with async replication or no replication, isn't there already a race in CommitTransaction() where two backends could reach RecordTransactionCommit() in one order but ProcArrayEndTransaction() in the other order? AFAICS using synchronous replication in one of the transactions just makes it more likely you'll experience such a visibility difference between the DO and REDO histories (!), by making RecordTransactionCommit() wait. Nothing prevents you getting a snapshot that can see t2 but not t1 in the DO history, while someone doing PITR or querying an asynchronous standby gets a snapshot that can see t1 but not t2 because those replay the REDO history. > In CSN based snapshot > discussions we came to the conclusion that to make standby visibility > order match master while still allowing for async transactions to > become visible before they are durable we need to make the commit > sequence a vector clock and transmit extra visibility ordering > information to standby's. Having one more level of delay between wal > logging of commit and making it visible would make the problem even > worse. I'd like to read that... could you please point me at the right bit of that discussion? > One other thing that might be an issue for some users is that this > patch only ensures that clients observe forwards progress of database > state after a writing transaction. With two consecutive read only > transactions that go to different servers a client could still observe > database state going backwards. True. This patch is about "read your writes", not, erm, "read your reads". That may indeed be problematic for some users. It's not a very satisfying answer but I guess you could run a dummy write query on the primary every time you switch between standbys, or before telling any other client to run read-only queries after you have done so, in order to convert your "r r" sequence into a "r w r" sequence... > It seems that fixing that would > require either keeping some per client state or a global agreement on > what snapshots are safe to provide, both of which you tried to avoid > for this feature. Agreed. You briefly mentioned this problem in the context of pairs of read-only transactions a while ago[1]. As you said then, it does seem plausible to do that with a token system that gives clients the last commit LSN from the snapshot used by a read only query, so that you can ask another standby to make sure that LSN has been replayed before running another read-only transaction. This could be handled explicitly by a motivated client that is talking to multiple nodes. A more general problem is client A telling client B to go and run queries and expecting B to see all transactions that A has seen; it now has to pass the LSN along with that communication, or rely on some kind of magic proxy that sees all transactions, or a radically different system with a GTM. [1] https://www.postgresql.org/message-id/CA%2BCSw_u4Vy5FSbjVc7qms6PuZL7QV90%2BonBEtK9PFqOsNj0Uhw@mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
On Thu, Jan 19, 2017 at 2:22 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Jan 19, 2017 at 8:11 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: >> On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> Long term, I think it would be pretty cool if we could develop a set >>> of features that give you distributed sequential consistency on top of >>> streaming replication. Something like (this | causality-tokens) + >>> SERIALIZABLE-DEFERRABLE-on-standbys[3] + >>> distributed-dirty-read-prevention[4]. >> >> Is it necessary that causal writes wait for replication before making >> the transaction visible on the master? I'm asking because the per tx >> variable wait time between logging commit record and making >> transaction visible makes it really hard to provide matching >> visibility order on master and standby. > > Yeah, that does seem problematic. Even with async replication or no > replication, isn't there already a race in CommitTransaction() where > two backends could reach RecordTransactionCommit() in one order but > ProcArrayEndTransaction() in the other order? AFAICS using > synchronous replication in one of the transactions just makes it more > likely you'll experience such a visibility difference between the DO > and REDO histories (!), by making RecordTransactionCommit() wait. > Nothing prevents you getting a snapshot that can see t2 but not t1 in > the DO history, while someone doing PITR or querying an asynchronous > standby gets a snapshot that can see t1 but not t2 because those > replay the REDO history. Yes there is a race even with all transactions having the same synchronization level. But nobody will mind if we some day fix that race. :) With different synchronization levels it is much trickier to fix as either async commits must wait behind sync commits before becoming visible, return without becoming visible or visibility order must differ from commit record LSN order. The first makes the async commit feature useless, second seems a no-go for semantic reasons, third requires extra information sent to standby's so they know the actual commit order. >> In CSN based snapshot >> discussions we came to the conclusion that to make standby visibility >> order match master while still allowing for async transactions to >> become visible before they are durable we need to make the commit >> sequence a vector clock and transmit extra visibility ordering >> information to standby's. Having one more level of delay between wal >> logging of commit and making it visible would make the problem even >> worse. > > I'd like to read that... could you please point me at the right bit of > that discussion? Some of that discussion was face to face at pgconf.eu, some of it is here: https://www.postgresql.org/message-id/CA+CSw_vbt=CwLuOgR7gXdpnho_Y4Cz7X97+o_bH-RFo7keNO8Q@mail.gmail.com Let me know if you have any questions. >> It seems that fixing that would >> require either keeping some per client state or a global agreement on >> what snapshots are safe to provide, both of which you tried to avoid >> for this feature. > > Agreed. You briefly mentioned this problem in the context of pairs of > read-only transactions a while ago[1]. As you said then, it does seem > plausible to do that with a token system that gives clients the last > commit LSN from the snapshot used by a read only query, so that you > can ask another standby to make sure that LSN has been replayed before > running another read-only transaction. This could be handled > explicitly by a motivated client that is talking to multiple nodes. A > more general problem is client A telling client B to go and run > queries and expecting B to see all transactions that A has seen; it > now has to pass the LSN along with that communication, or rely on some > kind of magic proxy that sees all transactions, or a radically > different system with a GTM. If/when we do CSN based snapshots, adding a GTM could be relatively straightforward. It's basically not all that far from what Spanner is doing by using a timestamp as the snapshot. But this is all relatively independent of this patch. Regards, Ants Aasma
On Fri, Jan 20, 2017 at 3:01 AM, Ants Aasma <ants.aasma@eesti.ee> wrote: > Yes there is a race even with all transactions having the same > synchronization level. But nobody will mind if we some day fix that > race. :) We really should fix that! > With different synchronization levels it is much trickier to > fix as either async commits must wait behind sync commits before > becoming visible, return without becoming visible or visibility order > must differ from commit record LSN order. The first makes the async > commit feature useless, second seems a no-go for semantic reasons, > third requires extra information sent to standby's so they know the > actual commit order. Thought experiment: 1. Log commit and make visible atomically (so DO and REDO agree on visibility order). 2. Introduce flag 'not yet visible' to commit record for sync rep commits. 3. Introduce a new log record 'make all invisible commits up to LSN X visible', which is inserted when enough sync rep standbys reply. Log this + make visible on primary atomically (again, so DO and REDO agree on visibility order). 4. Teach GetSnapshotData to deal with this using <insert magic here>. Now standby and primary agree on visibility order of async and sync transactions, and no standby will allow you to see transactions that the primary doesn't yet consider to be durable (ie flushed on a quorum of standbys etc). But... sync rep has to flush xlog twice on primary, and standby has to wait to make things visible, and remote_apply would either need to be changed or supplemented with a new level remote_apply_and_visible, and it's not obvious how to actually do atomic visibility + logging (I heard ProcArrayLock is kinda hot...). Hmm. Doesn't sound too popular... >>> In CSN based snapshot >>> discussions we came to the conclusion that to make standby visibility >>> order match master while still allowing for async transactions to >>> become visible before they are durable we need to make the commit >>> sequence a vector clock and transmit extra visibility ordering >>> information to standby's. Having one more level of delay between wal >>> logging of commit and making it visible would make the problem even >>> worse. >> >> I'd like to read that... could you please point me at the right bit of >> that discussion? > > Some of that discussion was face to face at pgconf.eu, some of it is > here: https://www.postgresql.org/message-id/CA+CSw_vbt=CwLuOgR7gXdpnho_Y4Cz7X97+o_bH-RFo7keNO8Q@mail.gmail.com > > Let me know if you have any questions. Thanks! That may take me some time... -- Thomas Munro http://www.enterprisedb.com
Hi
I'm wondering about status of this patch and how can I try it out?
> On 3 January 2017 at 02:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> The replay lag tracking patch this depends on is in the current commitfest
I assume you're talking about this patch [1] (at least it's the only thread
where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
returned with feedback, so what's the status of this one (`causal reads`)?
> First apply replay-lag-v16.patch, then refactor-syncrep-exit-v16.patch, then
> causal-reads-v16.patch.
It would be nice to have all three of them attached (for some reason I see only
last two of them in this thread). But anyway there are a lot of failed hunks
when I'm trying to apply `replay-lag-v16.patch` and `refactor-syncrep-exit-v16.patch`,
`causal-reads-v16.patch` (or last two of them separately).
On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > I'm wondering about status of this patch and how can I try it out? Hi Dmitry, thanks for your interest. >> On 3 January 2017 at 02:43, Thomas Munro <thomas.munro@enterprisedb.com> >> wrote: >> The replay lag tracking patch this depends on is in the current commitfest > > I assume you're talking about this patch [1] (at least it's the only thread > where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was > returned with feedback, so what's the status of this one (`causal reads`)? Right, replay lag tracking was committed. I'll post a rebased causal reads patch later today. -- Thomas Munro http://www.enterprisedb.com
On Mon, May 22, 2017 at 6:32 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> I'm wondering about status of this patch and how can I try it out? > > Hi Dmitry, thanks for your interest. > >>> On 3 January 2017 at 02:43, Thomas Munro <thomas.munro@enterprisedb.com> >>> wrote: >>> The replay lag tracking patch this depends on is in the current commitfest >> >> I assume you're talking about this patch [1] (at least it's the only thread >> where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was >> returned with feedback, so what's the status of this one (`causal reads`)? > > Right, replay lag tracking was committed. I'll post a rebased causal > reads patch later today. I ran into a problem while doing this, and it may take a couple more days to fix it since I am at pgcon this week. More soon. -- Thomas Munro http://www.enterprisedb.com
On Wed, May 24, 2017 at 3:58 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >>> I'm wondering about status of this patch and how can I try it out? > > I ran into a problem while doing this, and it may take a couple more > days to fix it since I am at pgcon this week. More soon. Apologies for the extended delay. Here is the rebased patch, now with a couple of improvements (see below). To recap, this is the third part of the original patch series[1], which had these components: 1. synchronous_commit = remote_apply, committed in PostgreSQL 9.6 2. replication lag tracking, committed in PostgreSQL 10 3. causal_reads, the remaining part, hereby proposed for PostgreSQL 11 The goal is to allow applications to move arbitrary read-only transactions to physical replica databases and still know that they can see all preceding write transactions or get an error. It's something like regular synchronous replication with synchronous_commit = remote_apply, except that it limits the impact on the primary and handles failure transitions with defined semantics. The inspiration for this kind of distributed read-follows-write consistency using read leases was a system called Comdb2[2][3], whose designer encouraged me to try to extend Postgres's streaming replication to do something similar. Read leases can also be found in some consensus systems like Google Megastore, albeit in more ambitious form IIUC. The name is inspired by a MySQL Galera feature (approximately the same feature but the approach is completely different; Galera adds read latency, whereas this patch does not). Maybe it needs a better name. Is this is a feature that people want to see in PostgreSQL? IMPROVEMENTS IN V17 The GUC to enable the feature is now called "causal_reads_max_replay_lag". Standbys listed in causal_reads_standby_names whose pg_stat_replication.replay_lag doesn't exceed that time are "available" for causal reads and will be waited for by the primary when committing. When they exceed that threshold they are briefly in "revoking" state and then "unavailable", and when the go return to an acceptable level they are briefly in "joining" state before reaching "available". CR states appear in pg_stat_replication and transitions are logged at LOG level. A new GUC called "causal_reads_lease_time" controls the lifetime of read leases sent from the primary to the standby. This affects the frequency of lease replacement messages, and more importantly affects the worst case of commit stall that can be introduced if connectivity to a standby is lost and we have to wait for the last sent lease to expire. In the previous version, one single GUC controlled both maximum tolerated replay lag and lease lifetime, which was good from the point of view that fewer GUCs are better, but bad because it had to be set fairly high when doing both jobs to be conservative about clock skew. The lease lifetime must be at least 4 x maximum tolerable clock skew. After the recent botching of a leap-second transition on a popular public NTP network (TL;DR OpenNTP is not a good choice of implementation to add to a public time server pool) I came to the conclusion that I wouldn't want to recommend a default max clock skew under 1.25s, to allow for some servers to be confused about leap seconds for a while or to be running different smearing algorithms. A reasonable causal_reads_lease_time recommendation for people who don't know much about the quality of their time source might therefore be 5s. I think it's reasonable to want to set the maximum tolerable replay lag to lower time than that, or in fact as low as you like, depending on your workload and hardware. Therefore I decided to split the old "causal_reads_timeout" GUC into "causal_reads_max_replay_lag" and "causal_reads_lease_time". This new version introduces fast lease revocation. Whenever the primary decides that a standby is not keeping up, it kicks it out of the set of CR-available standbys and revokes its lease, so that anyone trying to run causal reads transactions there will start receiving a new error. In the previous version, it always did that by blocking commits while waiting for the most recently sent lease to expire, which I now call "slow revocation" because it could take several seconds. Now it blocks commits only until the standby acknowledges that it is no longer available for causal reads OR the lease expires: ideally that takes the time of a network a round trip. Slow revocation is still needed in various failure cases such as lost connectivity. TESTING Apply the patch after first applying a small bug fix for replication lag tracking[4]. Then: 1. Set up some streaming replicas. 2. Stick causal_reads_max_replay_lag = 2s (or any time you like) in the primary's postgresql.conf. 3. Set causal_reads = on in some transactions on various nodes. 4. Try to break it! As long as your system clocks don't disagree by more than 1.25s (causal_reads_lease_time / 4), the causal reads guarantee will be upheld: standbys will either see transactions that have completed on the primary or raise an error to indicate that they are not available for causal reads transactions. You should not be able to break this guarantee, no matter what you do: unplug the network, kill arbitrary processes, etc. If you mess with your system clocks so they differ by more than causal_reads_lease_time / 4, you should see that a reasonable effort is made to detect that so it's still very unlikely you can break it (you'd need clocks to differ by more than causal_reads_lease_time / 4 but less than causal_reads_lease_time / 4 + network latency so that the excessive skew is not detected, and then you'd need a very well timed pair of transactions and loss of connectivity). [1] https://www.postgresql.org/message-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com [2] https://github.com/bloomberg/comdb2 [3] http://www.vldb.org/pvldb/vol9/p1377-scotti.pdf [4] https://www.postgresql.org/message-id/CAEepm%3D3tJX_0kSeDi8OYTMp8NogrqPxgP1%2B2uzsdePz9i0-V0Q%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Jun 23, 2017 at 11:48 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Apply the patch after first applying a small bug fix for replication > lag tracking[4]. Then: That bug fix was committed, so now causal-reads-v17.patch can be applied directly on top of master. > 1. Set up some streaming replicas. > 2. Stick causal_reads_max_replay_lag = 2s (or any time you like) in > the primary's postgresql.conf. > 3. Set causal_reads = on in some transactions on various nodes. > 4. Try to break it! Someone asked me off-list how to set this up quickly and easily for testing. Here is a shell script that will start up a primary server (port 5432) and 3 replicas (ports 5441 to 5443). Set the two paths at the top of the file before running in. Log in with psql postgres [-p <port>], then SET causal_reads = on to test its effect. causal_reads_max_replay_lay is set to 2s and depending on your hardware you might find that stuff like CREATE TABLE big_table AS SELECT generate_series(1, 10000000) or a large COPY data load causes replicas to be kicked out of the set after a while; you can also pause replay on the replicas with SELECT pg_wal_replay_pause() and pg_wal_replay_resume(), kill -STOP/-CONT or -9 the walreceiver processes to similar various failure modes, or run the replicas remotely and unplug the network. SELECT application_name, replay_lag, causal_reads_state FROM pg_state_replication to see the current situation, and also monitor the primary's LOG messages about transitions. You should find that the "read-your-writes-or-fail-explicitly" guarantee is upheld, no matter what you do, and furthermore than failing or lagging replicas don't hold hold the primary up very long: in the worst case causal_reads_lease_time for lost contact, and in the best case the time to exchange a couple of messages with the standby to tell it its lease is revoked and it should start raising an error. You might find test-causal-reads.c[1] useful for testing. > Maybe it needs a better name. Ok, how about this: the feature could be called "synchronous replay". The new column in pg_stat_replication could be called sync_replay (like the other sync_XXX columns). The GUCs could be called synchronous replay, synchronous_replay_max_lag and synchronous_replay_lease_time. The language in log messages could refer to standbys "joining the synchronous replay set". Restating the purpose of the feature with that terminology: If synchronous_replay is set to on, then you see the effects of all synchronous_replay = on transactions that committed before your transaction began, or an error is raised if that is not possible on the current node. This allows applications to direct read-only queries to read-only replicas for load balancing without seeing stale data. Is that clearer? Restating the relationship with synchronous replication with that terminology: while synchronous_commit and synchronous_standby_names are concerned with distributed durability, synchronous_replay is concerned with distributed visibility. While the former prevents commits from returning if the configured level of durability isn't met (for example "must be flushed on master + any 2 standbys"), the latter will simply drop any standbys from the synchronous replay set if they fail or lag more than synchronous_replay_max_lag. It is reasonable to want to use both features at once: my policy on distributed durability might be that I want all transactions to be flushed to disk on master + any of three servers before I report information to users, and my policy on distributed visibility might be that I want to be able to run read-only queries on any of my six read-only replicas, but don't want to wait for any that lag by more than 1 second. Thoughts? [1] https://www.postgresql.org/message-id/CAEepm%3D3NF%3D7eLkVR2fefVF9bg6RxpZXoQFmOP3RWE4r4iuO7vg%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 3 January 2017 at 01:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here is a new version of my "causal reads" patch (see the earlier > thread from the 9.6 development cycle[1]), which provides a way to > avoid stale reads when load balancing with streaming replication. I'm very happy that you are addressing this topic. I noticed you didn't put in links my earlier doubts about this specific scheme, though I can see doubts from myself and Heikki at least in the URLs. I maintain those doubts as to whether this is the right way forwards. This patch presumes we will load balance writes to a master and reads to a pool of standbys. How will we achieve that? 1. We decorate the application with additional info to indicate routing/write concerns. 2. We get middleware to do routing for us, e.g. pgpool style read/write routing The explicit premise of the patch is that neither of the above options are practical, so I'm unclear how this makes sense. Is there some use case that you have in mind that has not been fully described? If so, lets get it on the table. What I think we need is a joined up plan for load balancing, so that we can understand how it will work. i.e. explain the whole use case and how the solution works. I'm especially uncomfortable with any approaches that treat all sessions as one pool. For me, a server should support multiple pools. Causality seems to be a property of a particular set of pools. e.g. PoolS1 supports causal reads against writes to PoolM1 but not PoolM2, yet PoolS2 does not provide causal reads against PoolM1 orPoolM2. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm very happy that you are addressing this topic. > > I noticed you didn't put in links my earlier doubts about this > specific scheme, though I can see doubts from myself and Heikki at > least in the URLs. I maintain those doubts as to whether this is the > right way forwards. One technical problem was raised in the earlier thread by Ants Aasma that I concede may be fatal to this design (see note below about read-follows-read below), but I'm not sure. All the other discussion seemed to be about trade-offs between writer-waits and reader-waits schemes, both of which I still view as reasonable options for an end user to have in the toolbox. Your initial reaction was: > What we want is to isolate the wait only to people performing a write-read > sequence, so I think it should be readers that wait. I agree with you 100%, up to the comma. The difficulty is identifying which transactions are part of a write-read sequence. An application-managed LSN tracking system allows for waits to occur strictly in reads that are part of a write-read sequence because the application links them explicitly, and nobody is arguing that we shouldn't support that for hard working expert users. But to support applications that don't deal with LSNs (or some kind of "causality tokens") explicitly I think we'll finish up having to make readers wait for incidental later transactions too, not just the write that your read is dependent on, as I'll show below. When you can't identify write-read sequences perfectly, it comes down to a choice between taxing writers or taxing readers, and I'm not sure that one approach is universally better. Let me summarise my understanding of that trade-off. I'm going to use this terminology: synchronous replay = my proposal: ask the primary server to wait until standbys have applied tx1, a bit like 9.6 synchronous_commit = remote_apply, but with a system of leases for graceful failure. causality tokens = the approach Heikki mentioned: provide a way for tx1 to report its commit LSN to the client, then provide a way for a client to wait for the LSN to be replayed before taking a snapshot for tx2. tx1, tx2 = a pair of transactions with a causal dependency; we want tx2 to see tx1 because tx1 caused tx2 in some sense so must be seen to precede it. A poor man's causality token system can be cobbled together today with pg_current_wal_lsn() and a polling loop that checks pg_last_wal_replay_lsn(). It's a fairly obvious thing to want to do. Several people including Heikki Linnakangas, Craig Ringer, Ants Aasma, Ivan Kartyshov and probably many others have discussed better ways to do that[1], and a patch for the wait-for-LSN piece appeared in a recent commitfest[2]. I reviewed Ivan's patch and voted -1 only because it didn't work for higher isolation levels. If he continues to develop that I will be happy to review and help get it into committable shape, and if he doesn't I may try to develop it myself. In short, I think this is a good tool to have in the toolbox and PostgreSQL 11 should have it! But I don't think it necessarily invalidates my synchronous replay proposal: they represent different sets of trade-offs and might appeal to different users. Here's why: To actually use a causality token system you either need a carefully designed application that keeps track of causal dependencies and tokens, in which case the developer works harder but can benefit from from an asynchronous pipelining effect (by the time tx2 runs we hope that tx1 has been applied, so neither transaction had to wait). Let's call that "application-managed causality tokens". That's great for those users -- let's make that possible -- but most users don't want to write code like that. So I see approximately three choices for transparent middleware (or support built into standbys), which I'll name and describe as follows: 1. "Panoptic" middleware: Sees all queries so that it can observe commit LSNs and inject wait directives into all following read-only transactions. Since all queries now need to pass through a single process, you have a new bottleneck, an extra network hop, and a single point of failure so you'll probably want a failover system with split-brain defences. 2. "Hybrid" middleware: The middleware (or standby itself) listens to the replication stream so that it knows about commit LSNs (rather than spying on commits). The primary server waits until all connected middleware instances acknowledge commit LSNs before it releases commits, and then the middleware inserts wait-for-LSN directives into read-only transactions. Now there is no single point problem, but writers are impacted too. I mention this design because I believe this is conceptually similar to how Galera wsrep_sync_wait (AKA wsrep_causal_reads) works. (I call this "hybrid" because it splits the waiting between tx1 and tx2. Since it's synchronous, dealing with failure gracefully is tricky, probably needing a lease system like SR. I acknowledge that comparisons between our streaming replication and Galera are slightly bogus because Galera is a synchronous multi-master system.) 3. "Forward-only" middleware (insert better name): The middleware (or standby itself) asks the primary server for the latest committed LSN at the start of every transaction, and then tells the standby to wait for that LSN to be applied. There are probably some other schemes involving communicating middleware instances, but I don't think they'll be better in ways that matter -- am I wrong? Here's a trade-off table: SR AT PT HT FT tx1 commit waits? yes no no yes no tx2 snapshot waits? no yes yes yes yes tx2 waits for incidentaltransactions? no no yes yes yes tx2 has round-trip to primary? no no no no yes can tx2's wait be pipelined? yes no* no* no* SR = synchronous replay AT = application-managed causality tokens PT = panoptic middleware-managed causality tokens HT =hybrid middleware-managed or standby-managed causality tokens FT = forward-only middleware-managed causality tokens *Note that only synchronous replay and application-managed causality tokens track the actual causal dependency tx2->tx1. SR does it by making tx1 wait for replay so that tx2 doesn't have to wait at all and AT does it by making tx2 wait specifically for tx1 to be applied. PT, HT and FT don't actually know anything about tx1, so they make every read query wait until *all known transactions* are applied ("incidental transactions" above), throwing away the pipelining benefits of causality token systems (hence "no*" above). I haven't used it myself but I have heard that that is why read latency is a problem on Galera with causal reads mode enabled: due to lack of better information you have to wait for the replication system to drain its current apply queue before every query is processed, even if tx1 in your causally dependent transaction pair was already visible on the current node. So far I think that SR and AT are sweet spots. AT for people who are prepared to juggle causality tokens in their applications and SR for people who want to remain oblivious to all this stuff and who can tolerate a reduction in single-client write TPS. I also think AT's pipelining advantage over SR and SR's single-client TPS impact are diminished if you also choose to enable syncrep for durability, which isn't a crazy thing to want to do if you're doing anything important with your data. The other models where all readers wait for incidental transactions don't seem terribly attractive to me, especially if the motivating premise of load balancing with read-only replicas is (to steal a line) "ye [readers] are many, they are few". One significant blow my proposal received in the last thread was a comment from Ants about read-follows-read[3]. What do you think? I suspect the same problem applies to causality token based systems as discussed so far (except perhaps FT, the slowest and probably least acceptable to anyone). On the other hand, I think it's at least possible to fix that problem with causality tokens. You'd have to expose and capture the last-commit-LSN for every snapshot used in every single read query, and wait for it at the moment ever following read query takes a new snapshot. This would make AT even harder work for developers, make PT even slower, and make HT unworkable (it only knows about commits, not reads). I also suspect that a sizeable class of applications cares a lot less about read-follows-read than read-follows-write, but I could be wrong about that. > This patch presumes we will load balance writes to a master and reads > to a pool of standbys. How will we achieve that? > > 1. We decorate the application with additional info to indicate > routing/write concerns. > 2. We get middleware to do routing for us, e.g. pgpool style read/write routing > > The explicit premise of the patch is that neither of the above options > are practical, so I'm unclear how this makes sense. Is there some use > case that you have in mind that has not been fully described? If so, > lets get it on the table. I don't think that pgpool routing is impractical, just that it's not a great place to put transparent causality token tracking for the reasons I've explained above -- you'll introduce a ton of latency for all readers because you can't tell which earlier transactions they might be causally dependent on. I think it's also nice to be able to support the in-process connection pooling and routing that many application developers use to avoid extra hops, so it'd be nice to avoid making pooling/routing/proxy servers strictly necessary. > What I think we need is a joined up plan for load balancing, so that > we can understand how it will work. i.e. explain the whole use case > and how the solution works. Here are some ways you could set a system up: 1. Use middleware like pgpool or pgbouncer-rr to route queries automatically; this is probably limited to single-statement queries, since multi-statement queries can't be judged by their first statement alone. (Those types of systems could be taught to understand a request for a connection with causal reads enabled, and look at the current set of usable standbys by looking at the pg_stat_replication table.) 2. Use the connection pooling inside your application server or application framework/library: for example Hibernate[4], Django[5] and many other libraries offer ways to configure multiple database connection pools and route queries appropriately at a fairly high level. Such systems could probably be improved to handle 'synchronous replay not available' errors by throwing away the connection and retrying automatically on another connection, much as they do for serialization failures and deadlocks. 3. Modify your application to deal with separate connection pools directly wherever it runs database transactions. Perhaps I'm not thinking big enough: I tried to come up with an incremental improvement to PostgreSQL that would fix a problem that I know people have with their current hot standby deployment. I deliberately avoided proposing radical architectural projects such as moving cluster management, discovery, proxying, pooling and routing responsibilities into PostgreSQL. Perhaps those working on GTM type systems which effectively present a seamless single system find this whole discussion to be aiming too low and dealing with the wrong problems. > I'm especially uncomfortable with any approaches that treat all > sessions as one pool. For me, a server should support multiple pools. > Causality seems to be a property of a particular set of pools. e.g. > PoolS1 supports causal reads against writes to PoolM1 but not PoolM2, > yet PoolS2 does not provide causal reads against PoolM1 orPoolM2. Interesting, but I don't immediately see any fundamental difficulty for any of the designs discussed. For example, maybe tx1 should be able to set synchronous_replay = <group name>, rather than just 'on', to refer to a ground of standbys defined in some GUC. Just by the way, while looking for references I found PinningMasterSlaveRouter which provides a cute example of demand for causal reads (however implemented) in the Django community: https://github.com/jbalogh/django-multidb-router It usually sends read-only transactions to standbys, but keeps your web session temporarily pinned to the primary database to give you 15 seconds' worth of read-your-writes after each write transaction. [1] https://www.postgresql.org/message-id/flat/53E2D346.9030806%402ndquadrant.com [2] https://www.postgresql.org/message-id/flat/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru [3] https://www.postgresql.org/message-id/CAEepm%3D15WC7A9Zdj2Qbw3CUDXWHe69d%3DnBpf%2BjXui7OYXXq11w%40mail.gmail.com [4] https://stackoverflow.com/questions/25911359/read-write-splitting-hibernate [5] https://github.com/yandex/django_replicated (for example; several similar extensions exist) -- Thomas Munro http://www.enterprisedb.com
On Sat, Jun 24, 2017 at 4:05 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Jun 23, 2017 at 11:48 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Maybe it needs a better name. > > Ok, how about this: the feature could be called "synchronous replay". > The new column in pg_stat_replication could be called sync_replay > (like the other sync_XXX columns). The GUCs could be called > synchronous replay, synchronous_replay_max_lag and > synchronous_replay_lease_time. The language in log messages could > refer to standbys "joining the synchronous replay set". Feature hereby renamed that way. It seems a lot more self-explanatory. Please see attached. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Jun 27, 2017 at 12:20 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> What I think we need is a joined up plan for load balancing, so that >> we can understand how it will work. i.e. explain the whole use case >> and how the solution works. Here's a proof-of-concept hack of the sort of routing and retry logic that I think should be feasible with various modern application stacks (given the right extensions): https://github.com/macdice/py-pgsync/blob/master/DemoSyncPool.py -- Thomas Munro http://www.enterprisedb.com
> On 23 June 2017 at 13:48, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> Apologies for the extended delay. Here is the rebased patch, now with a
> couple of improvements (see below).
Thank you. I started to play with it a little bit, since I think it's an
interesting idea. And there are already few notes:
* I don't see a CF item for that, where is it?
* Looks like there is a sort of sensitive typo in `postgresql.conf.sample`:
```
+#causal_reads_standy_names = '*' # standby servers that can potentially become
+ # available for causal reads; '*' = all
+
```
it should be `causal_reads_standby_names`. Also I hope in the nearest future I
can provide a full review.
On Thu, Jul 13, 2017 at 2:51 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Thank you. I started to play with it a little bit, since I think it's an > interesting idea. And there are already few notes: Thanks Dmitry. > * I don't see a CF item for that, where is it? https://commitfest.postgresql.org/14/951/ The latest version of the patch is here: https://www.postgresql.org/message-id/CAEepm%3D0YigNQczAF-%3Dx_SxT6cJv77Yb0EO%2BcAFnqRyVu4%2BbKFw%40mail.gmail.com I renamed it to "synchronous replay", because "causal reads" seemed a bit too arcane. > * Looks like there is a sort of sensitive typo in `postgresql.conf.sample`: > > ``` > +#causal_reads_standy_names = '*' # standby servers that can potentially > become > + # available for causal reads; '*' = all > + > ``` > > it should be `causal_reads_standby_names`. Fixed in latest version (while renaming). > Also I hope in the nearest future > I > can provide a full review. Great news, thanks! -- Thomas Munro http://www.enterprisedb.com
On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 3 January 2017 at 01:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > >> Here is a new version of my "causal reads" patch (see the earlier >> thread from the 9.6 development cycle[1]), which provides a way to >> avoid stale reads when load balancing with streaming replication. > > I'm very happy that you are addressing this topic. > > I noticed you didn't put in links my earlier doubts about this > specific scheme, though I can see doubts from myself and Heikki at > least in the URLs. I maintain those doubts as to whether this is the > right way forwards. > > This patch presumes we will load balance writes to a master and reads > to a pool of standbys. How will we achieve that? > > 1. We decorate the application with additional info to indicate > routing/write concerns. > 2. We get middleware to do routing for us, e.g. pgpool style read/write routing > > The explicit premise of the patch is that neither of the above options > are practical, so I'm unclear how this makes sense. Is there some use > case that you have in mind that has not been fully described? If so, > lets get it on the table. > > What I think we need is a joined up plan for load balancing, so that > we can understand how it will work. i.e. explain the whole use case > and how the solution works. Simon, Here's a simple proof-of-concept Java web service using Spring Boot that demonstrates how load balancing could be done with this patch. It show two different techniques for routing: an "adaptive" one that learns which transactional methods need to run on the primary server by intercepting errors, and a "declarative" one that respects Spring's @Transactional(readOnly=true) annotations (inspired by the way people use MySQL Connector/J with Spring to do load balancing). Whole transactions are automatically retried at the service request level on transient failures using existing techniques (Spring Retry, as used for handling deadlocks and serialisation failures etc), and the "TransactionRouter" avoids servers that have recently raised the "synchronous_replay not available" error. Aside from the optional annotations, the application code in KeyValueController.java is unaware of any of this. https://github.com/macdice/syncreplay-spring-demo I suspect you could find ways to do similar things with basically any application development stack that supports some kind of container managed transactions. -- Thomas Munro http://www.enterprisedb.com
> On 12 July 2017 at 23:45, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> I renamed it to "synchronous replay", because "causal reads" seemed a bit too
> arcane.
Hi
I looked through the code of `synchronous-replay-v1.patch` a bit and ran a few
tests. I didn't manage to break anything, except one mysterious error that I've
got only once on one of my replicas, but I couldn't reproduce it yet.
Interesting thing is that this error did not affect another replica or primary.
Just in case here is the log for this error (maybe you can see something
obvious, that I've not noticed):
```
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211": Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211": Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: directories for tablespace 47733 could not be removed
HINT: You can remove the directories manually if necessary.
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
FATAL: could not create directory "pg_tblspc/47734/PG_10_201707211/47732": File exists
CONTEXT: WAL redo at 0/125F5768 for Storage/CREATE: pg_tblspc/47734/PG_10_201707211/47732/47736
LOG: startup process (PID 8034) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down
```
And speaking about the code, so far I have just a few notes (some of them
merely questions):
* In general the idea behind this patch sounds interesting for me, but it
relies heavily on time synchronization. As mentioned in the documentation:
"Current hardware clocks, NTP implementations and public time servers are
unlikely to allow the system clocks to differ more than tens or hundreds of
milliseconds, and systems synchronized with dedicated local time servers may
be considerably more accurate." But as far as I remember from my own
experience sometimes it maybe not that trivial on something like AWS because
of virtualization. Maybe it's an unreasonable fear, but is it possible to
address this problem somehow?
* Also I noticed that some time-related values are hardcoded (e.g. 50%/25%
time shift when we're dealing with leases). Does it make sense to move them
out and make them configurable?
* Judging from the `SyncReplayPotentialStandby` function, it's possible to have
`synchronous_replay_standby_names = "*, server_name"`, which is basically an
equivalent for just `*`, but it looks confusing. Is it worth it to prevent
this behaviour?
* In the same function `SyncReplayPotentialStandby` there is this code:
```
if (!SplitIdentifierString(rawstring, ',', &elemlist))
{
/* syntax error in list */
pfree(rawstring);
list_free(elemlist);
/* GUC machinery will have already complained - no need to do again */
return false;
}
```
Am I right that ideally this (a situation when at this point in the code
`synchronous_replay_standby_names` has incorrect value) should not happen,
because GUC will prevent us from that? If yes, then it looks for me that it
still makes sense to put here a log message, just to give more information in
a potentially weird situation.
* In the function `SyncRepReleaseWaiters` there is a commentary:
```
/*
* If the number of sync standbys is less than requested or we aren't
* managing a sync standby or a standby in synchronous replay state that
* blocks then just leave.
* /
if ((!got_recptr || !am_sync) && !walsender_sr_blocker)
```
Is this commentary correct? If I understand everything right `!got_recptr` -
the number of sync standbys is less than requested (a), `!am_sync` - we aren't
managing a sync standby (b), `walsender_sr_blocker` - a standby in synchronous
replay state that blocks (c). Looks like condition is `(a or b) and not c`.
* In the function `ProcessStandbyReplyMessage` there is a code that implements
this:
```
* If the standby reports that it has fully replayed the WAL for at least
* 10 seconds, then let's clear the lag times that were measured when it
* last wrote/flushed/applied a WAL record. This way we avoid displaying
* stale lag data until more WAL traffic arrives.
```
but I never found any mention of this 10 seconds in the documentation. Is it
not that important? Also, question 2 is related to this one.
* In the function `WalSndGetSyncReplayStateString` all the states are in lower
case except `UNKNOWN`, is there any particular reason for that?
There are also few more not that important notes (mostly about some typos and
few confusing names), but I'm going to do another round of review and testing
anyway so I'll just send them all next time.
On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > I looked through the code of `synchronous-replay-v1.patch` a bit and ran a > few > tests. I didn't manage to break anything, except one mysterious error that > I've > got only once on one of my replicas, but I couldn't reproduce it yet. > Interesting thing is that this error did not affect another replica or > primary. > Just in case here is the log for this error (maybe you can see something > obvious, that I've not noticed): > > ``` > LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": > Directory not empty > CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 > LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211": > Directory not empty > CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 > LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": > Directory not empty > CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 > LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211": > Directory not empty > CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 > LOG: directories for tablespace 47733 could not be removed > HINT: You can remove the directories manually if necessary. > CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 > FATAL: could not create directory "pg_tblspc/47734/PG_10_201707211/47732": > File exists > CONTEXT: WAL redo at 0/125F5768 for Storage/CREATE: > pg_tblspc/47734/PG_10_201707211/47732/47736 > LOG: startup process (PID 8034) exited with exit code 1 > LOG: terminating any other active server processes > LOG: database system is shut down > ``` Hmm. The first error ("could not remove directory") could perhaps be explained by temporary files from concurrent backends, leaked files from earlier crashes or copying a pgdata directory over the top of an existing one as a way to set it up, leaving behind some files from an earlier test? The second error ("could not create directory") is a bit stranger though... I think this must come from TablespaceCreateDbspace(): it must have stat()'d the file and got ENOENT, decided to create the directory, acquired TablespaceCreateLock, stat()'d the file again and found it still absent, then run mkdir() on the parents and got EEXIST and finally on the directory to be created, and surprisingly got EEXIST. That means that someone must have concurrently created the directory. Perhaps in your testing you accidentally copied a pgdata directory over the top of it while it was running? In any case I'm struggling to see how anything in this patch would affect anything at the REDO level. > And speaking about the code, so far I have just a few notes (some of them > merely questions): > > * In general the idea behind this patch sounds interesting for me, but it > relies heavily on time synchronization. As mentioned in the documentation: > "Current hardware clocks, NTP implementations and public time servers are > unlikely to allow the system clocks to differ more than tens or hundreds > of > milliseconds, and systems synchronized with dedicated local time servers > may > be considerably more accurate." But as far as I remember from my own > experience sometimes it maybe not that trivial on something like AWS > because > of virtualization. Maybe it's an unreasonable fear, but is it possible to > address this problem somehow? Oops, I had managed to lose an important hunk that deals with detecting excessive clock drift (ie badly configured servers) while rebasing a couple of versions back. Here is a version to put it back. With that change, if you disable NTP and manually set your standby's clock to be more than 1.25s (assuming synchronous_replay_lease_time is set to the default of 5s) behind the primary, the synchronous_replay should be unavailable and you should see this error in the standby's log: ereport(LOG, (errmsg("the primary server's clock time is too far ahead for synchronous_replay"), errhint("Check your servers' NTP configuration or equivalent."))); One way to test this without messing with your NTP setting or involving two different computers is to modify this code temporarily in WalSndKeepalive: now = GetCurrentTimestamp() + 1250100; This is a best effort intended to detect a system not running ntpd at all or talking to an insane time server. Fundamentally this proposal is based on the assumption that you can get your system clocks into sync within a tolerance that we feel confident estimating an upper bound for. It does appear that some Amazon OS images come with NTP disabled; that's a problem if you want to use this feature, but if you're running a virtual server without an ntpd you'll pretty soon drift seconds to minutes off UTC time and get "unavailable for synchronous replay" errors from this patch (and possibly the LOG message above, depending on the direction of drift). > * Also I noticed that some time-related values are hardcoded (e.g. 50%/25% > time shift when we're dealing with leases). Does it make sense to move > them > out and make them configurable? These numbers are interrelated, and I think they're best fixed in that ratio. You could make it more adjustable, but I think it's better to keep it simple with just a single knob. Let me restate that logic to explain how I came up with those ratios. There are two goals: 1. The primary needs to be able to wait for a lease to expire if connectivity is lost, even if the the standby's clock is behind the primary's clock by max_clock_skew. (Failure to do so could allow stale query results from a zombie standby that is somehow still handling queries after connectivity with the primary is lost.) 2. The primary needs to be able to replace leases often enough so that there are no gaps between them, even if the standby's clock is ahead of the primary's clock by max_clock_skew. (Failure to replace leases fast enough could cause spurious "unavailable" errors, but not incorrect query results. Technically it's max_clock_skew - network latency since it takes time for new leases to arrive but that's a minor detail). A solution that maximises tolerable clock skew and as an added bonus doesn't require the standby to have access to the primary's GUCs is to tell the standby that the expiry time is 25% earlier that the time the primary will really wait until, and replace leases when they still have 50% of their time to go. To illustrate using fixed-width ASCII-art, here's how the primary perceives the stream of leases it sends. In this diagram, '+' marks halfway, '!' marks the time the primary will send to the standby as the expiry time, and the final '|' marks the time the primary will really wait until if it has to, just in case the standby's clock is 'slow' (behind). The '!' is 25% earlier, and represents the maximum tolerable clock skew. |---+-!-| |---+-!-| |---+-!-| Here's how a standby with a clock that is 'slow' (behind) by max_clock_skew = 25% perceives this stream of leases: |-------! |-------! |-------! You can see that the primary server is able to wait just long enough for the lease to expire and the error to begin to be raised on this standby server, if it needs to. Here's how a standby with a clock that is 'fast' (ahead) by max_clock_skew = 25% perceives this stream of leases: |---! |---! |---! If it's ahead by more than that, we'll get gaps where the error may be raised spuriously in between leases. > * Judging from the `SyncReplayPotentialStandby` function, it's possible to > have > `synchronous_replay_standby_names = "*, server_name"`, which is basically > an > equivalent for just `*`, but it looks confusing. Is it worth it to prevent > this behaviour? Hmm. Seems harmless to me! > * In the same function `SyncReplayPotentialStandby` there is this code: > > ``` > if (!SplitIdentifierString(rawstring, ',', &elemlist)) > { > /* syntax error in list */ > pfree(rawstring); > list_free(elemlist); > /* GUC machinery will have already complained - no need to do again */ > return false; > } > ``` > > Am I right that ideally this (a situation when at this point in the code > `synchronous_replay_standby_names` has incorrect value) should not happen, > because GUC will prevent us from that? If yes, then it looks for me that > it > still makes sense to put here a log message, just to give more information > in > a potentially weird situation. Yes. That's exactly the coding that was used for synchronous_commit, before it was upgraded to support a new fancy syntax. I was trying to do things the established way. > * In the function `SyncRepReleaseWaiters` there is a commentary: > > ``` > /* > * If the number of sync standbys is less than requested or we aren't > * managing a sync standby or a standby in synchronous replay state that > * blocks then just leave. > * / > if ((!got_recptr || !am_sync) && !walsender_sr_blocker) > ``` > > Is this commentary correct? If I understand everything right `!got_recptr` > - > the number of sync standbys is less than requested (a), `!am_sync` - we > aren't > managing a sync standby (b), `walsender_sr_blocker` - a standby in > synchronous > replay state that blocks (c). Looks like condition is `(a or b) and not > c`. This code is trying to decide whether to leave early, rather than potentially blocking. The change in my patch is: - if (!got_recptr || !am_sync) + if ((!got_recptr || !am_sync) && !walsender_sr_blocker) The old coding said "if there aren't enough sync commit standbys, or I'm not a sync standby, then I can leave now". The coding with my patch is the same, except that in any case it won't leave early this walsender is managing a standby that potentially blocks commit. That said, it's a terribly named and documented function argument, so I have fixed that in the attached version; I hope that's better! To put it another way, with this patch there are two different reasons a transaction might need to wait: because of synchronous_commit and because of synchronous_replay. They're both forms of 'synchronous replication' I suppose. That if statement is saying 'if I don't need to wait for synchronous_commit, and I don't need to wait for synchronous_replay, then we can return early'. > * In the function `ProcessStandbyReplyMessage` there is a code that > implements > this: > > ``` > * If the standby reports that it has fully replayed the WAL for at > least > * 10 seconds, then let's clear the lag times that were measured when it > * last wrote/flushed/applied a WAL record. This way we avoid displaying > * stale lag data until more WAL traffic arrives. > ``` > but I never found any mention of this 10 seconds in the documentation. Is > it > not that important? Also, question 2 is related to this one. Hmm. Yeah that does seem a bit arbitrary. The documentation in master does already say that it's cleared without being saying exactly when: [...] If the standby server has entirely caught up with the sending server and there is no more WAL activity, the most recently measured lag times will continue to be displayed for a short time and then show NULL. The v1 patch changeed it from being based on wal_receiver_status_interval (sort of implicitly) to being 10 seconds, and yeah that is not a good change. In this v2 it's using wal_receiver_status_interval (though it has to do it explicitly, since this patch increases the amount of chit chat between the servers due to lease replacement). > * In the function `WalSndGetSyncReplayStateString` all the states are in > lower > case except `UNKNOWN`, is there any particular reason for that? This state should never exist, so no user will ever see it; I used "UNKNOWN" following the convention established by the function WalSndGetStateString() immediately above. > There are also few more not that important notes (mostly about some typos > and > few confusing names), but I'm going to do another round of review and > testing > anyway so I'll just send them all next time. Thanks for the review! -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Jul 31, 2017 at 5:49 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here is a version to put it back. Rebased after conflicting commit 030273b7. Now using format-patch with a commit message to keep track of review/discussion history. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Aug 10, 2017 at 2:02 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Rebased after conflicting commit 030273b7. Now using format-patch > with a commit message to keep track of review/discussion history. TAP test 006_logical_decoding.pl failed with that version. I had missed some places that know how to decode wire protocol messages I modified. Fixed in the attached version. It might be a good idea to consolidate the message encoding/decoding logic into reusable routines, independently of this work. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
> On 31 July 2017 at 07:49, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>>
>> I looked through the code of `synchronous-replay-v1.patch` a bit and ran a few
>> tests. I didn't manage to break anything, except one mysterious error that I've
>> got only once on one of my replicas, but I couldn't reproduce it yet.
>> Interesting thing is that this error did not affect another replica or primary.
>> Just in case here is the log for this error (maybe you can see something
>> obvious, that I've not noticed):
>>
>> LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
>> Directory not empty
>> ...
>
> Hmm. The first error ("could not remove directory") could perhaps be
> explained by temporary files from concurrent backends.
> ...
> Perhaps in your testing you accidentally copied a pgdata directory over the
> top of it while it was running? In any case I'm struggling to see how
> anything in this patch would affect anything at the REDO level.
Hmm...no, I don't think so. Basically what I was doing is just running
`installcheck` against a primary instance (I assume there is nothing wrong with
this approach, am I right?). This particular error was caused by `tablespace`
test which was failed in this case:
```
INSERT INTO testschema.foo VALUES(1);
ERROR: could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390": No such file or directory
```
I tried few more times, and I've got it two times from four attempts on a fresh
installation (when all instances were on the same machine). But anyway I'll try
to investigate, maybe it has something to do with my environment.
> > * Also I noticed that some time-related values are hardcoded (e.g. 50%/25%
> > time shift when we're dealing with leases). Does it make sense to move
> > them out and make them configurable?
>
> These numbers are interrelated, and I think they're best fixed in that
> ratio. You could make it more adjustable, but I think it's better to
> keep it simple with just a single knob.
Ok, but what do you think about converting them to constants to make them more
self explanatory? Like:
```
/*
+ * Since this timestamp is being sent to the standby where it will be
+ * compared against a time generated by the standby's system clock, we
+ * must consider clock skew. We use 25% of the lease time as max
+ * clock skew, and we subtract that from the time we send with the
+ * following reasoning:
+ */
+int max_clock_skew = synchronous_replay_lease_time * MAX_CLOCK_SKEW_PORTION;
```
Also I have another question. I tried to test this patch little bit more, and
I've got some strange behaviour after pgbench (here is the full output [1]):
```
# primary
$ ./bin/pgbench -s 100 -i test
NOTICE: table "pgbench_history" does not exist, skipping
NOTICE: table "pgbench_tellers" does not exist, skipping
NOTICE: table "pgbench_accounts" does not exist, skipping
NOTICE: table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 10000000 tuples (1%) done (elapsed 0.11 s, remaining 10.50 s)
200000 of 10000000 tuples (2%) done (elapsed 1.06 s, remaining 52.00 s)
300000 of 10000000 tuples (3%) done (elapsed 1.88 s, remaining 60.87 s)
2017-09-30 15:47:26.884 CEST [6035] LOG: revoking synchronous replay lease for standby "walreceiver"...
2017-09-30 15:47:26.900 CEST [6035] LOG: standby "walreceiver" is no longer available for synchronous replay
2017-09-30 15:47:26.903 CEST [6197] LOG: revoking synchronous replay lease for standby "walreceiver"...
400000 of 10000000 tuples (4%) done (elapsed 2.44 s, remaining 58.62 s)
2017-09-30 15:47:27.979 CEST [6197] LOG: standby "walreceiver" is no longer available for synchronous replay
```
```
# replica
2017-09-30 15:47:51.802 CEST [6034] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017-09-30 15:47:55.154 CEST [6030] LOG: invalid magic number 0000 in log segment 000000010000000000000020, offset 10092544
2017-09-30 15:47:55.257 CEST [10508] LOG: started streaming WAL from primary at 0/20000000 on timeline 1
2017-09-30 15:48:09.622 CEST [10508] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
Is it something well known or unrelated to the patch itself?
On Sun, Oct 1, 2017 at 9:05 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >>> LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": >>> Directory not empty >>> ... >> >> Hmm. The first error ("could not remove directory") could perhaps be >> explained by temporary files from concurrent backends. >> ... >> Perhaps in your testing you accidentally copied a pgdata directory over >> the >> top of it while it was running? In any case I'm struggling to see how >> anything in this patch would affect anything at the REDO level. > > Hmm...no, I don't think so. Basically what I was doing is just running > `installcheck` against a primary instance (I assume there is nothing wrong > with > this approach, am I right?). This particular error was caused by > `tablespace` > test which was failed in this case: > > ``` > INSERT INTO testschema.foo VALUES(1); > ERROR: could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390": > No such file or directory > ``` > > I tried few more times, and I've got it two times from four attempts on a > fresh > installation (when all instances were on the same machine). But anyway I'll > try > to investigate, maybe it has something to do with my environment. > > ... > > 2017-09-30 15:47:55.154 CEST [6030] LOG: invalid magic number 0000 in log > segment 000000010000000000000020, offset 10092544 Hi Dmitry, Thanks for testing. Yeah, it looks like the patch may be corrupting the WAL stream in some case that I didn't hit in my own testing procedure. I will try to reproduce these failures. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Oct 1, 2017 at 10:03 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> I tried few more times, and I've got it two times from four attempts on a >> fresh >> installation (when all instances were on the same machine). But anyway I'll >> try >> to investigate, maybe it has something to do with my environment. >> >> ... >> >> 2017-09-30 15:47:55.154 CEST [6030] LOG: invalid magic number 0000 in log >> segment 000000010000000000000020, offset 10092544 > > Hi Dmitry, > > Thanks for testing. Yeah, it looks like the patch may be corrupting > the WAL stream in some case that I didn't hit in my own testing > procedure. I will try to reproduce these failures. Hi Dmitry, I managed to reproduce something like this on one of my home lab machines running a different OS. Not sure why yet and it doesn't happen on my primary development box which is how I hadn't noticed it. I will investigate and aim to get a fix posted in time for the Commitfest. I'm also hoping to corner Simon at PGDay Australia in a couple of weeks to discuss this proposal... -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 28, 2017 at 6:24 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I managed to reproduce something like this on one of my home lab > machines running a different OS. Not sure why yet and it doesn't > happen on my primary development box which is how I hadn't noticed it. > I will investigate and aim to get a fix posted in time for the > Commitfest. I'm also hoping to corner Simon at PGDay Australia in a > couple of weeks to discuss this proposal... This leads me to think that returned with feedback is adapted for now. So done this way. Feel free to correct things if you think this is not adapted of course. -- Michael
On Wed, Nov 29, 2017 at 2:58 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Oct 28, 2017 at 6:24 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> I managed to reproduce something like this on one of my home lab >> machines running a different OS. Not sure why yet and it doesn't >> happen on my primary development box which is how I hadn't noticed it. >> I will investigate and aim to get a fix posted in time for the >> Commitfest. I'm also hoping to corner Simon at PGDay Australia in a >> couple of weeks to discuss this proposal... > > This leads me to think that returned with feedback is adapted for now. > So done this way. Feel free to correct things if you think this is not > adapted of course. Thanks. I'll be back. -- Thomas Munro http://www.enterprisedb.com
On Wed, Nov 29, 2017 at 11:04 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Nov 29, 2017 at 2:58 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Sat, Oct 28, 2017 at 6:24 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> I managed to reproduce something like this on one of my home lab >>> machines running a different OS. Not sure why yet and it doesn't >>> happen on my primary development box which is how I hadn't noticed it. >>> I will investigate and aim to get a fix posted in time for the >>> Commitfest. I'm also hoping to corner Simon at PGDay Australia in a >>> couple of weeks to discuss this proposal... >> >> This leads me to think that returned with feedback is adapted for now. >> So done this way. Feel free to correct things if you think this is not >> adapted of course. > > Thanks. I'll be back. Please do not target Sarah Connor then. (Sorry, too many patches.) -- Michael