Thread: Master-slave visibility order
I'm currently implementing commit sequence number (CSN) based snapshots and I hit a design decision that I would like to resolve before I have too much code to rewrite. The issue is commit visibility ordering on slaves. As a couple of threads on hackers have already noted, currently commit order on slaves can differ from what is seen on master. This arises from the fact that on master commit visibility is determined by the order of ProcArrayLock acquisition by ProcArrayEndTransaction(). On slaves commit visibility is exactly the order of commit records in WAL. Because XLogInsert() in RecordTransactionCommit() is not interlocked with ProcArrayEndTransaction() these orders can differ. In case of mixed sync and async transactions they in fact are quite likely to differ due to the durability wait in RecordTransactionCommit(). It's not possible to change master commit order to match WAL order because then either async transactions must either wait behind sync transactions before returning losing the point of async; or async transactions must return without becoming visible, changing user visible semantics; or sync transactions must become visible before they become durable, again changing user visible semantics. As it's not possible to change master commit order, the slave visibility order must change for the orders to be consistent. WAL currently doesn't have the information to reconstruct master commit order. Either we need to add a new WAL record for the commit order (only necessary when wal_level=hot_standby) or add a side channel to replication connections to communicate commit order information. One more consideration here is the wish expressed by several hackers that commit record LSNs could be used as CSNs. One of the most interesting benefits of this is the property of LSNs being the same over the whole cluster, meaning that it would be relatively simple to create cluster wide consistent snapshots. I currently see the following courses of action: 1. Do nothing about the inconsistency, use a transient global counter for master commit order and commit record LSN for slaves. Pro: doesn't change any semantics Con: we are not making anyprogress towards cluster wide snapshots or even serializable transactions on slaves. 2. Create a new WAL record type that is inserted when a transaction becomes visible. LSN of this record determines transaction visibility order. Async transactions can be optimized to skip this record. This record does not need to be flushed. Pro: cluster wide consistency, replication method agnostic Con: one extra WAL recordinsertion per writing transaction. (32 bytes of WAL per tx) 3. Use a transient global counter on master, send xid-csn pairs to slave via a side channel on the replication connection. Pro: Less overhead than WAL records Con: replication protocol needs(possibly invasive) changes, WAL shipping based replication can't use this mechanism, lots of extra code required. 4. Make the choice between 1 and 2 user configurable (it seems to me that it could even be changed without a restart). Thoughts? Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Wed, Aug 28, 2013 at 10:58 AM, Ants Aasma <ants@cybertec.at> wrote: > I currently see the following courses of action: > > 1. Do nothing about the inconsistency, use a transient global counter > for master commit order and commit record LSN for slaves. > Pro: doesn't change any semantics > Con: we are not making any progress towards cluster wide snapshots > or even serializable transactions on slaves. > > 2. Create a new WAL record type that is inserted when a transaction > becomes visible. LSN of this record determines transaction visibility > order. Async transactions can be optimized to skip this record. This > record does not need to be flushed. > Pro: cluster wide consistency, replication method agnostic > Con: one extra WAL record insertion per writing transaction. (32 > bytes of WAL per tx) > > 3. Use a transient global counter on master, send xid-csn pairs to > slave via a side channel on the replication connection. > Pro: Less overhead than WAL records > Con: replication protocol needs (possibly invasive) changes, WAL > shipping based replication can't use this mechanism, lots of extra > code required. > > 4. Make the choice between 1 and 2 user configurable (it seems to me > that it could even be changed without a restart). > > Thoughts? I think approach #2 is dead on arrival, at least as a default policy. It essentially amounts to requiring two commit records per transaction rather than one, and I think that has no chance of being acceptable. It's not just or even primarily the *volume* of WAL that I'm concerned about so much as the feeling that hitting WAL twice rather than once at the end of a transaction that may have only written one or two WAL records to begin with is going to slow things down pretty substantially, especially in high-concurrency scenarios. I wouldn't entirely dismiss the idea of changing the user-visible semantics. In addition to a WAL insertion pointer and a WAL flush pointer, you'd have a WAL snapshot pointer, which could run ahead of the flush pointer if the transactions were all asynchronous, but which for synchronous transactions could not advance faster than the flush pointer. Only users running a mix of synchronous_commit=on and synchronous_commit=off would be harmed, and maybe we could convince ourselves that's OK. Still, there's no doubt that there is a downside there. Therefore, I'm inclined to suggest that you implement #1. If, at a later time, we want to make progress on the issue of cluster-wide snapshot consistency, you could implement #2 or #3 as an optional feature that can be turned on via some flag. However, I would recommend against trying to do that in the initial patch; I think that doing either #2 or #3 is really a separate feature, and I think if you try to incorporate all of that code into the main CSN patch it's just going to be a distraction from what figures to be a very complicated patch even in minimal form. If you did choose to implement #2 as an option at some point, it would probably be worth optimizing for the case where commit ordering and visibility ordering match, and try to find a design where you only need the extra WAL record when the orderings don't match. I'm not sure exactly how to do that, but it might be worth investigating. I don't think that's enough to save #2 as a default behavior, but it might make it more palatable as an option. I agree with what others have said insofar as it would be nifty if we could use the commit LSN as the commit sequence number. But I think you've put your finger on why that's not likely to work out well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, thanks for your reply. On Thu, Aug 29, 2013 at 6:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think approach #2 is dead on arrival, at least as a default policy. > It essentially amounts to requiring two commit records per transaction > rather than one, and I think that has no chance of being acceptable. > It's not just or even primarily the *volume* of WAL that I'm concerned > about so much as the feeling that hitting WAL twice rather than once > at the end of a transaction that may have only written one or two WAL > records to begin with is going to slow things down pretty > substantially, especially in high-concurrency scenarios. Heikki's excellent work on WAL insert scaling improves this so the hit might not be all that big, considering that the visibility record only needs to be inserted - relatively cheap compared to a WAL sync. But it's still not likely to be free. I guess the only way to know for sure would be to build it and bench it. > I wouldn't entirely dismiss the idea of changing the user-visible > semantics. In addition to a WAL insertion pointer and a WAL flush > pointer, you'd have a WAL snapshot pointer, which could run ahead of > the flush pointer if the transactions were all asynchronous, but which > for synchronous transactions could not advance faster than the flush > pointer. Only users running a mix of synchronous_commit=on and > synchronous_commit=off would be harmed, and maybe we could convince > ourselves that's OK. Do you mean that mixed durability workloads with replication would make async transactions wait or delay the visibility? We have the additional complication of different synchronous_commit levels, so this decision also affects different levels of synchronous commits. > Still, there's no doubt that there is a downside there. Therefore, > I'm inclined to suggest that you implement #1. If, at a later time, > we want to make progress on the issue of cluster-wide snapshot > consistency, you could implement #2 or #3 as an optional feature that > can be turned on via some flag. However, I would recommend against > trying to do that in the initial patch; I think that doing either #2 > or #3 is really a separate feature, and I think if you try to > incorporate all of that code into the main CSN patch it's just going > to be a distraction from what figures to be a very complicated patch > even in minimal form. I'll go with #1. I agree that snapshot consistency a separate feature that is mostly orthogonal to CSN snapshots. I wanted to get this decision out of the way, so when it's time to discuss the actual patch we don't have the distraction of discussing why LSNs are not workable for determining visibility order. > If you did choose to implement #2 as an option at some point, it would > probably be worth optimizing for the case where commit ordering and > visibility ordering match, and try to find a design where you only > need the extra WAL record when the orderings don't match. I'm not > sure exactly how to do that, but it might be worth investigating. I > don't think that's enough to save #2 as a default behavior, but it > might make it more palatable as an option. Without a side channel the extra WAL record is necessary. Suppose that we want to determine the ordering with a single commit record. The slave must be able to deduce from the single record if it can make the commit immediately visible or should it wait for additional information. If it waits for additional information, that may never come as the master could have committed and then went idle. If it doesn't wait, then an async transaction could arrive on master, commit and would want to become visible, but the master can't make it visible without either violating the visibility order or letting the async transaction wait behind the sync. In other words, without an oracle (in the computer science sense :) ) master can't determine at the time of commit record generation if the orderings can differ, and as WAL is the only communication channel, neither can the slave. Timeouts won't help either as that would need clock synchronization between servers, similarly to Google's F1 system. Speaking of F1, they solve the same problem by having clients be aware of how fresh they want their snapshot to be. If we add this capability then clients aware of this functionality could shift the visibility wait from commit to the start of next transaction that needs to see the changes. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On 2013-08-30 00:22:49 +0300, Ants Aasma wrote: > Hi, thanks for your reply. > > On Thu, Aug 29, 2013 at 6:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > I think approach #2 is dead on arrival, at least as a default policy. > > It essentially amounts to requiring two commit records per transaction > > rather than one, and I think that has no chance of being acceptable. > > It's not just or even primarily the *volume* of WAL that I'm concerned > > about so much as the feeling that hitting WAL twice rather than once > > at the end of a transaction that may have only written one or two WAL > > records to begin with is going to slow things down pretty > > substantially, especially in high-concurrency scenarios. > > Heikki's excellent work on WAL insert scaling improves this so the hit > might not be all that big, considering that the visibility record only > needs to be inserted - relatively cheap compared to a WAL sync. But > it's still not likely to be free. I guess the only way to know for > sure would be to build it and bench it. FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The per CPU overhead actually minimally increased (at least in my tests), it just scales noticeably better than before. But I think that actually coordinating a consistent visibility order between commit, wal insertion and the procarray would have bigger scalability impact than the second record. I might be missing some clever tricks here though. > > If you did choose to implement #2 as an option at some point, it would > > probably be worth optimizing for the case where commit ordering and > > visibility ordering match, and try to find a design where you only > > need the extra WAL record when the orderings don't match. I'm not > > sure exactly how to do that, but it might be worth investigating. I > > don't think that's enough to save #2 as a default behavior, but it > > might make it more palatable as an option. > > Without a side channel the extra WAL record is necessary. Suppose that > we want to determine the ordering with a single commit record. The > slave must be able to deduce from the single record if it can make the > commit immediately visible or should it wait for additional > information. If it waits for additional information, that may never > come as the master could have committed and then went idle. Well, we relatively easily could offload the task of sending such information to the bgwriter or similar. I don't think that's a particularly good idea, but it certainly is a possibility. Andres
Andres Freund <andres@2ndquadrant.com> writes: > But I think that actually coordinating a consistent visibility order > between commit, wal insertion and the procarray would have bigger > scalability impact than the second record. I might be missing some > clever tricks here though. Yeah. ISTM the only way to really guarantee that the visible commit order is the same would be for transactions to hold the ProcArrayLock while they're inserting that WAL record. Needless to say, that would be absolutely disastrous performance-wise. Or at least, that's true as long as we rely on the current procarray-based mechanism for noting that a transaction is still in progress. Maybe there's some other approach altogether. regards, tom lane
On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres@2ndquadrant.com> wrote: > FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The > per CPU overhead actually minimally increased (at least in my tests), it > just scales noticeably better than before. Interesting. Do you have any insight what is behind the CPU overhead? Maybe the solution is to make WAL insertion cheap enough to not matter. That won't be easy, but neither are the alternatives. Regards, Ants Aasma
On Fri, Aug 30, 2013 at 12:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> But I think that actually coordinating a consistent visibility order >> between commit, wal insertion and the procarray would have bigger >> scalability impact than the second record. I might be missing some >> clever tricks here though. > > Yeah. ISTM the only way to really guarantee that the visible commit > order is the same would be for transactions to hold the ProcArrayLock > while they're inserting that WAL record. Needless to say, that would > be absolutely disastrous performance-wise. > > Or at least, that's true as long as we rely on the current procarray-based > mechanism for noting that a transaction is still in progress. Maybe > there's some other approach altogether. This is exactly what I'm working on. Under my scheme snapshots can be taken completely lock free, without consulting the procarray at all, and commits only need to exclude other commits from the moment that visibility order is determined to when it's safe to become visible. If we don't have any constraints on visibility order this is only a matter of looking up the transactions slot in a shared memory structure and writing the next commit sequence number there. I described the approach in a lot more detail a couple of months ago. [1] For now I'm going to leave the semantics as is and be content that we will have a better foundation to do something about it later. [1] http://www.postgresql.org/message-id/CA+CSw_tEpJ=md1zgxPkjH6CWDnTDft4gBi=+P9SnoC+Wy3pKdA@mail.gmail.com Regards, Ants Aasma