Thread: SSI and Hot Standby
Here's an issue for feedback from the community -- do we want to support truly serializable transactions on hot standby machines? The best way Dan and I have been able to think to do this is to build on the SERIALIZABLE READ ONLY DEFERRABLE behavior. We are able to obtain a snapshot and then check to see if it is at a place in the transaction processing that it would be guaranteed to be serializable without participating in predicate locking, rw-conflict detection, etc. If it's not, we block until a READ WRITE transaction completes, and then check again. Repeat. We may reach a point where we determine that the snapshot can't work, and we get a new one and start over. Due to the somewhat complex rules for this, you are likely to see a safe snapshot fairly quickly even in a mix which always has short-lived READ WRITE transactions running, although a single long-running READ WRITE transaction can block things until it completes. The idea is that whenever we see a valid snapshot which would yield a truly serializable view of the data for a READ ONLY transaction, we add a WAL record with that snapshot information. Of course, we might want some limit of how often they are sent, to avoid WAL bloat. A hot standby could just keep the most recently received of these and use it when a SERIALIZABLE transaction is requested. Perhaps DEFERRABLE in this context could mean that it waits for the *next* one and uses it, to assure "freshness". Actually, we could try to get tricky to avoid sending a complete snapshot by having two WAL messages with no payload -- one would mean "the snapshot you would get now is being tested for serializability". If it failed reach that state we would send another when we started working a new snapshot. The other type of message would mean "the snapshot you built when we last told you we were starting to test one is good." I *think* that can work, and it may require less WAL space. If we don't do something like this, do we just provide REPEATABLE READ on the standby as the strictest level of transaction isolation? If so, do we generate an error on a request for SERIALIZABLE, warn and provide degraded behavior, or just quietly give them REPEATABLE READ behavior? Thoughts? -Kevin
On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote: > Here's an issue for feedback from the community -- do we want to > support truly serializable transactions on hot standby machines? In this release? Maybe? In later releases? Yes. If it blocks your excellent contribution in this release, then from me, "no". If you can achieve this in this release, yes. However, if this is difficult or complex, then I would rather say "not yet" quickly now, than spend months working out the weirdnesses and possibly still get them wrong. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs <simon@2ndQuadrant.com> wrote: > In this release? Maybe? In later releases? Yes. > > If it blocks your excellent contribution in this release, then > from me, "no". If you can achieve this in this release, yes. > However, if this is difficult or complex, then I would rather say > "not yet" quickly now, than spend months working out the > weirdnesses and possibly still get them wrong. We already have a mechanism for generating a good snapshot, the hard part (for me at least) would be to get that snapshot over to the hot standby and have it use the latest one on a request for a serializable transaction. I have no experience with WAL file output, and don't know what it would take for hot standby to use it as I describe. I agree it's pretty late in the cycle, but I'm going through all the loose ends and found this one -- which has been hanging out on the Wiki page as an R&D item for over a full year without discussion. :-( If we provide the snapshots (which we can safely and easily do), can someone else who knows what they're doing with WAL and HS get the rest of it safely into the release? That seems to me to be the only way it can still happen for 9.1. If not, I agree this can be 9.2 material. We just have to decide how to document it and answer the questions near the bottom of my initial post of the thread. -Kevin
On Wed, Jan 19, 2011 at 8:34 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > I agree it's pretty late in the cycle, but I'm going through all the > loose ends and found this one -- which has been hanging out on the > Wiki page as an R&D item for over a full year without discussion. > :-( If we provide the snapshots (which we can safely and easily > do), can someone else who knows what they're doing with WAL and HS > get the rest of it safely into the release? That seems to me to be > the only way it can still happen for 9.1. I think it's way too late to be embarking on what will probably turn out to be a reasonably complex and possibly controversial new development arc. I don't have a strong position on what we should do instead, but let's NOT do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin's suggestion seems eminently reasonable to me and probably the best approach one can do for SSI and hot standby. Pulling it off in time for 9.1 would be a stretch; 9.2 seems quite doable. It's worth noting that one way or another, the semantics of SERIALIZABLE transactions on hot standby replicas could be surprising to some. There's no getting around this; serializability in distributed systems is just a hard problem in general. Either we go with Kevin's suggestion of treating SERIALIZABLE transactions as DEFERRABLE (whether now or for 9.2), causing them to have to use an older snapshot or block until an acceptable snapshot becomes available; or we require them to be downgraded to REPEATABLE READ either implicitly or explicitly. Now, neither of these is as alarming as they might sound, given that replication lag is a fact of life for hot standby systems and REPEATABLE READ is exactly the same as the current (9.0) SERIALIZABLE behavior. But it's definitely something that should be addressed in documentation. Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
On Wed, 2011-01-19 at 19:34 -0600, Kevin Grittner wrote: > I agree it's pretty late in the cycle, but I'm going through all the > loose ends and found this one -- which has been hanging out on the > Wiki page as an R&D item for over a full year without discussion. > :-( If we provide the snapshots (which we can safely and easily > do), can someone else who knows what they're doing with WAL and HS > get the rest of it safely into the release? That seems to me to be > the only way it can still happen for 9.1. I gave you a quick response to let you know that HS need not be a blocker, for this release. If you are saying you have knowingly ignored a requirement for a whole year, then I am shocked. How exactly did you think this would ever be committed? -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Robert Haas wrote: > Kevin Grittner wrote: >> I agree it's pretty late in the cycle, but I'm going through all >> the loose ends and found this one -- which has been hanging out on >> the Wiki page as an R&D item for over a full year without >> discussion. :-( If we provide the snapshots (which we can safely >> and easily do), can someone else who knows what they're doing with >> WAL and HS get the rest of it safely into the release? That seems >> to me to be the only way it can still happen for 9.1. > > I think it's way too late to be embarking on what will probably > turn out to be a reasonably complex and possibly controversial new > development arc. I don't have a strong position on what we should > do instead, but let's NOT do that. If that can't reasonably be done for 9.1, well, my next sentence was: >> If not, I agree this can be 9.2 material. It'd be sweet if it could still happen 9.1, but hardly a shock if it can't. I didn't want to presume to make the call. Like I said at the start, the alternative is to decide how noisy we want to be about providing snapshot isolation on hot standbys when SERIALIZABLE is requested, and figuring out where to document it. -Kevin
* Simon Riggs (simon@2ndQuadrant.com) wrote: > I gave you a quick response to let you know that HS need not be a > blocker, for this release. If you are saying you have knowingly ignored > a requirement for a whole year, then I am shocked. How exactly did you > think this would ever be committed? Erm, to be perfectly honest, I think the answer is probably "I was busy.", and "no one provided any feedback on *how* to deal with it." Given the amount of work that Kevin's put into this patch (which has been beyond impressive, imv), I have a hard time finding fault with him not getting time to implement a solution for Hot Standby for this. As you say, it's not a blocker, I agree completely with that, regardless of when it was identified as an issue. What we're talking about is right now, and right now is too late to fix it for HS, and to be perfectly frank, fixing it for HS isn't required or even a terribly important factor in if it should be committed. I'll refrain from casting stones about issues brought up nearly a year ago on certain other patches which are apparently not going to include what I, at least, consider extremely important to PG acceptance by others. Thanks, Stephen
Simon Riggs wrote: > I gave you a quick response to let you know that HS need not be a > blocker, for this release. If you are saying you have knowingly > ignored a requirement for a whole year, then I am shocked. How > exactly did you think this would ever be committed? I was asked not to discuss this effort on list for most of that time, and while it was on the Wiki page, I just lost track of it -- not maliciously or intentionally. I really apologize. By the time the 9.0 release was out and it was deemed OK for me to discuss things, I started getting feedback on problems which needed response, and I got into the mode of reacting to that rather than ticking through my issues list. -Kevin
On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote: > If we don't do something like this, do we just provide REPEATABLE > READ on the standby as the strictest level of transaction isolation? > If so, do we generate an error on a request for SERIALIZABLE, warn > and provide degraded behavior, or just quietly give them REPEATABLE > READ behavior? > > Thoughts? Hopefully there is a better option available. We don't want to silently give wrong results. Maybe we should bring back the compatibility GUC? It could throw an error unless the user sets the compatibility GUC to turn "serializable" into "repeatable read". Regards,Jeff Davis
On 20.01.2011 03:05, Kevin Grittner wrote: > If we don't do something like this, do we just provide REPEATABLE > READ on the standby as the strictest level of transaction isolation? > If so, do we generate an error on a request for SERIALIZABLE, warn > and provide degraded behavior, or just quietly give them REPEATABLE > READ behavior? +1 for generating an error. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 20.01.2011 03:05, Kevin Grittner wrote: >> If we don't do something like this, do we just provide REPEATABLE >> READ on the standby as the strictest level of transaction >> isolation? If so, do we generate an error on a request for >> SERIALIZABLE, warn and provide degraded behavior, or just quietly >> give them REPEATABLE READ behavior? > > +1 for generating an error. Before I go do that, I want to be sure everyone is clear about the state of things. If SSI is used to provide data integrity on the master, it will prevent any serialization anomalies from being persisted on any hot standby *long term*. For example, at any point where the standby is at a point in the transaction stream where there were no read/write transaction active, no anomalies can be observed. (That isn't the *only* time; it's just the simplest one to describe as an example.) Queries on the standby can, however, see *transient* anomalies when they run queries which would cause a serialization failure if run on the master at the same point in the transaction stream. This can only occur when, of two concurrent transactions, the one which *appears* to run second because the other can't read what it wrote, *commits* first. The most common and alarming situation where this occurs, in my opinion, is batch processing. This is extremely common in financial applications, and tends to show up in a lot of other places, too. (The receipting query set is an instance of this type of problem, but I'm going to keep it more general in hopes that people can see where it impacts them.) Imagine an application which has some small control record in a table, and inserts to some other table are assigned to a batch based on the control record. The batches are normally identified by ascending dates or serial numbers. Periodically a new batch is opened and the old batch is closed by updating a "current batch id" column in the control table. If the batch ID is updated and the transaction in which that update was executed commits while a transaction which read the old batch ID is still in flight, a read of the database will show that the batch is closed, but if you look at the detail of the batch, it will not yet be complete. Under SSI, one of these transactions will be canceled to prevent this. Our implementation will always allow the update which closes the batch to complete, and either the insert or the select of the detail will be rolled back with a serialization failure, depending on the timing the actions inside those transactions. If the insert fails, it can be retried, and will land in the new batch -- making the list of the batch which omits it OK. If the listing of the batch details is canceled, it will be because the insert into the old batch committed before it recognized the problem, so an immediate retry of the select will see the complete batch contents. A hot standby can't really take part in the predicate locking and transaction cancellation on the master. Dan and I have both come to the conclusion that the only reasonable way to allow hot standby to work with SSI is for the WAL (when wal_level = hot_standby) to contain information about which snapshots develop which won't see such a state. In the above example, barring some throttling mechanism skipping these particular snapshots, or other problematic conflicts around the same time, the master would tell the standby that the snapshot before either of the two problem transactions was OK, and then it would tell them that the snapshot after both had committed was OK. It would not suggest using the snapshot available between the commit of the control record update and the commit of the insert into the batch. This seems to me to be not completely unrelated to the snapshot synchronization patch. It is clearly closely related to the READ ONLY DEFERRABLE mode, which also looks for a snapshot which is immune to serialization anomalies without predicate locking, conflict detection, transaction cancellation, etc. Melding these two things with hot standby seems to be beyond what can reasonably happen for 9.1 without delaying the release. If someone is using one feature and not the other, they really don't have a problem. Like anyone else, if a hot standby user has been using SERIALIZABLE mode under 9.0 or earlier, they will need to switch to REPEATABLE READ. A SERIALIZABLE user who doesn't set up hot standby has no issue. Opinions so far seem to be in favor of reporting an error on the standby if SERIALIZABLE is requested, so that people don't silently get less protection than they expect. The most annoying thing about that is that if the use would *like* to use truly serializable transactions on the standby, and will do so when they get it in 9.2, they must switch to REPEATABLE READ now, and switch back to SERIALIZABLE with the next release. So, based on a more complete description of the issues, any more opinions on whether to generate the error, as suggested by Heikki? Does anyone think this justifies the compatibility GUC as suggested by Jeff? It seems to me that this deserved documentation in the MVCC chapter under both the "Serializable Isolation Level" and "Enforcing Consistency With Serializable Transactions" sections. I think it probably deserves a note in the SET TRANSACTION reference page, too. Agreed? Anywhere else? -Kevin
Kevin, > So, based on a more complete description of the issues, any more > opinions on whether to generate the error, as suggested by Heikki? If it's a choice between generating an error and letting users see inconsistent data, I'll take the former. > Does anyone think this justifies the compatibility GUC as suggested > by Jeff? I think it might, yes. Since someone could simply turn on the backwards compatibility flag for 9.1 and turn it off for 9.2, rather than trying to mess with transaction states which might be set in application code. Unfortunately, people have not responded to our survey :-( http://www.postgresql.org/community/survey.77 -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote: > The idea is that whenever we see a valid snapshot which would yield > a truly serializable view of the data for a READ ONLY transaction, > we add a WAL record with that snapshot information. You haven't explained why this approach is the way forwards. What other options have been ruled out, and why. The above approach doesn't sound particularly viable to me. It's not clear to me what the reason is that this doesn't just work on HS already. If you started there it might help. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs <simon@2ndQuadrant.com> wrote: > On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote: > >> The idea is that whenever we see a valid snapshot which would >> yield a truly serializable view of the data for a READ ONLY >> transaction, we add a WAL record with that snapshot information. > > You haven't explained why this approach is the way forwards. What > other options have been ruled out, and why. The above approach > doesn't sound particularly viable to me. Why not? We already generate appropriate snapshots for this in SSI, so is the problem in getting the appropriate information into the WAL stream or in having a request for a snapshot within a serializable transaction while running in hot standby the problem? > It's not clear to me what the reason is that this doesn't just > work on HS already. If you started there it might help. Because the standby would need to bombard the server with a stream of predicate lock information, we would need to allow transactions on the master to be canceled do in part to activity on the standby, and I don't even know how you would begin to track read/write conflicts between transactions on two different clusters. If any of that didn't make sense, it would probably be more efficient for everyone involved if those interested browsed the Overview section of the Wiki page than to have me duplicate its contents in a post. http://wiki.postgresql.org/wiki/Serializable -Kevin
On Jan21, 2011, at 00:11 , Simon Riggs wrote: > It's not clear to me what the reason is that this doesn't just work on > HS already. If you started there it might help. The problem is that snapshots taken on the master sometimes represent a state of the database which cannot occur under any (valid) serial schedule. Hence, if you use that snapshot to read the *whole* database, you've surely violated serializability. If you read only parts of the database, things may or may not be fine, depending on the parts you read. To have the same stringent guarantees that SERIALIZABLE provides on the master also for queries run against the slave, you somehow need to prevent this. The easiest way is to only use snapshots on the slave which *cannot* produce such anomalies. We already know now to generate such snapshots - SERIALIZABLE READ ONLY DEFERRABLE does exactly that. So the open question is mainly how to transfer such snapshots to the slave, and how often we transmit a new one. best regards, Florian Pflug
I wrote: > Why not? We already generate appropriate snapshots for this in > SSI, so is the problem in getting the appropriate information into > the WAL stream or in having a request for a snapshot within a > serializable transaction while running in hot standby the problem? I dropped few words. That was supposed to ask whether the problem was in getting hot standby to *use such a snapshot*. I'm open to other suggestions on how else we might do this. I don't see any alternatives, but maybe you're seeing some possibility that eludes me. -Kevin
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote: >> The idea is that whenever we see a valid snapshot which would yield >> a truly serializable view of the data for a READ ONLY transaction, >> we add a WAL record with that snapshot information. > You haven't explained why this approach is the way forwards. What other > options have been ruled out, and why. The above approach doesn't sound > particularly viable to me. I'm pretty concerned about the performance implications, too. In particular that sounds like you could get an unbounded amount of WAL emitted from a *purely read only* transaction flow. Which is not going to fly. regards, tom lane
On Fri, 2011-01-21 at 00:26 +0100, Florian Pflug wrote: > On Jan21, 2011, at 00:11 , Simon Riggs wrote: > > It's not clear to me what the reason is that this doesn't just work on > > HS already. If you started there it might help. > > > The problem is that snapshots taken on the master sometimes represent a > state of the database which cannot occur under any (valid) serial schedule. > Hence, if you use that snapshot to read the *whole* database, you've > surely violated serializability. If you read only parts of the database, > things may or may not be fine, depending on the parts you read. > > To have the same stringent guarantees that SERIALIZABLE provides on the > master also for queries run against the slave, you somehow need to prevent > this. The easiest way is to only use snapshots on the slave which *cannot* > produce such anomalies. We already know now to generate such snapshots - > SERIALIZABLE READ ONLY DEFERRABLE does exactly that. So the open question > is mainly how to transfer such snapshots to the slave, and how often we > transmit a new one. Thank you for explaining a little more. What I'm still not clear on is why that HS is different. Whatever rules apply on the master must also apply on the standby, immutably. Why is it we need to pass explicit snapshot information from master to standby? We don't do that, except at startup for normal HS. Why do we need that? I hear, but do not yet understand, that the SSI transaction sequence on the master may differ from the WAL transaction sequence. Is it important that the ordering on the master would differ from the standby? -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Tom Lane wrote: > I'm pretty concerned about the performance implications, too. In > particular that sounds like you could get an unbounded amount of > WAL emitted from a *purely read only* transaction flow. No. Read only transactions wouldn't create any flow at all. And I suggested that we might want some kind of throttle on how often we generate snapshots even from the read write transactions. I'm not at all clear on how you got to the concerns you have. Is there something in particular I could clear up for you that isn't already mentioned in the previous emails? -Kevin
> What I'm still not clear on is why that HS is different. Whatever rules > apply on the master must also apply on the standby, immutably. Why is it > we need to pass explicit snapshot information from master to standby? We > don't do that, except at startup for normal HS. Why do we need that? > > I hear, but do not yet understand, that the SSI transaction sequence on > the master may differ from the WAL transaction sequence. Is it important > that the ordering on the master would differ from the standby? The logical serializable ordering of transactions in SSI doesn't necessarily match the commit time ordering (i.e. the WAL sequence). For example, with two concurrent transactions, T1 might commit after T2, even though it didn't see the changes made by T2 and thus has to be considered "earlier". It doesn't matter whether T1 committed before T2 or the other way around, as long as no other transaction can tell the difference. If someone saw the changes made by T1 but not those made by T2, they'd see T2 as happening before T1, violating serializability. Our SSI code ensures that doesn't happen by tracking read dependencies. If it detects that such a read is happening, it rolls back one of the transactions involved. Now, if we extend this to hot standby, if T2 commits before T1 on the master, it obviously will on the slave too. A transaction run on the slave at the right time might be able to see that T2 has happened but not T1, which is unserializable. If that transaction had ben run on the master, then it would have been detected and something would have been rolled back, but the master has no way to know what data is being read on the slave. What Kevin is suggesting is that we already have a mechanism for identifying snapshots where serialization failures like these will never happen. If we pass that information to the slave and allow it to run transactions only on those snapshots, serializability is safe. Hopefully that made some more sense... Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
On Jan21, 2011, at 01:28 , Simon Riggs wrote: > What I'm still not clear on is why that HS is different. Whatever rules > apply on the master must also apply on the standby, immutably. Why is it > we need to pass explicit snapshot information from master to standby? We > don't do that, except at startup for normal HS. Why do we need that? > I hear, but do not yet understand, that the SSI transaction sequence on > the master may differ from the WAL transaction sequence. Is it important > that the ordering on the master would differ from the standby? The COMMIT order in the actual, concurrent, schedule doesn't not necessarily represent the order of the transaction in an equivalent serial schedule. Here's an example T1: BEGIN SERIALIZABLE; -- (Assume snapshot is set here) T1: UPDATE D1 ... ; T2: BEGIN SERIALIZABLE; -- (Assume snapshot is set here) T2: SELECT * FROM D1 ... ; T2: UPDATE D2 ... ; T1: COMMIT; T3: SELECT * FROM D1, D2; T2: COMMIT; Now, the COMMIT order is T1, T3, T2. Lets check if there is a equivalent serial schedule. In any such schedule T2 must run before T1 because T2 didn't see T1's changes to D1 T3 must run after T1 because T3 did see T1's changes to D1 T3 must run before T2 because T3 didn't see T2's changes to D2 This is obviously impossible - if T3 runs before T2 and T2 runs before T1 then T3 runs before T1, contradicting the second requirement. There is thus no equivalent serial schedule and we must abort of these transactions with a serialization error. Note that aborting T3 is sufficient, even though T3 is READ ONLY!. With T3 gone, an equivalent serial schedule is T2,T1! On the master, these "run before" requirement are tracked by remembering which transaction read which parts of the data via the SIREAD-lock mechanism (These are more flags than locks, since nobody ever blocks on them). Since we do not want to report SIREAD locks back to the master, the slave has to prevent this another way. Kevin's proposed solution does that by only using those snapshots on the slave for which reading the *whole* database is safe. The downside is that whether or not a snapshot is safe can only be decided after all concurrent transactions have finished. The snapshot is thus always a bit outdated, but shows that state that is known to be possible in some serial schedule. The very same mechanism can be used on the master also by setting the isolation level to SERIALIZABLE READ ONLY DEFERRED. best regards, Florian Pflug
Tom Lane wrote: > Simon Riggs writes: >> On Wed, 2011-01-19 at 19:05 -0600, Kevin Grittner wrote: >>> The idea is that whenever we see a valid snapshot which would >>> yield a truly serializable view of the data for a READ ONLY >>> transaction, we add a WAL record with that snapshot information. > >> You haven't explained why this approach is the way forwards. What >> other options have been ruled out, and why. The above approach >> doesn't sound particularly viable to me. > > I'm pretty concerned about the performance implications, too. In > particular that sounds like you could get an unbounded amount of > WAL emitted from a *purely read only* transaction flow. Which is > not going to fly. Ah, coming back to this and re-reading, I think I see the point of confusion. The technique we're suggesting is based on the fact that the *standby* is read only. The flow of information about snapshots (which might be done as actual snapshots with xid values, or possibly as marker records saying when a candidate snapshot is being considered and when the last one has been found acceptable) would be from the master *to* the standby, based on *read write transactions on the master*. They would be informing the slave of what would be a snapshot guaranteed not to see a serialization anomaly to a read only transaction. As I mentioned in another email, we might want to throttle this. My thinking was that we could start a timer on capturing a snapshot, and continue to gather new ones as they become available. When you hit the timer limit (maybe 100ms?) you send the latest snapshot, if you have a new one; otherwise you keep trying and send one as soon as you get it. -Kevin
On Thu, Jan 20, 2011 at 8:54 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > As I mentioned in another email, we might want to throttle this. My > thinking was that we could start a timer on capturing a snapshot, and > continue to gather new ones as they become available. When you hit > the timer limit (maybe 100ms?) you send the latest snapshot, if you > have a new one; otherwise you keep trying and send one as soon as you > get it. I think this is likely to suck. That's introducing 10 not-small XLOG records per second just in case someone happens to try to start a serializable transaction on a standby server. A possibly-viable alternative would be to build something into the SR protocol to allow the standby to request a workable snapshot from the master, and the master to send it (out-of-band with respect to the WAL stream) when so requested. Then it wouldn't work if you lose the connection to the master, but maybe that's OK. Even with that, it seems like there could be starvation problems - is there an upper bound on the length of time it would take the master to generate a safe snapshot for the standby to use? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > Kevin Grittner wrote: >> As I mentioned in another email, we might want to throttle this. >> My thinking was that we could start a timer on capturing a >> snapshot, and continue to gather new ones as they become >> available. When you hit the timer limit (maybe 100ms?) you send >> the latest snapshot, if you have a new one; otherwise you keep >> trying and send one as soon as you get it. > > I think this is likely to suck. That's introducing 10 not-small > XLOG records per second just in case someone happens to try to > start a serializable transaction on a standby server. That depends on whether we can pull off the idea for not sending the snapshot itself which I mentioned. But that idea is pretty sketchy at the moment. I can't swear we can make that work, but if we can, it should use a lot less WAL space. > A possibly-viable alternative would be to build something into the > SR protocol to allow the standby to request a workable snapshot > from the master, and the master to send it (out-of-band with > respect to the WAL stream) when so requested. If we can make that work, that has advantages. > it seems like there could be starvation problems - is there an > upper bound on the length of time it would take the master to > generate a safe snapshot for the standby to use? Unfortunately, to get a safe snapshot you need to grab a candidate snapshot and wait for all serializable read write transactions which were active at the time to complete. At a minimum. If any of them develop the wrong pattern of rw-dependencies you have to discard it, grab a new snapshot, and try again. I suspect that most of the time you will succeed on the first snapshot, and so will be able to call it safe when the last concurrent serializable read write transaction completes, but that *could* be a long time, and there is no upper bound. That's why I was suggesting that we try to keep a fairly current safe snapshot sitting on the standby and use it when a serializable transaction is requested. Unless you request DEFERRABLE, in which case you would wait for the *next* one to arrive. I keep looking for another angle on this, but I'm not finding it. I've *thought* I had something a couple times this evening while tossing it around in my head, but the ideas fall apart on closer inspection. :-( -Kevin
On 21.01.2011 03:19, Dan Ports wrote: >> What I'm still not clear on is why that HS is different. Whatever rules >> apply on the master must also apply on the standby, immutably. Why is it >> we need to pass explicit snapshot information from master to standby? We >> don't do that, except at startup for normal HS. Why do we need that? >> >> I hear, but do not yet understand, that the SSI transaction sequence on >> the master may differ from the WAL transaction sequence. Is it important >> that the ordering on the master would differ from the standby? > > The logical serializable ordering of transactions in SSI doesn't > necessarily match the commit time ordering (i.e. the WAL sequence). For > example, with two concurrent transactions, T1 might commit after T2, > even though it didn't see the changes made by T2 and thus has to be > considered "earlier". > > It doesn't matter whether T1 committed before T2 or the other way > around, as long as no other transaction can tell the difference. If > someone saw the changes made by T1 but not those made by T2, they'd see > T2 as happening before T1, violating serializability. Our SSI code > ensures that doesn't happen by tracking read dependencies. If it > detects that such a read is happening, it rolls back one of the > transactions involved. > > Now, if we extend this to hot standby, if T2 commits before T1 on the > master, it obviously will on the slave too. A transaction run on the > slave at the right time might be able to see that T2 has happened but > not T1, which is unserializable. If that transaction had ben run on the > master, then it would have been detected and something would have been > rolled back, but the master has no way to know what data is being read > on the slave. We have enough information in the standby to reconstruct all writes done in the master. I gather that's not enough, in order to roll back read-only transaction T3 on the standby which would see an anomaly, we'd also need to know what reads T1 and T2 did in the master. Is that correct? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Jan 21, 2011 at 08:44:59AM +0200, Heikki Linnakangas wrote: > We have enough information in the standby to reconstruct all writes done > in the master. I gather that's not enough, in order to roll back > read-only transaction T3 on the standby which would see an anomaly, we'd > also need to know what reads T1 and T2 did in the master. Is that correct? That's some of the information we need, but it's not enough... The problem is that the conflict might not be discovered until after T3 (the reader) commits. In that case, it's too late to abort T3, so you'd need to roll back T2 instead. But that means a read-only transaction on the slave has to be able to cause a concurrent read-write transaction on the master to abort, which brings with it no end of problems. To make that a little more concrete, let me borrow Kevin's favorite batch processing example... [master] T2: BEGIN[master] T2: SELECT FROM control[master] T1: BEGIN[master] T1: UPDATE control[master] T1: COMMIT [slave]T3: BEGIN [slave] T3: SELECT FROM control, receipt [slave] T3: COMMIT[master] T2: INSERT INTO receipt[master] T2:COMMIT If this all happened at the master, T2 would get rolled back when it tries to do its INSERT. (I just tried it.) But if T3 happened on the slave, the master doesn't know that it read both tables, nor does the slave know at the time it's executing T3 that it's going to conflict with T2. Dan -- Dan R. K. Ports MIT CSAIL http://drkp.net/
On Fri, 2011-01-21 at 02:32 -0500, Dan Ports wrote: > On Fri, Jan 21, 2011 at 08:44:59AM +0200, Heikki Linnakangas wrote: > > We have enough information in the standby to reconstruct all writes done > > in the master. I gather that's not enough, in order to roll back > > read-only transaction T3 on the standby which would see an anomaly, we'd > > also need to know what reads T1 and T2 did in the master. Is that correct? > > That's some of the information we need, but it's not enough... > > The problem is that the conflict might not be discovered until after T3 > (the reader) commits. In that case, it's too late to abort T3, so you'd > need to roll back T2 instead. But that means a read-only transaction on > the slave has to be able to cause a concurrent read-write transaction > on the master to abort, which brings with it no end of problems. So T1 and T2 are already potentially unserialized and the presence of T3 causes the sequence to be "caught out" from which we must then abort T2. The witness does not create the offence, they just report the crime. So any xid that commits in a different sequence to the order in which the xid was assigned creates a potential for unserialization? Or? On HS we know the order of arrival of xids, and we know the order of commits, so we should be able to work out which are the potentially unserializable snapshots. That would allow us to make the standby independent of the master, thereby avoiding all this messy information flow. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 21.01.2011 11:10, Simon Riggs wrote: > So any xid that commits in a different sequence to the order in which > the xid was assigned creates a potential for unserialization? Or? It's not the order in which the xid was assigned that matters, but the order the transactions started and got their snapshots. The xids might be assigned a lot later, after the transactions have already read data. > On HS we know the order of arrival of xids, and we know the order of > commits, so we should be able to work out which are the potentially > unserializable snapshots. That would allow us to make the standby > independent of the master, thereby avoiding all this messy information > flow. Unfortunately although we know the order of arrival of xids, it doesn't tell us the order the transactions started. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote: > On 21.01.2011 11:10, Simon Riggs wrote: > > So any xid that commits in a different sequence to the order in which > > the xid was assigned creates a potential for unserialization? Or? > > It's not the order in which the xid was assigned that matters, but the > order the transactions started and got their snapshots. The xids might > be assigned a lot later, after the transactions have already read data. So if a read-write transaction assigns an xid before it takes a snapshot then we'll be OK? That seems much easier to arrange than passing chunks of snapshot data backwards and forwards. Optionally. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 01/21/2011 03:25 AM, Florian Pflug wrote: > The COMMIT order in the actual, concurrent, schedule doesn't not necessarily > represent the order of the transaction in an equivalent serial schedule. Here's > an example > > T1: BEGIN SERIALIZABLE; -- (Assume snapshot is set here) > T1: UPDATE D1 ... ; > T2: BEGIN SERIALIZABLE; -- (Assume snapshot is set here) > T2: SELECT * FROM D1 ... ; > T2: UPDATE D2 ... ; > T1: COMMIT; > T3: SELECT * FROM D1, D2; > T2: COMMIT; > > Now, the COMMIT order is T1, T3, T2. Lets check if there is a equivalent > serial schedule. In any such schedule > > T2 must run before T1 because T2 didn't see T1's changes to D1 > T3 must run after T1 because T3 did see T1's changes to D1 > T3 must run before T2 because T3 didn't see T2's changes to D2 > > This is obviously impossible - if T3 runs before T2 and T2 runs before T1 > then T3 runs before T1, contradicting the second requirement. There is thus > no equivalent serial schedule and we must abort of these transactions with > a serialization error. > > Note that aborting T3 is sufficient, even though T3 is READ ONLY!. With T3 gone, > an equivalent serial schedule is T2,T1! Sorry for bothering all of you, but I just don't get this. What if T2 rolls back instead of committing? Then the snapshot of T3 would have been valid, right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it doesn't, because it can't see the changes of T2 in any case. Thus, it would seem that the snapshot is valid. On the other hand I can't see anything wrong in the logic in your post. What am I missing? I am feeling stupid... At least for dumps I don't see how T2 can matter (assuming T3 is the pg_dump's snapshot). Because if you reload from the dump, T2 never happened in that dump. In the reloaded database it just did not exist at all. - Anssi
On Jan21, 2011, at 12:55 , Anssi Kääriäinen wrote: > On 01/21/2011 03:25 AM, Florian Pflug wrote: >> The COMMIT order in the actual, concurrent, schedule doesn't not necessarily >> represent the order of the transaction in an equivalent serial schedule. Here's >> an example >> >> T1: BEGIN SERIALIZABLE; -- (Assume snapshot is set here) >> T1: UPDATE D1 ... ; >> T2: BEGIN SERIALIZABLE; -- (Assume snapshot is set here) >> T2: SELECT * FROM D1 ... ; >> T2: UPDATE D2 ... ; >> T1: COMMIT; >> T3: SELECT * FROM D1, D2; >> T2: COMMIT; >> >> Now, the COMMIT order is T1, T3, T2. Lets check if there is a equivalent >> serial schedule. In any such schedule >> >> T2 must run before T1 because T2 didn't see T1's changes to D1 >> T3 must run after T1 because T3 did see T1's changes to D1 >> T3 must run before T2 because T3 didn't see T2's changes to D2 >> >> This is obviously impossible - if T3 runs before T2 and T2 runs before T1 >> then T3 runs before T1, contradicting the second requirement. There is thus >> no equivalent serial schedule and we must abort of these transactions with >> a serialization error. >> >> Note that aborting T3 is sufficient, even though T3 is READ ONLY!. With T3 gone, >> an equivalent serial schedule is T2,T1! > Sorry for bothering all of you, but I just don't get this. What if T2 rolls back instead of committing? Then the snapshotof T3 would have been valid, right? Yeah. If T2 is removed from the picture, the only remaining ordering constraint is "T3 must run after T1 because T3 did seeT1's changes to D1", and thus T1,T3 is an equivalent serial schedule. > Now, for the snapshot of T3 it doesn't matter if T2 commits or if it doesn't, because it can't see the changes of T2 inany case. Thus, it would seem that the snapshot is valid. On the other hand I can't see anything wrong in the logic inyour post. What am I missing? I am feeling stupid... The problem is that T3 sees the effects of T1 but not those of T2. Since T2 must run *before* T1 in any equivalent serialschedule, that is impossible. In other words, if you look at an equivalent serial schedule of the *writing* transactionsT1 and T2 you won't find a place to insert T3 such that it gives the same answer as in the concurrent schedule. It isn't really T3's snapshot that is invalid, it's the interleaving of T1,T2,T3 because there is no equivalent serial schedule(a serial schedule the produces the same results). If, for example T3 reads only *one* of the tables D1,D2 then thewhole thing suddenly *is* serializable! If T3 reads only D1 an equivalent serial schedule must run T3 after T1, and ifit reads only D2 then it must run before T2. That "validity" of snapshots comes into play if you attempt to distinguish safe and unsafe interleaved schedules *without*taking the dataset inspected by T3 into account. So you simply assume that T3 reads the *whole* database (sincethats the worst case), and must thus run *after* all transactions I didn't see as COMMITTED in any serial schedule. The root of the whole issue is that this might not be possible! Some not-yet-committed transaction (T2 in the example) mayhave to be placed *before* some transaction seen as COMMITTED by T3 (T1 is our example). Since T3 needs to run *after*T1 (since it saw it as committed) it'd thus also see T2 in any serial schedule. But it didn't see T2 in the interleavedschedule, we're hosed. > At least for dumps I don't see how T2 can matter (assuming T3 is the pg_dump's snapshot). Because if you reload from thedump, T2 never happened in that dump. In the reloaded database it just did not exist at all. Still, the would dump reflects a database state that *logically* never existed (i.e. not in any serial schedule). If youdump for disaster recovery, you might not care. If you dump to copy the data onto some reporting server you might. best regards, Florian Pflug
On Jan21, 2011, at 10:19 , Heikki Linnakangas wrote: > On 21.01.2011 11:10, Simon Riggs wrote: >> So any xid that commits in a different sequence to the order in which >> the xid was assigned creates a potential for unserialization? Or? > > It's not the order in which the xid was assigned that matters, but the order the transactions started and got their snapshots.The xids might be assigned a lot later, after the transactions have already read data. Any pair of concurrent transactions on the master between which r/w-dependencies exist are a potential risk. If their orderin all equivalent serial schedule doesn't match their commit order, any snapshot taken between the two commits don'trepresent a fully consistent view of the database. Hm, wait a minute... Thinks... Since transactions on the slave don't write, they really don't need to be aware of any SIREAD locks taken on the master,right? We'd still need to detect conflicts between SIREAD locks taken on the slaves and writes by transaction on themaster, but that could be handled by the recovery process without having to report anything back to the master, and withoutlogging SIREAD lock acquisitions. So, how about the following A) We log r/w-dependencies between transactions committed on the master in the WAL, probably in the COMMIT record B) SERIALIZABLE queries on the slave use the SIREAD lock machinery like they'd do on the master. The recovery process doesthe necessary conflict flagging in case the write happens (in wall clock time) after the slave, mimicking what the writingtransaction had done on the master had it seen the SIREAD lock C) By using the r/w-dependency information from the WAL plus the r/w-dependency information generated on the slave we candetect dangerous situations on the slave, and abort the offending query on the slave. (A) and (B) seem quite straight-forward. Regarding (C), I'm not familiar enough with the inner workings of the SSI patchto judge that. best regards, Florian Pflug
2011/1/21 Anssi Kääriäinen <anssi.kaariainen@thl.fi>: > Sorry for bothering all of you, but I just don't get this. What if T2 rolls > back instead of committing? Then the snapshot of T3 would have been valid, > right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it > doesn't, because it can't see the changes of T2 in any case. Thus, it would > seem that the snapshot is valid. On the other hand I can't see anything > wrong in the logic in your post. What am I missing? I am feeling stupid... > > At least for dumps I don't see how T2 can matter (assuming T3 is the > pg_dump's snapshot). Because if you reload from the dump, T2 never happened > in that dump. In the reloaded database it just did not exist at all. This has been discussed before; in [1] I summarized: "IOW, one could say that the backup is consistent only if it were never compared against the system as it continued running after the dump took place." Nicolas [1] <URL:http://archives.postgresql.org/pgsql-hackers/2010-09/msg01763.php>
Heikki Linnakangas wrote: > It's not the order in which the xid was assigned that matters, but > the order the transactions started and got their snapshots. The > xids might be assigned a lot later, after the transactions have > already read data. From the "Apparent Serial Order of Execution" sub-section of the Overview section of the Wiki page: http://wiki.postgresql.org/wiki/Serializable#Apparent_Serial_Order_of_Execution | The order in which transactions appear to have executed is | determined by something more subtle than in S2PL: read/write | dependencies. If a transaction attempts to read data which is not | visible to it because the transaction which wrote it (or will later | write it) is concurrent (one of them was running when the other | acquired its snapshot), then the reading transaction appears to | have executed first, regardless of the actual sequence of | transaction starts or commits (since it sees a database state prior | to that in which the other transaction leaves it). It is the fact that it generates results consistent with some serial order of execution which makes this truly serializable. With SSI that order isn't tied to transaction start or transaction commit when transactions overlap -- the one which *appears* to have run first might have started first *and* committed first. And as Dan pointed out with a repeatable example, a read only transaction can contribute to a cycle in teh apparent order of execution, even if it commits ahead of one of the writers. There really are only two ways a read only transaction can be guaranteed to see consistent data under SSI: (1) A read write transaction might need to be cancelled to prevent the view of the data a committed read only transaction has already seen from becoming inconsistent. (Dan's example) (2) The read only transaction needs to get a snapshot which the master has determined is safe. There really aren't any other alternatives under SSI. The question is which we support on hot standbys and how we achieve it. Or whether we just dont, I guess; but that doesn't seem very satisfactory as a long term solution. -Kevin
On Fri, Jan 21, 2011 at 8:05 AM, Nicolas Barbier <nicolas.barbier@gmail.com> wrote: > 2011/1/21 Anssi Kääriäinen <anssi.kaariainen@thl.fi>: > >> Sorry for bothering all of you, but I just don't get this. What if T2 rolls >> back instead of committing? Then the snapshot of T3 would have been valid, >> right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it >> doesn't, because it can't see the changes of T2 in any case. Thus, it would >> seem that the snapshot is valid. On the other hand I can't see anything >> wrong in the logic in your post. What am I missing? I am feeling stupid... >> >> At least for dumps I don't see how T2 can matter (assuming T3 is the >> pg_dump's snapshot). Because if you reload from the dump, T2 never happened >> in that dump. In the reloaded database it just did not exist at all. > > This has been discussed before; in [1] I summarized: > > "IOW, one could say that the backup is consistent only if it were > never compared against the system as it continued running after the > dump took place." But that's a pretty fair way to look at it, isn't it? I mean, I guess it's a question of what you plan to use that backup for, but if it's disaster recovery, everything that happened after the dump is gone, so no such comparison will occur. And that's probably the most common reason for taking a dump. It occurs to me that focusing on how this is going to work on Hot Standby might be looking at the question too narrowly. The general issue is - does this technique generalize to a distributed computing environment, with distributed transactions across multiple PostgreSQL databases? For example, what if the control record in Kevin's example is stored in another database, or on another server. Or what if some tables are replicated via Slony? I realize this is all outside the scope of the patch, but that's exactly the point: making this stuff work across multiple databases (even if they are replicas of each other) is much more complex than getting it to work on just one machine. Even if we could agree on how to do it, coming up with some hack that can only ever possibly work in the Hot Standby case might not be the best thing to do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > (1) A read write transaction might need to be canceled to > prevent the view of the data a committed read only transaction has > already seen from becoming inconsistent. (Dan's example) And this one seems entirely a theoretical possibility. I spent a little time looking it over, and I don't see how it could be made to work from hot standbys without an unbounded flow of predicate lock information from all standbys to the master *plus* blocking commits on the master for the duration of the longest round trip latency to any standby. I think we can call this one dead on arrival. -Kevin
On 01/21/2011 02:21 PM, Florian Pflug wrote: > Still, the would dump reflects a database state that *logically* never existed (i.e. not in any serial schedule). If youdump for disaster recovery, you might not care. If you dump to copy the data onto some reporting server you might. > > best regards, > Florian Pflug I am beginning to understand the problem. If you don't mind, here is a complete example if somebody else is having troubles understanding this. Let's say we have tables D1 and D2. Both contain a single column, id, and a single row. The data in the beginning is as follows: D1: id = 1 D2: id = 1 The constrains: D1.id can only be incremented. Whenever D2.id is updated, it must be updated to D1.id + 1. The transactions: T1: begin; update D1 set id = id + 1; T2: begin; update D2 set id = (select id+1 from D1); T1: commit; T3: begin; select id from D1; select id from D2; commit; Data seen: (2, 1) -- this is a possible state T2: commit; T4: begin; select id from D1; select id from D2; commit; Data seen: (2, 2) This is again a possible state. But if we compare this to the state seen by T3 this is not valid. From state (2, 1) we can not get to state (2, 2) without breaking one of the constraints. Thus, the state of T3 is not valid in the database. So, I finally got it! :-) I hope this example will help somebody else understand the problem. The problem I had understanding this was that the state in T3 is in fact perfectly valid. I though that there must be some problem with that state alone. There isn't, unless you compare it to the state after T2 has committed. Thanks to all explaining this to me, - Anssi
Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 21, 2011 at 8:05 AM, Nicolas Barbier >> This has been discussed before; in [1] I summarized: >> >> "IOW, one could say that the backup is consistent only if it were >> never compared against the system as it continued running after >> the dump took place." > > But that's a pretty fair way to look at it, isn't it? I mean, I > guess it's a question of what you plan to use that backup for, but > if it's disaster recovery, everything that happened after the dump > is gone, so no such comparison will occur. And that's probably > the most common reason for taking a dump. It's not, however, a reason for having a hot standby (versus a warm standby or PITR backup). > It occurs to me that focusing on how this is going to work on Hot > Standby might be looking at the question too narrowly. The > general issue is - does this technique generalize to a distributed > computing environment, with distributed transactions across > multiple PostgreSQL databases? No, and I can pretty much guarantee that you can't have such a solution without blocking on all masters at commit time. What you're suggesting goes *way* beyond two phase commit, which just guarantees the integrity rules of each database are honored and that all transactions either commit or don't. You're talking about sharing lock information across high-latency communication links which in SSI are communicated via LW locking. Expect any such generalized "and world peace!" solution to be rather slow. > For example, what if the control record in Kevin's example is > stored in another database, or on another server. Or what if some > tables are replicated via Slony? I realize this is all outside > the scope of the patch Yep. Again, the patch achieves true serializability with minimal cost and *no blocking*. Spend a few minutes thinking about how you might coordinate what you propose, and you'll see it's going to involve blocking based on waiting for messages from across the wire. > but that's exactly the point: making this stuff work across > multiple databases (even if they are replicas of each other) is > much more complex than getting it to work on just one machine. > Even if we could agree on how to do it, coming up with some hack > that can only ever possibly work in the Hot Standby case might not > be the best thing to do. I don't see it as a hack. It's the logical extension of SSI onto read only replicas. If you're looking for something more than that (as the above suggests), it's not a good fit; but I suspect that there are people besides me who would want to use hot standby for reporting and read only web access who would want a serializable view. What this proposal does is to say that there are two time streams to look at on the standby -- how far along you are for purposes of recovery, and how far along you are for purposes of seeing a view of the data sure to be consistent the later state of the master. With SSI they can't be the same. If someone wants them to be, they could implement a traditional S2PL serializable mode, complete with blocking and deadlocks, and then you'd have it automatically on the replicas, because with S2PL the apparent order of execution matches the commit order. -Kevin
2011/1/21 Robert Haas <robertmhaas@gmail.com>: > On Fri, Jan 21, 2011 at 8:05 AM, Nicolas Barbier > <nicolas.barbier@gmail.com> wrote: > >> 2011/1/21 Anssi Kääriäinen <anssi.kaariainen@thl.fi>: >> >>> Sorry for bothering all of you, but I just don't get this. What if T2 rolls >>> back instead of committing? Then the snapshot of T3 would have been valid, >>> right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it >>> doesn't, because it can't see the changes of T2 in any case. Thus, it would >>> seem that the snapshot is valid. On the other hand I can't see anything >>> wrong in the logic in your post. What am I missing? I am feeling stupid... >>> >>> At least for dumps I don't see how T2 can matter (assuming T3 is the >>> pg_dump's snapshot). Because if you reload from the dump, T2 never happened >>> in that dump. In the reloaded database it just did not exist at all. >> >> This has been discussed before; in [1] I summarized: >> >> "IOW, one could say that the backup is consistent only if it were >> never compared against the system as it continued running after the >> dump took place." > > But that's a pretty fair way to look at it, isn't it? Indeed, I just wanted to point Anssi to the previous discussion. > It occurs to me that focusing on how this is going to work on Hot > Standby might be looking at the question too narrowly. The general > issue is - does this technique generalize to a distributed computing > environment, with distributed transactions across multiple PostgreSQL > databases? For example, what if the control record in Kevin's example > is stored in another database, or on another server. Or what if some > tables are replicated via Slony? I realize this is all outside the > scope of the patch, but that's exactly the point: making this stuff > work across multiple databases (even if they are replicas of each > other) is much more complex than getting it to work on just one > machine. Even if we could agree on how to do it, coming up with some > hack that can only ever possibly work in the Hot Standby case might > not be the best thing to do. You seem to be questioning whether the normal (?) way of using 2PC on multiple DBs (just start transactions, and let 2PC coordinate the commits) that all use SERIALIZABLE isolation mode always results in global serializable behavior. I must say that I don't immediately see the answer (my gut feeling says "nope"), but it sure is an interesting question. In the special case where all databases use SS2PL as their concurrency mechanism, the answer is "yes". (Because any "possibly conflicting" local transactions of any global transactions that touch the same tables ("conflict") will necessarily commit in some globally consistent order, which in the case of SS2PL is consistent with the "must-have-happened-before-in-any-serialized-order" relation, i.e., using the terminology from the literature: "because SS2PL is CO".) I don't know whether it is generally considered a requirement for a 2PC environment to guarantee global serializability. Nicolas
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote: >> It's not the order in which the xid was assigned that matters, but the >> order the transactions started and got their snapshots. The xids might >> be assigned a lot later, after the transactions have already read data. > So if a read-write transaction assigns an xid before it takes a snapshot > then we'll be OK? That seems much easier to arrange than passing chunks > of snapshot data backwards and forwards. Optionally. No, that idea is DOA from a performance standpoint. We sweated blood to avoid having to assign XIDs to read-only transactions, and we're not going back. If SSI requires that, SSI is not getting committed. regards, tom lane
On Fri, Jan 21, 2011 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote: >>> It's not the order in which the xid was assigned that matters, but the >>> order the transactions started and got their snapshots. The xids might >>> be assigned a lot later, after the transactions have already read data. > >> So if a read-write transaction assigns an xid before it takes a snapshot >> then we'll be OK? That seems much easier to arrange than passing chunks >> of snapshot data backwards and forwards. Optionally. > > No, that idea is DOA from a performance standpoint. We sweated blood to > avoid having to assign XIDs to read-only transactions, and we're not > going back. If SSI requires that, SSI is not getting committed. So far I think all of the ideas proposed for generalizing this across the master-standby connection seem likely to be DOA from a performance perspective. But I think we have a pretty broad consensus that it's OK to punt this issue for 9.1. We can always add this in 9.2 if it can be demonstrated to work well, but it's all vapor-ware right now anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Fri, 2011-01-21 at 11:19 +0200, Heikki Linnakangas wrote: >>> It's not the order in which the xid was assigned that matters, >>> but the order the transactions started and got their snapshots. >>> The xids might be assigned a lot later, after the transactions >>> have already read data. > >> So if a read-write transaction assigns an xid before it takes a >> snapshot then we'll be OK? That seems much easier to arrange than >> passing chunks of snapshot data backwards and forwards. We're not talking about passing the backwards. I'm suggesting that we probably don't even need to pass them forward, but that suggestion has been pretty handwavy so far. I guess I should fill it out, because everyone's been ignoring it so far. > No, that idea is DOA from a performance standpoint. We sweated > blood to avoid having to assign XIDs to read-only transactions, > and we're not going back. If SSI requires that, SSI is not > getting committed. SSI doesn't require that. The suggestion that it would in *any* way help with the interaction with hot standby is off-base. -Kevin
Anssi Kääriäinen<anssi.kaariainen@thl.fi> wrote: > I am beginning to understand the problem. If you don't mind, here > is a complete example if somebody else is having troubles > understanding this. > > Let's say we have tables D1 and D2. Both contain a single column, > id, and a single row. The data in the beginning is as follows: > > D1: id = 1 > D2: id = 1 > > The constrains: D1.id can only be incremented. Whenever D2.id is > updated, it must be updated to D1.id + 1. > > The transactions: > T1: begin; update D1 set id = id + 1; > T2: begin; update D2 set id = (select id+1 from D1); > T1: commit; > T3: begin; select id from D1; select id from D2; commit; Data > seen: (2, 1) -- this is a possible state > T2: commit; > T4: begin; select id from D1; select id from D2; commit; Data > seen: (2, 2) > This is again a possible state. But if we compare this to the > state seen > by T3 this is not valid. From state (2, 1) we can not get to state > (2, 2) without breaking one of the constraints. Thus, the state of > T3 is not valid in the database. > > So, I finally got it! :-) I hope this example will help somebody > else understand the problem. Yeah, interesting example. Under SSI, once T3 selects from D2 you have a dangerous structure, and either T2 or T3 must fail to prevent the possibility of the sort of anomaly your example demonstrates. We would prefer to see T2 fail, because if T3 fails it will continue to fail on retry until T2 completes. We're trying to avoid that kind of thrashing. If T2 fails and is retried, it will immediately succeed and generate results consistent with what T3 saw. When I test your example, though, I'm getting the serialization failure on T3 rather than T2, so I'd call that a bug. Will investigate. Thanks again for your tests! You seem to be able to shake out issues better than anyone else! Once found, fixing them is not usually very hard, it's coming up with that creative usage pattern to *find* the problem which is the hard part. OK if I add this one to our dcheck test suite, too? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > When I test your example, though, I'm getting the serialization > failure on T3 rather than T2, so I'd call that a bug. Will > investigate. Thanks again for your tests! Fixed with this: http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=b91460812396b68362c812d6e4fb67799fc6147e Thanks again! -Kevin
I wrote: > We're not talking about passing the backwards. I'm suggesting > that we probably don't even need to pass them forward, but that > suggestion has been pretty handwavy so far. I guess I should fill > it out, because everyone's been ignoring it so far. It's been too hectic today to flesh this out very well, but I can at least do a better brain dump -- you know, wave my hands a little less vaguely. The idea of communicating regarding a safe snapshot through the WAL without actually *sending* snapshot XIDs through the WAL might work something like this: (1) We communicate when we are starting to consider a snapshot. This would always be related to the commit or rollback of a serializable read-write transaction, so perhaps we could include the information in an existing WAL record. We would need to find one free bit somewhere, or make room for it. Alternatively, we could send a new WAL record type to communicate this. At the point that a standby processes such a WAL record, it would grab a snapshot effect after the commit, and save it as the "latest candidate", releasing the previous candidate, if any. (2) If a snapshot fails to make it to a safe status on the master, it will pick a new candidate and repeat (1) -- there's no need to explicitly quash a failed candidate. (3) We communicate when we find that the last candidate made it to "safe" status. Again, this would be related to the commit or rollback of a serializable read-write transaction. Same issues about needing (another) bit or using a new record type. When a standby receives this, it promotes the latest candidate to the new "safe snapshot" to be used when a serializable transaction asks for a snapshot, replacing the previous value, if any. Any transactions waiting for a snapshot (either because there previously wasn't a safe snapshot on record or because they requested DEFERRABLE) could be provided the new snapshot and turned loose. (4) It's not inconceivable that we might want to send both (1) and (3) with the same commit. (5) Obviously, we can pick our heuristics for how often we try to refresh this, limiting it to avoid too much overhead, at the cost of less frequent snapshot updates for serializable transactions on the standbys. My assumption is that when we have a safe snapshot (which should be pretty close to all the time), we immediately provide it to any serializable transaction requesting a snapshot, except it seems to make sense to use the new DEFERRABLE mode to mean that you want to use the *next* one to arrive. This would effectively cause the point in time which was visible to serializable transactions to lag behind what is visible to other transactions by a variable amount, but would ensure that a serializable transaction couldn't see any serialization anomalies. It would also be immune to serialization failures from SSI logic; but obviously, standby-related cancellations would be in play. I don't know whether the older snapshots would tend to increase the standby-related cancellations, but it wouldn't surprise me. Hopefully this is enough for people to make something of it. -Kevin
On Fri, 2011-01-21 at 18:52 -0600, Kevin Grittner wrote: > My assumption is that when we have a safe snapshot (which should be > pretty close to all the time), we immediately provide it to any > serializable transaction requesting a snapshot, except it seems to > make sense to use the new DEFERRABLE mode to mean that you want to > use the *next* one to arrive. How would it handle this situation: 1. Standby has safe snapshot S1 2. Primary does a VACUUM which removes some stuff visible in S1 3. Standby can't replay the VACUUM because it still has S1, but also can't get a new S2 because the WAL needed for that is behind the VACUUM So, S1 needs to be discarded. What do we do on the standby while there is no safe snapshot? I suppose throw errors -- I can't think of anything else. > This would effectively cause the point in time which was visible to > serializable transactions to lag behind what is visible to other > transactions by a variable amount, but would ensure that a > serializable transaction couldn't see any serialization anomalies. > It would also be immune to serialization failures from SSI logic; > but obviously, standby-related cancellations would be in play. I > don't know whether the older snapshots would tend to increase the > standby-related cancellations, but it wouldn't surprise me. I'm also a little concerned about the user-understandability here. Is it possible to make the following guarantees in this approach: 1. If transactions are completing on the primary, new snapshots will be taken on the standby; and 2. If no write transactions are in progress on the primary, then the standby will get a snapshot that represents the exact same data as on the primary? That would be fairly easy to explain to users. If there is a visibility lag, then we just say "finish the write transactions, and progress will be made". And if the system is idle, they should see identical data. Regards,Jeff Davis
> Jeff Davis wrote: > On Fri, 2011-01-21 at 18:52 -0600, Kevin Grittner wrote: >> My assumption is that when we have a safe snapshot (which should >> be pretty close to all the time), we immediately provide it to any >> serializable transaction requesting a snapshot, except it seems to >> make sense to use the new DEFERRABLE mode to mean that you want to >> use the *next* one to arrive. > > How would it handle this situation: > 1. Standby has safe snapshot S1 > 2. Primary does a VACUUM which removes some stuff visible in S1 > 3. Standby can't replay the VACUUM because it still has S1, but > also can't get a new S2 because the WAL needed for that is behind > the VACUUM > > So, S1 needs to be discarded. What do we do on the standby while > there is no safe snapshot? I suppose throw errors -- I can't think > of anything else. We could wait for the next safe snapshot to arrive. I don't know how often that combination would occur, particulary in a situation where there were long-running serializable read write transactions on the master which would prevent a new safe snapshot from being generated. It seems as though a long-running transaction on the master would also block vacuum activity. I'm not sure how we can *really* know the frequency without field experience. >> This would effectively cause the point in time which was visible >> to serializable transactions to lag behind what is visible to >> other transactions by a variable amount, but would ensure that a >> serializable transaction couldn't see any serialization anomalies. >> It would also be immune to serialization failures from SSI logic; >> but obviously, standby-related cancellations would be in play. I >> don't know whether the older snapshots would tend to increase the >> standby-related cancellations, but it wouldn't surprise me. > > I'm also a little concerned about the user-understandability here. > Is it possible to make the following guarantees in this approach: > > 1. If transactions are completing on the primary, new snapshots > will be taken on the standby; and The rules there are rather complicated. Based on previous responses to posts where I've gotten into that detail, I fear that specifying it with complete accuracy would cause so many eyes to glaze over that nobody would get to the end of the description. I will do it if anybody asks, but without that I'll just say that the conditions for blocking a safe snapshot in a mix of short-lived read-write transactions are so esoteric that I expect that they would be uncommon in practical use. On the other hand, one long-running read-write transaction could block generation of a new safe snapshot indefinitely. Transactions declared as read-only or running at an isolation level other than serializable would have no impact on generation of a safe snapshot. > 2. If no write transactions are in progress on the primary, then > the standby will get a snapshot that represents the exact same data > as on the primary? A snapshot taken while there are no serializable read write transactions active can immediately be declared safe. Whether such a snapshot is always available on the standby depends on what sort of throttling, if any, is used. > That would be fairly easy to explain to users. If there is a > visibility lag, then we just say "finish the write transactions, > and progress will be made". And if the system is idle, they should > see identical data. Well, unless it's sync rep, you'll always have some latency between the master and the standby. And any throttling to control resource utilization could also cause latency between other transactions and serializable ones. But other than that, you're exactly on target. -Kevin
"When I test your example, though, I'm getting the serialization failure on T3 rather than T2, so I'd call that a bug. Will investigate. Thanks again for your tests! You seem to be able to shake out issues better than anyone else! Once found, fixing them is not usually very hard, it's coming up with that creative usage pattern to *find* the problem which is the hard part." Thank you very much, but I do not deserve this honor. I was just constructing an example for myself so that I could understandwhy read only transaction might pose a problem. I posted it to help other people to see a concrete example of theproblem. I had no idea this would show an actual bug in the code. "OK if I add this one to our dcheck test suite, too?" It is of course OK. And if you want to add this as an example in the documentation, it would be great. This is a simple,but concrete example of why read only serializable transaction might cause an anomaly. If I am not mistaken, thereisn't any complete example in the documentation. It was hard for me to grasp why there might be a problem and I don'tthink I am alone. - Anssi