Thread: Low hanging fruit in lazy-XID-assignment patch?
Simon was complaining a bit ago that we still have problems with excessive contention for the ProcArrayLock, and that much of this stems from the need for transaction exit to take that lock exclusively. The lazy-XID patch, as committed, doesn't help that situation at all, saying /* * Lock ProcArrayLock because that's what GetSnapshotData uses. * You might assume that we can skipthis step if we had no * transaction id assigned, because the failure case outlined * in GetSnapshotDatacannot happen in that case. This is true, * but we *still* need the lock guarantee that two concurrent * computations of the *oldest* xmin will get the same result. */ On reflection though this seems wrong: I believe that we could skip taking the lock when exiting a transaction with no XID. The actions being guarded with the lock are MyProc->xid = InvalidTransactionId; MyProc->lxid = InvalidLocalTransactionId; MyProc->xmin = InvalidTransactionId; MyProc->inVacuum = false; /* must be cleared with xid/xmin */ /* Clear the subtransaction-XID cache too while holding the lock */ MyProc->subxids.nxids = 0; MyProc->subxids.overflowed= false; Clearing xid is obviously a no-op if we had no xid, and if we had no xid we have no subxids either, so the last 2 lines are also no-ops. I cannot see any reason why we need a guard on clearing lxid, either. inVacuum is only interesting if xmin is, since if there's no xid assigned then it's effectively just a filter on whether other backends pay attention to this one's xmin. That leaves xmin, which AFAICS is only interesting for the computations of GetOldestXmin() and RecentGlobalXmin. And I assert it doesn't matter if those numbers advance asynchronously, so long as they never go backward. Comments? regards, tom lane
Tom Lane wrote: > Simon was complaining a bit ago that we still have problems with > excessive contention for the ProcArrayLock, and that much of this stems > from the need for transaction exit to take that lock exclusively. > The lazy-XID patch, as committed, doesn't help that situation at all, > saying > > /* > * Lock ProcArrayLock because that's what GetSnapshotData uses. > * You might assume that we can skip this step if we had no > * transaction id assigned, because the failure case outlined > * in GetSnapshotData cannot happen in that case. This is true, > * but we *still* need the lock guarantee that two concurrent > * computations of the *oldest* xmin will get the same result. > */ I think the comment is correct in principle - If we remove the oldest xmin without locking, then two concurrent OldestXmin calculations will get two different results. The question is if that has any negative effects, though. > That leaves xmin, which AFAICS is > only interesting for the computations of GetOldestXmin() and > RecentGlobalXmin. And I assert it doesn't matter if those numbers > advance asynchronously, so long as they never go backward. Yes, the xmin is surely the only field that might need need the locking. It was this comment in GetSnapshotData that made me keep the locking in the first place: * It is sufficient to get shared lock on ProcArrayLock, even if we are * computing a serializable snapshot and thereforewill be setting * MyProc->xmin. This is because any two backends that have overlapping * shared holds on ProcArrayLockwill certainly compute the same xmin * (since no xact, in particular not the oldest, can exit the set of * runningtransactions while we hold ProcArrayLock --- see further * discussion just below). So it doesn't matter whether anotherbackend * concurrently doing GetSnapshotData or GetOldestXmin sees our xmin as * set or not; he'd compute the samexmin for himself either way. * (We are assuming here that xmin can be set and read atomically, * just like xid.) But now that I read this again, I think that comment is just missleading - especially the part "So it doesn't matter whether another backend concurrently doing GetSnapshotData or GetOldestXmin sees our xmin as set or not; he'd compute the same xmin for himself either way." This sounds as if the Proc->xmin that *one* backend announces had influence over the Proc->xmin that *another* backend might compute. Which isn't true - it only influences the GlobalXmin that another backend might compute. So I believe you're right, and we can skip taking the lock in the no xid case - I actually think with just a little bit of more work, we can go even further, and get rid of the ReadNewTransactionId() call completely during snapshotting. There are two things we must ensure when I comes to snapshots, commits and xid assignment. 1) A transaction must either be not in progress, be in our snapshot, or have an xid >= xmax. 2) If transaction A sees B as committed, and B sees C as committed, then A must see C as committed. ad 1): We guarantee that by storing the xid in the proc array before releasing the XidGenLock. Therefore, when we laterobtain our xmax value, we can be sure that we see all xacts in the proc array that have an xid < xmax andare in progress. ad 2): We guarantee that by serializing snapshotting against committing. Since we use ReadNewTransactionId() as thesnapshot's xmax this implies that we take the ProcArrayLock *before* reading our xmax value. Now, ReadNewTransactionId() is actually larger than necessary as a xmax. The minimal xmax that we can set is "largest committed xid"+1. We can easily track that value during commit when we hold the ProcArrayLock (If we have no xid, and therefor don't have to hold the lock, we also don't need to update that value). If we used this "LatestCommittedXid" as xmax, we'd still guarantee (2), but without having to hold the XidGenLock during GetSnapshotData(). I wouldn't have dared to suggest this for 8.3, but since you came up with locking improvements in the first place... ;-) greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: > Tom Lane wrote: >> The lazy-XID patch, as committed, doesn't help that situation at all, > I think the comment is correct in principle - If we remove the oldest > xmin without locking, then two concurrent OldestXmin calculations > will get two different results. The question is if that has any > negative effects, though. My point is that there's a difference between what you compute (and publish) as your own xmin, and what you compute as the RecentGlobalXmin. I don't think there's any need for a guarantee that two concurrent processes get the same estimate of RecentGlobalXmin, as long as they do not get an estimate less than reality, ie, that someone cannot later compute and publish a smaller xmin. There are reasons why we want two concurrent GetSnapshotDatas to compute the same xmin, but I think in the end it just comes down to being a prerequisite for the above constraint --- without that you're not sure that someone might not be about to publish an xmin less than what you obtained as RecentGlobalXmin. Dropping a live xid is a whole different issue. There, you have the problem that you need everyone to see a consistent commit order, which is what the example in GetSnapshotData is about. But I don't think that xmin enters into that. xmin is only about "is it safe to drop this tuple because no one can see it?". There, we don't have to be exactly correct, we only have to err in the conservative direction. > It was this comment in GetSnapshotData that made me keep the locking > in the first place: > * It is sufficient to get shared lock on ProcArrayLock, even if we are > * computing a serializable snapshot and therefore will be setting > * MyProc->xmin. This is because any two backends that have overlapping > * shared holds on ProcArrayLock will certainly compute the same xmin If I recall correctly, that text was written to justify downgrading GetSnapshotData's hold on ProcArrayLock from exclusive to shared --- it was merely arguing that the results wouldn't change if we did that. I don't see an argument there that this condition is really *necessary*. We do have to think carefully about whether GetOldestXmin can compute a value that's too large, but right at the moment I see no problem there. > So I believe you're right, and we can skip taking the lock in the no > xid case - I actually think with just a little bit of more work, we > can go even further, and get rid of the ReadNewTransactionId() call > completely during snapshotting. [ squint... ] This goes a bit far for me. In particular, I think this will fail in the edge case when there are no live XIDs visible in ProcArray. You cannot go back and do ReadNewTransactionId afterward, at least not without re-scanning the ProcArray a second time, which makes it at best a questionable win. regards, tom lane
Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: >> So I believe you're right, and we can skip taking the lock in the no >> xid case - I actually think with just a little bit of more work, we >> can go even further, and get rid of the ReadNewTransactionId() call >> completely during snapshotting. > > [ squint... ] This goes a bit far for me. In particular, I think this > will fail in the edge case when there are no live XIDs visible in > ProcArray. You cannot go back and do ReadNewTransactionId afterward, > at least not without re-scanning the ProcArray a second time, which > makes it at best a questionable win. Why would it? The idea was to remember the largest committed xid, and that won't go away just because the proc array is rather empty xid-wise. Actually, in that case the "largest comitted xid"+1 will (nearly) be what ReadNewTransactionId() returns. (Nearly because the transaction with the xid ReadNewTransactionId()-1 might have aborted, so largestCommittedXid might be a bit further behind ReadNewTransactionId().) (That slightly lagging of largestCommittedXid might cause some tuples not to be VACUUMED though, so we might want to update largestCommittedXid for ABORTS too, and probably rename it to largestNonRunningXid or whatever ;-) ). I would go as far as saying that largestCommittedXid+1 is the natural choice for xmax - after all, xmax is the cutoff point after which a xid *cannot* be seen as committed, and largestCommittedXid+1 is the smallest xmax that guarantees that we see xacts committed before the snapshot as committed. The xmin computation won't change - apart from using some other initial value. This would rid us of the rather complicated entanglement of XidGenLock and the ProcArrayLock, lessen the lock contention, and reduce the average snapshot size a bit. greetings, Florian Pflug
On Fri, 2007-09-07 at 06:36 +0200, Florian G. Pflug wrote: > Tom Lane wrote: > > "Florian G. Pflug" <fgp@phlo.org> writes: > >> So I believe you're right, and we can skip taking the lock in the no > >> xid case Sounds good. > - I actually think with just a little bit of more work, we > >> can go even further, and get rid of the ReadNewTransactionId() call > >> completely during snapshotting. > > > > [ squint... ] This goes a bit far for me. In particular, I think this > > will fail in the edge case when there are no live XIDs visible in > > ProcArray. You cannot go back and do ReadNewTransactionId afterward, > > at least not without re-scanning the ProcArray a second time, which > > makes it at best a questionable win. > > Why would it? I think the additional suggestion goes a bit too far. You may be right, but I don't want to change the transaction system in advanced ways this close to the next release. We may have difficulty spotting bugs in that thinking during beta. lazy XID assignment will reduce the number of times GetNextTransactionId() is called and if we also avoid taking the ProcArrayLock for CommitTransaction() then we will have significantly reduced contention for mixed workloads (i.e. most real ones). -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Fri, 2007-09-07 at 06:36 +0200, Florian G. Pflug wrote: >> Tom Lane wrote: >>> "Florian G. Pflug" <fgp@phlo.org> writes: >>> - I actually think with just a little bit of more work, we >>>> can go even further, and get rid of the ReadNewTransactionId() call >>>> completely during snapshotting. >>> [ squint... ] This goes a bit far for me. In particular, I think this >>> will fail in the edge case when there are no live XIDs visible in >>> ProcArray. You cannot go back and do ReadNewTransactionId afterward, >>> at least not without re-scanning the ProcArray a second time, which >>> makes it at best a questionable win. >> Why would it? > > I think the additional suggestion goes a bit too far. You may be right, > but I don't want to change the transaction system in advanced ways this > close to the next release. We may have difficulty spotting bugs in that > thinking during beta. Ok, those were two clear votes against doing this, so I'll stop arguing ;-). I do think that we should have another look at this when 8.4 opens, though. greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: > Why would it? The idea was to remember the largest committed xid, and that > won't go away just because the proc array is rather empty xid-wise. I hadn't fully absorbed this idea last night, but now that I have, I'm starting to think it's a good one. > (That slightly lagging of largestCommittedXid might cause some tuples not to > be VACUUMED though, so we might want to update largestCommittedXid for > ABORTS too, and probably rename it to largestNonRunningXid or whatever ;-) ). LargestCompletedXid, perhaps? > This would rid us of the rather complicated entanglement of XidGenLock and > the ProcArrayLock, lessen the lock contention, and reduce the average snapshot > size a bit. In view of Simon's earlier comments at http://archives.postgresql.org/pgsql-hackers/2007-07/msg00948.php it seems like not having to take the XidGenLock during GetSnapshotData ought to be a pretty serious win performance-wise, and it might open up some other possibilities that are now foreclosed by the risk of deadlock between the two locks. I've spent the past hour or so trying to consolidate the comments in GetSnapshotData and related places into a single chunk of text to be added to src/backend/access/transam/README. Attached is what I have so far --- this incorporates the idea of not taking ProcArrayLock to exit an XID-less transaction, but not yet Florian's idea. I think it'd get simpler if we changed to that, but am posting this for comments. regards, tom lane Interlocking transaction begin, transaction end, and snapshots -------------------------------------------------------------- We try hard to minimize the amount of overhead and lock contention involved in the frequent activities of beginning/ending a transaction and taking a snapshot. Unfortunately, we must have some interlocking for this, because it is critical that all backends agree on the commit order of transactions. For example, suppose an UPDATE in xact A is blocked by xact B's prior update of the same row, and xact B is doing commit while xact C gets a snapshot. Xact A can complete and commit as soon as B releases its locks. If xact C's GetSnapshotData sees xact B as still running, then it had better see xact A as still running as well, or it will be able to see two tuple versions - one deleted by xact B and one inserted by xact A. That is, if A thinks it committed after B, C had better think the same. We enforce this by not allowing any transaction to exit the set of running transactions while a snapshot is being taken. (This rule is probably stronger than necessary, but see the next problem.) The implementation of this is that GetSnapshotData takes the ProcArrayLock in shared mode (thereby allowing multiple backends to take snapshots in parallel), but xact.c must take the ProcArrayLock in exclusive mode while clearing MyProc->xid at transaction end (either commit or abort). Here is another variant of the risk scenario: 1. Xact A is running (in Read Committed mode). 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is swapped out before it can acquire ProcArrayLock. 3. Xact B gets new XID (>= C's xmax), makes changes and commits. 4. Xact A changes some row R changed by xact B and commits. 5. Xact C finishes getting its snapshot data. It sees xact A as done, but sees xact B as still running (since B >= xmax). Now C will see R changed by xact B and then xact A, *but* does not see other changes made by xact B. If C is supposed to be in Serializable mode, this is wrong. To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock before it calls ReadNewTransactionId. This prevents xact A from exiting the set of running transactions seen by xact C. Therefore both A and B will be seen as still running => no inconsistency. In short, then, the rule is that no transactions may exit the set of currently-running transactions between the time we fetch xmax and the time we finish building our snapshot. However, this restriction only applies to transactions that have an XID --- read-only transactions can end without acquiring ProcArrayLock, since they don't affect anyone else's snapshot. We must also require GetNewTransactionId to store the new XID into the shared ProcArray before releasing XidGenLock. This ensures that when GetSnapshotData calls ReadNewTransactionId (which also takes XidGenLock), all active XIDs before the returned value of nextXid are already present in the ProcArray and can't be missed by GetSnapshotData. Unfortunately, we can't have GetNewTransactionId take ProcArrayLock to do this, else it could deadlock against GetSnapshotData. Therefore, we simply let GetNewTransactionId store into MyProc->xid without any lock. We are thereby relying on fetch/store of an XID to be atomic, else other backends might see a partially-set XID. (NOTE: for multiprocessors that need explicit memory access fence instructions, this means that acquiring/releasing XidGenLock is just as necessary as acquiring/releasing ProcArrayLock for GetSnapshotData to ensure it sees up-to-date xid fields.) This also means that readers of the ProcArray xid fields must be careful to fetch a value only once, rather than assume they can read it multiple times and get the same answer each time. Another important activity that uses the shared ProcArray is GetOldestXmin, which must determine a lower bound for the oldest xmin of any active MVCC snapshot, system-wide. Each individual backend advertises the smallest xmin of its own snapshots in MyProc->xmin, or zero if it currently has no live snapshots (eg, if it's between transactions or hasn't yet set a snapshot for a new transaction). GetOldestXmin takes the MIN() of the valid xmin fields. It does this with only shared lock on ProcArrayLock, which means there is a potential race condition against other backends doing GetSnapshotData concurrently: we must be certain that a concurrent backend that is about to set its xmin does not compute an xmin less than what GetOldestXmin returns. We ensure that by including all the active XIDs into the MIN() calculation, along with the valid xmins. The rule that transactions can't exit without taking exclusive ProcArrayLock ensures that concurrent holders of shared ProcArrayLock will compute the same minimum of currently-active XIDs: no xact, in particular not the oldest, can exit while we hold shared ProcArrayLock. So GetOldestXmin's view of the minimum active XID will be the same as that of any concurrent GetSnapshotData, and so it can't produce an overestimate. If there is no active transaction at all, GetOldestXmin returns the result of ReadNewTransactionId. Note that two concurrent executions of GetOldestXmin might not see the same result from ReadNewTransactionId --- but if there is a difference, the intervening execution(s) of GetNewTransactionId must have stored their XIDs into the ProcArray, so the later execution of GetOldestXmin will see them and compute the same global xmin anyway. GetSnapshotData also performs an oldest-xmin calculation (which had better match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used for some tuple age cutoff checks where a fresh call of GetOldestXmin seems too expensive. Note that while it is certain that two concurrent executions of GetSnapshotData will compute the same xmin for their own snapshots, as argued above, it is not certain that they will arrive at the same estimate of RecentGlobalXmin. This is because we allow XID-less transactions to clear their MyProc->xmin asynchronously (without taking ProcArrayLock), so one execution might see what had been the oldest xmin, and another not. This is OK since RecentGlobalXmin need only be a valid lower bound. As noted above, we are already assuming that fetch/store of the xid fields is atomic, so assuming it for xmin as well is no extra risk.
Tom Lane wrote: > I've spent the past hour or so trying to consolidate the comments in > GetSnapshotData and related places into a single chunk of text to be > added to src/backend/access/transam/README. Attached is what I have so > far --- this incorporates the idea of not taking ProcArrayLock to exit > an XID-less transaction, but not yet Florian's idea. I think it'd get > simpler if we changed to that, but am posting this for comments. > Interlocking transaction begin, transaction end, and snapshots > -------------------------------------------------------------- > > We try hard to minimize the amount of overhead and lock contention involved > in the frequent activities of beginning/ending a transaction and taking a > snapshot. Unfortunately, we must have some interlocking for this, because > it is critical that all backends agree on the commit order of transactions. This is actually still a slightly stronger requirement than what we really need I think. To simplify the following discussion, "A -> B" shall mean that transaction A saw B as committed. Conversely, "A !> B" shall mean that A treats B as in-progress. If A was in read-committed mode, the visibility refers to the latest snapshot that A used. Now assume A and B commit at nearly the same time, and for two other transactions C and D the following holds: C -> A, C !> B but D !> A, D -> B. This would violate the requirement that the commit order is globally agreed upon, yet as long as both A !> B and B !> A holds, there is no conflict. (Note that if A and B are serializable, A !> B && B !> A implies that A and B cannot have touched the same record and have both committed - one would have been aborted due to a SerializationError). I must admit, though, that this is a quite academic case, since the prerequisite A !> B and B !> A is something we have no control over for read-committed transactions - who knows when they might have taken their last snapshot... Still, I wanted to mention this because I believe that the minimal requirement that we actually *need* to enforce is A -> B and B -> C imply A -> C. (T1) The actual implementation will probably always have to enforce something slightly stronger, but it's still nice to know the minimal guarantee needed to be able to judge correctness. > For example, suppose an UPDATE in xact A is blocked by xact B's prior > update of the same row, and xact B is doing commit while xact C gets a > snapshot. Xact A can complete and commit as soon as B releases its locks. > If xact C's GetSnapshotData sees xact B as still running, then it had > better see xact A as still running as well, or it will be able to see two > tuple versions - one deleted by xact B and one inserted by xact A. In my notation this becomes: A -> B and C !> B implies C !> A. This then follows from (T1) - Assume that A -> B, C !> B but C -> A, then with (A) C -> B follows from C -> A and A -> B, which contradicts C !> B. > We > enforce this by not allowing any transaction to exit the set of running > transactions while a snapshot is being taken. (This rule is probably > stronger than necessary, but see the next problem.) The implementation > of this is that GetSnapshotData takes the ProcArrayLock in shared mode > (thereby allowing multiple backends to take snapshots in parallel), but > xact.c must take the ProcArrayLock in exclusive mode while clearing > MyProc->xid at transaction end (either commit or abort). Agreed. We *do* enforce a strict global ordering of committs and snapshots. This then guarantees (T1) because if A -> B and B -> C, than A *must* have taken it's snapshot after B committed, and B in turn *must* have taken it's snapshot after C committed, so surely A -> C will hold too. > Here is another variant of the risk scenario: > > 1. Xact A is running (in Read Committed mode). > 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is > swapped out before it can acquire ProcArrayLock. > 3. Xact B gets new XID (>= C's xmax), makes changes and commits. > 4. Xact A changes some row R changed by xact B and commits. > 5. Xact C finishes getting its snapshot data. It sees xact A as done, > but sees xact B as still running (since B >= xmax). > > Now C will see R changed by xact B and then xact A, *but* does not see > other changes made by xact B. If C is supposed to be in Serializable mode, > this is wrong. I never really grasped why we need to assume serializable here - this seems wrong if C is read-committed too. Seeing only half of a transaction's changes can never be right, can it? > To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock > before it calls ReadNewTransactionId. This prevents xact A from exiting > the set of running transactions seen by xact C. Therefore both A and B > will be seen as still running => no inconsistency. Another point of view is that determining the xmax of a snapshot really *is* part of taking the snapshot. Since we already obtained that we need to serialize snapshotting and committing, it follows that we must not allow committs to happen between the time we take the xmax and the time we complete the snapshot. > In short, then, the rule is that no transactions may exit the set of > currently-running transactions between the time we fetch xmax and the time > we finish building our snapshot. However, this restriction only applies > to transactions that have an XID --- read-only transactions can end without > acquiring ProcArrayLock, since they don't affect anyone else's snapshot. Yes, they completely fall out of the picture, because nobody ever cares if they even committed. > We must also require GetNewTransactionId to store the new XID into the > shared ProcArray before releasing XidGenLock. This ensures that when > GetSnapshotData calls ReadNewTransactionId (which also takes XidGenLock), > all active XIDs before the returned value of nextXid are already present in > the ProcArray and can't be missed by GetSnapshotData. Unfortunately, we > can't have GetNewTransactionId take ProcArrayLock to do this, else it could > deadlock against GetSnapshotData. Therefore, we simply let > GetNewTransactionId store into MyProc->xid without any lock. We are > thereby relying on fetch/store of an XID to be atomic, else other backends > might see a partially-set XID. (NOTE: for multiprocessors that need > explicit memory access fence instructions, this means that > acquiring/releasing XidGenLock is just as necessary as acquiring/releasing > ProcArrayLock for GetSnapshotData to ensure it sees up-to-date xid fields.) > This also means that readers of the ProcArray xid fields must be careful to > fetch a value only once, rather than assume they can read it multiple times > and get the same answer each time. In other words: It won't do for transaction to acquire an xid, then go into hiding, and later suddenly reappear and start storing it's rather old xid somewhere. Or, again, differently put: When we take a snapshot we must be sure that any xid < our xmax is either not in use anymore, or shows up in the proc array. By storing the xid in the proc array while holding the XidGenLock, we ensure that there is always is at most one xid assigned, but not showing up in the proc array. This xid is always equal to either nextXid, or nextXid-1. It is the "nextXid-1" case that forces us to take the XidGenLock - it ensures that *no* xid is assigned but not showing up in the proc array. If we used any older value for the xmax, the nextXid-1 case wouldn't hurt us - the "hidden" xid will still be >= our xmax. The last paragraph is in essence what guarantees the correctness of using LargestCompletedXid instead of ReadNewTransactionId() as the xmax. > Another important activity that uses the shared ProcArray is GetOldestXmin, > which must determine a lower bound for the oldest xmin of any active MVCC > snapshot, system-wide. Each individual backend advertises the smallest > xmin of its own snapshots in MyProc->xmin, or zero if it currently has no > live snapshots (eg, if it's between transactions or hasn't yet set a > snapshot for a new transaction). GetOldestXmin takes the MIN() of the > valid xmin fields. It does this with only shared lock on ProcArrayLock, > which means there is a potential race condition against other backends > doing GetSnapshotData concurrently: we must be certain that a concurrent > backend that is about to set its xmin does not compute an xmin less than > what GetOldestXmin returns. We ensure that by including all the active > XIDs into the MIN() calculation, along with the valid xmins. The rule that > transactions can't exit without taking exclusive ProcArrayLock ensures that > concurrent holders of shared ProcArrayLock will compute the same minimum of > currently-active XIDs: no xact, in particular not the oldest, can exit > while we hold shared ProcArrayLock. So GetOldestXmin's view of the minimum > active XID will be the same as that of any concurrent GetSnapshotData, and > so it can't produce an overestimate. If there is no active transaction at > all, GetOldestXmin returns the result of ReadNewTransactionId. Note that > two concurrent executions of GetOldestXmin might not see the same result > from ReadNewTransactionId --- but if there is a difference, the intervening > execution(s) of GetNewTransactionId must have stored their XIDs into the > ProcArray, so the later execution of GetOldestXmin will see them and > compute the same global xmin anyway. Agreed. A shorter reasoning would maybe be: We need to ensure that no snapshot that is still in use has a xmin smaller than what GetOldestXmin() returns. We have to check two cases, snapshot creation and transaction exit. During snapshot creation, we compute the xmin as MIN() of all xids in our snapshot. A concurrent GetOldestXmin() or GlobalXmin computation cannot derive a larger value, since it takes the published xids into account, as well as the published xmins. This, together with the rule that no transaction can remove it's xid from the set of running transactions while we either take a snapshot or do GetOldestXmin() guarantees correctness. During transaction exit, we don't need to take any lock to clear our xmin. The xmin won't influence the xmin calculations of other transactions, only GlobalXmin and GetOldestXmin() computations. Since we won't use our snapshot anymore, we don't care if they take our xmin into account for that or not. > GetSnapshotData also performs an oldest-xmin calculation (which had better > match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used > for some tuple age cutoff checks where a fresh call of GetOldestXmin seems > too expensive. Note that while it is certain that two concurrent > executions of GetSnapshotData will compute the same xmin for their own > snapshots, as argued above, it is not certain that they will arrive at the > same estimate of RecentGlobalXmin. This is because we allow XID-less > transactions to clear their MyProc->xmin asynchronously (without taking > ProcArrayLock), so one execution might see what had been the oldest xmin, > and another not. This is OK since RecentGlobalXmin need only be a valid > lower bound. As noted above, we are already assuming that fetch/store > of the xid fields is atomic, so assuming it for xmin as well is no extra > risk. Agreed - see above for a equivalent description I had a rather hard time understanding these things initially - at least for me, the enlightenment came when I realized that most of the locking in there actually ensures that (T1) holds - or in other words, that the relation "A sees B as committed" *has* to be a transitive one. So I'd like to see this reasoning in that README - though maybe I'm the only one to whom this is the clearest wording... greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: > Tom Lane wrote: >> Here is another variant of the risk scenario: >> >> 1. Xact A is running (in Read Committed mode). >> 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is >> swapped out before it can acquire ProcArrayLock. >> 3. Xact B gets new XID (>= C's xmax), makes changes and commits. >> 4. Xact A changes some row R changed by xact B and commits. >> 5. Xact C finishes getting its snapshot data. It sees xact A as done, >> but sees xact B as still running (since B >= xmax). >> >> Now C will see R changed by xact B and then xact A, *but* does not see >> other changes made by xact B. If C is supposed to be in Serializable mode, >> this is wrong. > I never really grasped why we need to assume serializable here - this seems > wrong if C is read-committed too. Seeing only half of a transaction's changes > can never be right, can it? You'd be surprised ;-) ... Read Committed is not totally consistent, even considering a single statement using a single snapshot, because of the rules about UPDATEs starting from the latest committed version. But let's not get into that today. Anyway, one thing I realized while making this text is that the example cases mentioned in CommitTransaction and GetSnapshotData really were basically the same case. The difference is that in this case, the failure to see B as committed is closely related to the way that xmax is computed. We could get rid of this example if we switched over to your LatestCompletedXid proposal. > I had a rather hard time understanding these things initially - at least > for me, the enlightenment came when I realized that most of the locking > in there actually ensures that (T1) holds - or in other words, that > the relation "A sees B as committed" *has* to be a transitive one. > So I'd like to see this reasoning in that README - though maybe I'm the > only one to whom this is the clearest wording... I'll put some of this into the README. regards, tom lane
Here's some revised text for the README file, based on using Florian's idea of a global latestCompletedXid variable. As I worked through it I realized that in this design, XidGenLock gates entry of new XIDs into the ProcArray while ProcArrayLock gates their removal. Which is an interesting sort of symmetry property. It also turns out that the reason we need to gate entry with XidGenLock is to keep from breaking GetOldestXmin, rather than to ensure correctness of snapshots per se. (Note: I refer in the text to ProcArrayEndTransaction(), which is a function I'm thinking of putting into procarray.c to replace the current inline-in-xact.c code that clears xid and related fields.) Comments? regards, tom lane Interlocking transaction begin, transaction end, and snapshots -------------------------------------------------------------- We try hard to minimize the amount of overhead and lock contention involved in the frequent activities of beginning/ending a transaction and taking a snapshot. Unfortunately, we must have some interlocking for this, because we must ensure consistency about the commit order of transactions. For example, suppose an UPDATE in xact A is blocked by xact B's prior update of the same row, and xact B is doing commit while xact C gets a snapshot. Xact A can complete and commit as soon as B releases its locks. If xact C's GetSnapshotData sees xact B as still running, then it had better see xact A as still running as well, or it will be able to see two tuple versions - one deleted by xact B and one inserted by xact A. Another reason why this would be bad is that C would see (in the row inserted by A) earlier changes by B, and it would be inconsistent for C not to see any of B's changes elsewhere in the database. Formally, the correctness requirement is "if a snapshot A considers transaction X as committed, and any of transaction X's snapshots considered transaction Y as committed, then snapshot A must consider transaction Y as committed". What we actually enforce is strict serialization of commits and rollbacks with snapshot-taking: we do not allow any transaction to exit the set of running transactions while a snapshot is being taken. (This rule is stronger than necessary for consistency, but is relatively simple to enforce, and it assists with some other issues as explained below.) The implementation of this is that GetSnapshotData takes the ProcArrayLock in shared mode (so that multiple backends can take snapshots in parallel), but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode while clearing MyProc->xid at transaction end (either commit or abort). ProcArrayEndTransaction also holds the lock while advancing the shared latestCompletedXid variable. This allows GetSnapshotData to use latestCompletedXid + 1 as xmax for its snapshot: there can be no transaction >= this xid value that the snapshot needs to consider as completed. In short, then, the rule is that no transaction may exit the set of currently-running transactions between the time we fetch latestCompletedXid and the time we finish building our snapshot. However, this restriction only applies to transactions that have an XID --- read-only transactions can end without acquiring ProcArrayLock, since they don't affect anyone else's snapshot nor latestCompletedXid. Transaction start, per se, doesn't have any interlocking with these considerations, since we no longer assign an XID immediately at transaction start. But when we do decide to allocate an XID, GetNewTransactionId must store the new XID into the shared ProcArray before releasing XidGenLock. This ensures that all top-level XIDs <= latestCompletedXid are either present in the ProcArray, or not running anymore. (This guarantee doesn't apply to subtransaction XIDs, because of the possibility that there's not room for them in the subxid array; instead we guarantee that they are present or the overflow flag is set.) If a backend released XidGenLock before storing its XID into MyProc, then it would be possible for another backend to allocate and commit a later XID, causing latestCompletedXid to pass the first backend's XID, before that value became visible in the ProcArray. That would break GetOldestXmin, as discussed below. We allow GetNewTransactionId to store the XID into MyProc->xid (or the subxid array) without taking ProcArrayLock. This was once necessary to avoid deadlock; while that is no longer the case, it's still beneficial for performance. We are thereby relying on fetch/store of an XID to be atomic, else other backends might see a partially-set XID. This also means that readers of the ProcArray xid fields must be careful to fetch a value only once, rather than assume they can read it multiple times and get the same answer each time. Another important activity that uses the shared ProcArray is GetOldestXmin, which must determine a lower bound for the oldest xmin of any active MVCC snapshot, system-wide. Each individual backend advertises the smallest xmin of its own snapshots in MyProc->xmin, or zero if it currently has no live snapshots (eg, if it's between transactions or hasn't yet set a snapshot for a new transaction). GetOldestXmin takes the MIN() of the valid xmin fields. It does this with only shared lock on ProcArrayLock, which means there is a potential race condition against other backends doing GetSnapshotData concurrently: we must be certain that a concurrent backend that is about to set its xmin does not compute an xmin less than what GetOldestXmin returns. We ensure that by including all the active XIDs into the MIN() calculation, along with the valid xmins. The rule that transactions can't exit without taking exclusive ProcArrayLock ensures that concurrent holders of shared ProcArrayLock will compute the same minimum of currently-active XIDs: no xact, in particular not the oldest, can exit while we hold shared ProcArrayLock. So GetOldestXmin's view of the minimum active XID will be the same as that of any concurrent GetSnapshotData, and so it can't produce an overestimate. If there is no active transaction at all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound for the xmin that might be computed by concurrent or later GetSnapshotData calls. (We know that no XID less than this could be about to appear in the ProcArray, because of the XidGenLock interlock discussed above.) GetSnapshotData also performs an oldest-xmin calculation (which had better match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used for some tuple age cutoff checks where a fresh call of GetOldestXmin seems too expensive. Note that while it is certain that two concurrent executions of GetSnapshotData will compute the same xmin for their own snapshots, as argued above, it is not certain that they will arrive at the same estimate of RecentGlobalXmin. This is because we allow XID-less transactions to clear their MyProc->xmin asynchronously (without taking ProcArrayLock), so one execution might see what had been the oldest xmin, and another not. This is OK since RecentGlobalXmin need only be a valid lower bound. As noted above, we are already assuming that fetch/store of the xid fields is atomic, so assuming it for xmin as well is no extra risk.
Tom Lane wrote: > Here's some revised text for the README file, based on using Florian's idea > of a global latestCompletedXid variable. As I worked through it I realized > that in this design, XidGenLock gates entry of new XIDs into the ProcArray > while ProcArrayLock gates their removal. Which is an interesting sort of > symmetry property. It also turns out that the reason we need to gate entry > with XidGenLock is to keep from breaking GetOldestXmin, rather than to ensure > correctness of snapshots per se. I believe it would break both, no? If an xid <= latestCompletedXid is not included in the snapshot, but later used for updates, the snapshot will see those changes as committed when they really are not. But other than that, it really sounds fine. It certainly explains things much better than the comments in the existing code. I noticed two rather cosmetic issues .) latestCompletedXid sounds as it might refer to the *last* completed xid, but it actually refers to the largest / highest completed xid. So maybe we should call it highestCompletedXid or largestCompletedXid. .) Since you mention that we assume reading and writing int4s are atomic operations, maybe we should mention that for safety's sake we mark the corresponding pointers with volatile? greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: > I noticed two rather cosmetic issues > .) latestCompletedXid sounds as it might refer to the *last* completed xid, > but it actually refers to the largest / highest completed xid. So maybe we > should call it highestCompletedXid or largestCompletedXid. Actually that was an intentional choice: because of the wraparound behavior of XIDs, the "latest" value is not necessarily numerically largest. I'm not wedded to it though. > .) Since you mention that we assume reading and writing int4s are atomic > operations, maybe we should mention that for safety's sake we mark the > corresponding pointers with volatile? Couldn't hurt. I have a draft patch that I'll post shortly. regards, tom lane
As a fallout of this work that I haven't seen made explicit, a session opening a transaction and then sitting around doing nothing will not cause as many problems as it used to -- for example it won't cause VACUUM to be unable to clean up dead rows. Is this correct? Nowadays this is no longer much of a problem, but it used to be, back when drivers were more broken than they are now (when they had the commit method send "COMMIT; BEGIN"); however, it still seems to me like a big step forwards. -- Alvaro Herrera http://www.amazon.com/gp/registry/5ZYLFMCVHXC "Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
Alvaro Herrera <alvherre@commandprompt.com> writes: > As a fallout of this work that I haven't seen made explicit, a session > opening a transaction and then sitting around doing nothing will not > cause as many problems as it used to -- for example it won't cause > VACUUM to be unable to clean up dead rows. Is this correct? Yeah, if you just issue BEGIN and then sit, you won't have acquired either an xid or an xmin, so you don't create a VACUUM problem anymore. If you issue BEGIN, then SELECT, then sit, you'll be publishing an xmin but not an xid, so at that point you become a problem for VACUUM. However, internally you don't have any live snapshots (if you're in READ COMMITTED mode), so eventually we could have you stop publishing an xmin too. That's something for 8.4 though. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > If you issue BEGIN, then SELECT, then sit, you'll be publishing an xmin > but not an xid, so at that point you become a problem for VACUUM. > However, internally you don't have any live snapshots (if you're in READ > COMMITTED mode), so eventually we could have you stop publishing an xmin > too. That's something for 8.4 though. Aren't there some things that depend on the idea that even READ COMMITTED transactions still have a serializable snapshot lying around for them to use? This is actually one of the rough spots in HOT that I'm afraid you'll have problems with when you review it. If there were any HOT chains which are broken according to a new index definition then a any transaction considering using that index needs to know whether there's any possibility the plan will be used with a snapshot which can see those old tuples. It currently does this by checking if the transaction which created the index is in its serializable snapshot. It seems weird to have the planner using snapshots in any way, but I don't see any direct problems which arise. Essentially it's using the serializable snapshot as a proxy for better live snapshot bookkeeping. If there's no serializable snapshot then any future snapshot taken will surely not be able to see the old tuples, and if there is then it should be fine (though not as good as possible if we did more bookkeeping) to use it to determine whether the old tuples could possibly be visible when the plan is executed. I'm a bit afraid the plan will stay cached longer than would be ideal. If the plan doesn't use the new index then ideally we would want to invalidate it from the cache at the end of this transaction I suppose. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> If you issue BEGIN, then SELECT, then sit, you'll be publishing an xmin >> but not an xid, so at that point you become a problem for VACUUM. >> However, internally you don't have any live snapshots (if you're in READ >> COMMITTED mode), so eventually we could have you stop publishing an xmin >> too. That's something for 8.4 though. > Aren't there some things that depend on the idea that even READ COMMITTED > transactions still have a serializable snapshot lying around for them to use? I don't see why. A READ COMMITTED transaction that is between statements does still have a serializable snapshot sitting around, but it won't ever reference it again. (AFAIR you can't switch an active transaction into serializable mode.) So with a bit of work on snapshot management we could recognize that we have no live snapshots and clear the published xmin. This has been discussed before, eg this thread: http://archives.postgresql.org/pgsql-patches/2007-03/msg00381.php > This is actually one of the rough spots in HOT that I'm afraid you'll have > problems with when you review it. If there were any HOT chains which are > broken according to a new index definition then a any transaction considering > using that index needs to know whether there's any possibility the plan will > be used with a snapshot which can see those old tuples. It currently does this > by checking if the transaction which created the index is in its serializable > snapshot. This seems to be a proxy for "what's the oldest live snapshot in this backend", which is exactly what we'd have to track to adjust xmin anyway. regards, tom lane
Greg, > Aren't there some things that depend on the idea that even READ COMMITTED > transactions still have a serializable snapshot lying around for them to > use? No, that would be REPEATABLE READ. That's one of the areas where we need to test HOT; does RR and SERIALIZABLE still work correctly with HOT? -- Josh Berkus PostgreSQL @ Sun San Francisco
"Josh Berkus" <josh@agliodbs.com> writes: > No, that would be REPEATABLE READ. That's one of the areas where we need to > test HOT; does RR and SERIALIZABLE still work correctly with HOT? I'm a little confused by your question. REPEATABLE READ is an isolation level we don't directly support in Postgres. If you set the isolation level to REPEATABLE READ you get SERIALIZABLE which is a higher level. HOT works fine with SERIALIZABLE, It uses the same criteria as vacuum for determining when a tuple is dead, namely it compares against RecentGlobalXmin. The check for whether a transaction can see any old tuples in "broken" chains uses the serializable snapshot as a conservative proxy for the oldest snapshot which might be in use. That will work for both serializable and read committed. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Added to TODO: * Expire published xmin for read-only and idle transactions http://archives.postgresql.org/pgsql-hackers/2007-09/msg00343.php --------------------------------------------------------------------------- Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > As a fallout of this work that I haven't seen made explicit, a session > > opening a transaction and then sitting around doing nothing will not > > cause as many problems as it used to -- for example it won't cause > > VACUUM to be unable to clean up dead rows. Is this correct? > > Yeah, if you just issue BEGIN and then sit, you won't have acquired > either an xid or an xmin, so you don't create a VACUUM problem anymore. > > If you issue BEGIN, then SELECT, then sit, you'll be publishing an xmin > but not an xid, so at that point you become a problem for VACUUM. > However, internally you don't have any live snapshots (if you're in READ > COMMITTED mode), so eventually we could have you stop publishing an xmin > too. That's something for 8.4 though. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +