Thread: Re: serializable lock consistency
On Wed, Dec 15, 2010 at 9:20 AM, Florian Pflug <fgp@phlo.org> wrote: > On Dec14, 2010, at 15:01 , Robert Haas wrote: >> On Tue, Dec 14, 2010 at 7:51 AM, Florian Pflug <fgp@phlo.org> wrote: >>>> - serializable lock consistency - I am fairly certain this needs >>>> rebasing. I don't have time to deal with it right away. That sucks, >>>> because I think this is a really important change. >>> I can try to find some time to update the patch if it suffers from bit-rot. Would that help? >> >> Yes! > > I've rebased the patch to the current HEAD, and re-run my FK concurrency test suite, > available from https://github.com/fgp/fk_concurrency, to verify that things still work. > > I've also asserts to the callers of heap_{update,delete,lock_tuple} to verify (and document) > that update_xmax may only be InvalidTransactionId if a lockcheck_snapshot is passed to > heap_{update,delete,lock_tuple}. > > Finally, I've improved the explanation in src/backend/executor/README of how row locks and > REPEATABLE READ transactions interact, and tried to state the guarantees provided by > FOR SHARE and FOR UPDATE locks more precisely. > > I've published my work to https://github.com/fgp/postgres/tree/serializable_lock_consistency, > and attached an updated patch. I'd be happy to give write access to that GIT repository > to anyone who wants to help getting this committed. I am having a hard time understanding this patch. I decided to start with the README, and I'm still lost. :-( My understanding of the problem is as follows. Acquiring a lock on a tuple prevents the tuple from being concurrently updated. You might take such a lock on a tuple T, make some other modification to the database M, and commit, in the hope that your lock will prevent a concurrent transaction from updating T without seeing M. However, it doesn't work. When your lock on T is released, a concurrent serializable transaction will still be looking at the database using its original snapshot, and therefore won't see M. In effect, you might as well not have bothered obtaining a lock at all. (You'd get the same wrong answer that way, without unnecessary blocking!) To fix this, one must ensure that if a tuple T is locked - either for update or for share - by transaction A, and if a serializable transaction B whose snapshot doesn't permit it to see the effects of A subsequently attempts to update T, it must roll back. Questions: 1. Is my understanding of the problem correct? 2. Is my understanding of the fix correct? 3. Is that what this patch implements? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> My understanding of the problem is as follows. Acquiring a lock on a > tuple prevents the tuple from being concurrently updated. You might > take such a lock on a tuple T, make some other modification to the > database M, and commit, in the hope that your lock will prevent a > concurrent transaction from updating T without seeing M. However, it > doesn't work. When your lock on T is released, a concurrent > serializable transaction will still be looking at the database using > its original snapshot, and therefore won't see M. Exactly. > In effect, you > might as well not have bothered obtaining a lock at all. (You'd get > the same wrong answer that way, without unnecessary blocking!) If only serializable transactions are involved and they all commit, then yes. The only reason for serializable transactions A to wait for a lock taken by another transaction B is that B might abort, in which case A may proceed. > To fix this, one must ensure that if a tuple T is locked - either for > update or for share - by transaction A, and if a serializable > transaction B whose snapshot doesn't permit it to see the effects of A > subsequently attempts to update T, it must roll back. Yes. Otherwise, B cannot verify that the database is consistent. Note that it's sufficient to check if B can see the effects of the *latest* locker of T. If it can see those, it must also see the effects of any previous locker. But because of this, B cannot distinguish different lock strengths on T - even if A locked T in exclusive mode, some transaction A2 may lock T in shared mode after A has committed but before B inspects T. > Questions: > 1. Is my understanding of the problem correct? > 2. Is my understanding of the fix correct? Seems fine to me. > 3. Is that what this patch implements? It is. There is a small additional twist, however. For serializable transactions, everything is as you explained it. Taking a lock amounts to saying "If you can't see what I did, leave this tuple alone". A read-committed transactions, though, sees different things at different times since it takes a new snapshot for every statement. Since we cannot raise a serialization error in a read-committed transaction, an UPDATE or DELETE statement within a read-committed transaction may very well modify a previously locked row, even if *doesn't* see the effects of some concurrent locker. Any snapshot taken after the UDPATE or DELETE hit the locked row, however, *will* see those changes. This includes the snapshot taken within any AFTER trigger fired on updating the locked row. Thus, things for one fine for RI constraints enforced by triggers. best regards, Florian Pflug
On Sun, Dec 19, 2010 at 4:02 AM, Florian Pflug <fgp@phlo.org> wrote: > Yes. Otherwise, B cannot verify that the database is consistent. > > Note that it's sufficient to check if B can see the effects of the > *latest* locker of T. If it can see those, it must also see the > effects of any previous locker. But because of this, B cannot > distinguish different lock strengths on T - even if A locked T > in exclusive mode, some transaction A2 may lock T in shared mode > after A has committed but before B inspects T. This seems to point to a rather serious problem, though. If B sees that the last locker A1 aborted, correctness requires that it roll back B, because there might have been some other transaction A2 which locked locked T and committed before A1 touched it. Implementing that behavior could lead to a lot of spurious rollbacks, but NOT implementing it isn't fully correct. > For serializable transactions, everything is as you explained it. Taking a > lock amounts to saying "If you can't see what I did, leave this tuple alone". > A read-committed transactions, though, sees different things at different > times since it takes a new snapshot for every statement. Since we cannot > raise a serialization error in a read-committed transaction, an UPDATE > or DELETE statement within a read-committed transaction may very well > modify a previously locked row, even if *doesn't* see the effects of some > concurrent locker. Any snapshot taken after the UDPATE or DELETE hit the > locked row, however, *will* see those changes. This includes the snapshot > taken within any AFTER trigger fired on updating the locked row. Thus, > things for one fine for RI constraints enforced by triggers. I can't parse the last sentence of this paragraph. I think there is a word or two wrong. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Sun, Dec 19, 2010 at 4:02 AM, Florian Pflug <fgp@phlo.org> wrote: >> Note that it's sufficient to check if B can see the effects of the >> *latest* locker of T. If it can see those, it must also see the >> effects of any previous locker. But because of this, B cannot >> distinguish different lock strengths on T - even if A locked T >> in exclusive mode, some transaction A2 may lock T in shared mode >> after A has committed but before B inspects T. > > This seems to point to a rather serious problem, though. If B sees > that the last locker A1 aborted, correctness requires that it roll > back B, because there might have been some other transaction A2 which > locked locked T and committed before A1 touched it. Implementing that > behavior could lead to a lot of spurious rollbacks, but NOT > implementing it isn't fully correct. Certainly seems serios. How on earth did I manage to miss that one, I wonder :-( If only shared locks are invovled, the patch probably works correctly, though, since they don't remove all traces of previous lockers, they merely add themselves to the multi-xid holding the lock. That'd also explain why my concurrency test-suite didn't trip over this. So we may only need to do what you propose for exclusive locks. I'll look at this later today in more detail - I'm not currently at home, so I don't have access to the sources... >> For serializable transactions, everything is as you explained it. Taking >> a >> lock amounts to saying "If you can't see what I did, leave this tuple >> alone". >> A read-committed transactions, though, sees different things at >> different >> times since it takes a new snapshot for every statement. Since we cannot >> raise a serialization error in a read-committed transaction, an UPDATE >> or DELETE statement within a read-committed transaction may very well >> modify a previously locked row, even if *doesn't* see the effects of >> some >> concurrent locker. Any snapshot taken after the UDPATE or DELETE hit the >> locked row, however, *will* see those changes. This includes the >> snapshot >> taken within any AFTER trigger fired on updating the locked row. Thus, >> things for one fine for RI constraints enforced by triggers. > > I can't parse the last sentence of this paragraph. I think there is a > word or two wrong. Ups, sorry for that. s/for one/are - the sentence should have read "Thus, things are find for RI constraints enforced by triggers". best regards, Florian Pflug
On Sun, Dec 19, 2010 at 9:12 AM, Florian Pflug <fgp@phlo.org> wrote: >> On Sun, Dec 19, 2010 at 4:02 AM, Florian Pflug <fgp@phlo.org> wrote: >>> Note that it's sufficient to check if B can see the effects of the >>> *latest* locker of T. If it can see those, it must also see the >>> effects of any previous locker. But because of this, B cannot >>> distinguish different lock strengths on T - even if A locked T >>> in exclusive mode, some transaction A2 may lock T in shared mode >>> after A has committed but before B inspects T. >> >> This seems to point to a rather serious problem, though. If B sees >> that the last locker A1 aborted, correctness requires that it roll >> back B, because there might have been some other transaction A2 which >> locked locked T and committed before A1 touched it. Implementing that >> behavior could lead to a lot of spurious rollbacks, but NOT >> implementing it isn't fully correct. > > Certainly seems serios. How on earth did I manage to miss that one, I > wonder :-( > > If only shared locks are invovled, the patch probably works correctly, > though, since they don't remove all traces of previous lockers, they > merely add themselves to the multi-xid holding the lock. That'd also > explain why my concurrency test-suite didn't trip over this. So we may > only need to do what you propose for exclusive locks. I'd be willing to bet, without checking, that if the previous shared locker is no longer running by the time the next one comes along, we don't create a multixact in the first place. And even if it does, what about this: exclusive lock, commit, shared lock, abort. As unhappy as I am with the present behavior, this cure sounds like it might be worse than the disease. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Dec19, 2010, at 17:01 , Robert Haas wrote: > On Sun, Dec 19, 2010 at 9:12 AM, Florian Pflug <fgp@phlo.org> wrote: >>> On Sun, Dec 19, 2010 at 4:02 AM, Florian Pflug <fgp@phlo.org> wrote: >>>> Note that it's sufficient to check if B can see the effects of the >>>> *latest* locker of T. If it can see those, it must also see the >>>> effects of any previous locker. But because of this, B cannot >>>> distinguish different lock strengths on T - even if A locked T >>>> in exclusive mode, some transaction A2 may lock T in shared mode >>>> after A has committed but before B inspects T. >>> >>> This seems to point to a rather serious problem, though. If B sees >>> that the last locker A1 aborted, correctness requires that it roll >>> back B, because there might have been some other transaction A2 which >>> locked locked T and committed before A1 touched it. Implementing that >>> behavior could lead to a lot of spurious rollbacks, but NOT >>> implementing it isn't fully correct. >> >> Certainly seems serios. How on earth did I manage to miss that one, I >> wonder :-( >> >> If only shared locks are invovled, the patch probably works correctly, >> though, since they don't remove all traces of previous lockers, they >> merely add themselves to the multi-xid holding the lock. That'd also >> explain why my concurrency test-suite didn't trip over this. So we may >> only need to do what you propose for exclusive locks. > > I'd be willing to bet, without checking, that if the previous shared > locker is no longer running by the time the next one comes along, we > don't create a multixact in the first place. And you'd have won that bet. I just checked, and this is exactly what we do. > And even if it does, > what about this: exclusive lock, commit, shared lock, abort. I figured we could solve that by adding the exclusive locker's xid to the multi-xid if it was >= GlobalXmin. But... Playing games with multi-xid lifetimes helps nothing in case T1 locks, T1 commits, T2 updates, T2 aborts, all after T0 took its snapshot but before T0 attempts to delete. :-( Actually, wait a second... What we're interested here is the last non-aborted xmax of the tuple. If that is invisible to some serializable transaction which tries to modify the tuple, we raise a serialization error. In the case of updates, deletes and exclusive locks, we always wait for the xid(s) (more than if share-locked) in xmax to either commit or abort, and thus know their fate. We can therefore track the latest non-aborted xmax in a single additional field "xlast" if we simply save the old xmax there before overwriting it *if* that xmax did actually commit. Otherwise we leave xlast as it was. The same works for share locks. If the tuple was previously locked exclusively, updated or deleted, we proceed as above. If the tuple was previously share-locked, some other lockers may still be in progress. If at least one of them has committed, we again store its xid in xlast. Otherwise, we again leave xlast as it was. Note that xlast is never a multi-xid! When a serializable transaction updates, deletes or exclusively locks a tuple, we check if either xmax *or* xlast is invisible. If so, we raise a serialization error. If we reuse the legacy field xvac to store xlast, we don't get into trouble with binary upgrades either. We' need to find a way to deal with tuples where HEAP_MOVED_IN or HEAP_MOVED_OUT is set, but that seems manageable.. Does that sound viable? > As unhappy as I am with the present behavior, this cure sounds like it > might be worse than the disease. I agree. Aborting due to conflicts with aborted transactions really seems like a bad idea - there probably cases were that would lead to one spurious abort causing another causing another... But the solution sketched may be a way out... best regards, Florian Pflug PS: Thanks to anyone who put work into this! I'm extremely sorry that I didn't catch this earlier, and thus caused unnecessary work on your end :-(
On 19.12.2010 20:57, Florian Pflug wrote: > If we reuse the legacy field xvac to store xlast, we don't get into > trouble with binary upgrades either. We' need to find a way to deal > with tuples where HEAP_MOVED_IN or HEAP_MOVED_OUT is set, but that > seems manageable.. xvac shares the field with command id, and cid is in use while the tuple is being updated. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Dec20, 2010, at 07:16 , Heikki Linnakangas wrote: > On 19.12.2010 20:57, Florian Pflug wrote: >> If we reuse the legacy field xvac to store xlast, we don't get into >> trouble with binary upgrades either. We' need to find a way to deal >> with tuples where HEAP_MOVED_IN or HEAP_MOVED_OUT is set, but that >> seems manageable.. > > xvac shares the field with command id, and cid is in use while the tuple is being updated. Right :-( Well, that nails this coffin shut pretty tightly, unless we were willing to add another field to heap tuples. best regards, Florian Pflug
On 20.12.2010 13:52, Florian Pflug wrote: > On Dec20, 2010, at 07:16 , Heikki Linnakangas wrote: >> On 19.12.2010 20:57, Florian Pflug wrote: >>> If we reuse the legacy field xvac to store xlast, we don't get into >>> trouble with binary upgrades either. We' need to find a way to deal >>> with tuples where HEAP_MOVED_IN or HEAP_MOVED_OUT is set, but that >>> seems manageable.. >> >> xvac shares the field with command id, and cid is in use while the tuple is being updated. > > Right :-( > > Well, that nails this coffin shut pretty tightly, unless we were willing to add > another field to heap tuples. One way to look at this is that the problem arises because SELECT FOR UPDATE doesn't create a new tuple like UPDATE does. The problematic case was: > T1 locks, T1 commits, T2 updates, T2 aborts, all after T0 > took its snapshot but before T0 attempts to delete. :-( If T1 does a regular UPDATE, T2 doesn't overwrite the xmax on the original tuple, but on the tuple that T1 created. So one way to handle FOR UPDATE would be to lazily turn the lock operation by T1 into a dummy update, when T2 updates the tuple. You can't retroactively make a regular update on behalf of the locking transaction that committed already, or concurrent selects would see the same row twice, but it might work with some kind of a magic tuple that's only followed through the ctid from the original one, and only for the purpose of visibility checks. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Dec20, 2010, at 13:13 , Heikki Linnakangas wrote: > One way to look at this is that the problem arises because SELECT FOR UPDATE doesn't create a new tuple like UPDATE does.The problematic case was: > >> T1 locks, T1 commits, T2 updates, T2 aborts, all after T0 >> took its snapshot but before T0 attempts to delete. :-( > > If T1 does a regular UPDATE, T2 doesn't overwrite the xmax on the original tuple, but on the tuple that T1 created. > So one way to handle FOR UPDATE would be to lazily turn the lock operation by T1 into a dummy update, when T2 updates thetuple. You can't retroactively make a regular update on behalf of the locking transaction that committed already, or concurrentselects would see the same row twice, but it might work with some kind of a magic tuple that's only followed throughthe ctid from the original one, and only for the purpose of visibility checks. In the case of an UPDATE of a recently locked tuple, we could avoid having to insert a dummy tuple by storing the old tuple'sxmax in the new tuple's xmax. We'd flag the old tuple, and attempt to restore the xmax of any flagged tuple with anaborted xmax and a ctid != t_self during scanning and vacuuming. For DELETEs, that won't work. However, could we maybe abuse the ctid to store the old xmax? It currently contains t_self,but do we actually depend on that? FOR-SHARE and FOR-UPDATE locks could preserve information about the latest committed locker by creating a multi-xid. ForFOR-SHARE locks, we'd just need to ensure that we only remove all but one finished transactions. For FOR-UPDATE locks,we'd need to create a multi-xid if the old xmax is >= GlobalXmin, but I guess that's tolerable. best regards, Florian Pflug
On Mon, Dec 20, 2010 at 9:11 AM, Florian Pflug <fgp@phlo.org> wrote: > On Dec20, 2010, at 13:13 , Heikki Linnakangas wrote: >> One way to look at this is that the problem arises because SELECT FOR UPDATE doesn't create a new tuple like UPDATE does.The problematic case was: >> >>> T1 locks, T1 commits, T2 updates, T2 aborts, all after T0 >>> took its snapshot but before T0 attempts to delete. :-( >> >> If T1 does a regular UPDATE, T2 doesn't overwrite the xmax on the original tuple, but on the tuple that T1 created. > >> So one way to handle FOR UPDATE would be to lazily turn the lock operation by T1 into a dummy update, when T2 updatesthe tuple. You can't retroactively make a regular update on behalf of the locking transaction that committed already,or concurrent selects would see the same row twice, but it might work with some kind of a magic tuple that's onlyfollowed through the ctid from the original one, and only for the purpose of visibility checks. > > In the case of an UPDATE of a recently locked tuple, we could avoid having to insert a dummy tuple by storing the old tuple'sxmax in the new tuple's xmax. We'd flag the old tuple, and attempt to restore the xmax of any flagged tuple with anaborted xmax and a ctid != t_self during scanning and vacuuming. > > For DELETEs, that won't work. However, could we maybe abuse the ctid to store the old xmax? It currently contains t_self,but do we actually depend on that? My first reaction to all of this is that it sounds awfully grotty. > FOR-SHARE and FOR-UPDATE locks could preserve information about the latest committed locker by creating a multi-xid. ForFOR-SHARE locks, we'd just need to ensure that we only remove all but one finished transactions. For FOR-UPDATE locks,we'd need to create a multi-xid if the old xmax is >= GlobalXmin, but I guess that's tolerable. Even in the original version of this patch, there's a non-trivial overhead here when a multi-xid exists that doesn't exist today: a serializable transaction has to grovel through the XIDs in the multi-xact and figure out whether any of them are new enough to be a problem. I fear that this whole approach is a case of trying to jam a square peg through a round hole. We're trying to force the on-disk format that we have to meet a requirement it wasn't designed for, and it's looking pretty ugly. Kevin Grittner's work is a whole different approach to this problem, and while that's obviously not fully debugged and committed yet either, it's often easier to design a new tool to solve a particular problem than to make an existing tool that was really meant for something else do some new thing in addition. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > Kevin Grittner's work is a whole different approach to this > problem, and while that's obviously not fully debugged and > committed yet either, it's often easier to design a new tool to > solve a particular problem than to make an existing tool that was > really meant for something else do some new thing in addition. I see Florian's patch meeting a real need though, even though it is an orthogonal solution to the same problem. For those converting to PostgreSQL from Oracle, there are shops with a lot of code which counts on the behavior which Florian is trying to implement. Without Florian's patch, if they do a minimal-effort conversion to PostgreSQL (without removing SELECT FOR SHARE/UPDATE clauses or explicit locks) and count on SSI behavior, they will be paying for the cost of integrity enforcement *twice*. It also may be useful for those who can't easily structure their code to automatically retry transactions which experience a serialization failure. For those shops doing "green field" development or converting from products with true serializability (DB2, Sybase ASE, InnoDB, etc.) it will probably be easier for them to just use SSI; and I have reason to believe that SSI will generally perform better, so Florian's patch doesn't obviate the need for SSI. -Kevin
On Mon, Dec 20, 2010 at 12:39 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > I see Florian's patch meeting a real need though, I agree, but that whole approach seems to be foundering on the rocks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Dec20, 2010, at 17:54 , Robert Haas wrote: > On Mon, Dec 20, 2010 at 9:11 AM, Florian Pflug <fgp@phlo.org> wrote: >> On Dec20, 2010, at 13:13 , Heikki Linnakangas wrote: >>> One way to look at this is that the problem arises because SELECT FOR UPDATE doesn't create a new tuple like UPDATE does.The problematic case was: >>> >>>> T1 locks, T1 commits, T2 updates, T2 aborts, all after T0 >>>> took its snapshot but before T0 attempts to delete. :-( >>> >>> If T1 does a regular UPDATE, T2 doesn't overwrite the xmax on the original tuple, but on the tuple that T1 created. >> >>> So one way to handle FOR UPDATE would be to lazily turn the lock operation by T1 into a dummy update, when T2 updatesthe tuple. You can't retroactively make a regular update on behalf of the locking transaction that committed already,or concurrent selects would see the same row twice, but it might work with some kind of a magic tuple that's onlyfollowed through the ctid from the original one, and only for the purpose of visibility checks. >> >> In the case of an UPDATE of a recently locked tuple, we could avoid having to insert a dummy tuple by storing the oldtuple's xmax in the new tuple's xmax. We'd flag the old tuple, and attempt to restore the xmax of any flagged tuple withan aborted xmax and a ctid != t_self during scanning and vacuuming. >> >> For DELETEs, that won't work. However, could we maybe abuse the ctid to store the old xmax? It currently contains t_self,but do we actually depend on that? > > My first reaction to all of this is that it sounds awfully grotty. Well, there's precedent for overlaying fields in the tuple header for space efficiency... >> FOR-SHARE and FOR-UPDATE locks could preserve information about the latest committed locker by creating a multi-xid. ForFOR-SHARE locks, we'd just need to ensure that we only remove all but one finished transactions. For FOR-UPDATE locks,we'd need to create a multi-xid if the old xmax is >= GlobalXmin, but I guess that's tolerable. > > Even in the original version of this patch, there's a non-trivial > overhead here when a multi-xid exists that doesn't exist today: a > serializable transaction has to grovel through the XIDs in the > multi-xact and figure out whether any of them are new enough to be a > problem. It has to grovel through them anyway to see if any one of them is still running (in the UPDATE/DELETE/FOR-UPDATE case), or needs to copy them into a new multi-xid (in the FOR-SHARE case). Scanning it a second time thus won't cause any further IO. If the second scan is really a concern here, it could probably be folded into the first, but I doubt very much that it'd be worth the added complexity. Having to create a multi-xid for FOR-UPDATE locks is a real concern, though. But if overlaying/abusing ctid proves to be possible in the case of DELETEs, the same could be done for FOR-SHARE and FOR-UPDATE locks. Only the UPDATE case would then remain... > I fear that this whole approach is a case of trying to jam a > square peg through a round hole. We're trying to force the on-disk > format that we have to meet a requirement it wasn't designed for, and > it's looking pretty ugly. Certainly true. But most of the grotty-ness isn't inherently part of a solution to the problem at hand. Rather, it's imposed on us by the requirement for binary upgradability. IMHO, it's the price we pay for having that. > Kevin Grittner's work is a whole different > approach to this problem, and while that's obviously not fully > debugged and committed yet either, it's often easier to design a new > tool to solve a particular problem than to make an existing tool that > was really meant for something else do some new thing in addition. Kevin's work only solves the problem at hand if we removed support for what we call SERIALIZABLE now completely. As it stands, however, what we now call SERIALIZABLE will still be available as isolation level REPEATABLE READ, just as it is now, and will retain its quirky behaviour regarding row-level locks. For some workloads at least, Kevin's also introduces many more reasons for spurious serialization failures. Don't take me wrong here - I think what Kevin is doing is very important, and will be a huge leap for postgres, putting us ahead of all commercial and open-source competitors that I know of. But having to pay the price for that to be able to simply enforce RI constraints slightly outside of what FK constraints can do still seems wrong. For driving a screw into a piece of wood, a screwdriver still beats even the most powerful hammer, and leaves fewer traces on the wood... BTW, I realized that preventing UPDATEs, DELETEs and FOR-UPDATE locks from removing all traces of previous lock owners would also help to solve the bug cautioned about here: http://www.postgresql.org/docs/9.0/interactive/sql-select.html#SQL-FOR-UPDATE-SHARE Briefly, the problem is if you do BEGIN; SELECT * FROM mytable WHERE key = 1 FOR UPDATE; SAVEPOINT s; UPDATE mytable SET ... WHERE key = 1; ROLLBACK TO s; the row ends up unlocked, instead of locked FOR UPDATE. While the current behaviour of SERIALIZABLE transactions regarding row locks may be categorized not as a bug but as a lack of features, this problem is IMHO clearly a bug. A bug that it's possible to live with, yes, but still a bug. For me, this is another very good reason to explore this further. Plus, it improves the ratio of grotty-ness vs. number-of-problems-soved ;-) best regards, Florian Pflug
On Mon, Dec 20, 2010 at 12:49 PM, Florian Pflug <fgp@phlo.org> wrote: > For me, this is another very good reason to explore this further. Plus, it > improves the ratio of grotty-ness vs. number-of-problems-soved ;-) By all means, look into it further. I fear the boat is filling up with water, but if you manage to come up with a workable solution I'll be as happy as anyone, promise! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Dec20, 2010, at 18:54 , Robert Haas wrote: > On Mon, Dec 20, 2010 at 12:49 PM, Florian Pflug <fgp@phlo.org> wrote: >> For me, this is another very good reason to explore this further. Plus, it >> improves the ratio of grotty-ness vs. number-of-problems-soved ;-) > > By all means, look into it further. I fear the boat is filling up > with water, but if you manage to come up with a workable solution I'll > be as happy as anyone, promise! I'll try to create a details proposal. To do that, however, I'll require some guidance on whats acceptable and whats not. Here's a summary of the preceding discussion To deal with aborted transactions correctly, we need to track the last locker of a particular tuple that actually committed. If we also want to fix the bug that causes a row lock to be lost upon doing lock;savepoint;update;restore that "latest committed locker" will sometimes need to be a set, since it'll need to store the outer transaction's xid as well as the latest actually committed locker. As long as no transaction aborts are involved, the tuple's xmax contains all the information we need. If a transaction updates, deletes or locks a row, the previous xmax is overwritten. If the transaction later aborts, we cannot decide whether it has previously been locked or not. And these ideas have come up A) Transactions who merely lock a row could put the previous locker's xid (if >= GlobalXmin) *and* their own xid into amulti-xid, and store that in xmax. For shared locks, this merely means cleaning out the existing multi-xid a bit lessaggressively. There's no risk of bloat there, since we only need to keep one committed xid, not all of them. For exclusivelocks, we currently never create a multi-xid. That'd change, we'd need to create one if we find a previous lockerwith an xid >= GlobalXmin. This doesn't solve the UPDATE and DELETE cases. For SELECT-FOR-SHARE this is probablythe best option, since it comes very close to what we do currently. B) A transaction who UPDATEs or DELETEs a tuple could create an intermediate lock-only tuple which'd contain the necessary information about previous lock holders. We'd only need to do that if there actually is one with xid >= GlobalXmin.We could then choose whether to do the same for SELECT-FOR-UPDATE, or whether we'd prefer to go with (A) C) The ctid field is only necessary for updated tuples. We could thus overlay it with a field which stores the last committedlocker after a DELETE. UPDATEs could be handled either as in (B), or by storing the information in the ctid-overlayin the *new* tuple. SELECT-FOR-UPDATE could again either also use the ctid overlay or use (A). D) We could add a new tuple header field xlatest. To support binary upgrade, we'd need to be able to read tuples withoutthat field also. We could then either create a new tuple version upon the first lock request to such a tuple (whichwould then include the new header), or we could simply raise a serialization error if a serializable transactiontried to update a tuple without the field whose xmax was aborted and >= GlobalXmin. I have the nagging feeling that (D) will meet quite some resistance. (C) was too well received either, though I wonder if that'd change if the grotty-ness was hidden behind a API, much xvac/cmin/cmax overlay is. (B) seems like a lot of overhead, but maybe cleaner. More research is needed though to check how it'd interact with HOT and how to get the locking right. (A) is IMHO the best solution for the SELECT-FOR-SHARE since it's very close to what we do today. Any comments? Especially of the "don't you dare" kind? best regards, Florian Pflug
On Mon, Dec 20, 2010 at 5:32 PM, Florian Pflug <fgp@phlo.org> wrote: > On Dec20, 2010, at 18:54 , Robert Haas wrote: >> On Mon, Dec 20, 2010 at 12:49 PM, Florian Pflug <fgp@phlo.org> wrote: >>> For me, this is another very good reason to explore this further. Plus, it >>> improves the ratio of grotty-ness vs. number-of-problems-soved ;-) >> >> By all means, look into it further. I fear the boat is filling up >> with water, but if you manage to come up with a workable solution I'll >> be as happy as anyone, promise! > > I'll try to create a details proposal. To do that, however, I'll require > some guidance on whats acceptable and whats not. > > Here's a summary of the preceding discussion > > To deal with aborted transactions correctly, we need to track the last > locker of a particular tuple that actually committed. If we also want > to fix the bug that causes a row lock to be lost upon doing > lock;savepoint;update;restore that "latest committed locker" will > sometimes need to be a set, since it'll need to store the outer > transaction's xid as well as the latest actually committed locker. > > As long as no transaction aborts are involved, the tuple's xmax > contains all the information we need. If a transaction updates, > deletes or locks a row, the previous xmax is overwritten. If the > transaction later aborts, we cannot decide whether it has previously > been locked or not. > > And these ideas have come up > > A) Transactions who merely lock a row could put the previous > locker's xid (if >= GlobalXmin) *and* their own xid into a multi-xid, > and store that in xmax. For shared locks, this merely means cleaning > out the existing multi-xid a bit less aggressively. There's > no risk of bloat there, since we only need to keep one committed > xid, not all of them. For exclusive locks, we currently never > create a multi-xid. That'd change, we'd need to create one > if we find a previous locker with an xid >= GlobalXmin. This doesn't > solve the UPDATE and DELETE cases. For SELECT-FOR-SHARE this > is probably the best option, since it comes very close to what > we do currently. > > B) A transaction who UPDATEs or DELETEs a tuple could create an > intermediate lock-only tuple which'd contain the necessary > information about previous lock holders. We'd only need to do > that if there actually is one with xid >= GlobalXmin. We could > then choose whether to do the same for SELECT-FOR-UPDATE, or > whether we'd prefer to go with (A) > > C) The ctid field is only necessary for updated tuples. We could thus > overlay it with a field which stores the last committed locker after > a DELETE. UPDATEs could be handled either as in (B), or by storing the > information in the ctid-overlay in the *new* tuple. SELECT-FOR-UPDATE > could again either also use the ctid overlay or use (A). > > D) We could add a new tuple header field xlatest. To support binary > upgrade, we'd need to be able to read tuples without that field > also. We could then either create a new tuple version upon the > first lock request to such a tuple (which would then include the > new header), or we could simply raise a serialization error if > a serializable transaction tried to update a tuple without the > field whose xmax was aborted and >= GlobalXmin. > > I have the nagging feeling that (D) will meet quite some resistance. (C) was > too well received either, though I wonder if that'd change if the grotty-ness > was hidden behind a API, much xvac/cmin/cmax overlay is. (B) seems like a > lot of overhead, but maybe cleaner. More research is needed though to check > how it'd interact with HOT and how to get the locking right. (A) is IMHO the > best solution for the SELECT-FOR-SHARE since it's very close to what we do > today. > > Any comments? Especially of the "don't you dare" kind? I think any solution based on (D) has zero chance of being accepted, and it wouldn't be that high except that probabilities can't be negative. Unfortunately, I don't understand (A) or (B) well enough to comment intelligently. My previously expressed concern about (C) wasn't based on ugliness, but rather on my believe that there is likely a whole lot of code which relies on the CTID being a self-link when no UPDATE has occurred. We'd have to be confident that all such cases had been found and fixed, which might not be easy to be confident about. I have a more "meta" concern, too. Let's suppose (just for example) that you write some absolutely brilliant code which makes it possible to overlay CTID and which has the further advantage of being stunningly obvious, so that we have absolute confidence that it is correct. The obvious question is - if we can overlay CTID in some situations, is there a better use for that than this? Just to throw out something that might be totally impractical, maybe we could get rid of XMAX. If the tuple hasn't been updated, the CTID field stores the XMAX; if it HAS been updated, the CTID points to the successor tuple, whose XMIN is our XMAX. I'm sure there are a bunch of reasons why that doesn't actually work - I can think of some of them myself - but the point is that there's an opportunity cost to stealing those bits. Once they're dedicated to this purpose, they can't ever be dedicated to any other purpose. Space in the tuple header is precious, and I am not at all convinced that we should be willing to surrender any for this. We have to believe not only that this change is good, but also that it's more good than some other purpose to which that bit space could potentially be put. Now obviously, overlaying onto space that's already reserved is better than allocating new space. But it's not free. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Dec21, 2010, at 00:08 , Robert Haas wrote: > My previously expressed concern about (C) wasn't based on ugliness, > but rather on my believe that there is likely a whole lot of code > which relies on the CTID being a self-link when no UPDATE has > occurred. We'd have to be confident that all such cases had been > found and fixed, which might not be easy to be confident about. Thats certainly a valid concern, and one that a credible proposal will have to address. > The obvious question is - if we can overlay CTID in some > situations, is there a better use for that than this? Just to throw > out something that might be totally impractical, maybe we could get > rid of XMAX. If the tuple hasn't been updated, the CTID field stores > the XMAX; if it HAS been updated, the CTID points to the successor > tuple, whose XMIN is our XMAX. I'm sure there are a bunch of reasons > why that doesn't actually work - I can think of some of them myself - > but the point is that there's an opportunity cost to stealing those > bits. Once they're dedicated to this purpose, they can't ever be > dedicated to any other purpose. Yes, there is an opportunity cost there, I can't argue with that. > Space in the tuple header is > precious, and I am not at all convinced that we should be willing to > surrender any for this. Thats a pretty tight space to maneuver in, though. So tight, in fact, that I may as well give up, barring some absolutely genius idea, which I don't even know where to look for at the moment. Every feature has its price, and if giving up on completely hypothetical future savings is too costly, then surely anything else I might suggest is too :-( > We have to believe not only that this change > is good, but also that it's more good than some other purpose to which > that bit space could potentially be put. Unfortunately, I'm running out of arguments for why *is* important and *is* worth paying a price. I've been forced to simply give up on making some database-side constraint enforcement 100% waterproof in the past. That bugged me greatly each time, and each time weakened my case when I tried to explain to client why enforcing such things in the database is a Good Thing. I therefore have a hard time trying to understand why people mostly seem to regard this is a non-issue or merely a nice-to-have. Regarding the sub-transaction vs. locking issue - I haven't been bitten by this personally, since I tend to avoid using sub-transactions at all. But if I did, I'd feel even stronger about this, since it's clearly a bug. best regards, Florian Pflug
On Mon, Dec 20, 2010 at 7:19 PM, Florian Pflug <fgp@phlo.org> wrote: >> Space in the tuple header is >> precious, and I am not at all convinced that we should be willing to >> surrender any for this. > > Thats a pretty tight space to maneuver in, though. So tight, in fact, > that I may as well give up, barring some absolutely genius idea, which > I don't even know where to look for at the moment. > > Every feature has its price, and if giving up on completely hypothetical > future savings is too costly, then surely anything else I might suggest > is too :-( Of the ideas proposed so far, the idea of somehow making use of the existing multi-xid machinery to do this seems the most promising to me. But I haven't yet understood exactly what you're proposing, or fully thought through the performance implications, which is obviously something that needs to happen, and if that doesn't pan out, then, as I said upthread and you said here, yeah, we may need a new idea. It would be useful if some other folks weighed in on this, too. I understand what behavior we're trying to get here and why we want that, but I don't necessarily have the greatest understanding of all the details of the on-disk format. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company