Thread: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
The attached patch implements INSERT...ON DUPLICATE KEY LOCK FOR UPDATE. This is similar to INSERT...ON DUPLICATE KEY IGNORE (which is also proposed at part of this new revision of the patch), but additionally acquires a row exclusive lock on the row that prevents insertion from proceeding in respect of some tuple proposed for insertion. This feature offers something that I believe could be reasonably described as upsert. Consider: postgres=# create table foo(a int4 primary key, b text); CREATE TABLE postgres=# with r as ( insert into foo(a,b) values (5, '!'), (6, '@') on duplicate key lock for update returning rejects * ) update foo set b = r.b from r where foo.a = r.a; UPDATE 0 Here there are 0 rows affected by the update, because all work was done in the insert. If l do it again 2 rows are affected by the update: postgres=# with r as ( insert into foo(a,b) values (5, '!'), (6, '@') on duplicate key lock for update returning rejects * ) update foo set b = r.b from r where foo.a = r.a; UPDATE 2 Obviously, rejects were now projected into the wCTE, and the underlying rows were locked. The idea is that we can update the rows, confident that each rejection-causing row will be updated in a race-free fashion. I personally prefer this to something like MySQL's INSERT...ON DUPLICATE KEY UPDATE, because it's more flexible. For example, we could have deleted the locked rows instead, if that happened to make sense. Making this kind of usage idiomatic feels to me like the Postgres way to do upsert. Others may differ here. I will however concede that it'll be unfortunate to not have some MySQL compatibility, for the benefit of people porting widely used web frameworks. I'm not really sure if I should have done something brighter here than lock the first duplicate found, or if it's okay that that's all I do. That's another discussion entirely. Though previously Andres and I did cover the question of prioritizing unique indexes, so that the most sensible duplicate for the particular situation was returned, according to some criteria. As previously covered, I felt that including a row locking component was essential to reasoning about our requirements for what I've termed "speculative insertion" -- the basic implementation of value locking that is needed to make all this work. As I said in that earlier thread, there are many opinions about this, and it isn't obvious which one is right. Any approach needs to have its interactions with row locking considered right up front. Those that consider this a new patch with new functionality, or even a premature expansion on what I've already posted should carefully consider that. Do we really want to assume that these two things are orthogonal? I think that they're probably not, but even if that happens to turn out to have been not the case, it's an unnecessary risk to take. Row locking ========== Row locking is implemented with calls to a new function above ExecInsert. We don't bother with the usual EvalPlanQual looping pattern for now, preferring to just re-check from scratch if there is a concurrent update from another session (see comments in ExecLockHeapTupleForUpdateSpec() for details). We might do better here. I haven't considered the row locking functionality in too much detail since the last revision, preferring to focus on value locking. Buffer locking/value locking ====================== Andres raised concerns about the previous patch's use of exclusive buffer locks for extended periods (i.e. during a single heap tuple insertion). These locks served as extended value locks. With this revision, we don't hold exclusive buffer locks for the duration of heap insertion - we hold shared buffer locks instead. I believe that Andres principal concern was the impact on concurrent index scans by readers, so I think that all of this will go some way towards alleviating his concerns generally. This necessitated inventing entirely new LWLock semantics around "weakening" (from exclusive to shared) and "strengthening" (from shared to exclusive) of locks already held. Of course, as you'd expect, there are some tricky race hazards surrounding these new functions that clients need to be mindful of. These have been documented within lwlock.c. I looked for a precedent for these semantics, and found a few. Perhaps the most prominent was Boost, a highly regarded, peer-reviews C++ library. Boost implements exactly these semantics for some of its thread synchronization/mutex primitives: http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable They have a concept of upgradable ownership, which is just like shared ownership, except, I gather, that the owner reserves the exclusive right to upgrade to an exclusive lock (for them it's not quite an exclusive lock; it's an upgradeable/downgradable exclusive lock). My solution is to push that responsibility onto the client - I admonish something along the lines of "don't let more than one shared locker do this at a time per LWLock". I am of course mindful of this caveat in my modifications to the btree code, where I "weaken" and then later "strengthen" an exclusive lock - the trick here is that before I weaken I get a regular exclusive lock, and I only actually weaken after that when going ahead with insertion. I suspect that this may not be the only place where this trick is helpful. This intended usage is described in the relevant comments added to lwlock.c. Testing ====== This time around, in order to build confidence in the new LWLock infrastructure for buffer locking, on debug builds we re-verify that the value proposed for insertion on the locked page is in fact not on that page as expected during the second phase, and that our previous insertion point calculation is still considered correct. This is kind of like the way we re-verifying the wait-queue-is-in-lsn-order invariant in syncrep.c on debug builds. It's really a fancier assertion - it doesn't just test the state of scalar variables. This was invaluable during development of the new LWLock infrastructure. Just as before, but this time with just shared buffer locks held during heap tuple insertion, the patch has resisted considerable brute-force efforts to break it (e.g. using pgbench to get many sessions speculatively inserting values into a table. Many different INSERT... ON DUPLICATE KEY LOCK FOR UPDATE statements, interspersed with UPDATE, DELETE and SELECT statements. Seeing if spurious duplicate tuple insertions occur, or deadlocks, or assertion failures). As always, isolation tests are included. Bugs ==== I fixed the bug that Andres reported in relation to multiple exclusive indexes' interaction with waits for another transaction's end during speculative insertion. I did not get around to fixing the broken ecpg regression tests, as reported by Peter Eisentraut. I was a little puzzled by the problem there. I'll return to it in a while, or perhaps someone else can propose a solution. Thoughts? -- Peter Geoghegan
Attachment
On Sun, Sep 8, 2013 at 10:21 PM, Peter Geoghegan <pg@heroku.com> wrote: > This necessitated inventing entirely new LWLock semantics around > "weakening" (from exclusive to shared) and "strengthening" (from > shared to exclusive) of locks already held. Of course, as you'd > expect, there are some tricky race hazards surrounding these new > functions that clients need to be mindful of. These have been > documented within lwlock.c. I've since found that I can fairly reliably get this to deadlock at high client counts (say, 95, which will do it on my 4 core laptop with a little patience). To get this to happen, I used pgbench with a single INSERT...ON DUPLICATE KEY IGNORE transaction script. The more varied workload that I tested this most recent revision (v2) with the most, with a transaction consisting on a mixture of different statements (UPDATEs, DELETEs, INSERT...ON DUPLICATE KEY LOCK FOR UPDATE) did not show the problem. What I've been doing to recreate this is pgbench runs in an infinite loop from a bash script, with a new table created for each iteration. Each iteration has 95 clients "speculatively insert" a total of 1500 possible tuples for 15 seconds. After this period, the table has exactly 1500 tuples, with primary key values 1 - 1500. Usually, after about 5 - 20 minutes, deadlock occurs. This was never a problem with the exclusive lock coding (v1), unsurprisingly - after all, as far as buffer locks are concerned, it did much the same thing as the existing code. I've made some adjustments to LWLockWeaken, LWLockStrengthen and LWLockRelease that made the deadlocks go away. Or at least, no deadlocks or other problems manifested themselves using the same test case for over two hours. Attached revision includes these changes, as well as a few minor comment tweaks here and there. I am working on an analysis of the broader deadlock hazards - the implications of simultaneously holding multiple shared buffer locks (that is, one for every unique index btree leaf page participating in value locking) for the duration of a each heap tuple insertion (each heap_insert() call). I'm particularly looking for unexpected ways in which this locking could interact with other parts of the code that also acquire buffer locks, for example vacuumlazy.c. I'll also try and estimate how much of a maintainability burden unexpected locking interactions with these other subsystems might be. In case it isn't obvious, the deadlocking issue addressed by this revision is not inherent to my design or anything like that - the bugs fixed by this revision are entirely confined to lwlock.c. -- Peter Geoghegan
Attachment
Hi Peter, Nice to see the next version, won't have time to look in any details in the next few days tho. On 2013-09-10 22:25:34 -0700, Peter Geoghegan wrote: > I am working on an analysis of the broader deadlock hazards - the > implications of simultaneously holding multiple shared buffer locks > (that is, one for every unique index btree leaf page participating in > value locking) for the duration of a each heap tuple insertion (each > heap_insert() call). I'm particularly looking for unexpected ways in > which this locking could interact with other parts of the code that > also acquire buffer locks, for example vacuumlazy.c. I'll also try and > estimate how much of a maintainability burden unexpected locking > interactions with these other subsystems might be. I think for this approach to be workable you also need to explain how we can deal with stuff like toast insertion that may need to write hundreds of megabytes all the while leaving an entire value-range of the unique key share locked. I still think that even doing a plain heap insertion is longer than acceptable to hold even a share lock over a btree page, but as long as stuff like toast insertions happen while doing so that's peanuts. The easiest answer is doing the toasting before doing the index locking, but that will result in bloat, the avoidance of which seems to be the primary advantage of your approach. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Sep 11, 2013 at 2:28 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Nice to see the next version, won't have time to look in any details in > the next few days tho. Thanks Andres! > I think for this approach to be workable you also need to explain how we > can deal with stuff like toast insertion that may need to write hundreds > of megabytes all the while leaving an entire value-range of the unique > key share locked. Right. That is a question that needs to be addressed in a future revision. > I still think that even doing a plain heap insertion is longer than > acceptable to hold even a share lock over a btree page Well, there is really only one way of judging something like that, and that's to do a benchmark. I still haven't taken the time to "pick the low hanging fruit" here that I'd mentioned - there are some fairly obvious ways to shorten the window in which value locks are held. Furthermore, I'm sort of at a loss as to what a fair benchmark would look like - what is actually representative here? Also, what's the baseline? It's not as if someone has an alternative, competing patch. We can only hypothesize what additional costs those other approaches introduce, unless someone has a suggestion as to how they can be simulated without writing the full patch, which is something I'd entertain. As I've already pointed out, all page splits occur with the same buffer exclusive lock held. Only, in our case, we're weakening that lock to a shared lock. So I don't think that the heap insertion is going to be that big of a deal, particularly in the average case. Having said that, it's a question that surely must be closely examined before proceeding much further. And yes, the worst case could be pretty bad, and that surely matters too. > The easiest answer is doing the toasting before doing the index locking, > but that will result in bloat, the avoidance of which seems to be the > primary advantage of your approach. I would say that the primary advantage of my approach is that it's much simpler than any other approach that has been considered by others in the past. The approach is easier to reason about because it's really just an extension of how btrees already do value locking. Granted, I haven't adequately demonstrated that things really are so rosy, but I think I'll be able to. The key point is that with trivial exception, all other parts of the code, like VACUUM, don't consider themselves to directly have license to acquire locks on btree buffers - they go through the AM interface instead. What do they know about what makes sense for a particular AM? The surface area actually turns out to be fairly manageable. With the promise tuple approach, it's more the maintainability overhead of new *classes* of bloat that I'm concerned about than the bloat itself, and all the edge cases that are likely to be introduced. But yes, the overhead of doing all that extra writing (including WAL-logging twice), and the fact that it all has to happen with an exclusive lock on the leaf page buffer is also a concern of mine. With v3 of my patch, we still only have to do all the preliminary work like finding the right page and verifying that there are no duplicates once. So with recent revisions, the amount of time spent exclusive locking with my proposed approach is now approximately half the time of alternative proposals (assuming no page split is necessary). In the worst case, the number of values locked on the leaf page is quite localized and manageable, as a natural consequence of the fact that it's a btree leaf page. I haven't run any numbers, but for an int4 btree (which really is the worst case here), 200 or so read-locked values would be quite close to as bad as things got. Plus, if there isn't a second phase of locking, which is on average a strong possibility, those locks would be hardly held at all - contrast that with having to do lots of exclusive locking for all that clean-up. I might experiment with weakening the exclusive lock even earlier in my next revision, and/or strengthening later. Off hand, I can't see a reason for not weakening after we find the first leaf page that the key might be on (granted, I haven't thought about it that much) - _bt_check_unique() does not have license to alter the buffer already proposed for insertion. Come to think of it, all of this new buffer lock weakening/strengthening stuff might independently justify itself as an optimization to regular btree index tuple insertion. That's a whole other patch, though -- it's a big ambition to have as a sort of incidental adjunct to what is already a big, complex patch. In practice the vast majority of insertions don't involve TOASTing. That's not an excuse for allowing the worst case to be really bad in terms of its impact on query response time, but it may well justify having whatever ameliorating measures we take result in bloat. It's at least the kind of bloat we're more or less used to dealing with, and have already invested a lot in controlling. Plus bloat-wise it can't be any worse than just inserting the tuple and having the transaction abort on a duplicate, since that already happens after toasting has done its work with regular insertion. -- Peter Geoghegan
On Wed, Sep 11, 2013 at 8:47 PM, Peter Geoghegan <pg@heroku.com> wrote: > In practice the vast majority of insertions don't involve TOASTing. > That's not an excuse for allowing the worst case to be really bad in > terms of its impact on query response time, but it may well justify > having whatever ameliorating measures we take result in bloat. It's at > least the kind of bloat we're more or less used to dealing with, and > have already invested a lot in controlling. Plus bloat-wise it can't > be any worse than just inserting the tuple and having the transaction > abort on a duplicate, since that already happens after toasting has > done its work with regular insertion. Andres is being very polite here, but the reality is that this approach has zero chance of being accepted. You can't hold buffer locks for a long period of time across complex operations. Full stop.It's a violation of the rules that are clearly documentedin src/backend/storage/buffer/README, which have been in place for a very long time, and this patch is nowhere near important enough to warrant a revision of those rules. We are not going to risk breaking every bit of code anywhere in the backend or in third-party code that takes a buffer lock. You are never going to convince me, or Tom, that the benefit of doing that is worth the risk; in fact, I have a hard time believing that you'll find ANY committer who thinks this approach is worth considering. Even if you get the code to run without apparent deadlocks, that doesn't mean there aren't any; it just means that you haven't found them all yet. And even if you managed to squash every such hazard that exists today, so what? Fundamentally, locking protocols that don't include deadlock detection don't scale. You can use such locks in limited contexts where proofs of correctness are straightforward, but trying to stretch them beyond that point results not only in bugs, but also in bad performance and unmaintainable code. With a much more complex locking regimen, even if your code is absolutely bug-free, you've put a burden on the next guy who wants to change anything; how will he avoid breaking things? Our buffer locking regimen suffers from painful complexity and serious maintenance difficulties as is. Moreover, we've already got performance and scalability problems that are attributable to every backend in the system piling up waiting on a single lwlock, or a group of simultaneously-held lwlocks. Dramatically broadening the scope of where lwlocks are used and for how long they're held is going to make that a whole lot worse. What's worse, the problems will be subtle, restricted to the people using this feature, and very difficult to measure on production systems, and I have no confidence they'd ever get fixed. A further problem is that a backend which holds even one lwlock can't be interrupted. We've had this argument before and it seems that you don't think that non-interruptibility is a problem, but it project policy to allow for timely interrupts in all parts of the backend and we're not going to change that policy for this patch. Heavyweight locks are heavy weight precisely because they provide services - like deadlock detection, satisfactory interrupt handling, and, also importantly, FIFO queuing behavior - that are *important* for locks that are held over an extended period of time. We're not going to go put those services into the lightweight lock mechanism because then it would no longer be light weight, and we're not going to ignore the importance of them, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 13, 2013 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Andres is being very polite here, but the reality is that this > approach has zero chance of being accepted. I quite like Andres, but I have yet to see him behave as you describe in a situation where someone proposed what was fundamentally a bad idea. Maybe you should let him speak for himself? > You can't hold buffer > locks for a long period of time across complex operations. Full stop. > It's a violation of the rules that are clearly documented in > src/backend/storage/buffer/README, which have been in place for a very > long time, and this patch is nowhere near important enough to warrant > a revision of those rules. The importance of this patch is a value judgement. Our users have been screaming for this for over ten years, so to my mind it has a fairly high importance. Also, every other database system of every stripe worth mentioning has something approximately equivalent to this, including ones with much less functionality generally. The fact that we don't is a really unfortunate omission. As to the rules you refer to, you must mean "These locks are intended to be short-term: they should not be held for long". I don't think that they will ever be held for long. At least, when I've managed the amount of work that a heap_insert() can do better. I expect to produce a revision where toasting doesn't happen with the locks held soon. Actually, I've already written the code, I just need to do some testing. > We are not going to risk breaking every > bit of code anywhere in the backend or in third-party code that takes > a buffer lock. You are never going to convince me, or Tom, that the > benefit of doing that is worth the risk; in fact, I have a hard time > believing that you'll find ANY committer who thinks this approach is > worth considering. I would suggest letting those other individuals speak for themselves too. Particularly if you're going to name someone who is on vacation like that. > Even if you get the code to run without apparent deadlocks, that > doesn't mean there aren't any; Of course it doesn't. Who said otherwise? > Our buffer locking regimen suffers > from painful complexity and serious maintenance difficulties as is. That's true to a point, but it has more to do with things like how VACUUM interacts with hio.c. Things like this: /** Release the file-extension lock; it's now OK for someone else to extend* the relation some more. Note that we cannotrelease this lock before* we have buffer lock on the new page, or we risk a race condition* against vacuumlazy.c ---see comments therein.*/ if (needLock) UnlockRelationForExtension(relation, ExclusiveLock); The btree code is different, though: It implements a well-defined interface, with much clearer separation of concerns. As I've said already, with trivial exception (think contrib), no external code considers itself to have license to obtain locks of any sort on btree buffers. No external code of ours - without exception - does anything with multiple locks, or exclusive locks on btree buffers. I'll remind you that I'm only holding shared locks when control is outside of the btree code. Even within the btree code, the number of access method functions that could conflict with what I do here (that acquire exclusive locks) is very small when you exclude things that only exclusive lock the meta-page (there are also very few of those). So the surface area is quite small. I'm not denying that there is a cost, or that I haven't expanded things in a direction I'd prefer not to. I just think that it may well be worth it, particularly when you consider the alternatives - this may well be the least worst thing. I mean, if we do the promise tuple thing, and there are multiple unique indexes, what happens when an inserter needs to block pending the outcome of another transaction? They had better go clean up the promise tuples from the other unique indexes that they're trying to insert into, because they cannot afford to hold value locks for a long time, no matter how they're implemented. That could take much longer than just releasing a shared buffer lock, since for each unique index the promise tuple must be re-found from scratch. There are huge issues with additional complexity and bloat. Oh, and now your lightweight locks aren't so lightweight any more. If the value locks were made interruptible through some method, such as the promise tuples approach, does that really make deadlocking acceptable? So at least your system didn't seize up. But on the other hand, the user randomly had a deadlock error through no fault of their own. The former may be worse, but the latter is also inexcusable. In general, the best solution is just to not have deadlock hazards. I wouldn't be surprised if reasoning about deadlocking was harder with that alternative approach to value locking, not easier. > Moreover, we've already got performance and scalability problems that > are attributable to every backend in the system piling up waiting on a > single lwlock, or a group of simultaneously-held lwlocks. > Dramatically broadening the scope of where lwlocks are used and for > how long they're held is going to make that a whole lot worse. You can hardly compare a buffer's LWLock with a system one that protects critical shared memory structures. We're talking about a shared lock on a single btree leaf page per unique index involved in upserting. > A further problem is that a backend which holds even one lwlock can't > be interrupted. We've had this argument before and it seems that you > don't think that non-interruptibility is a problem, but it project > policy to allow for timely interrupts in all parts of the backend and > we're not going to change that policy for this patch. I don't think non-interruptibility is a problem? Really, do you think that this kind of inflammatory rhetoric helps anybody? I said nothing of the sort. I recall saying something about an engineering trade-off. Of course I value interruptibility. If you're concerned about non-interruptibility, consider XLogFlush(). That does rather a lot of work with WALWriteLock exclusive locked. On a busy system, some backend is very frequently going to experience a non-interruptible wait for the duration of however long it takes to write and flush perhaps a whole segment. All other flushing backends are stuck in non-interruptible waits waiting for that backend to finish. I think that the group commit stuff might have regressed worst-case interruptibility for flushers by quite a bit; should we have never committed that, or do you agree with my view that it's worth it? In contrast, what I've proposed here is in general quite unlikely to result in any I/O for the duration of the time the locks are held. Only writers will be blocked. And only those inserting into a narrow range of values around the btree leaf page. Much of the work that even those writers need to do will be unimpeded anyway; they'll just block on attempting to acquire an exclusive lock on the first btree leaf page that the value they're inserting could be on. And the additional non-interruptible wait of those inserters won't be terribly much more than the wait of the backend where heap tuple insertion takes a long time anyway - that guy already has to do close to 100% of that work with a non-interruptible wait today (once we eliminate heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is right at the end of heap_insert, and the buffer is pinned and locked very close to the start. -- Peter Geoghegan
* Peter Geoghegan (pg@heroku.com) wrote: > I would suggest letting those other individuals speak for themselves > too. Particularly if you're going to name someone who is on vacation > like that. It was my first concern regarding this patch. Thanks, Stephen
On Fri, Sep 13, 2013 at 12:14 PM, Stephen Frost <sfrost@snowman.net> wrote: > It was my first concern regarding this patch. It was my first concern too. -- Peter Geoghegan
On 2013-09-13 11:59:54 -0700, Peter Geoghegan wrote: > On Fri, Sep 13, 2013 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > Andres is being very polite here, but the reality is that this > > approach has zero chance of being accepted. > > I quite like Andres, but I have yet to see him behave as you describe > in a situation where someone proposed what was fundamentally a bad > idea. Maybe you should let him speak for himself? Unfortunately I have to agree with Robert here, I think it's a complete nogo to do what you propose so far and I've several times now presented arguments why I think so. The reason I wasn't saying "this will never get accepted" are twofold: a) I don't want to stiffle alternative ideas to the "promises" idea, just because I think it's the way to go. That might stop a better idea from being articulated. b) I am not actually in the position to say it's not going to be accepted. *I* think that unless you make some fundamental and very, very clever modifications to your algorithm that end up *not holding a lock over other operations at all*, it's not going to get committed. And I'll chip in with my -1. And clever modification doesn't mean slightly restructuring heapam.c's operations. > The importance of this patch is a value judgement. Our users have been > screaming for this for over ten years, so to my mind it has a fairly > high importance. Also, every other database system of every stripe > worth mentioning has something approximately equivalent to this, > including ones with much less functionality generally. The fact that > we don't is a really unfortunate omission. I aggree it's quite important but that doesn't mean we have to do stuff that we think are unacceptable, especially as there *are* other ways to do it. > As to the rules you refer to, you must mean "These locks are intended > to be short-term: they should not be held for long". I don't think > that they will ever be held for long. At least, when I've managed the > amount of work that a heap_insert() can do better. I expect to produce > a revision where toasting doesn't happen with the locks held soon. > Actually, I've already written the code, I just need to do some > testing. I personally think - and have stated so before - that doing a heap_insert() while holding the btree lock is unacceptable. > The btree code is different, though: It implements a well-defined > interface, with much clearer separation of concerns. Which you're completely violating by linking the btree buffer locking with the heap locking. It's not about the btree code alone. At this point I am a bit confused why you are asking for review. > I mean, if we do the promise tuple thing, and there are multiple > unique indexes, what happens when an inserter needs to block pending > the outcome of another transaction? They had better go clean up the > promise tuples from the other unique indexes that they're trying to > insert into, because they cannot afford to hold value locks for a long > time, no matter how they're implemented. Why? We're using normal transaction visibility rules here. We don't stop *other* values on the same index getting updated or similar. And anyway. It doesn't matter which problem the "promises" idea has. We're discussing your proposal here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 13, 2013 at 12:23 PM, Andres Freund <andres@2ndquadrant.com> wrote: > The reason I wasn't saying "this will never get accepted" are twofold: > a) I don't want to stiffle alternative ideas to the "promises" idea, > just because I think it's the way to go. That might stop a better idea > from being articulated. b) I am not actually in the position to say it's > not going to be accepted. Well, the reality is that the promises idea hasn't been described in remotely enough detail to compare it to what I have here. I've pointed out plenty of problems with it. After all, it was the first thing that I considered, and I'm on the record talking about it in the 2012 dev meeting. I didn't take that approach for many good reasons. The reason I ended up here is not because I didn't get the memo about holding buffer locks across complex operations being a bad thing. At least grant me that. I'm here because in all these years no one has come up with a suggestion that doesn't have some very major downsides. Like, even worse than this. >> As to the rules you refer to, you must mean "These locks are intended >> to be short-term: they should not be held for long". I don't think >> that they will ever be held for long. At least, when I've managed the >> amount of work that a heap_insert() can do better. I expect to produce >> a revision where toasting doesn't happen with the locks held soon. >> Actually, I've already written the code, I just need to do some >> testing. > > I personally think - and have stated so before - that doing a > heap_insert() while holding the btree lock is unacceptable. Presumably your reason is essentially that we exclusive lock a heap buffer (exactly one heap buffer) while holding shared locks on btree index buffers. Is that really so different to holding an exclusive lock on a btree buffer while holding a shared lock on a heap buffer? Because that's what _bt_check_unique() does today. Now, I'll grant you that there is one appreciable difference, which is that multiple unique indexes may be involved. But limiting ourselves to the primary key or something like that remains an option. And I'm not sure that it's really any worse anyway. >> The btree code is different, though: It implements a well-defined >> interface, with much clearer separation of concerns. > > Which you're completely violating by linking the btree buffer locking > with the heap locking. It's not about the btree code alone. You're right that it isn't about just the btree code. In order for a deadlock to occur, there must be a mutual dependency. What code could feel entitled to hold buffer locks on btree buffers and heap buffers at the same time except the btree code itself? It already does so. But no one else does the same thing. If anyone did anything with a heap buffer lock held that could result in a call into one of the btree access method functions (I'm not contemplating the possibility of this other code locking the btree buffer *directly*), I'm quite sure that that would be rejected outright today, because that causes deadlocks. Certainly, vacuumlazy.c doesn't do it, for example. Why would anyone ever want to do that anyway? I cannot think of any reason. I suppose that that does still leave "transitive dependencies", but now you're stretching things. After all, you're not supposed to hold buffer locks for long! The dependency would have to transit through, say, one of the system LWLocks used for WAL Logging. Seems pretty close to impossible that it'd be an issue - index stuff is only WAL-logged as index tuples are inserted (that is, as the locks are finally released). Everyone automatically does that kind of thing in a consistent order of locking, unlocking in the opposite order anyway. But what of the btree code deadlocking with itself? There are only a few functions (2 or 3) where that's possible even in principle. I think that they're going to be not too hard to analyze. For example, with insertion, the trick is to always lock in a consistent order and unlock/insert in the opposite order. The heap shared lock(s) needed in the btree code cannot deadlock with another upserter because once the other upserter has that exclusive heap buffer lock, it's *inevitable* that it will release all of its shared buffer locks. Granted, I could stand to think about this case more, but you get the idea - it *is* possible to clamp down on the code that needs to care about this stuff to a large degree. It's subtle, but btrees are generally considered pretty complicated, and the btree code already cares about some odd cases like these (it takes special precuations for catalog indexes, for example). The really weird thing about my patch is that the btree code trusts the executor to call the heapam code to do the right thing in the right way - it now knows more than I'd prefer. Would you be happier if the btree code took more direct responsibility for the heap tuple insertion instead? Before you say "that's ridiculous", consider the big modularity violation that has always existed. It may be no more ridiculous than that. And that existing state of affairs may be no less ridiculous than living with what I've already done. > At this point I am a bit confused why you are asking for review. I am asking for us, collectively, through consensus, to resolve the basic approach to doing this. That was something I stated right up front, pointing out details of where the discussion had gone in the past. That was my explicit goal. There has been plenty of discussing on this down through the years, but nothing ever came from it. Why is this an intractable problem for over a decade for us alone? Why isn't this a problem for other database systems? I'm not implying that it's because they do this. It's something that I am earnestly interested in, though. A number of people have asked me that, and I don't have a good answer for them. >> I mean, if we do the promise tuple thing, and there are multiple >> unique indexes, what happens when an inserter needs to block pending >> the outcome of another transaction? They had better go clean up the >> promise tuples from the other unique indexes that they're trying to >> insert into, because they cannot afford to hold value locks for a long >> time, no matter how they're implemented. > > Why? We're using normal transaction visibility rules here. We don't stop > *other* values on the same index getting updated or similar. Because you're locking a value in some other, earlier unique index, all the while waiting *indefinitely* on some other value in a second or subsequent one. That isn't acceptable. A bunch of backends would back up just because one backend had this contention on the second unique index value that the others didn't actually have themselves. My design allows those other backends to immediately go through and finish. Value locks have these kinds of hazards no matter how you implement them. Deadlocks, and unreasonable stalling as described here is always unacceptable - whether or not the problems are detected at runtime is ultimately of marginal interest. Either way, it's a bug. I think that the details of how this approach compare to others are totally pertinent. For me, that's the whole point - getting towards something that will balance all of these concerns and be acceptable. Yes, it's entirely possible that that could look quite different to what I have here. I do not want to reduce all this to a question of "is this one design acceptable or not?". Am I not allowed to propose a design to drive discussion? That's how the most important features get implemented around here. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> wrote: > we exclusive lock a heap buffer (exactly one heap buffer) while > holding shared locks on btree index buffers. Is that really so > different to holding an exclusive lock on a btree buffer while > holding a shared lock on a heap buffer? Because that's what > _bt_check_unique() does today. Is it possible to get a deadlock doing only one of those two things? Is it possible to avoid a deadlock doing both of them? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 13, 2013 at 2:59 PM, Peter Geoghegan <pg@heroku.com> wrote: > I would suggest letting those other individuals speak for themselves > too. Particularly if you're going to name someone who is on vacation > like that. You seem to be under the impression that I'm mentioning Tom's name, or Andres's, because I need to win some kind of an argument. I don't. We're not going to accept a patch that uses lwlocks in the way that you are proposing. > I mean, if we do the promise tuple > thing, and there are multiple unique indexes, what happens when an > inserter needs to block pending the outcome of another transaction? > They had better go clean up the promise tuples from the other unique > indexes that they're trying to insert into, because they cannot afford > to hold value locks for a long time, no matter how they're > implemented. As Andres already pointed out, this is not correct. Just to add to what he said, we already have long-lasting value locks in the form of SIREAD locks. SIREAD can exist at different levels of granularity, but one of those levels is index-page-level granularity, where they have the function of guarding against concurrent insertions of values that would fall within that page, which just so happens to be the same thing you want to do here. The difference between those locks and what you're proposing here is that they are implemented differently. That is why those were acceptable and this is not. > That could take much longer than just releasing a shared > buffer lock, since for each unique index the promise tuple must be > re-found from scratch. There are huge issues with additional > complexity and bloat. Oh, and now your lightweight locks aren't so > lightweight any more. Yep, totally agreed. If you simply lock the buffer, or take some other action which freezes out all concurrent modifications to the page, then re-finding the lock is much simpler. On the other hand, it's much simpler precisely because you've reduced concurrency to the degree necessary to make it simple. And reducing concurrency is bad. Similarly, complexity and bloat are not great things taken in isolation, but many of our existing locking schemes are already very complex. Tuple locks result in a complex jig that involves locking the tuple via the heavyweight lock manager, performing a WAL-logged modification to the page, and then releasing the lock in the heavyweight lock manager. As here, that is way more expensive than simply grabbing and holding a share-lock on the page. But we get a number of important benefits out of it. The backend remains interruptible while the tuple is locked, the protocol for granting locks is FIFO to prevent starvation, we don't suppress page eviction while the lock is held, we can simultaneously lock arbitrarily large numbers of tuples, and deadlocks are detect and handled cleanly. If those requirements were negotiable, we would surely have negotiated them away already, because the performance benefits would be immense. > If the value locks were made interruptible through some method, such > as the promise tuples approach, does that really make deadlocking > acceptable? Yes. It's not possible to prevent all deadlocks. It IS possible to make sure that they are properly detected and that precisely one of the transactions involved is rolled back to resolve the deadlock. > You can hardly compare a buffer's LWLock with a system one that > protects critical shared memory structures. We're talking about a > shared lock on a single btree leaf page per unique index involved in > upserting. Actually, I can and I am. Buffers ARE critical shared memory structures. >> A further problem is that a backend which holds even one lwlock can't >> be interrupted. We've had this argument before and it seems that you >> don't think that non-interruptibility is a problem, but it project >> policy to allow for timely interrupts in all parts of the backend and >> we're not going to change that policy for this patch. > > I don't think non-interruptibility is a problem? Really, do you think > that this kind of inflammatory rhetoric helps anybody? I said nothing > of the sort. I recall saying something about an engineering trade-off. > Of course I value interruptibility. I don't see what's inflammatory about that statement. The point is that this isn't the first time you've proposed a change which would harm interruptibility and it isn't the first time I've objected on precisely that basis. Interruptibility is not a nice-to-have that we can trade away from time to time; it's essential and non-negotiable. > If you're concerned about non-interruptibility, consider XLogFlush(). > That does rather a lot of work with WALWriteLock exclusive locked. On > a busy system, some backend is very frequently going to experience a > non-interruptible wait for the duration of however long it takes to > write and flush perhaps a whole segment. All other flushing backends > are stuck in non-interruptible waits waiting for that backend to > finish. I think that the group commit stuff might have regressed > worst-case interruptibility for flushers by quite a bit; should we > have never committed that, or do you agree with my view that it's > worth it? It wouldn't take a lot to convince me that it wasn't worth it, because I was never all that excited about that patch to begin with. I think it mostly helps in extremely artificial situations that are not likely to occur on real systems anyway. But, yeah, WALWriteLock is a problem, no doubt about it. We should try to make the number of such problems go down, not up, even if it means passing up new features that we'd really like to have. > In contrast, what I've proposed here is in general quite unlikely to > result in any I/O for the duration of the time the locks are held. > Only writers will be blocked. And only those inserting into a narrow > range of values around the btree leaf page. Much of the work that even > those writers need to do will be unimpeded anyway; they'll just block > on attempting to acquire an exclusive lock on the first btree leaf > page that the value they're inserting could be on. Sure, but you're talking about broadening the problem from the guy performing the insert to everybody who might be trying to an insert that hits one of the same unique-index pages. Instead of holding one buffer lock, the guy performing the insert is now holding as many buffer locks as there are indexes. That's a non-trivial issue. For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes, you'll error out. In fact, if you get the number of indexes exactly right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and panic the whole system. Oh, and if different backends load the index list in different orders, because say the system catalog gets vacuumed between their respective relcache loads, then they may try to lock the indexes in different orders and cause an undetected deadlock. And, drifting a bit further off-topic, even to get as far as you have, you've added overhead to every lwlock acquisition and release, even for people who never use this functionality. I'm pretty skeptical about anything that involves adding additional frammishes to the lwlock mechanism. There are a few new primitives I'd like, too, but every one we add slows things down for everybody. > And the additional > non-interruptible wait of those inserters won't be terribly much more > than the wait of the backend where heap tuple insertion takes a long > time anyway - that guy already has to do close to 100% of that work > with a non-interruptible wait today (once we eliminate > heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is > right at the end of heap_insert, and the buffer is pinned and locked > very close to the start. That's true but somewhat misleading. Textually most of the function holds the buffer lock, but heap_prepare_insert(), CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and XLogWrite() are the parts that do substantial amounts of computation, and only the last of those happens while holding the buffer lock. And that last is really fundamental, because we can't let any other backend see the modified buffer until we've xlog'd the changes. The problems you're proposing to create do not fall into the same category. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
<p dir="ltr">I haven't read the patch and the btree code is an area I really don't know, so take this for what it's worth....<br/><p dir="ltr">It seems to me that the nature of the problem is that there will unavoidably be a nexus betweenthe two parts of the code here. We can try to isolate it as much as possible but we're going to need a bit of a compromise.<pdir="ltr">I'm imagining a function that takes two target heap buffers and a btree key. It would descend thebtree and holding the leaf page lock do a try_lock on the heap pages. If it fails to get the locks then it releases whateverit got and returns for the heap update to find new pages and try again.<p dir="ltr">This still leaves the potentialproblem with page splits and I assume it would still be tricky to call it without unsatisfactorily mixing executorand btree code. But that's as far as I got. <p dir="ltr">-- <br /> greg
On 2013-09-14 09:57:43 +0100, Greg Stark wrote: > It seems to me that the nature of the problem is that there will > unavoidably be a nexus between the two parts of the code here. We can try > to isolate it as much as possible but we're going to need a bit of a > compromise. I think Roberts and mine point is that there are several ways to approach this without doing that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-09-13 14:41:46 -0700, Peter Geoghegan wrote: > On Fri, Sep 13, 2013 at 12:23 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > The reason I wasn't saying "this will never get accepted" are twofold: > > a) I don't want to stiffle alternative ideas to the "promises" idea, > > just because I think it's the way to go. That might stop a better idea > > from being articulated. b) I am not actually in the position to say it's > > not going to be accepted. > > Well, the reality is that the promises idea hasn't been described in > remotely enough detail to compare it to what I have here. I've pointed > out plenty of problems with it. Even if you disagree, I still think that doesn't matter in the very least. You say: > I think that the details of how this approach compare to others are > totally pertinent. For me, that's the whole point - getting towards > something that will balance all of these concerns and be acceptable. Well, the two other people involved in the discussion so far have gone on the record saying that the presented approach is not acceptable to them. And you haven't started reacting to that. > Yes, it's entirely possible that that could look quite different to > what I have here. I do not want to reduce all this to a question of > "is this one design acceptable or not?". But the way you're discussing it so far is exactly reducing it that way. If you want the discussion to be about *how* can we implement it that the various concerns are addressed: fsck*ing great. I am with you there. In the end, even though I have my usual strong opinions which is the best way, I don't care which algorithm gets pursued further. At least, if, and only if, it has a fighting chance of getting committed. Which this doesn't. > After all, it was the first thing that > I considered, and I'm on the record talking about it in the 2012 dev > meeting. I didn't take that approach for many good reasons. Well, I wasn't there when you said that ;) > The reason I ended up here is not because I didn't get the memo about > holding buffer locks across complex operations being a bad thing. At > least grant me that. I'm here because in all these years no one has > come up with a suggestion that doesn't have some very major downsides. > Like, even worse than this. I think you're massively, massively, massively overstating the dangers of bloat here. It's a known problem that's *NOT* getting worse by any of the other proposals if you compare it with the loop/lock/catch implementation of upsert that we have today as the only option. And we *DO* have infrastructure to deal with bloat, even if could use some improvement. We *don't* have infrastructure with deadlocks on lwlocks. And we're not goint to get that infrastructure, because it would even further remove the "lw" part of lwlocks. > >> As to the rules you refer to, you must mean "These locks are intended > >> to be short-term: they should not be held for long". I don't think > >> that they will ever be held for long. At least, when I've managed the > >> amount of work that a heap_insert() can do better. I expect to produce > >> a revision where toasting doesn't happen with the locks held soon. > >> Actually, I've already written the code, I just need to do some > >> testing. > > > > I personally think - and have stated so before - that doing a > > heap_insert() while holding the btree lock is unacceptable. > > Presumably your reason is essentially that we exclusive lock a heap > buffer (exactly one heap buffer) while holding shared locks on btree > index buffers. It's that it interleaves an already complex but local locking scheme that required several years to become correct with another that is just the same. That's an utterly horrid idea. > Is that really so different to holding an exclusive > lock on a btree buffer while holding a shared lock on a heap buffer? > Because that's what _bt_check_unique() does today. Yes, it it is different. But, in my opinion, bt_check_unique() doing so is a bug that needs fixing. Not something that we want to extend. (Note that _bt_check_unique() already needs to deal with the fact that it reads an unlocked page, because it moves right in some cases) And, as you say: > Now, I'll grant you that there is one appreciable difference, which is > that multiple unique indexes may be involved. But limiting ourselves > to the primary key or something like that remains an option. And I'm > not sure that it's really any worse anyway. I don't think that's an acceptable limitation. If it were something we could lift in a release or two, maybe, but that's not what you're talking about. > > At this point I am a bit confused why you are asking for review. > > I am asking for us, collectively, through consensus, to resolve the > basic approach to doing this. That was something I stated right up > front, pointing out details of where the discussion had gone in the > past. That was my explicit goal. There has been plenty of discussing > on this down through the years, but nothing ever came from it. At the moment ISTM you're not conceding on *ANY* points. That's not very often the way to find concensus. > Why is this an intractable problem for over a decade for us alone? Why > isn't this a problem for other database systems? I'm not implying that > it's because they do this. It's something that I am earnestly > interested in, though. A number of people have asked me that, and I > don't have a good answer for them. Afaik all those go the route of bloat, don't they? Also, at least in the past, mysql had a long list of caveats around it... > >> I mean, if we do the promise tuple thing, and there are multiple > >> unique indexes, what happens when an inserter needs to block pending > >> the outcome of another transaction? They had better go clean up the > >> promise tuples from the other unique indexes that they're trying to > >> insert into, because they cannot afford to hold value locks for a long > >> time, no matter how they're implemented. > > > > Why? We're using normal transaction visibility rules here. We don't stop > > *other* values on the same index getting updated or similar. > Because you're locking a value in some other, earlier unique index, > all the while waiting *indefinitely* on some other value in a second > or subsequent one. That isn't acceptable. A bunch of backends would > back up just because one backend had this contention on the second > unique index value that the others didn't actually have themselves. My > design allows those other backends to immediately go through and > finish. That argument doesn't make sense to me. You're inserting a unique value. It completely makes sense that you can only insert one of them. If it's unclear whether you can insert, you're going to have to wait. Thats why they are UNIQUE after all. You're describing a complete nonadvantange here. It's also how unique indexes already work. Also note, that wait's on xids are properly supervised by deadlock detection. Even if it had an advantage, not blocking *for the single unique key alone* opens you to issues of livelocks where several backends retry because of each other indefinitely. > Value locks have these kinds of hazards no matter how you implement > them. Deadlocks, and unreasonable stalling as described here is always > unacceptable - whether or not the problems are detected at runtime is > ultimately of marginal interest. Either way, it's a bug. Whether postgres locks down in a way that can only resolved by kill -9 or whether it aborts a transaction are, like, a couple of magnitude of a difference. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Sep 14, 2013 at 12:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I mean, if we do the promise tuple >> thing, and there are multiple unique indexes, what happens when an >> inserter needs to block pending the outcome of another transaction? >> They had better go clean up the promise tuples from the other unique >> indexes that they're trying to insert into, because they cannot afford >> to hold value locks for a long time, no matter how they're >> implemented. > > As Andres already pointed out, this is not correct. While not doing this is not incorrect, it certainly would be useful for preventing deadlocks and unnecessary contention. In a world where people expect either an insert or an update, we ought to try and reduce contention across multiple unique indexes. I can understand why that doesn't matter today, though - if you're going to insert duplicates indifferent to whether or not there will be conflicts, that's a kind of abuse, and not worth optimizing - seems probable that most transactions will commit. However, it seems much less probable that most upserters will insert. People may well wish to upsert all the time where an insert is hardly ever necessary, which is one reason why I have doubts about other proposals. Note that today there is no guarantee that the original waiter for a duplicate-inserting xact to complete will be the first one to get a second chance, so I think it's hard to question this on correctness grounds. Even if they are released in FIFO order, there is no reason to assume that the first waiter will win the race with a second. Most obviously, the second waiter may not even ever get the chance to block on the same xid at all (so it's not really a waiter at all) and still be able to insert, if the blocking-xact aborts after the second "waiter" starts its descent but before it checks uniqueness. All this, even though the second "waiter" arrived maybe minutes after the first. What I'm talking about here is really unlikely to result in lock starvation, because the original waiter typically gets to observe the other waiter go through, and that's reason enough to give up entirely. Now, it's kind of weird that the original waiter will still end up blocking on the xid that caused it to wait in the first instance. So there should be more thought put into that, like remembering the xid and only waiting on it on a retry, or some similar scheme. Maybe you could contrive a scenario where this causes lock starvation, but I suspect you could do the same thing for the present btree insertion code. > Just to add to > what he said, we already have long-lasting value locks in the form of > SIREAD locks. SIREAD can exist at different levels of granularity, but > one of those levels is index-page-level granularity, where they have > the function of guarding against concurrent insertions of values that > would fall within that page, which just so happens to be the same > thing you want to do here. The difference between those locks and > what you're proposing here is that they are implemented differently. > That is why those were acceptable and this is not. As the implementer of this patch, I'm obligated to put some checks in unique index insertion that everyone has to care about. There is no way around that. Complexity issues aside, I think that an argument could be made for this approach *reducing* the impact on concurrency relative to other approaches, if there isn't too many unique indexes to deal with, which is the case the vast majority of the time. I mean, those other approaches necessitate doing so much more with *exclusive* locks held. Like inserting, maybe doing a page split, WAL-logging, all with the lock, and then either updating in place or killing the promise tuple, and WAL-logging that, with an exclusive lock held the second time around. Plus searching for everything twice. I think that frequently killing all of those broken-promise tuples could have deleterious effects on concurrency and/or index bloat or the kind only remedied by reindex. Do you update the freespace map too? More exclusive locks! Or if you leave it up to VACUUM (and just set the xid to InvalidXid, which is still extra work), autovacuum has to care about a new *class* of bloat - index-only bloat. Plus lots of dead duplicates are bad for performance in btrees generally. > As here, that is way more expensive than > simply grabbing and holding a share-lock on the page. But we get a > number of important benefits out of it. The backend remains > interruptible while the tuple is locked, the protocol for granting > locks is FIFO to prevent starvation, we don't suppress page eviction > while the lock is held, we can simultaneously lock arbitrarily large > numbers of tuples, and deadlocks are detect and handled cleanly. If > those requirements were negotiable, we would surely have negotiated > them away already, because the performance benefits would be immense. False equivalence. We only need to lock as many unique index *values* (not tuples) as are proposed for insertion per slot (which can be reasonably bound), and only for an instant. Clearly it would be totally unacceptable if tuple-level locks made backends uninterruptible indefinitely. Of course, this is nothing like that. >> If the value locks were made interruptible through some method, such >> as the promise tuples approach, does that really make deadlocking >> acceptable? > > Yes. It's not possible to prevent all deadlocks. It IS possible to > make sure that they are properly detected and that precisely one of > the transactions involved is rolled back to resolve the deadlock. You seem to have misunderstood me here, or perhaps I was unclear. I'm referring to deadlocks that cannot really be predicted or analyzed by the user at all - see my comments below on insertion order. >> I don't think non-interruptibility is a problem? Really, do you think >> that this kind of inflammatory rhetoric helps anybody? I said nothing >> of the sort. I recall saying something about an engineering trade-off. >> Of course I value interruptibility. > > I don't see what's inflammatory about that statement. The fact that you simply stated that I don't think non-interruptibility is a problem in a non-qualified way, obviously. > Interruptibility is not a nice-to-have that we > can trade away from time to time; it's essential and non-negotiable. I seem to recall you saying something about the Linux kernel and their attitude to interruptibility. Yes, interruptibility is not just a nice-to-have; it is essentially. However, without dismissing your other concerns, I have yet to hear a convincing argument as to why anything I've done here is going to make any difference to interruptibility that would be appreciable to any human. So far it's been a slippery slope type argument that can be equally well used to argue against some facet of almost any substantial patch ever proposed. I just don't think that regressing interruptibility marginally is *necessarily* sufficient justification for rejecting an approach outright. FYI, *that's* how I value interruptibility generally. >> In contrast, what I've proposed here is in general quite unlikely to >> result in any I/O for the duration of the time the locks are held. >> Only writers will be blocked. And only those inserting into a narrow >> range of values around the btree leaf page. Much of the work that evehn >> those writers need to do will be unimpeded anyway; they'll just block >> on attempting to acquire an exclusive lock on the first btree leaf >> page that the value they're inserting could be on. > > Sure, but you're talking about broadening the problem from the guy > performing the insert to everybody who might be trying to an insert > that hits one of the same unique-index pages. In general, that isn't that much worse than just blocking the value directly. The number of possible values that could also be blocked is quite low. The chances of it actually mattering that those additional values are locked in the still small window in which the buffer locks are held is in generally fairly low, particularly on larger tables where there is naturally a large number of possible distinct values. I will however concede that the impact on inserters that want to insert a non-locked value that belongs on the locked page or its child might be worse, but it's already a problem that inserted index tuples can all end up on the same page, if not to the same extent. > Instead of holding one > buffer lock, the guy performing the insert is now holding as many > buffer locks as there are indexes. That's a non-trivial issue. Actually, as many buffer locks as there are *unique* indexes. It might be a non-trivial issue, but this whole problem is decidedly non-trivial, as I'm sure we can all agree. > For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes, > you'll error out. In fact, if you get the number of indexes exactly > right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and > panic the whole system. Oh, come on. We can obviously engineer a solution to that problem. I don't think I've ever seen a table with close to 100 *unique* indexes. 4 or 5 is a very high number. If we just raised on error if someone tried to do this with more than 10 unique indexes, I would guess that'd we'd get exactly zero complaints about it. > Oh, and if different backends load the index list in different orders, > because say the system catalog gets vacuumed between their respective > relcache loads, then they may try to lock the indexes in different > orders and cause an undetected deadlock. Undetected deadlock is really not much worse than detected deadlock here. Either way, it's a bug. And it's something that any kind of implementation will need to account for. It's not okay to *unpredictably* deadlock, in a way that the user has no control over. Today, someone can do an analysis of their application and eliminate deadlocks if they need to. That might not be terribly practical much of the time, but it can be done. It certainly is practical to do it in a localized way. I wouldn't like to compromise that. So yes, you're right that I need to control for this sort of thing better than in the extant patch, and in fact this was discussed fairly early on. But it's an inherent problem. > And, drifting a bit further off-topic, even to get as far as you have, > you've added overhead to every lwlock acquisition and release, even > for people who never use this functionality. If you look at the code, you'll see that I've made very modest modifications to LWLockRelease only. I would be extremely surprised if the overhead was not only in the noise, but was completely impossible to detect through any conventional benchmark. These are the same kind of very modest changes made for LWLockAcquireOrWait(), and you said nothing about that at the time. Despite the fact that you now appear to think that that whole effort was largely a waste of time. > That's true but somewhat misleading. Textually most of the function > holds the buffer lock, but heap_prepare_insert(), > CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and > XLogWrite() are the parts that do substantial amounts of computation, > and only the last of those happens while holding the buffer lock. I've already written modifications so that I don't have to do heap_prepare_insert() with the locks held. There is no reason to call CheckForSerializableConflictIn() with the additional locks held either. After all, "For a heap insert, we only need to check for table-level SSI locks". As for RelationGetBufferForTuple(), yes, the majority of the time it will have to do very little without acquiring an exclusive lock, because it's going to get that from the last place a heap tuple was inserted from. -- Peter Geoghegan
On Sat, Sep 14, 2013 at 3:15 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> Well, the reality is that the promises idea hasn't been described in >> remotely enough detail to compare it to what I have here. I've pointed >> out plenty of problems with it. > > Even if you disagree, I still think that doesn't matter in the very > least. It matters if you care about getting this feature. > You say: > >> I think that the details of how this approach compare to others are >> totally pertinent. For me, that's the whole point - getting towards >> something that will balance all of these concerns and be acceptable. > > Well, the two other people involved in the discussion so far have gone > on the record saying that the presented approach is not acceptable to > them. And you haven't started reacting to that. Uh, yes I have. I'm not really sure what you could mean by that. What am I refusing to address? >> Yes, it's entirely possible that that could look quite different to >> what I have here. I do not want to reduce all this to a question of >> "is this one design acceptable or not?". > > But the way you're discussing it so far is exactly reducing it that way. The fact that I was motivated to do things this way serves to illustrate the problems generally. > If you want the discussion to be about *how* can we implement it that > the various concerns are addressed: fsck*ing great. I am with you there. Isn't that what we were doing? There has been plenty of commentary on alternative approaches. > In the end, even though I have my usual strong opinions which is the > best way, I don't care which algorithm gets pursued further. At least, > if, and only if, it has a fighting chance of getting committed. Which > this doesn't. I don't think that any design that has been described to date doesn't have serious problems. Causing excessive bloat, particularly in indexes is a serious problem also. >> The reason I ended up here is not because I didn't get the memo about >> holding buffer locks across complex operations being a bad thing. At >> least grant me that. I'm here because in all these years no one has >> come up with a suggestion that doesn't have some very major downsides. >> Like, even worse than this. > > I think you're massively, massively, massively overstating the dangers > of bloat here. It's a known problem that's *NOT* getting worse by any of > the other proposals if you compare it with the loop/lock/catch > implementation of upsert that we have today as the only option. Why would I compare it with that? That's terrible, and very few of our users actually know about it anyway. Also, will an UPDATE followed by an INSERT really bloat all that much anyway? > And we > *DO* have infrastructure to deal with bloat, even if could use some > improvement. We *don't* have infrastructure with deadlocks on > lwlocks. And we're not goint to get that infrastructure, because it > would even further remove the "lw" part of lwlocks. Everything I said so far is predicated on LWLocks not deadlocking here, so I'm not really sure why you'd say that. If I can't find a way to prevent deadlock, then clearly the approach is doomed. > It's that it interleaves an already complex but local locking scheme > that required several years to become correct with another that is just > the same. That's an utterly horrid idea. You're missing my point, which is that it may be possible, with relatively modest effort, to analyze things to ensure that deadlock is impossible - regardless of the complexities of the two systems - because they're reasonably well encapsulated. See below, under "I'll say it again". Now, I can certainly understand why you wouldn't be willing to accept that at face value. The idea isn't absurd, though. You could think of the heap_insert() call as being under the control of the btree code (just as, say, heap_hot_search() is), even though the code isn't at all structured that way, and that's awkward. I'm actually slightly tempted to structure it that way. >> Is that really so different to holding an exclusive >> lock on a btree buffer while holding a shared lock on a heap buffer? >> Because that's what _bt_check_unique() does today. > > Yes, it it is different. But, in my opinion, bt_check_unique() doing so > is a bug that needs fixing. Not something that we want to extend. Well, I think you know that that's never going to happen. There are all kinds of reasons why it works that way that cannot be disavowed. My definition of a bug includes a user being affected. >> > At this point I am a bit confused why you are asking for review. >> >> I am asking for us, collectively, through consensus, to resolve the >> basic approach to doing this. That was something I stated right up >> front, pointing out details of where the discussion had gone in the >> past. That was my explicit goal. There has been plenty of discussing >> on this down through the years, but nothing ever came from it. > > At the moment ISTM you're not conceding on *ANY* points. That's not very > often the way to find concensus. Really? I've conceded plenty of points. Just now I conceded a point to Robert about insertion being blocked for inserters that want to insert a value that isn't already locked/existing, and he didn't even raise that in the first place. Most prominently, I've conceded that it is entirely questionable that I hold the buffer locks for longer - before you even responded to my original patch! I've said it many many times many many ways. It should be heavily scrutinized. But you both seem to be making general points along those lines, without reference to what I've actually done. Those general points could almost to the same extent apply to _bt_check_unique() today, which is why I have a hard time accepting them at face value. To say that what that function does is "a bug" is just not credible, because it's been around in essentially the same form since at least a time when you and I were in primary school. I'll remind you that you haven't been able to demonstrate deadlock in a way that invalidates my approach. While of course that's not how this is supposed to work, I've been too busy defending myself here to get down to the business of carefully analysing the relatively modest interactions between btree and heap that could conceivably introduce a deadlock. Yes, the burden to prove this can't deadlock is mine, but I thought I'd provide you with the opportunity to prove that it can. I'll say it again: For a deadlock, there needs to be a mutual dependency. Provided the locking phase doesn't acquire any locks other than buffer locks, and during the interaction with the heap btree inserters (or the locking phase) cannot acquire heap locks in a way that conflicts with other upserters, we will be fine. It doesn't necessarily matter how complex each system individually is, because the two meet in such a limited area (well, two areas now, I suppose), and they only meet in one direction - there is no reciprocation where the heap code locks or otherwise interacts with index buffers. When the heap insertion is performed, all index value locks are already acquired. The locking phase cannot block itself because of the ordering of locking, but also because the locks on the heap that it takes are only shared locks. Now, this analysis is somewhat complex, and underdeveloped. But as Robert said, there are plenty of things about locking in Postgres that are complex and subtle. He also said that it doesn't matter if I can prove that it won't deadlock, but I'd like a second opinion on that, since my proof might actually be, if not simple, short, and therefore may not represent an ongoing burden in the way Robert seemed to think it would. > That argument doesn't make sense to me. You're inserting a unique > value. It completely makes sense that you can only insert one of > them. > Even if it had an advantage, not blocking *for the single unique key alone* > opens you to issues of livelocks where several backends retry because of > each other indefinitely. See my remarks to Robert. > Whether postgres locks down in a way that can only resolved by kill -9 > or whether it aborts a transaction are, like, a couple of magnitude of a > difference. Not really. I can see the advantage of having the deadlock be detectable from a defensive-coding standpoint. But index locking ordering inconsistencies, and the deadlocks they may cause are not acceptable generally. -- Peter Geoghegan
On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <stark@mit.edu> wrote: > It seems to me that the nature of the problem is that there will unavoidably > be a nexus between the two parts of the code here. We can try to isolate it > as much as possible but we're going to need a bit of a compromise. Exactly. That's why all the proposals with the exception of this one have to date involved unacceptable bloating - that's how they try and span the nexus. I'll find it very difficult to accept any implementation that is going to bloat things even worse than our upsert looping example. The only advantage of such an implementation over the upsert example is that it'll avoid burning through subxacts. The main reason I don't want to take that approach is that I know it won't be accepted, because it's a disaster. That's why the people that proposed this in various forms down through the years haven't gone and implemented it themselves. I do not accept that all of this is like the general situation with row locks. I do not think that the big costs of having many dead duplicates in a unique index can be overlooked (or perhaps the cost of cleaning them up eagerly, which is something I'd also expect to work very badly). That's something that's going to reverberate all over the place. Imagine a simple, innocent looking pattern that resulted in there being unique indexes that became hugely bloated. It's not hard. What I will concede (what I have conceded, actually) is that it would be better if the locks were more granular. Now, I'm not so much concerned about concurrent inserters inserting values that just so happen to be values that were locked. It's more the case that I'm worried about inserters blocking on other values that are incidentally locked despite not already existing, that would go on the locked page or maybe a later page. In particular, I'm concerned about the impact on SERIAL primary key columns. Not exactly an uncommon case (though one I'd already thought to optimize by locking last). What I think might actually work acceptably is if we were to create an SLRU that kept track of value-locks per buffer. The challenge there would be to have regular unique index inserters care about them, while having little to no impact on their regular performance. This might be possible by having them check the buffer for external value locks in the SLRU immediately after exclusive locking the buffer - usually that only has to happen once per index tuple insertion (assuming no duplicates necessitate retry). If they find their value in the SLRU, they do something like unlock and block on the other xact and restart. Now, obviously a lot of the details would have to be worked out, but it seems possible. In order for any of this to really be possible, there'd have to be some concession made to my position, as Greg mentions here. In other words, I'd need buy-in for the general idea of holding locks in shared memory from indexes across heap tuple insertion (subject to a sound deadlock analysis, of course). Some modest compromises may need to be made around interruptibility. I'd also probably need agreement that it's okay that value locks can not last more than an instant (they cannot be held indefinitely pending the end of a transaction). This isn't something that I imagine to be too controversial, because it's true today for a single unique index. As I've already outlined, anyone waiting on another transaction with a would-be duplicate to commit has very few guarantees about the order that it'll get its second shot relative to the order it initial queued up behind the successful but not-yet-committed inserter. -- Peter Geoghegan
<p dir="ltr"><br /> On 15 Sep 2013 10:19, "Peter Geoghegan" <<a href="mailto:pg@heroku.com">pg@heroku.com</a>> wrote:<br/> ><br /> > On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <<a href="mailto:stark@mit.edu">stark@mit.edu</a>>wrote:<br /> > > It seems to me that the nature of the problem isthat there will unavoidably<br /> > > be a nexus between the two parts of the code here. We can try to isolate it<br/> > > as much as possible but we're going to need a bit of a compromise.<br /><p dir="ltr">> In order forany of this to really be possible, there'd have to be<br /> > some concession made to my position, as Greg mentionshere. In other<br /> > words, I'd need buy-in for the general idea of holding locks in shared<br /> > memoryfrom indexes across heap tuple insertion (subject to a sound<br /> > deadlock analysis, of course). <p dir="ltr">Actuallythat wasn't what I meant by that.<p dir="ltr">What I meant is that there going to be some code couplingbetween the executor and btree code. That's purely a question of course structure, and will be true regardless ofthe algorithm you settle on.<p dir="ltr">What I was suggesting was an api for a function that would encapsulate that coupling.The executor would call this function which would promise to obtain all the locks needed for both operations orgive up. Effectively it would be a special btree operation which would have special knowledge of the executor only in thatit knows that being able to get a lock on two heap buffers is something the executor needs sometimes.<p dir="ltr">I'mnot sure this fits well with your syntax since it assumes the update will happen at the same time as the indexlookup but as I said I haven't read your patch, maybe it's not incompatible. I'm writing all this on my phone so it'smostly just pie in the sky brainstorming. I'm sorry if it's entirely irrelevant.<br />
Peter Geoghegan <pg@heroku.com> wrote: > There is no reason to call CheckForSerializableConflictIn() with > the additional locks held either. After all, "For a heap insert, > we only need to check for table-level SSI locks". You're only talking about not covering that call with a *new* LWLock, right? We put some effort into making sure that such calls were only inside of LWLocks which were needed for correctness. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-09-15 02:19:41 -0700, Peter Geoghegan wrote: > On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <stark@mit.edu> wrote: > > It seems to me that the nature of the problem is that there will unavoidably > > be a nexus between the two parts of the code here. We can try to isolate it > > as much as possible but we're going to need a bit of a compromise. > > Exactly. That's why all the proposals with the exception of this one > have to date involved unacceptable bloating - that's how they try and > span the nexus. > I'll find it very difficult to accept any implementation that is going > to bloat things even worse than our upsert looping example. How would any even halfway sensible example cause *more* bloat than the upsert looping thing? I'll concede that bloat is something to be aware of, but just because it's *an* issue, it's not *the* only issue. In all the solutions I can think of/have heard of that have the chance of producing additional bloat also have good chance of cleaning up the additional bloat. In the "promises" approach you simply can mark the promise index tuples as LP_DEAD in the IGNORE case if you've found a conflicting tuple. In the OR UPDATE case you can immediately reuse them. There's no heap bloat. The logic for dead items already exists in nbtree, so that's not too much complication. The case where that doesn't work is when postgres dies inbetween or we're signalled to abort. But that produces bloat for normal DML anyway. Any vacuum or insert can check whether the promise xid has committed and remove the promise otherwise. In the proposals that involve just inserting the heaptuple and then handle the uniqueness violation when inserting the index tuples you can immediately mark the index tuples as dead and mark it as prunable. > The only advantage of such an implementation over the upsert example is that > it'll avoid burning through subxacts. The main reason I don't want to > take that approach is that I know it won't be accepted, because it's a > disaster. That's why the people that proposed this in various forms > down through the years haven't gone and implemented it themselves. I > do not accept that all of this is like the general situation with row > locks. The primary advantage will be that it's actually usable by users without massive overhead in writing dozens of functions. I don't think the bloat issue had much to do with the feature not getting implemented so far. It's that nobody was willing to do the work and endure the discussions around it. And I definitely applaud you for finally tackling the issue despite that. > I do not think that the big costs of having many dead > duplicates in a unique index can be overlooked Why would there be so many duplicate index tuples? The primary user of this is going to be UPSERT. In case there's a conflicting tuple, there is going to be a new tuple version. Which will need a new index entry quite often. If there's no conflict, we will insert anyway. So, there's the case of UPSERTs that could be done as HOT updates because there's enough space on the page and none of the indexes actually have changed. As explained above, we can simply mark the index tuple as dead in that case (don't even need an exclusive lock for that, if done right). > (or perhaps the cost of > cleaning them up eagerly, which is something I'd also expect to work > very badly). Why? Remember the page you did the insert to, do a _bt_moveright() to catch eventual splits. Mark the item as dead. Done. The next insert will repack the page if necessary (cf. _bt_findinsertloc). > What I will concede (what I have conceded, actually) is that it would > be better if the locks were more granular. Now, I'm not so much > concerned about concurrent inserters inserting values that just so > happen to be values that were locked. It's more the case that I'm > worried about inserters blocking on other values that are incidentally > locked despite not already existing, that would go on the locked page > or maybe a later page. In particular, I'm concerned about the impact > on SERIAL primary key columns. Not exactly an uncommon case (though > one I'd already thought to optimize by locking last). Yes, I think that's the primary issue from a scalability and performance POV. Locking entire ranges of values, potentially even on inner pages (because you otherwise would have to split) isn't going to work. > What I think might actually work acceptably is if we were to create an > SLRU that kept track of value-locks per buffer. The challenge there > would be to have regular unique index inserters care about them, while > having little to no impact on their regular performance. This might be > possible by having them check the buffer for external value locks in > the SLRU immediately after exclusive locking the buffer - usually that > only has to happen once per index tuple insertion (assuming no > duplicates necessitate retry). If they find their value in the SLRU, > they do something like unlock and block on the other xact and restart. > Now, obviously a lot of the details would have to be worked out, but > it seems possible. If you can make that work, without locking heap and btree pages at the same time, yes, I think that's a possible way forward. One way to offset the cost of SLRU in the common case where there is no contention would be to have a page level flag that triggers that lookup. There should be space in btpo_flags. > In order for any of this to really be possible, there'd have to be > some concession made to my position, as Greg mentions here. In other > words, I'd need buy-in for the general idea of holding locks in shared > memory from indexes across heap tuple insertion (subject to a sound > deadlock analysis, of course). I don't have a fundamental problem with holding locks during the insert. I have a problem with holding page level lightweight locks on the btree and the heap at the same time. > Some modest compromises may need to be made around interruptibility. Why? As far as I understand that proposal, I don't see why that would be needed? > I'd also probably need agreement that > it's okay that value locks can not last more than an instant (they > cannot be held indefinitely pending the end of a transaction). This > isn't something that I imagine to be too controversial, because it's > true today for a single unique index. As I've already outlined, anyone > waiting on another transaction with a would-be duplicate to commit has > very few guarantees about the order that it'll get its second shot > relative to the order it initial queued up behind the successful but > not-yet-committed inserter. I forsee problems here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Sep 14, 2013 at 6:27 PM, Peter Geoghegan <pg@heroku.com> wrote: > Note that today there is no guarantee that the original waiter for a > duplicate-inserting xact to complete will be the first one to get a > second chance, so I think it's hard to question this on correctness > grounds. Even if they are released in FIFO order, there is no reason > to assume that the first waiter will win the race with a second. Most > obviously, the second waiter may not even ever get the chance to block > on the same xid at all (so it's not really a waiter at all) and still > be able to insert, if the blocking-xact aborts after the second > "waiter" starts its descent but before it checks uniqueness. All this, > even though the second "waiter" arrived maybe minutes after the first. ProcLockWakeup() only wakes as many waiters from the head of the queue as can all be granted the lock without any conflicts. So I don't think there is a race condition in that path. > So far it's > been a slippery slope type argument that can be equally well used to > argue against some facet of almost any substantial patch ever > proposed. I don't completely agree with that characterization, but you do have a point. Obviously, if the differences in the area of interruptibility, starvation, deadlock risk, etc. can be made small enough relative to the status quo can be made small enough, then those aren't reasons to reject the approach. But I'm skeptical that you're going to be able to accomplish that, especially without adversely affecting maintainability. I think the way that you're proposing to use lwlocks here is sufficiently different from what the rest of the system does that it's going to be hard to avoid system-wide affects that can't easily be caught during code review; and like Andres, I don't share your skepticism about alternative approaches. >> For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes, >> you'll error out. In fact, if you get the number of indexes exactly >> right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and >> panic the whole system. > > Oh, come on. We can obviously engineer a solution to that problem. I > don't think I've ever seen a table with close to 100 *unique* indexes. > 4 or 5 is a very high number. If we just raised on error if someone > tried to do this with more than 10 unique indexes, I would guess > that'd we'd get exactly zero complaints about it. That's not a solution; that's a hack. > Undetected deadlock is really not much worse than detected deadlock > here. Either way, it's a bug. And it's something that any kind of > implementation will need to account for. It's not okay to > *unpredictably* deadlock, in a way that the user has no control over. > Today, someone can do an analysis of their application and eliminate > deadlocks if they need to. That might not be terribly practical much > of the time, but it can be done. It certainly is practical to do it in > a localized way. I wouldn't like to compromise that. I agree that unpredictable deadlocks are bad. I think the fundamental problem with UPSERT, MERGE, and this proposal is what happens when the conflicting tuple is present but not visible to your scan, either because it hasn't committed yet or because it has committed but is not visible to your snapshot. I'm not clear on how you handle that in your approach. > If you look at the code, you'll see that I've made very modest > modifications to LWLockRelease only. I would be extremely surprised if > the overhead was not only in the noise, but was completely impossible > to detect through any conventional benchmark. These are the same kind > of very modest changes made for LWLockAcquireOrWait(), and you said > nothing about that at the time. Despite the fact that you now appear > to think that that whole effort was largely a waste of time. Well, I did have some concerns about the performance impact of that patch: http://www.postgresql.org/message-id/CA+TgmoaPyQKEaoFz8HkDGvRDbOmRpkGo69zjODB5=7Jh3hbPQA@mail.gmail.com I also discovered, after it was committed, that it didn't help in the way I expected: http://www.postgresql.org/message-id/CA+TgmoY8P3sD=oUViG+xZjmZk5-phuNV39rtfyzUQxU8hJtZxw@mail.gmail.com It's true that I didn't raise those concerns contemporaneously with the commit, but I didn't understand the situation well enough at that time to realize how narrow the benefit was. I've wished, on a number of occasions, to be able to add more lwlock primitives. The problem with that is that if everybody does it, we'll pretty soon end up with a mess. I attempted to address that with this proposal: http://www.postgresql.org/message-id/CA+Tgmob4YE_k5dpO0T07PNf1SOKPybo+wj4m4FryOS7Z4_yOzg@mail.gmail.com ...but nobody (including me) was very sure that was the right way forward, and it never went anywhere. However, I think the basic issue remains. I was sad to discover last week that Heikki handled this problem for the WAL scalability patch by basically copy-and-pasting much of the lwlock code and then hacking it up. I think we're well on our way to an unmaintainable mess already, and I don't want it to get worse. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-09-17 12:29:51 -0400, Robert Haas wrote: > But I'm skeptical that you're going to be able to accomplish that, > especially without adversely affecting maintainability. I think the > way that you're proposing to use lwlocks here is sufficiently > different from what the rest of the system does that it's going to be > hard to avoid system-wide affects that can't easily be caught during > code review; I actually think extending lwlocks to allow downgrading an exclusive lock is a good idea, independent of this path, and I think there are some areas of the code where we could use that capability to increase scalability. Now, that might be because I pretty much suggested using them in such a way to solve some of the problems :P I don't think they solve the issue of this patch (holding several nbtree pages locked across heap operations) though. > I agree that unpredictable deadlocks are bad. I think the fundamental > problem with UPSERT, MERGE, and this proposal is what happens when the > conflicting tuple is present but not visible to your scan, either > because it hasn't committed yet or because it has committed but is not > visible to your snapshot. I'm not clear on how you handle that in > your approach. Hm. I think it should be handled exactly the way we handle it for unique indexes today. Wait till it's clear whether you can proceed. At some point we might to extend that logic to more cases, but that should be separate discussion imo. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 17, 2013 at 6:20 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> I agree that unpredictable deadlocks are bad. I think the fundamental >> problem with UPSERT, MERGE, and this proposal is what happens when the >> conflicting tuple is present but not visible to your scan, either >> because it hasn't committed yet or because it has committed but is not >> visible to your snapshot. I'm not clear on how you handle that in >> your approach. > > Hm. I think it should be handled exactly the way we handle it for unique > indexes today. Wait till it's clear whether you can proceed. That's what I do, although getting those details right has been of secondary concern for obvious reasons. > At some point we might to extend that logic to more cases, but that > should be separate discussion imo. This is essentially why I went and added a row locking component over your objections. Value locks (regardless of implementation) effectively stop an insertion from finishing, but not from starting. ISTM that locking the row with value locks held can cause deadlock. So, unfortunately, we cannot really discuss value locking and row locking separately, even though I see the appeal of trying to. Gaining an actual representative notion of the expense of releasing and re-acquiring the locks is too tightly coupled with how this is handled and how frequently we need to restart. Plus there may well be other issues in the same vein that we've yet to consider. -- Peter Geoghegan
On 2013-09-18 00:54:38 -0500, Peter Geoghegan wrote: > > At some point we might to extend that logic to more cases, but that > > should be separate discussion imo. > > This is essentially why I went and added a row locking component over > your objections. I didn't object to implementing row level locking. I said that if your basic algorithm without row level locks is viewn as being broken, it won't be fixed by implementing row level locking. What I meant here is just that we shouldn't implement a mode with less waiting for now even if there might be usecases because that will open another can of worms. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 17, 2013 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Sep 14, 2013 at 6:27 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Note that today there is no guarantee that the original waiter for a >> duplicate-inserting xact to complete will be the first one to get a >> second chance > ProcLockWakeup() only wakes as many waiters from the head of the queue > as can all be granted the lock without any conflicts. So I don't > think there is a race condition in that path. Right, but what about XactLockTableWait() itself? It only acquires a ShareLock on the xid of the got-there-first inserter that potentially hasn't yet committed/aborted. There will be no conflicts between multiple second-chance-seeking blockers trying to acquire this lock concurrently, and so in fact there is (what I guess you'd consider to be) a race condition in the current btree insertion code. So my earlier point about according an upsert implementation license to optimize ordering of retries across multiple unique indexes -- that it isn't really inconsistent with the current code when dealing with only one unique index insertion -- has not been invalidated. EvalPlanQualFetch() and Do_MultiXactIdWait() also call XactLockTableWait(), for similar reasons. In my patch, the later row locking code used by INSERT...ON DUPLICATE KEY LOCK FOR UPDATE calls XactLockTableWait() too. >> So far it's >> been a slippery slope type argument that can be equally well used to >> argue against some facet of almost any substantial patch ever >> proposed. > > I don't completely agree with that characterization, but you do have a > point. Obviously, if the differences in the area of interruptibility, > starvation, deadlock risk, etc. can be made small enough relative to > the status quo can be made small enough, then those aren't reasons to > reject the approach. That all seems fair to me. That's the standard that I'd apply as a reviewer myself. > But I'm skeptical that you're going to be able to accomplish that, > especially without adversely affecting maintainability. I think the > way that you're proposing to use lwlocks here is sufficiently > different from what the rest of the system does that it's going to be > hard to avoid system-wide affects that can't easily be caught during > code review; Fair enough. In case it isn't already totally clear to someone, I concede that it isn't going to be workable to hold even shared buffer locks across all these operations. Let's get past that, though. > and like Andres, I don't share your skepticism about > alternative approaches. Well, I expressed skepticism about one alternative approach in particular, which is the promise tuples approach. Andres seems to think that I'm overly concerned about bloat, but I'm not sure he appreciates why I'm so sensitive to it in this instance. I'll be particularly sensitive to it if value locks need to be held indefinitely rather than there being a speculative grab-the-value-locks attempt (because that increases the window in which another session can necessitate that we retry at row locking time quite considerably - see below). > I think the fundamental > problem with UPSERT, MERGE, and this proposal is what happens when the > conflicting tuple is present but not visible to your scan, either > because it hasn't committed yet or because it has committed but is not > visible to your snapshot. Yeah, you're right. As I mentioned to Andres already, when row locking happens and there is this kind of conflict, my approach is to retry from scratch (go right back to before value lock acquisition) in the sort of scenario that generally necessitates EvalPlanQual() looping, or to throw a serialization failure where that's appropriate. After an unsuccessful attempt at row locking there could well be an interim wait for another xact to finish, before retrying (at read committed isolation level). This is why I think that value locking/retrying should be cheap, and should avoid bloat if at all possible. Forgive me if I'm making a leap here, but it seems like what you're saying is that the semantics of upsert that one might naturally expect are *arguably* fundamentally impossible, because they entail potentially locking a row that isn't current to your snapshot, and you cannot throw a serialization failure at read committed. I respectfully suggest that that exact definition of upsert isn't a useful one, because other snapshot isolation/MVCC systems operating within the same constraints must have the same issues, and yet they manage to implement something that could be called upsert that people seem happy with. > I also discovered, after it was committed, that it didn't help in the > way I expected: > > http://www.postgresql.org/message-id/CA+TgmoY8P3sD=oUViG+xZjmZk5-phuNV39rtfyzUQxU8hJtZxw@mail.gmail.com Well, at the time you didn't also provide raw commit latency benchmark results for your hardware using a tool like pg_test_fsync, which I'd consider absolutely essential to such a discussion. That's mostly or entirely what the group commit stuff does - amortize that cost among concurrently flushing transactions. Around this time, the patch was said by Heikki to just relieve lock contention around WALWriteLock - the 9.2 release notes say much the same. I never understood it that way, though Heikki disagreed with that [1]. Certainly, if relieving contention was all the patch did, then you wouldn't expect the 9.3 commit_delay implementation to help anyone, but it does: with a slow fsync holding the lock 50% *longer* can actually help tremendously. So I *always* agreed with you that there was hardware where group commit would barely help with a moderately sympathetic benchmark like the pgbench default. Not that it matters much now. > It's true that I didn't raise those concerns contemporaneously with > the commit, but I didn't understand the situation well enough at that > time to realize how narrow the benefit was. > > I've wished, on a number of occasions, to be able to add more lwlock > primitives. The problem with that is that if everybody does it, we'll > pretty soon end up with a mess. I wouldn't go that far. The number of possible additional primitives that are useful isn't that high, unless we decide that LWLocks are going to be a fundamentally different thing, which I consider unlikely. > http://www.postgresql.org/message-id/CA+Tgmob4YE_k5dpO0T07PNf1SOKPybo+wj4m4FryOS7Z4_yOzg@mail.gmail.com > > ...but nobody (including me) was very sure that was the right way > forward, and it never went anywhere. However, I think the basic issue > remains. I was sad to discover last week that Heikki handled this > problem for the WAL scalability patch by basically copy-and-pasting > much of the lwlock code and then hacking it up. I think we're well on > our way to an unmaintainable mess already, and I don't want it to get > worse. :-( I hear what you're saying about LWLocks. I did follow the FlexLocks stuff at the time myself. Obviously we aren't going to add new lwlock operations if they have exactly no clients. However, I think that the semantics implemented (weakening and strengthening of locks) may well be handy somewhere else. So while I wouldn't go and commit that stuff on the off chance that it will be useful, it's worth bearing in mind going forward that it's quite possible to weaken/strengthen locks. [1] http://www.postgresql.org/message-id/4FB0A673.7040002@enterprisedb.com -- Peter Geoghegan
Peter, * Peter Geoghegan (pg@heroku.com) wrote: > Forgive me if I'm making a leap here, but it seems like what you're > saying is that the semantics of upsert that one might naturally expect > are *arguably* fundamentally impossible, because they entail > potentially locking a row that isn't current to your snapshot, and you > cannot throw a serialization failure at read committed. I respectfully > suggest that that exact definition of upsert isn't a useful one, I'm not sure I follow this completely- you're saying that a definition of 'upsert' which includes having to lock rows which aren't in your current snapshot (for reasons stated) isn't a useful one. Is the implication that a useful definition of 'upsert' is that it *doesn't* have to lock rows which aren't in your current snapshot, and if so, then what would the semantics of that upsert look like? > because other snapshot isolation/MVCC systems operating within the > same constraints must have the same issues, and yet they manage to > implement something that could be called upsert that people seem happy > with. This I am generally in agreement with, to the extent that 'upsert' is something we really want and we should figure out a way to get there from here, but it wouldn't be the first time that we worked out a better solution than existing implementations. So, another '+1' from me wrt your working this issue and please don't get too discouraged that there's a lot of pressure to find a magic bullet- I think part of it is exactly because everyone wants this and wants it to be better than what's out there today. Thanks, Stephen
Hi Stephen, On Fri, Sep 20, 2013 at 6:55 PM, Stephen Frost <sfrost@snowman.net> wrote: > I'm not sure I follow this completely- you're saying that a definition > of 'upsert' which includes having to lock rows which aren't in your > current snapshot (for reasons stated) isn't a useful one. Is the > implication that a useful definition of 'upsert' is that it *doesn't* > have to lock rows which aren't in your current snapshot, and if so, then > what would the semantics of that upsert look like? No, I'm suggesting that the useful semantics are that it does potentially lock rows not yet visible to our snapshot that have committed - the latest row version. I see no alternative (we can't throw a serialization failure at read committed isolation level), and Andres seemed to agree that this was the way forward. Robert described problems he saw with this a few years ago [1]. It *is* a problem (we need to think very carefully about it), but, as I've said, it is a problem that anyone implementing this feature for a Snapshot Isolation/MVCC database would have to deal with, and several have. So, what the patch does right now is (if you squint) analogous to how SELECT FOR UPDATE uses EvalPlanQual already. However, instead of re-verifying a qual, we're re-verifying that the value locking has identified the right tid (there will probably be a different one in the subsequent iteration, or maybe we *can* go insert this time). We need consensus across unique indexes to go ahead with insertion, but once we know that we can't (and have a tid to lock), value locks can go away - we'll know if anything has changed about the tid's logical row that we need to care about when row locking. Besides, holding value locks while row locking has deadlock hazards, and, because value locks only stop insertions *finishing*, holding on to them is at best pointless. The tid we get from locking, that points to a would-be duplicate heap tuple has always committed - otherwise we'd never return from locking, because that blocks pending the outcome of a duplicate-inserting-xact (and only returns the tid when that xact commits). Even though this tuple is known to be visible, it may be deleted in the interim before row locking, in which case restarting from before value locking is appropriate. It might also be updated, which would necessitate locking a later row version in order to prevent race conditions. But it seems dangerous, invasive, and maybe even generally impossible to try and wait for the transaction that updated to commit or abort so that we can lock that later version the usual way (the usual EvalPlanQual looping thing) - better to restart value locking. The fundamental observation about value locking (at least for any half-way reasonable implementation), that I'd like to emphasize, is that short of a radical overhaul that would have many downsides, it can only ever prevent insertion from *finishing*. The big picture of my design is that it tries to quickly grab value locks, release them and grab a row lock (or insert heap tuples, index tuples, and then release value locks). If row locking fails, it waits for the conflicter xact to finish, and then restarts before the value locking of the current slot. If you think that's kind of questionable, maybe you have a point, but consider: 1. How else are you going to handle it if row locking needs to handle conflicts? You might say "I can re-verify that no unique index columns were affected instead", and maybe you can, but what if that doesn't help because they *were* changed? Besides, doesn't this break the amcanunique contract? Surely judging what's really a duplicate is the AM's job. You're back to "I need to throw an error to get out of this but I have no good excuse to do so at read committed" -- you've lost the usual duplicate key error "excuse". I don't think you can expect holding the value locks throughout row locking to help, because, as I've said, that causes morally indefensible deadlocks, and besides, it doesn't stop what row locking would consider to be a conflict, it just stops insertion from *finishing*. 2. In the existing btree index insertion code, the order that retries occur in the event of unique index tuple insertion finding an unfinished conflicting xact *is* undefined. Yes, that's right - the first waiter is not guaranteed to be the first to get a second chance. It's not even particularly probable! See remarks from my last mail to Robert for more information. 3. People with a real purist's view on the (re)ordering of value locking must already think that EvalPlanQual() is completely ghetto for very similar reasons, and as such should just go use a higher isolation level. For the rest of us, what concurrency control anomaly can allowing this cause over and above what's already possible there? Are lock starvation risks actually appreciably raised at all? What Andres and Robert seem to expect generally - that value locks only be released when we the locker has a definitive answer - actually *can* be ensured at the higher isolation levels, where the system has license to bail out by throwing a serialization failure. The trick there is just to throw an error if the first *retry* at cross-index value locking is unsuccessful or blocks on a whole other xact -- a serialization error (and not a unique constraint violation error, as would often but not necessarily otherwise occur for non-upserters). Naturally, it could also happen that at > read committed, row locking throws a serialization failure (as is always mandated over using EvalPlanQual() or other monkeying around at higher isolation levels). > This I am generally in agreement with, to the extent that 'upsert' is > something we really want and we should figure out a way to get there > from here, but it wouldn't be the first time that we worked out a > better solution than existing implementations. So, another '+1' from me > wrt your working this issue and please don't get too discouraged that > there's a lot of pressure to find a magic bullet Thanks for the encouragement! [1] http://www.postgresql.org/message-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com -- Peter Geoghegan
On Sun, Sep 15, 2013 at 8:23 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> I'll find it very difficult to accept any implementation that is going >> to bloat things even worse than our upsert looping example. > > How would any even halfway sensible example cause *more* bloat than the > upsert looping thing? I was away in Chicago over the week, and didn't get to answer this. Sorry about that. In the average/uncontended case, the subxact example bloats less than all alternatives to my design proposed to date (including the "unborn heap tuple" idea Robert mentioned in passing to me in person the other day, which I think is somewhat similar to a suggestion of Heikki's [1]). The average case is very important, because in general contention usually doesn't happen. But you need to also appreciate that because of the way row locking works and the guarantees value locking makes, any ON DUPLICATE KEY LOCK FOR UPDATE implementation is going to have to potentially restart in more places (as compared to the doc's example), maybe including value locking of each unique index and certainly including row locking. So the contended case might even be worse as well. On average, it is quite likely that either the UPDATE or INSERT will succeed - there has to be some concurrent activity around the same values for either to fail, and in general that's quite unlikely. If the UPDATE doesn't succeed, it won't bloat, and it's then very likely that the INSERT at the end of the loop will go ahead and succeed without itself creating bloat. Going forward with this discussion, I would like us all to take as read that the buffer locking stuff is a prototype approach to value locking, to be refined later (subject to my basic design being judged fundamentally sound). I don't think anyone believes that it's fundamentally incorrect in that it doesn't do something that it claims to do (concerns are more around what it might do or prevent that it shouldn't), and it can still drive discussion in a very useful direction. So far criticism of this patch has almost entirely been on aspects of buffer locking, but it would be much more useful for the time being to simply assume that the buffer locks *are* interruptible. It's probably okay with me to still be a bit suspicious of deadlocking, though, because if we refine the buffer locking using a more granular SLRU value locking approach, that doesn't necessarily guarantee that it's impossible, even if it does (I guess) prevent undesirable interactions with other buffer locking. [1] http://www.postgresql.org/message-id/45E845C4.6030000@enterprisedb.com -- Peter Geoghegan
On Fri, Sep 20, 2013 at 5:48 PM, Peter Geoghegan <pg@heroku.com> wrote: >> ProcLockWakeup() only wakes as many waiters from the head of the queue >> as can all be granted the lock without any conflicts. So I don't >> think there is a race condition in that path. > > Right, but what about XactLockTableWait() itself? It only acquires a > ShareLock on the xid of the got-there-first inserter that potentially > hasn't yet committed/aborted. There will be no conflicts between > multiple second-chance-seeking blockers trying to acquire this lock > concurrently, and so in fact there is (what I guess you'd consider to > be) a race condition in the current btree insertion code. I should add: README.tuplock says the following: """ The protocol for waiting for a tuple-level lock is really LockTuple() XactLockTableWait() mark tuple as locked by me UnlockTuple() When there are multiple waiters, arbitration of who is to get the lock next is provided by LockTuple(). """ So because this isn't a tuple-level lock - it's really a value-level lock - LockTuple() is not called by the btree code at all, and so arbitration of who gets the lock is, as I've said, essentially undefined. -- Peter Geoghegan
On Sat, Sep 21, 2013 at 7:22 PM, Peter Geoghegan <pg@heroku.com> wrote: > So because this isn't a tuple-level lock - it's really a value-level > lock - LockTuple() is not called by the btree code at all, and so > arbitration of who gets the lock is, as I've said, essentially > undefined. Addendum: It isn't even a value-level lock, because the buffer locks are of course released before the XactLockTableWait() call. It's a simple attempt to acquire a shared lock on an xid. -- Peter Geoghegan
Hi, I don't have time to answer the other emails today (elections, climbing), but maybe you could clarify the below? On 2013-09-21 17:07:11 -0700, Peter Geoghegan wrote: > On Sun, Sep 15, 2013 at 8:23 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> I'll find it very difficult to accept any implementation that is going > >> to bloat things even worse than our upsert looping example. > > > > How would any even halfway sensible example cause *more* bloat than the > > upsert looping thing? > > I was away in Chicago over the week, and didn't get to answer this. > Sorry about that. > > In the average/uncontended case, the subxact example bloats less than > all alternatives to my design proposed to date (including the "unborn > heap tuple" idea Robert mentioned in passing to me in person the other > day, which I think is somewhat similar to a suggestion of Heikki's > [1]). The average case is very important, because in general > contention usually doesn't happen. I can't follow here. Why does e.g. the promise tuple approach bloat more than the subxact example? The protocol is roughly: 1) Insert index pointer containing an xid to be waiting upon instead of the target tid into all indexes 2) Insert heap tuple, we can be sure there's no conflict now 3) Go through the indexes and repoint the item to point to the tid of the heaptuple instead of the xid. There's zero heap or index bloat in the uncontended case. In the contended case it's just the promise tuples from 1) that are inserted before the conflict is detected. Those can be marked as dead when the conflict happened. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Sep 22, 2013 at 2:10 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I can't follow here. Why does e.g. the promise tuple approach bloat more > than the subxact example? > The protocol is roughly: > 1) Insert index pointer containing an xid to be waiting upon instead of > the target tid into all indexes > 2) Insert heap tuple, we can be sure there's no conflict now > 3) Go through the indexes and repoint the item to point to the tid of the > heaptuple instead of the xid. > > There's zero heap or index bloat in the uncontended case. In the > contended case it's just the promise tuples from 1) that are inserted > before the conflict is detected. Those can be marked as dead when the > conflict happened. It depends on your definition of the contended case. You're assuming that insertion is the most probable outcome, when in fact much of the time updating is just as likely or even more likely. Many promise tuples may be inserted before actually seeing a conflict and deciding to update/lock for update. In order for the example in the docs to bloat at all, both the UPDATE and the INSERT need to fail within a tiny temporal window - that's what I mean by uncontended (it is usually tiny because if the UPDATE blocks, that often means it will succeed anyway, but if not the INSERT will very probably succeed). This is because the UPDATE won't bloat when no existing row is seen, because its subplan will return no rows. The INSERT will only bloat if it fails, which is generally very unlikely because of the fact that the UPDATE just did nothing. Contrast that with bloating almost every time an UPDATE is necessary (I think that bloat that is generally cleaned up synchronously is still bloat). That's before we even talk about the additional overhead. Making the locks expensive to release/clean-up could really hurt, since it appears they'll *have* to be unlocked before row locking, and during that time concurrent activity affecting the row to be locked can necessitate a full restart - that's a window we want to keep as small as possible. I think reviewer time would for now be much better spent discussing the patch at a higher level (along the lines of my recent mail to Stephen and Robert). I've been at least as guilty as anyone else in getting mired in these details. We'll be much better equipped to have this discussion afterwards, because it isn't clear to us if we really need or would find it at all useful to have long-lasting value locks, how frequently we'll need to retry and for what reasons, and so on. My immediate concern as the patch author is to come up with a better answer to the problem that Robert described [1], because "hey, I locked the row -- you take it from here user that might not have any version of it visible to you" is not good enough. I hope that there isn't a tension between solving that problem and offering the flexibility and composability of the proposed syntax. [1] http://www.postgresql.org/message-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com -- Peter Geoghegan
On 2013-09-22 12:54:57 -0700, Peter Geoghegan wrote: > On Sun, Sep 22, 2013 at 2:10 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I can't follow here. Why does e.g. the promise tuple approach bloat more > > than the subxact example? > > The protocol is roughly: > > 1) Insert index pointer containing an xid to be waiting upon instead of > > the target tid into all indexes > > 2) Insert heap tuple, we can be sure there's no conflict now > > 3) Go through the indexes and repoint the item to point to the tid of the > > heaptuple instead of the xid. > > > > There's zero heap or index bloat in the uncontended case. In the > > contended case it's just the promise tuples from 1) that are inserted > > before the conflict is detected. Those can be marked as dead when the > > conflict happened. > > It depends on your definition of the contended case. You're assuming > that insertion is the most probable outcome, when in fact much of the > time updating is just as likely or even more likely. Many promise > tuples may be inserted before actually seeing a conflict and deciding > to update/lock for update. I still fail to see how that's relevant. For every index there's two things that can happen: a) there's a conflicting tuple. In that case we can fail at that point/convert to an update. No Bloat. b) there's no conflicting tuple. In that case we will insert a promise tuple. If there's no conflict in further indexes (i.e. we INSERT), the promise will converted to a plain tuple. If there *is* a further conflict, you *still* need the new index tuple because by definition (the index changed) it cannot be an HOT update. So you convert it as well. No Bloat. > I think that bloat that is generally cleaned up synchronously is still > bloat I don't think it's particularly relevant because the above will just cause bloat in case of rollbacks and such which is nothin new, but: I fail to fee the point of such a position. > I think reviewer time would for now be much better spent discussing > the patch at a higher level (along the lines of my recent mail to > Stephen and Robert). Yes, I plan to reply to those, I just didn't have time to do so this weekend. There's other stuff than PG every now and then ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Sep 22, 2013 at 1:39 PM, Andres Freund <andres@2ndquadrant.com> wrote: > I still fail to see how that's relevant. For every index there's two > things that can happen: > a) there's a conflicting tuple. In that case we can fail at that > point/convert to an update. No Bloat. Well, yes - if the conflict is in the first unique index you look at. > b) there's no conflicting tuple. In that case we will insert a promise > tuple. Yeah, if there is no conflict relating to any of the tuples, the cost is limited to updating the promise tuples in-place. Not exactly a trivial additional cost even then though, because you have to exclusive lock and WAL-log twice per index tuple. > If there's no conflict in further indexes (i.e. we INSERT), the > promise will converted to a plain tuple. Sure. > If there *is* a further > conflict, you *still* need the new index tuple because by definition > (the index changed) it cannot be an HOT update. By definition? What do you mean? This isn't MySQL's REPLACE. This feature is almost certainly going to tacitly require the user to write the upsert SQL with a particular unique index in mind (to figure that out for ourselves, we'd need to somehow ask/infer, which is ugly/very hard to impossible). The UPDATE, as typically written, probably *won't* actually update any of the other, incidentally unique-constrained/value locked columns that we have to check in case that's what the user really meant, and very probably not the "interesting" column appearing in the UPDATE qual itself, so it probably *will* be a HOT update. > So you convert it as > well. No Bloat. Even if this is a practical possibility, which I doubt, the book keeping sounds very messy and invasive indeed. > Yes, I plan to reply to those, I just didn't have time to do so this > weekend. Great, thanks. I cannot strongly emphasize enough how I think that's the way to frame all of this. So much so that I almost managed to resist answering the above points. :-) > There's other stuff than PG every now and then ;) Hope you enjoyed the hike. -- Peter Geoghegan
On Fri, Sep 20, 2013 at 8:48 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Sep 17, 2013 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Sep 14, 2013 at 6:27 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> Note that today there is no guarantee that the original waiter for a >>> duplicate-inserting xact to complete will be the first one to get a >>> second chance > >> ProcLockWakeup() only wakes as many waiters from the head of the queue >> as can all be granted the lock without any conflicts. So I don't >> think there is a race condition in that path. > > Right, but what about XactLockTableWait() itself? It only acquires a > ShareLock on the xid of the got-there-first inserter that potentially > hasn't yet committed/aborted. That's an interesting point. As you pointed out in later emails, that cases is handled for heap tuple locks, but btree uniqueness conflicts are a different kettle of fish. > Yeah, you're right. As I mentioned to Andres already, when row locking > happens and there is this kind of conflict, my approach is to retry > from scratch (go right back to before value lock acquisition) in the > sort of scenario that generally necessitates EvalPlanQual() looping, > or to throw a serialization failure where that's appropriate. After an > unsuccessful attempt at row locking there could well be an interim > wait for another xact to finish, before retrying (at read committed > isolation level). This is why I think that value locking/retrying > should be cheap, and should avoid bloat if at all possible. > > Forgive me if I'm making a leap here, but it seems like what you're > saying is that the semantics of upsert that one might naturally expect > are *arguably* fundamentally impossible, because they entail > potentially locking a row that isn't current to your snapshot, Precisely. > and you cannot throw a serialization failure at read committed. Not sure that's true, but at least it might not be the most desirable behavior. > I respectfully > suggest that that exact definition of upsert isn't a useful one, > because other snapshot isolation/MVCC systems operating within the > same constraints must have the same issues, and yet they manage to > implement something that could be called upsert that people seem happy > with. Yeah. I wonder how they do that. > I wouldn't go that far. The number of possible additional primitives > that are useful isn't that high, unless we decide that LWLocks are > going to be a fundamentally different thing, which I consider > unlikely. I'm not convinced, but we can save that argument for another day. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 23, 2013 at 12:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Right, but what about XactLockTableWait() itself? It only acquires a >> ShareLock on the xid of the got-there-first inserter that potentially >> hasn't yet committed/aborted. > > That's an interesting point. As you pointed out in later emails, that > cases is handled for heap tuple locks, but btree uniqueness conflicts > are a different kettle of fish. Right. It suits my purposes to have the value locks be held for only an instant, because: 1) It will perform much better and deadlock much less in certain scenarios if sessions are given leeway to not block each other across multiple values in multiple unique indexes (i.e. we make them "considerate", like a person with a huge shopping cart that lets another person with one thing ahead of them in a queue, and perhaps in doing so greatly reduces their own wait because the guy with one thing makes the cashier immediately say 'no' to the person with all those groceries. Ahem). I don't think that this implies any additional anomalies at read committed, and I'm reasonably confident that this doesn't regress things to any degree lock starvation wise - lock starvation can only come from a bunch of inserters of the same value that consistently abort, just like the present situation with one unique index (I think it's better with multiple unique indexes than with only one - more opportunities for the would-be-starved session to hear a definitive no answer and give up). 2) It will probably be considerably easier if and when we improve on the buffer locking stuff (by adding a value locking SLRU) to assume that they'll be shortly held. For example, maybe it's okay that the implementation doesn't allow page splits on value-locked pages, and maybe that makes things much easier to reason about. If you're determined to have a strict serial ordering of value locking *without serialization failures*, I think what I've already said about the interactions between row locking and value locking demonstrates that that's close to or actually impossible. Plus, it would really suck for performance if that SLRU had to actually swap value locks to and from disk, which becomes a real possibility if they're really long held (mere index scans aren't going to keep the cache warm, so the worst-case latency for an innocent inserter into some narrow range of values might be really bad). Speaking of ease of implementation, how do you guarantee that the value locking waiters get the right to insert in serial order (if that's something that you value, which I don't at RC)? You have to fix the same "race" that already exists when acquiring a ShareLock on an xid, and blocking on value lock acquisition. The only possible remedy I can see for that is to integrate heap and btree locking in a much more intimate and therefore sketchy way. You need something like LockTuple() to arbitrate ordering, but what, and how, and where, and with how many buffer locks held? Most importantly: 3) As I've already mentioned, heavy value locks (like promise tuples or similar schemes, as opposed to actual heavyweight locks) concomitantly increase the window in which a conflict can be created for row locking. Most transactions last but an instant, and so the fact that other session may already be blocked locking on the would-be-duplicate row perhaps isn't that relevant. Doing all that clean-up is going to give other sessions increased opportunity to lock the row themselves, and ruin our day. But these points are about long held value locks, not the cost of making their acquisition relatively expensive or inexpensive (but still more or less instantaneous), so why mention that at all? Well, since we're blocking everyone else with our value locks, they get to have a bad day too. All the while, they're perhaps virtually pre-destined to find some row to lock, but the window for something to happen to that row for that to conflict with eventual row locking (to *unnecessarily* conflict, as for example when an innocuous HOT update occurs) gets larger and larger as they wait longer and longer on value locks. Loosely speaking, things get multiplicatively worse - total gridlock is probably possible, with the deadlock detector only breaking the gridlock up a bit if we get lucky (unless, maybe, if value locks last until transaction end...which I think is nigh on impossible anyway). The bottom line is that long lasting value locks - value locks that last the duration of a transaction and are acquired serially, while guaranteeing that the inserter that gets all the value locks needed itself gets to insert - have the potential to cascade horribly, in ways that I can only really begin to reason about. That is, they do *if* we presume that they have the interactions with row locking that I believe they do, a belief that no one has taken issue with yet. Even *considering* this is largely academic, though, because without some kind of miracle guaranteeing serial ordering, a miracle that doesn't allow for serialization failures and also doesn't seriously slow down, for example, updates (by making them care about value locks *before* they do anything, or in the case of HOT updates *at all*), all of this is _impossible_. So, I say let's just do the actually-serial-ordering for value lock acquisition with serialization failures where we're > read committed. I've seriously considered what it would take to do it any other way so things would work how you and Andres expect for read committed, and it makes my head hurt, because apart from seeming unnecessary to me, it also seems completely hopeless. Am I being too negative here? Well, I guess that's possible. The fact is that it's really hard to judge, because all of this is really hard to reason about. That's what I really don't like about it. >> I respectfully >> suggest that that exact definition of upsert isn't a useful one, >> because other snapshot isolation/MVCC systems operating within the >> same constraints must have the same issues, and yet they manage to >> implement something that could be called upsert that people seem happy >> with. > > Yeah. I wonder how they do that. My guess is that they have some fancy snapshot type that is used by the equivalent of ModifyTable subplans, that is appropriately paranoid about the Halloween problem and so on. How that actually might work is far from clear, but it's a question that I have begun to consider. As I said, a concern is that it would be in tension with the generalized, composable syntax, where we don't explicitly have a "special update". I'd really like to hear other opinions, though. -- Peter Geoghegan
On Mon, Sep 23, 2013 at 12:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> and you cannot throw a serialization failure at read committed. > > Not sure that's true, but at least it might not be the most desirable behavior. I'm pretty sure that that's totally true. "You don't have to worry about serialization failures at read committed, except when you do" seems kind of weak to me. Especially since none of the usual suspects say the same thing. That said, it sure would be convenient if it wasn't true! -- Peter Geoghegan
Hi, Various messages are discussing semantics around visibility. I by now have a hard time keeping track. So let's keep the discussion of the desired semantics to this thread. There have been some remarks about serialization failures in read committed transactions. I agree, those shouldn't occur. But I don't actually think they are so much of a problem if we follow the path set by existing uses of the EPQ logic. The scenario described seems to be an UPSERT conflicting with a row it cannot see in the original snapshot of the query. In that case I think we just have to follow the example laid by ExecUpdate, ExecDelete and heap_lock_tuple. Use the EPQ machinery (or an alternative approach with similar enough semantics) to get a new snapshot and follow the ctid chain. When we've found the end of the chain we try to update that tuple. That surely isn't free of surprising semantics, but it would follows existing semantics. Which everybody writing concurrent applications in read committed should (but doesn't) know. Adding a different set of semantics seems like a bad idea. Robert seems to have been the primary sceptic around this, what scenario are you actually concerned about? There are some scenarios that doesn't trivially answer. But I'd like to understand the primary concerns first. Regards, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Sep 23, 2013 at 7:05 PM, Peter Geoghegan <pg@heroku.com> wrote: > It suits my purposes to have the value locks be held for only an > instant, because: > > [ detailed explanation ] I don't really disagree with any of that. TBH, I think the question of how long value locks (as you call them) are held is going to boil down to a question of how they end up being implemented. As I mentioned to you at PG Open (going through the details here for those following along at home), we could optimistically insert the new heap tuple, then go add index entries for it, and if we find a conflict, then instead of erroring out, we mark the tuple we were inserting dead and go try update the conflicting tuple instead. In that implementation, if we find that we have to wait for some other transaction along the way, it's probably not worth reversing out the index entries already inserted, because getting them into the index in the first place was a WAL-logged operation, and therefore relatively expensive, and IMHO it's most likely better to just hope things work out than to risk having to redo all of that. On the other hand, if the locks are strictly in-memory, then the cost of releasing them all before we go to wait, and of reacquiring them after we finish waiting, is pretty low. There might be some modularity issues to work through there, but they might not turn out to be very painful, and the advantages you mention are certainly worth accruing if it turns out to be fairly straightforward. Personally, I think that trying to keep it all in-memory is going to be hard. The problem is that we can't de-optimize regular inserts or updates to any significant degree to cater to this feature - because as valuable as this feature is, the number of times it gets used is still going to be a whole lot smaller than the number of times it doesn't get used. Also, I tend to think that we might want to define the operation as a REPLACE-type operation with respect to a certain set of key columns; and so we'll do the insert-or-update behavior with respect only to the index on those columns and let the chips fall where they may with respect to any others. In that case this all becomes much less urgent. > Even *considering* this is largely academic, though, because without > some kind of miracle guaranteeing serial ordering, a miracle that > doesn't allow for serialization failures and also doesn't seriously > slow down, for example, updates (by making them care about value locks > *before* they do anything, or in the case of HOT updates *at all*), > all of this is _impossible_. So, I say let's just do the > actually-serial-ordering for value lock acquisition with serialization > failures where we're > read committed. I've seriously considered what > it would take to do it any other way so things would work how you and > Andres expect for read committed, and it makes my head hurt, because > apart from seeming unnecessary to me, it also seems completely > hopeless. > > Am I being too negative here? Well, I guess that's possible. The fact > is that it's really hard to judge, because all of this is really hard > to reason about. That's what I really don't like about it. Suppose we define the operation as REPLACE rather than INSERT...ON DUPLICATE KEY LOCK FOR UPDATE. Then we could do something like this: 1. Try to insert a tuple. If no unique index conflicts occur, stop. 2. Note the identity of the conflicting tuple and mark the inserted heap tuple dead. 3. If the conflicting tuple's inserting transaction is still in progress, wait for the inserting transaction to end. 4. If the conflicting tuple is dead (e.g. because the inserter aborted), start over. 5. If the conflicting tuple's key columns no longer match the key columns of the REPLACE operation, start over. 6. If the conflicting tuple has a valid xmax, wait for the deleting or locking transaction to end. If xmax is still valid, follow the CTID chain to the updated tuple, let that be the new conflicting tuple, and resume from step 5. 7. Update the tuple, even though it may be invisible to our snapshot (a deliberate MVCC violation!). While this behavior is admittedly wonky from an MVCC perspective, I suspect that it would make a lot of people happy. >>> I respectfully >>> suggest that that exact definition of upsert isn't a useful one, >>> because other snapshot isolation/MVCC systems operating within the >>> same constraints must have the same issues, and yet they manage to >>> implement something that could be called upsert that people seem happy >>> with. >> >> Yeah. I wonder how they do that. > > My guess is that they have some fancy snapshot type that is used by > the equivalent of ModifyTable subplans, that is appropriately paranoid > about the Halloween problem and so on. How that actually might work is > far from clear, but it's a question that I have begun to consider. As > I said, a concern is that it would be in tension with the generalized, > composable syntax, where we don't explicitly have a "special update". > I'd really like to hear other opinions, though. The tension here feels fairly fundamental to me; I don't think our implementation is to blame. I think the problem isn't so much to figure out a clever trick that will make this all work in a truly elegant fashion as it is to decide exactly how we're going to compromise MVCC semantics in the least blatant way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 24, 2013 at 5:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Various messages are discussing semantics around visibility. I by now > have a hard time keeping track. So let's keep the discussion of the > desired semantics to this thread. > > There have been some remarks about serialization failures in read > committed transactions. I agree, those shouldn't occur. But I don't > actually think they are so much of a problem if we follow the path set > by existing uses of the EPQ logic. The scenario described seems to be an > UPSERT conflicting with a row it cannot see in the original snapshot of > the query. > In that case I think we just have to follow the example laid by > ExecUpdate, ExecDelete and heap_lock_tuple. Use the EPQ machinery (or an > alternative approach with similar enough semantics) to get a new > snapshot and follow the ctid chain. When we've found the end of the > chain we try to update that tuple. > That surely isn't free of surprising semantics, but it would follows existing > semantics. Which everybody writing concurrent applications in read > committed should (but doesn't) know. Adding a different set of semantics > seems like a bad idea. > Robert seems to have been the primary sceptic around this, what scenario > are you actually concerned about? I'm not skeptical about offering it as an option; in fact, I just suggested basically the same thing on the other thread, before reading this. Nonetheless it IS an MVCC violation; the chances that someone will be able to demonstrate serialization anomalies that can't occur today with this new facility seem very high to me. I feel it's perfectly fine to respond to that by saying: yep, we know that's possible, if it's a concern in your environment then don't use this feature. But it should be clearly documented. I do think that it will be easier to get this to work if we have a define the operation as REPLACE, bundling all of the magic inside a single SQL command. If the user issues an INSERT first and then must try an UPDATE afterwards if the INSERT doesn't actually insert, then you're going to have problems if the UPDATE can't see the tuple with which the INSERT conflicted, and you're going to need some kind of a loop in case the UPDATE itself fails. Even if we can work out all the details, a single command that does insert-or-update seems like it will be easier to use and more efficient. You might also want to insert multiple tuples using INSERT ... VALUES (...), (...), (...); figuring out which ones were inserted and which ones must now be updated seems like a chore better avoided. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 24, 2013 at 7:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't really disagree with any of that. TBH, I think the question > of how long value locks (as you call them) are held is going to boil > down to a question of how they end up being implemented. Well, I think we can rule out value locks that are held for the duration of a transaction right away. That's just not going to fly. > As I mentioned to you at PG Open (going through the details here for those > following along at home), we could optimistically insert the new heap > tuple, then go add index entries for it, and if we find a conflict, > then instead of erroring out, we mark the tuple we were inserting dead > and go try update the conflicting tuple instead. In that > implementation, if we find that we have to wait for some other > transaction along the way, it's probably not worth reversing out the > index entries already inserted, because getting them into the index in > the first place was a WAL-logged operation, and therefore relatively > expensive, and IMHO it's most likely better to just hope things work > out than to risk having to redo all of that. I'm afraid that there are things that concern me about this design. It does have one big advantage over promise-tuples, which is that the possibility of index-only bloat, and even the possible need to freeze indexes separately from their heap relation is averted (or are you going to have recovery do promise clean-up instead? Does recovery look for an eventual successful insertion relating to the promise? How far does it look?). However, while I'm just as concerned as you that backing out is too expensive, I'm equally concerned that there is no reasonable alternative to backing out, which is why cheap, quick in-memory value locks are so compelling to me. See my remarks below. > On the other hand, if the locks are strictly in-memory, then the cost > of releasing them all before we go to wait, and of reacquiring them > after we finish waiting, is pretty low. There might be some > modularity issues to work through there, but they might not turn out > to be very painful, and the advantages you mention are certainly worth > accruing if it turns out to be fairly straightforward. It's certainly a difficult situation to judge. > Personally, I think that trying to keep it all in-memory is going to > be hard. The problem is that we can't de-optimize regular inserts or > updates to any significant degree to cater to this feature - because > as valuable as this feature is, the number of times it gets used is > still going to be a whole lot smaller than the number of times it > doesn't get used. Right - I don't think that anyone would argue that any other standard should be applied. Fortunately, I'm reasonably confident that it can work. The last part of index tuple insertion, where we acquire an exclusive lock on a buffer, needs to look out for a page header bit (on pages considered for insertion of its value). The cost of that to anyone not using this feature is likely to be infinitesimally small. We can leave clean-up of that bit to the next inserter, who needs the exclusive lock anyway and doesn't find a corresponding SLRU entry. But really, that's a discussion for another day. I think we'd want to track value locks per pinned-by-upserter buffer, to localize any downsides on concurrency. If we forbid page-splits in respect of a value-locked page, we can still have a stable value (buffer number) to use within a shared memory hash table, or something along those lines. We're still going to want to minimize the duration of locking under this scheme, by doing TOASTing before locking values and so on, which is quite possible. If we're really lucky, maybe the value locking stuff can be generalized or re-used as part of a btree index insertion buffer feature. > Also, I tend to think that we might want to define > the operation as a REPLACE-type operation with respect to a certain > set of key columns; and so we'll do the insert-or-update behavior with > respect only to the index on those columns and let the chips fall > where they may with respect to any others. In that case this all > becomes much less urgent. Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT, not try an INSERT, then maybe mark the heap tuple if there's a unique index dup and then go UPDATE the conflicting tuple. I mention this only because the term REPLACE has a certain baggage, and I feel it's important to be careful about such things. The only way that's going to work is if you say "use this unique index", which will look pretty gross in DML. That might actually be okay with me if we had somewhere to go from there in a future release, but I doubt that's the case. Another issue is that I'm not sure that this helps Andres much (or rather, clients of the logical changeset generation infrastructure that need to do conflict resolution), and that matters a lot to me here. > Suppose we define the operation as REPLACE rather than INSERT...ON > DUPLICATE KEY LOCK FOR UPDATE. Then we could do something like this: > > 1. Try to insert a tuple. If no unique index conflicts occur, stop. > 2. Note the identity of the conflicting tuple and mark the inserted > heap tuple dead. > 3. If the conflicting tuple's inserting transaction is still in > progress, wait for the inserting transaction to end. Sure, this is basically what the code does today (apart from marking a just-inserted tuple dead). > 4. If the conflicting tuple is dead (e.g. because the inserter > aborted), start over. Start over from where? I presume you mean the index tuple insertion, as things are today. Or do you mean the very start? > 5. If the conflicting tuple's key columns no longer match the key > columns of the REPLACE operation, start over. What definition of equality or inequality? I think you're going to have to consider stashing information about the btree operator class, which seems not ideal - a modularity violation beyond what we already do in, say, execQual.c, I think. I think in general we have to worry about the distinction between a particular btree operator class's idea of equality (doesn't have to be = operator), that exists for a particular index, and some qual's idea of equality. It would probably be quite invasive to fix this, which I for one would find hard to justify. I think my scheme is okay here while yours isn't, because mine involves row locking only, and hoping that nothing gets updated in that tiny window after transaction commit - if it doesn't, that's good enough for us, because we know that the btree code's opinion still holds - if I'm not mistaken, *nothing* can have changed to the logical row without us hearing about it (i.e. without heap_lock_tuple() returning HeapTupleUpdated). On the other hand, you're talking about concluding that something is not a duplicate in a way that needs to satisfy btree unique index equality (so whatever operator is associated with btree strategy number 3, equality, for some particular unique index with some particular operator class) and not necessarily just a qual written with a potentially distinct notion of equality in respect of the relevant unique-constrained datums. Maybe you can solve this one problem, but the fact remains that to do so would be a pretty bad modularity violation, even by the standards of the existing btree code. That's the basic reason why I'm averse to using EvalPlanQual() in this fashion, or in a similar fashion. Even if you solve all the problems for btree, I can't imagine what type of burden it puts on amcanunique AM authors generally - I know at least one person who won't be happy with that. :-) > 6. If the conflicting tuple has a valid xmax, wait for the deleting or > locking transaction to end. If xmax is still valid, follow the CTID > chain to the updated tuple, let that be the new conflicting tuple, and > resume from step 5. So you've arbitrarily restricted us to one value lock and one row lock per REPLACE slot processed, which sort of allows us to avoid solving the basic problem of value locking, because it isn't too bad now - no need to backtrack across indexes. Clean-up (marking the heap tuple dead) is much more expensive than releasing locks in memory (although much less expensive than promise tuple killing), but needing to clean-up is maybe less likely because conflicts can only come from one unique index. Has this really bought us anything, though? Consider that conflicts are generally only expected on one unique index anyway. Plus you still have the disconnect between value and row locking, as far as I can tell - "start from scratch" remains a possible step until very late, except you pay a lot more for clean-up - avoiding that expensive clean-up is the major benefit of introducing an SLRU-based shadow value locking scheme to the btree code. I don't see that there is a way to deal with the value locking/row locking disconnect other than to live with it in a smart way. Anyway, your design probably avoids the worst kind of gridlock. Let's assume that it works out -- my next question has to be, where can we go from there? > 7. Update the tuple, even though it may be invisible to our snapshot > (a deliberate MVCC violation!). I realize that you just wanted to sketch a design, but offhand I think that the basic problem with what you describe is that it isn't accepting of the inevitability of there being a disconnect between value and row locking. Also, this doesn't fit with any roadmap for getting a real upsert, and compromises the conceptual integrity of the AM in a way that isn't likely to be accepted, and, at the risk of saying too much before you've defended your design, perhaps even necessitates invasive changes to the already extremely complicated row locking code. > While this behavior is admittedly wonky from an MVCC perspective, I > suspect that it would make a lot of people happy. "Wonky from an MVCC perspective" is the order of the day here. :-) >> My guess is that they have some fancy snapshot type that is used by >> the equivalent of ModifyTable subplans, that is appropriately paranoid >> about the Halloween problem and so on. How that actually might work is >> far from clear, but it's a question that I have begun to consider. As >> I said, a concern is that it would be in tension with the generalized, >> composable syntax, where we don't explicitly have a "special update". >> I'd really like to hear other opinions, though. > > The tension here feels fairly fundamental to me; I don't think our > implementation is to blame. I think the problem isn't so much to > figure out a clever trick that will make this all work in a truly > elegant fashion as it is to decide exactly how we're going to > compromise MVCC semantics in the least blatant way. Yeah, I totally understand the problem that way. I think it would be a bit of a pity to give up the composability, which I liked, but it's something that we'll have to consider. On the other hand, perhaps we can get away with it - we simply don't know enough yet. -- Peter Geoghegan
On Sat, Sep 21, 2013 at 05:07:11PM -0700, Peter Geoghegan wrote: > In the average/uncontended case, the subxact example bloats less than > all alternatives to my design proposed to date (including the "unborn > heap tuple" idea Robert mentioned in passing to me in person the other > day, which I think is somewhat similar to a suggestion of Heikki's > [1]). The average case is very important, because in general > contention usually doesn't happen. This thread had a lot of discussion about bloating. I wonder, does the code check to see if there is a matching row _before_ adding any data? Our test-and-set code first checks to see if the lock is free, then if it it is, it locks the bus and does a test-and-set. Couldn't we easily check the indexes for matches before doing any locking? It seems that would avoid bloat in most cases, and allow for a simpler implementation. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote: > This thread had a lot of discussion about bloating. I wonder, does the > code check to see if there is a matching row _before_ adding any data? That's pretty much what the patch does. > Our test-and-set code first checks to see if the lock is free, then if > it it is, it locks the bus and does a test-and-set. Couldn't we easily > check the indexes for matches before doing any locking? It seems that > would avoid bloat in most cases, and allow for a simpler implementation. The value locks are only really necessary for getting consensus across unique indexes on whether or not to go forward, and to ensure that insertion can *finish* unhindered once we're sure that's appropriate. Once we've committed to insertion, we hold them across heap tuple insertion and release each value lock as part of something close to conventional btree index tuple insertion (with an index tuple with an ordinary heap pointer inserted). I believe that all schemes proposed to date have some variant of what could be described as value locking, such as ordinary index tuples inserted speculatively. Value locks are *not* held during row locking, and an attempt at row locking is essentially opportunistic for various reasons (it boils down to the fact that re-verifying uniqueness outside of the btree code is very unappealing, and in any case would naturally sometimes be insufficient - what happens if key values change across row versions?). This might sound a bit odd, but is in a sense no different to the current state of affairs, where the first waiter on a blocking xact that inserted a would-be duplicate is not guaranteed to be the first to get a second chance at inserting. I don't believe that there are any significant additional lock starvation hazards. In the simple case where there is a conflicting tuple that's already committed, value locks above and beyond what the btree code does today are unnecessary (provided the attempt to acquire a row lock is eventually successful, which mostly means that no one else has updated/deleted - otherwise we try again). -- Peter Geoghegan
On Wed, Sep 25, 2013 at 08:48:11PM -0700, Peter Geoghegan wrote: > On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote: > > This thread had a lot of discussion about bloating. I wonder, does the > > code check to see if there is a matching row _before_ adding any data? > > That's pretty much what the patch does. So, I guess my question is if we are only bloating on a contended operation, do we expect that to happen so much that bloat is a problem? I think the big objection to the patch is the additional code complexity and the potential to slow down other sessions. If it is only bloating on a contended operation, are these two downsides worth avoiding the bloat? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Sep 26, 2013 at 07:43:15AM -0400, Bruce Momjian wrote: > On Wed, Sep 25, 2013 at 08:48:11PM -0700, Peter Geoghegan wrote: > > On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote: > > > This thread had a lot of discussion about bloating. I wonder, does the > > > code check to see if there is a matching row _before_ adding any data? > > > > That's pretty much what the patch does. > > So, I guess my question is if we are only bloating on a contended > operation, do we expect that to happen so much that bloat is a problem? > > I think the big objection to the patch is the additional code complexity > and the potential to slow down other sessions. If it is only bloating > on a contended operation, are these two downsides worth avoiding the > bloat? Also, this isn't like the case where we are incrementing sequences --- I am unclear what workload is going to cause a lot of contention. If two sessions try to insert the same key, there will be bloat, but later upsert operations will already see the insert and not cause any bloat. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Sep 26, 2013 at 4:43 AM, Bruce Momjian <bruce@momjian.us> wrote: > So, I guess my question is if we are only bloating on a contended > operation, do we expect that to happen so much that bloat is a problem? Maybe I could have done a better job of explaining the nature of my concerns around bloat. I am specifically concerned about bloat and the clean-up of bloat that occurs between (or during) value locking and eventual row locking, because of the necessarily opportunistic nature of the way we go from one to the other. Bloat, and the obligation to clean it up synchronously, make row lock conflicts more likely. Conflicts make bloat more likely, because a conflict implies that another iteration, complete with more bloat, is necessary. When you consider that the feature will frequently be used with the assumption that updating is a much more likely outcome, it becomes clear that we need to be careful about this sort of interplay. Having said all that, I would have no objection to some reasonable, bound amount of bloat occurring elsewhere if that made sense. For example, I'd certainly be happy to consider the question of whether or not it's worth doing a kind of speculative heap insertion before acquiring value locks, because that doesn't need to happen again and again in the same, critical place, in the interim between value locking and row locking. The advantage of doing that particular thing would be to reduce the duration that value locks are held - the disadvantages would be the *usual* disadvantages of bloat. However, this is obviously a premature discussion to have now, because the eventual exact nature of value locks are not known. > I think the big objection to the patch is the additional code complexity > and the potential to slow down other sessions. If it is only bloating > on a contended operation, are these two downsides worth avoiding the > bloat? I believe that all other schemes proposed have some degree of bloat even in the uncontended case, because they optimistically assume than an insert will occur, when in general an update is perhaps just as likely, and will bloat just the same. So, as I've said before, definition of uncontended is important here. There is no reason to assume that alternative proposals will affect concurrency any less than my proposal - the buffer locking thing certainly isn't essential to my design. You need to weigh things like WAL-logging multiples times, that other proposals have. You're right to say that all of this is complex, but I really think that quite apart from anything else, my design is simpler than others. For example, the design that Robert sketched would introduce a fairly considerable modularity violation, per my recent remarks to him, and actually plastering over that would be a considerable undertaking. Now, you might counter, "but those other designs haven't been worked out enough". That's true, but then my efforts to work them out further by pointing out problems with them haven't gone very far. I have sincerely tried to see a way to make them work. -- Peter Geoghegan
On Tue, Sep 24, 2013 at 10:15 PM, Peter Geoghegan <pg@heroku.com> wrote: > Well, I think we can rule out value locks that are held for the > duration of a transaction right away. That's just not going to fly. I think I agree with that. I don't think I remember hearing that proposed. > If we're really lucky, maybe the value locking stuff can be > generalized or re-used as part of a btree index insertion buffer > feature. Well, that would be nifty. >> Also, I tend to think that we might want to define >> the operation as a REPLACE-type operation with respect to a certain >> set of key columns; and so we'll do the insert-or-update behavior with >> respect only to the index on those columns and let the chips fall >> where they may with respect to any others. In that case this all >> becomes much less urgent. > > Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT, > not try an INSERT, then maybe mark the heap tuple if there's a unique > index dup and then go UPDATE the conflicting tuple. I mention this > only because the term REPLACE has a certain baggage, and I feel it's > important to be careful about such things. I see. Well, we could try to mimic their semantics, I suppose. Those semantics seem like a POLA violation to me; who would have thought that a REPLACE could delete multiple tuples? But what do I know? > The only way that's going to work is if you say "use this unique > index", which will look pretty gross in DML. That might actually be > okay with me if we had somewhere to go from there in a future release, > but I doubt that's the case. Another issue is that I'm not sure that > this helps Andres much (or rather, clients of the logical changeset > generation infrastructure that need to do conflict resolution), and > that matters a lot to me here. Yeah, it's kind of awful. >> Suppose we define the operation as REPLACE rather than INSERT...ON >> DUPLICATE KEY LOCK FOR UPDATE. Then we could do something like this: >> >> 1. Try to insert a tuple. If no unique index conflicts occur, stop. >> 2. Note the identity of the conflicting tuple and mark the inserted >> heap tuple dead. >> 3. If the conflicting tuple's inserting transaction is still in >> progress, wait for the inserting transaction to end. > > Sure, this is basically what the code does today (apart from marking a > just-inserted tuple dead). > >> 4. If the conflicting tuple is dead (e.g. because the inserter >> aborted), start over. > > Start over from where? I presume you mean the index tuple insertion, > as things are today. Or do you mean the very start? Yes, that's what I meant. >> 5. If the conflicting tuple's key columns no longer match the key >> columns of the REPLACE operation, start over. > > What definition of equality or inequality? Binary equality, same as we'd use to decide whether an update can be done HOT. >> 7. Update the tuple, even though it may be invisible to our snapshot >> (a deliberate MVCC violation!). > > I realize that you just wanted to sketch a design, but offhand I think > that the basic problem with what you describe is that it isn't > accepting of the inevitability of there being a disconnect between > value and row locking. Also, this doesn't fit with any roadmap for > getting a real upsert, Well, there are two separate issues here: what to do about MVCC, and how to do the locking. From an MVCC perspective, I can think of only two behaviors when the conflicting tuple is committed but invisible: roll back, or update it despite it being invisible. If you're saying you don't like either of those choices, I couldn't agree more, but I don't have a third idea. If you do, I'm all ears. In terms of how to do the locking, what I'm mostly saying is that we could try to implement this in a way that invents as few new concepts as possible. No promise tuples, no new SLRU, no new page-level bits, just index tuples and heap tuples and so on. Ideally, we don't even change the WAL format, although step 2 might require a new record type. To the extent that what I actually described was at variance with that goal, consider it a defect in my explanation rather than an intent to vary. I think there's value in considering such an implementation because each new thing that we have to introduce in order to get this feature is a possible reason for it to be rejected - for modularity reasons, or because it hurts performance elsewhere, or because it's more code we have to maintain, or whatever. Now, what I hear you saying is, gee, the performance of that might be terrible. I'm not sure that I believe that, but it's possible that you're right. Much seems to depend on what you think the frequency of conflicts will be, and perhaps I'm assuming it will be low while you're assuming a higher value. Regardless, if the performance of the sort of implementation I'm talking about would be terrible (under some agreed-upon definition of what terrible means in this context), then that's a good argument for not doing it that way. I'm just not convinced that's the case. Basically, if there's a way we can do this without changing the on-disk format (even in a backward-compatible way), I'd be strongly inclined to go that route unless we have a really compelling reason to believe it's going to suck (or be outright impossible). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 26, 2013 at 3:07 PM, Peter Geoghegan <pg@heroku.com> wrote: > When you consider that the feature will frequently be used with the > assumption that updating is a much more likely outcome, it becomes > clear that we need to be careful about this sort of interplay. I think one thing that's pretty clear at this point is that almost any version of this feature could be optimized for either the insert case or the update case. For example, my proposal could be modified to search for a conflicting tuple first, potentially wasting an index probes (or multiple index probes, if you want to search for potential conflicts in multiple indexes) if we're inserting, but winning heavily in the update case. As written, it's optimized for the insert case. In fact, I don't know how to know which of these things we should optimize for. I wrote part of the code for an EDB proprietary feature that can do insert-or-update loads about 6 months ago[1], and we optimized it for updates. That was not, however, a matter of principal; it just turned out to be easier to implement that way. In fact, I would have assumed that the insert-mostly case was more likely, but I think the real answer is that some environments will be insert-mostly and some will be update-mostly and some will be a mix. If we really want to squeeze out every last drop of possible performance, we might need two modes: one that assumes we'll mostly insert, and another that assumes we'll mostly update. That seems a frustrating amount of detail to have to expose to the user; an implementation that was efficient in both cases would be very desirable, but I do not have a good idea how to get there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] In case you're wondering, attempting to use that feature to upsert an invisible tuple will result in the load failing with a unique index violation.
On Thu, Sep 26, 2013 at 03:33:34PM -0400, Robert Haas wrote: > On Thu, Sep 26, 2013 at 3:07 PM, Peter Geoghegan <pg@heroku.com> wrote: > > When you consider that the feature will frequently be used with the > > assumption that updating is a much more likely outcome, it becomes > > clear that we need to be careful about this sort of interplay. > > I think one thing that's pretty clear at this point is that almost any > version of this feature could be optimized for either the insert case > or the update case. For example, my proposal could be modified to > search for a conflicting tuple first, potentially wasting an index > probes (or multiple index probes, if you want to search for potential > conflicts in multiple indexes) if we're inserting, but winning heavily > in the update case. As written, it's optimized for the insert case. I assumed the code was going to do the index lookups first without a lock, and take the appropriate action, insert or update, with fallbacks for guessing wrong. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Well, I think we can rule out value locks that are held for the >> duration of a transaction right away. That's just not going to fly. > > I think I agree with that. I don't think I remember hearing that proposed. I think I might have been unclear - I mean locks that are held for the duration of *another* transaction, not our own, as we wait for that other transaction to commit/abort. I think that earlier remarks from yourself and Andres implied that this would be necessary. Perhaps I'm mistaken. Your most recent design proposal doesn't do this, but I think that that's only because it restricts the user to a single unique index - it would otherwise be necessary to sit on the earlier value locks (index tuples belonging to an unfinished transaction) pending the completion of some other conflicting transaction, which has numerous disadvantages (as described in my "it suits my purposes to have the value locks be held for only an instant" mail to you [1]). >> If we're really lucky, maybe the value locking stuff can be >> generalized or re-used as part of a btree index insertion buffer >> feature. > > Well, that would be nifty. Yes, it would. I think, based on a conversation with Rob Wultsch, that it's another area where MySQL still do quite a bit better. > I see. Well, we could try to mimic their semantics, I suppose. Those > semantics seem like a POLA violation to me; who would have thought > that a REPLACE could delete multiple tuples? But what do I know? I think that it's fairly widely acknowledged to not be very good. Every MySQL user uses INSERT...ON DUPLICATE KEY UPDATE instead. >> The only way that's going to work is if you say "use this unique >> index", which will look pretty gross in DML. > Yeah, it's kind of awful. It is. >> What definition of equality or inequality? > > Binary equality, same as we'd use to decide whether an update can be done HOT. I guess that's acceptable in theory, because binary equality is necessarily a *stricter* condition than equality according to some operator that is an equivalence relation. But the fact remains that you're just ameliorating the problem by making it happen less often (both through this kind of trick, but also by restricting us to one unique index), not actually fixing it. > Well, there are two separate issues here: what to do about MVCC, and > how to do the locking. Totally agreed. Fortunately, unlike the different aspects of value and row locking, I think that these two questions can be reasonable considered independently. > From an MVCC perspective, I can think of only > two behaviors when the conflicting tuple is committed but invisible: > roll back, or update it despite it being invisible. If you're saying > you don't like either of those choices, I couldn't agree more, but I > don't have a third idea. If you do, I'm all ears. I don't have another idea either. In fact, I'd go so far as to say that doing any third thing that's better than those two to any reasonable person is obviously impossible. But I'd add that we simple cannot rollback at read committed, so we're just going to have to hold our collective noses and do strange things with visibility. FWIW, I'm tentatively looking at doing something like this: *************** HeapTupleSatisfiesMVCC(HeapTuple htup, S *** 958,963 **** --- 959,975 ---- * By here, the inserting transaction has committed - have to check * when... */ + + /* + * Not necessarily visible to snapshot under conventional MVCC rules, but + * still locked by our xact and not updated -- importantly, normal MVCC + * semantics apply when we update the row, so only one version will be + * visible at once + */ + if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) && + TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple))) + return true; + if (XidInMVCCSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot)) return false; /* treat as still in progress */ This is something that I haven't given remotely enough thought yet, so please take it with a big grain of salt. > In terms of how to do the locking, what I'm mostly saying is that we > could try to implement this in a way that invents as few new concepts > as possible. No promise tuples, no new SLRU, no new page-level bits, > just index tuples and heap tuples and so on. Ideally, we don't even > change the WAL format, although step 2 might require a new record > type. To the extent that what I actually described was at variance > with that goal, consider it a defect in my explanation rather than an > intent to vary. I think there's value in considering such an > implementation because each new thing that we have to introduce in > order to get this feature is a possible reason for it to be rejected - > for modularity reasons, or because it hurts performance elsewhere, or > because it's more code we have to maintain, or whatever. There is certainly value in considering that, and you're right to take that tact - it is generally valuable to have a patch be minimally invasive. However, ultimately that's just one aspect of any given design, an aspect that needs to be weighed against others where there is a tension. Obviously in this instance I believe, rightly or wrongly, that doing more - adding more infrastructure than might be considered strictly necessary - is the least worst thing. Also, sometimes the apparent similarity of a design to what we have today is illusory - certainly, I think you'd at least agree that the problems that bloating during the interim between value locking and row locking present are qualitatively different to other problems that bloat presents in all existing scenarios. FWIW I'm not doing things this way because I'm ambitious, and am willing to risk not having my work accepted if that means I might get something that performs better, or has more features (like not requiring the user to specify a unique index in DML). Rather, I'm doing things this way because I sincerely believe that on balance mine is the best, most forward-thinking design proposed to date, and therefore the design most likely to ultimately be accepted (even though I do of course accept that there are numerous aspects that need to be worked out still). If the whole design is ultimately not accepted, that's something that I'll have to deal with, but understand that I don't see any way to play it safe here (except, I suppose, to give up now). > Now, what I hear you saying is, gee, the performance of that might be > terrible. I'm not sure that I believe that, but it's possible that > you're right. I think that the average case will be okay, but not great. I think that the worst case performance may well be unforgivably bad, and it's a fairly plausible worst case. Even if someone disputes its likelihood, and demonstrates that it isn't actually that likely, that isn't necessarily very re-assuring - getting all the details right is pretty subtle, especially compared to just not bloating, and just deferring to the btree code whose responsibilities include enforcing uniqueness. > Much seems to depend on what you think the frequency of > conflicts will be, and perhaps I'm assuming it will be low while > you're assuming a higher value. Regardless, if the performance of the > sort of implementation I'm talking about would be terrible (under some > agreed-upon definition of what terrible means in this context), then > that's a good argument for not doing it that way. I'm just not > convinced that's the case. All fair points. Forgive me for repeating myself, but the word "conflict" needs to be used carefully here, because there are two basic ways of interpreting it - something that happens due to concurrent xact activity around the same values, and something that happens due to there already being some row there with a conflicting value from some time ago (or that our xact inserted, even). Indeed, the former *is* generally much less likely than the latter, so the distinction is important. You could also further differentiate between value level and row level conflicts, or at least I think that you should, and that we should allow for value level conflicts. Let me try and explain myself better, with reference to a concrete example. Suppose we have a table with a primary key column, A, and a unique constraint column, B, and we lock the pk value first and the unique constraint value second. I'm assuming your design, but allowing for multiple unique indexes because I don't think doing anything less will be accepted - promise tuples have some of the same problems, as well as some other idiosyncratic ones (see my earlier remarks on recovery/freezing [2] for examples of those). So there is a fairly high probability that the pk value on A will be unique, and a fairly low probability that the unique constraint value on B will be unique, at least in this usage pattern of interest, where the user is mostly going to end up updating. Mostly, we insert a speculative regular index tuple (that points to a speculative heap tuple that we might decide to kill) into the pk column, A, right away, and then maybe block pending the resolution of a conflicting transaction on the unique constraint column B. I don't think we have any reasonable way of not blocking on A - if we go clean it up for the wait, that's going to bloat quite dramatically, *and* we have to WAL log. In any case you seemed to accept that cleaning up bloat synchronously like that was just going to be too expensive. So I suppose that rules that out. That just leaves sitting on the "value lock" (that the pk index tuple already inserted effectively is) indefinitely, pending the outcome of the first transaction. What are the consequences of sitting on that value lock indefinitely? Well, xacts are going to block on the pk value much more frequently, by simple virtue of the fact that the value locks there are held for a long time - they just needed to hear a "no" answer, which the unique constraint was in most cases happy to immediately give, so this is totally unnecessary. Contention is now in a certain sense almost as bad for every unique index as it is for the weakest link. That's only where the problems begin, though, and it isn't necessary for there to be bad contention on more than one unique index (the pk could just be on a serial column, say) to see bad effects. So your long-running xact that's blocking all the other sessions on its proposed value for a (or maybe even b) - that finally gets to proceed. Regardless of whether it commits or aborts, there will be a big bloat race. This is because when the other sessions get the go-ahead to proceed, they'll all run to get the row lock (one guy might insert instead). Only one will be successful, but they'll all kill their heap tuple on the assumption that they'll probably lock the row, which is only true in the average case. Now, maybe you can teach them to not bother killing the heap tuple when there are no index tuples actually inserted to ruin things, but then maybe not, and maybe it wouldn't help in this instance if you did teach them (because there's a third, otherwise irrelevant constraint or whatever). Realize you can generally only kill the heap tuple *before* you have the row lock, because otherwise a totally innocent non-HOT update (may not update any unique indexed columns at all) will deadlock with your session, which I don't think is defensible, and will probably happen often if allowed to (after all, this is upsert - users are going to want to update their locked rows!). So in this scenario, each of the original blockers will simultaneously try again and again to get the row lock as one transaction proceeds with locking and then probably commits. For every blocker's iteration (there will be n_blockers - 1 iterations, with each iteration resolving things for one blocker only), each blocker bloats. We're talking about creating duplicates in unique indexes for each and every iteration, for each and every blocker, and we all know duplicates in btree indexes are, in a word, bad. I can imagine one or two ridiculously bloated indexes in this scenario. It's even degenerative in another direction - the more aggregate bloat we have, the slower the jump from value to row locking takes, the more likely conflicts are, the more likely bloat is. Contrast this with my design, where re-ordering of would-be conflicters across unique indexes (or serialization failures) can totally nip this in the bud *if* the contention can be re-ordered around, but if not, at least there is no need to worry about aggregating bloat at all, because it creates no bloat. Now, you're probably thinking "but I said I'll reverify the row for conflicts across versions, and it'll be fine - there's generally no need to iterate and bloat again provided no unique-indexed column changed, even if that is more likely to occur due to the clean-up pre row locking". Maybe I'm being unfair, but apart from requiring a considerable amount of additional infrastructure of its own (a new EvalPlanQual()-like thing that cares about binary equality in respect of some columns only across row versions), I think that this is likely to turn out to be subtly flawed in some way, simply because of the modularity violation, so I haven't given you the benefit of the doubt about your ability to frequently avoid repeatedly asking the index + btree code what to do. For example, partial unique indexes - maybe something that looked okay before because you simply didn't have cause to insert into that unique index has to be considered in light of the fact that it changed across row versions - are you going to stash that knowledge too, and is it likely to affect someone who might otherwise not have these issues really badly because we have to assume the worst there? Do you want to do a value verification thing for that too, as we do when deciding to insert into partial indexes in the first place? Even if this new nothing-changed-across-versions infrastructure works, will it work often enough in practice to be worth it -- have you ever tracked the proportion of updates that were HOT updates in a production DB? It isn't uncommon for it to not be great, and I think that we can take that as a proxy for how well this will work. It could be totally legitimate for the UPDATE portion to alter a unique indexed column all the time. > Basically, if there's a way we can do this without changing the > on-disk format (even in a backward-compatible way), I'd be strongly > inclined to go that route unless we have a really compelling reason to > believe it's going to suck (or be outright impossible). I don't believe that anything that I have proposed needs to break our on-disk format - I hadn't considered what the implications might be in this area for other proposals, but it's possible that that's an additional advantage of doing value locking all in-memory. [1] http://www.postgresql.org/message-id/CAM3SWZRV0F-DjgpXu-WxGoG9eEcLawNrEiO5+3UKRp2e5s=TSg@mail.gmail.com [2] http://www.postgresql.org/message-id/CAM3SWZQUUuYYcGksVytmcGqACVMkf1ui1uvfJekM15YkWZpzhw@mail.gmail.com -- Peter Geoghegan
On Thu, Sep 26, 2013 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think one thing that's pretty clear at this point is that almost any > version of this feature could be optimized for either the insert case > or the update case. For example, my proposal could be modified to > search for a conflicting tuple first, potentially wasting an index > probes (or multiple index probes, if you want to search for potential > conflicts in multiple indexes) if we're inserting, but winning heavily > in the update case. I don't think that's really the case. In what sense could my design really be said to prioritize either the INSERT or the UPDATE case? I'm pretty sure that it's still necessary to get all the value locks per unique index needed up until the first one with a conflict even if you know that you're going to UPDATE for *some* reason, in order for things to be well defined (which is important, because there might be more than one conflict, and which one is locked matters - maybe we could add DDL to let unique indexes have a checking priority or something like that). The only appreciable downside of my design for updates that I can think of is that there has to be another index scan, to find the locked-for-update row to update. However, that's probably worth it, since it is at least relatively rare, and allows the user the flexibility of using a more complex UPDATE predicate than "apply to conflicter", which is something that the MySQL syntax effectively limits users to. -- Peter Geoghegan
On Thu, Sep 26, 2013 at 11:58 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Well, I think we can rule out value locks that are held for the >>> duration of a transaction right away. That's just not going to fly. >> >> I think I agree with that. I don't think I remember hearing that proposed. > > I think I might have been unclear - I mean locks that are held for the > duration of *another* transaction, not our own, as we wait for that > other transaction to commit/abort. I think that earlier remarks from > yourself and Andres implied that this would be necessary. Perhaps I'm > mistaken. Your most recent design proposal doesn't do this, but I > think that that's only because it restricts the user to a single > unique index - it would otherwise be necessary to sit on the earlier > value locks (index tuples belonging to an unfinished transaction) > pending the completion of some other conflicting transaction, which > has numerous disadvantages (as described in my "it suits my purposes > to have the value locks be held for only an instant" mail to you [1]). OK, now I understand what you are saying. I don't think I agree with it. > I don't have another idea either. In fact, I'd go so far as to say > that doing any third thing that's better than those two to any > reasonable person is obviously impossible. But I'd add that we simple > cannot rollback at read committed, so we're just going to have to hold > our collective noses and do strange things with visibility. I don't accept that as a general principal. We're writing the code; we can make it behave any way we think best. > This is something that I haven't given remotely enough thought yet, so > please take it with a big grain of salt. I doubt that any change to HeapTupleSatisfiesMVCC() will be acceptable. This feature needs to restrain itself to behavior changes that only affect users of this feature, I think. > There is certainly value in considering that, and you're right to take > that tact - it is generally valuable to have a patch be minimally > invasive. However, ultimately that's just one aspect of any given > design, an aspect that needs to be weighed against others where there > is a tension. Obviously in this instance I believe, rightly or > wrongly, that doing more - adding more infrastructure than might be > considered strictly necessary - is the least worst thing. Also, > sometimes the apparent similarity of a design to what we have today is > illusory - certainly, I think you'd at least agree that the problems > that bloating during the interim between value locking and row locking > present are qualitatively different to other problems that bloat > presents in all existing scenarios. TBH, no, I don't think I agree with that. See further below. > Let me try and explain myself better, with reference to a concrete > example. Suppose we have a table with a primary key column, A, and a > unique constraint column, B, and we lock the pk value first and the > unique constraint value second. I'm assuming your design, but allowing > for multiple unique indexes because I don't think doing anything less > will be accepted - promise tuples have some of the same problems, as > well as some other idiosyncratic ones (see my earlier remarks on > recovery/freezing [2] for examples of those). OK, so far I'm right with you. > So there is a fairly high probability that the pk value on A will be > unique, and a fairly low probability that the unique constraint value > on B will be unique, at least in this usage pattern of interest, where > the user is mostly going to end up updating. Mostly, we insert a > speculative regular index tuple (that points to a speculative heap > tuple that we might decide to kill) into the pk column, A, right away, > and then maybe block pending the resolution of a conflicting > transaction on the unique constraint column B. I don't think we have > any reasonable way of not blocking on A - if we go clean it up for the > wait, that's going to bloat quite dramatically, *and* we have to WAL > log. In any case you seemed to accept that cleaning up bloat > synchronously like that was just going to be too expensive. So I > suppose that rules that out. That just leaves sitting on the "value > lock" (that the pk index tuple already inserted effectively is) > indefinitely, pending the outcome of the first transaction. Agreed. > What are the consequences of sitting on that value lock indefinitely? > Well, xacts are going to block on the pk value much more frequently, > by simple virtue of the fact that the value locks there are held for a > long time - they just needed to hear a "no" answer, which the unique > constraint was in most cases happy to immediately give, so this is > totally unnecessary. Contention is now in a certain sense almost as > bad for every unique index as it is for the weakest link. That's only > where the problems begin, though, and it isn't necessary for there to > be bad contention on more than one unique index (the pk could just be > on a serial column, say) to see bad effects. Here's where I start to lose faith. It's unclear to me what those other transactions are doing. If they're trying to insert a record that conflicts with the primary key of the tuple we're inserting, they're probably doomed, but not necessarily; we might roll back. If they're also upserting, it's absolutely essential that they wait until we get done before deciding what to do. > So your long-running xact that's blocking all the other sessions on > its proposed value for a (or maybe even b) - that finally gets to > proceed. Regardless of whether it commits or aborts, there will be a > big bloat race. This is because when the other sessions get the > go-ahead to proceed, they'll all run to get the row lock (one guy > might insert instead). Only one will be successful, but they'll all > kill their heap tuple on the assumption that they'll probably lock the > row, which is only true in the average case. Now, maybe you can teach > them to not bother killing the heap tuple when there are no index > tuples actually inserted to ruin things, but then maybe not, and maybe > it wouldn't help in this instance if you did teach them (because > there's a third, otherwise irrelevant constraint or whatever). Supposing they are all upserters, it seems to me that what will probably happen is that one of them will lock the row and update it, and then commit. Then the next one will lock the row and update it, and then commit. And so on. It's probably important to avoid having them keep recreating speculative tuples and then killing them as long as a candidate tuple is available, so that they don't create a dead tuple per iteration. But that seems doable. > Realize you can generally only kill the heap tuple *before* you have > the row lock, because otherwise a totally innocent non-HOT update (may > not update any unique indexed columns at all) will deadlock with your > session, which I don't think is defensible, and will probably happen > often if allowed to (after all, this is upsert - users are going to > want to update their locked rows!). I must be obtuse; I don't see why that would deadlock. A bigger problem that I've just realized, though, is that once somebody else has blocked on a unique index insertion, they'll be stuck there until end of transaction even if we kill the tuple, because they're waiting on the xid, not the index itself. That might be fatal to my proposed design, or at least require the use of some more clever locking regimen. > Contrast this with my design, where re-ordering of would-be > conflicters across unique indexes (or serialization failures) can > totally nip this in the bud *if* the contention can be re-ordered > around, but if not, at least there is no need to worry about > aggregating bloat at all, because it creates no bloat. Yeah, possibly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 27, 2013 at 5:36 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't have another idea either. In fact, I'd go so far as to say >> that doing any third thing that's better than those two to any >> reasonable person is obviously impossible. But I'd add that we simple >> cannot rollback at read committed, so we're just going to have to hold >> our collective noses and do strange things with visibility. > > I don't accept that as a general principal. We're writing the code; > we can make it behave any way we think best. I presume you're referring to the principle that we cannot throw serialization failures at read committed. I'd suggest that letting that happen would upset a lot of people, because it's so totally unprecedented. A large segment of our user base would just consider that to be Postgres randomly throwing errors, and would be totally dismissive of the need to do so, and not without some justification - no one else does the same. The reality is that the majority of our users don't even know what an isolation level is. I'm not just talking about people that use Postgres more casually, such as Heroku customers. I've personally talked to people who didn't even know what a transaction isolation level was, that were in a position where they really, really should have known. > I doubt that any change to HeapTupleSatisfiesMVCC() will be > acceptable. This feature needs to restrain itself to behavior changes > that only affect users of this feature, I think. I agree with the principle of what you're saying, but I'm not aware that those changes to HeapTupleSatisfiesMVCC() imply any behavioral changes for those not using the feature. Certainly, the standard regression tests and isolation tests still pass, for what it's worth. Having said that, I have not thought about it enough to be willing to actually defend that bit of code. Though I must admit that I am a little encouraged by the fact that it passes casual inspection. I am starting to wonder if it's really necessary to have a "blessed" update that can see the locked, not-otherwise-visible tuple. Doing that certainly has its disadvantages, both in terms of code complexity and in terms of being arbitrarily restrictive. We're going to have to allow the user to see the locked row after it's updated (the new row version that we create will naturally be visible to its creating xact) - is it really any worse that the user can see it before an update (or a delete)? The user could decide to effectively make the update change nothing, and see the same thing anyway. I get why you're averse to doing odd things to visibility - I was too. I just don't see that we have a choice if we want this feature to work acceptably with read committed. In addition, as it happens I just don't see that the general situation is made any worse by the fact that the user might be able to see the row before an update/delete. Isn't is also weird to update or delete something you cannot see? Couldn't EvalPlanQual() be said to be an MVCC violation on similar grounds? It also "reaches into the future". Locking a row isn't really that distinct from updating it in terms of the code footprint, but also from a logical perspective. > It's probably important to avoid having > them keep recreating speculative tuples and then killing them as long > as a candidate tuple is available, so that they don't create a dead > tuple per iteration. But that seems doable. I'm not so sure. >> Realize you can generally only kill the heap tuple *before* you have >> the row lock, because otherwise a totally innocent non-HOT update (may >> not update any unique indexed columns at all) will deadlock with your >> session, which I don't think is defensible, and will probably happen >> often if allowed to (after all, this is upsert - users are going to >> want to update their locked rows!). > > I must be obtuse; I don't see why that would deadlock. If you don't see it, then you aren't being obtuse in asking for clarification. It's really easy to be wrong about this kind of thing. If the non-HOT update updates some random row, changing the key columns, it will lock that random row version. It will then proceed with "value locking" (i.e. inserting index tuples in the usual way, in this case with entirely new values). It might then block on one of the index tuples we, the upserter, have already inserted (these are our "value locks" under your scheme). Meanwhile, we (the upserter) might have *already* concluded that the *old* heap row that the regular updater is in the process of rendering invisible is to blame in respect of some other value in some later unique index, and that *it* must be locked. Deadlock. This seems very possible if the key values are somewhat correlated, which is probably generally quite common. The important observation here is that an updater, in effect, locks both the old and new sets of values (for non-HOT updates). And as I've already noted, any practical "value locking" implementation isn't going to be able to prevent the update from immediately locking the old, because that doesn't touch an index. Hence, there is an irresolvable disconnect between value and row locking. Are we comfortable with this? Before you answer, consider that there was lots of bugs (their words) in the MySQL implementation of this same basic idea surrounding excessive deadlocking - I heard through the grapevine that they fixed a number of bugs along these lines, and that their implementation has historically had lots of deadlocking problems. I think that the way to deal with weird, unprincipled deadlocking is to simply not hold value locks at the same time as row locks - it is my contention that the lock starvation hazards that avoiding being smarter about this may present aren't actually an issue, unless you have some kind of highly implausible perfect storm of read-committed aborters inserting around the same values - only one of those needs to commit to remedy the situation - the first "no" answer is all we need to give up. To repeat myself, that's really the essential nature of my design: it is accepting of the inevitability of there being a disconnect between value and row locking. Value locks that are implemented in a sane way can do very little; they can only prevent a conflicting insertion from *finishing*, and not from causing a conflict for row locking. > A bigger problem that I've just realized, though, is that once > somebody else has blocked on a unique index insertion, they'll be > stuck there until end of transaction even if we kill the tuple, > because they're waiting on the xid, not the index itself. That might > be fatal to my proposed design, or at least require the use of some > more clever locking regimen. Well, it's really fatal to your proposed design *because* it implies that others will be blocked on earlier value locks, which is what I was trying to say (in saying this, I'm continuing to hold your design to the same standard as my own, which is that it must work across multiple unique indexes - I believe that you yourself accept this standard based on your remarks here). For the benefit of others who may not get what we're talking about: in my patch, that isn't a problem, because when we block on acquiring an xid ShareLock pending value conflict resolution, that means that the other guy actually did insert (and did not merely think about it), and so with that design it's entirely appropriate that we wait for his xact to end. >> Contrast this with my design, where re-ordering of would-be >> conflicters across unique indexes (or serialization failures) can >> totally nip this in the bud *if* the contention can be re-ordered >> around, but if not, at least there is no need to worry about >> aggregating bloat at all, because it creates no bloat. > > Yeah, possibly. I think that re-ordering is an important property of any design where we cannot bail out with serialization failures. I know it seems weird, because it seems like an MVCC violation to have our behavior altered as a result of a transaction that committed that isn't even visible to us. As I think you appreciate, on a certain level that's just the nature of the beast. This might sound stupid, but: you can say the same thing about unique constraint violations! I do not believe that this introduces any anomalies that read committed doesn't already permit according to the standard. -- Peter Geoghegan
On Tue, Sep 24, 2013 at 2:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Various messages are discussing semantics around visibility. I by now > have a hard time keeping track. So let's keep the discussion of the > desired semantics to this thread. Yes, it's pretty complicated. I meant to comment on this here, but ended up saying some stuff to Robert about this in the main thread, so I should probably direct you to that. You were probably right to start a new thread, because I think we can usefully discuss this topic in parallel, but that's just what ended up happening. -- Peter Geoghegan
On Fri, Sep 27, 2013 at 8:36 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Sep 27, 2013 at 5:36 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I don't have another idea either. In fact, I'd go so far as to say >>> that doing any third thing that's better than those two to any >>> reasonable person is obviously impossible. But I'd add that we simple >>> cannot rollback at read committed, so we're just going to have to hold >>> our collective noses and do strange things with visibility. >> >> I don't accept that as a general principal. We're writing the code; >> we can make it behave any way we think best. > > I presume you're referring to the principle that we cannot throw > serialization failures at read committed. I'd suggest that letting > that happen would upset a lot of people, because it's so totally > unprecedented. A large segment of our user base would just consider > that to be Postgres randomly throwing errors, and would be totally > dismissive of the need to do so, and not without some justification - > no one else does the same. The reality is that the majority of our > users don't even know what an isolation level is. I'm not just talking > about people that use Postgres more casually, such as Heroku > customers. I've personally talked to people who didn't even know what > a transaction isolation level was, that were in a position where they > really, really should have known. Yes, it might not be a good idea. But I'm just saying, we get to decide. >> I doubt that any change to HeapTupleSatisfiesMVCC() will be >> acceptable. This feature needs to restrain itself to behavior changes >> that only affect users of this feature, I think. > > I agree with the principle of what you're saying, but I'm not aware > that those changes to HeapTupleSatisfiesMVCC() imply any behavioral > changes for those not using the feature. Certainly, the standard > regression tests and isolation tests still pass, for what it's worth. > Having said that, I have not thought about it enough to be willing to > actually defend that bit of code. Though I must admit that I am a > little encouraged by the fact that it passes casual inspection. Well, at a minimum, it's a performance worry. Those functions are *hot*. Single branches do matter there. > I am starting to wonder if it's really necessary to have a "blessed" > update that can see the locked, not-otherwise-visible tuple. Doing > that certainly has its disadvantages, both in terms of code complexity > and in terms of being arbitrarily restrictive. We're going to have to > allow the user to see the locked row after it's updated (the new row > version that we create will naturally be visible to its creating xact) > - is it really any worse that the user can see it before an update (or > a delete)? The user could decide to effectively make the update change > nothing, and see the same thing anyway. If we're not going to just error out over the invisible tuple the user needs some way to interact with it. The details are negotiable. > I get why you're averse to doing odd things to visibility - I was too. > I just don't see that we have a choice if we want this feature to work > acceptably with read committed. In addition, as it happens I just > don't see that the general situation is made any worse by the fact > that the user might be able to see the row before an update/delete. > Isn't is also weird to update or delete something you cannot see? > > Couldn't EvalPlanQual() be said to be an MVCC violation on similar > grounds? It also "reaches into the future". Locking a row isn't really > that distinct from updating it in terms of the code footprint, but > also from a logical perspective. Yes, EvalPlanQual() is definitely an MVCC violation. >>> Realize you can generally only kill the heap tuple *before* you have >>> the row lock, because otherwise a totally innocent non-HOT update (may >>> not update any unique indexed columns at all) will deadlock with your >>> session, which I don't think is defensible, and will probably happen >>> often if allowed to (after all, this is upsert - users are going to >>> want to update their locked rows!). >> >> I must be obtuse; I don't see why that would deadlock. > > If you don't see it, then you aren't being obtuse in asking for > clarification. It's really easy to be wrong about this kind of thing. > > If the non-HOT update updates some random row, changing the key > columns, it will lock that random row version. It will then proceed > with "value locking" (i.e. inserting index tuples in the usual way, in > this case with entirely new values). It might then block on one of the > index tuples we, the upserter, have already inserted (these are our > "value locks" under your scheme). Meanwhile, we (the upserter) might > have *already* concluded that the *old* heap row that the regular > updater is in the process of rendering invisible is to blame in > respect of some other value in some later unique index, and that *it* > must be locked. Deadlock. This seems very possible if the key values > are somewhat correlated, which is probably generally quite common. OK, I see. > The important observation here is that an updater, in effect, locks > both the old and new sets of values (for non-HOT updates). And as I've > already noted, any practical "value locking" implementation isn't > going to be able to prevent the update from immediately locking the > old, because that doesn't touch an index. Hence, there is an > irresolvable disconnect between value and row locking. This part I don't follow. "locking the old"? What irresolvable disconnect? I mean, they're different things; I get *that*. > Are we comfortable with this? Before you answer, consider that there > was lots of bugs (their words) in the MySQL implementation of this > same basic idea surrounding excessive deadlocking - I heard through > the grapevine that they fixed a number of bugs along these lines, and > that their implementation has historically had lots of deadlocking > problems. > > I think that the way to deal with weird, unprincipled deadlocking is > to simply not hold value locks at the same time as row locks - it is > my contention that the lock starvation hazards that avoiding being > smarter about this may present aren't actually an issue, unless you > have some kind of highly implausible perfect storm of read-committed > aborters inserting around the same values - only one of those needs to > commit to remedy the situation - the first "no" answer is all we need > to give up. OK, I take your point, I think. The existing system already acquires value locks when a tuple lock is held, during an UPDATE, and we can't change that. >>> Contrast this with my design, where re-ordering of would-be >>> conflicters across unique indexes (or serialization failures) can >>> totally nip this in the bud *if* the contention can be re-ordered >>> around, but if not, at least there is no need to worry about >>> aggregating bloat at all, because it creates no bloat. >> >> Yeah, possibly. > > I think that re-ordering is an important property of any design where > we cannot bail out with serialization failures. I know it seems weird, > because it seems like an MVCC violation to have our behavior altered > as a result of a transaction that committed that isn't even visible to > us. As I think you appreciate, on a certain level that's just the > nature of the beast. This might sound stupid, but: you can say the > same thing about unique constraint violations! I do not believe that > this introduces any anomalies that read committed doesn't already > permit according to the standard. I worry about the behavior being confusing and hard to understand in the presence of multiple unique indexes and reordering. Perhaps I simply don't understand the problem domain well-enough yet. From a user perspective, I would really think people would want to specify a set of key columns and then update if a match is found on those key columns. Suppose there's a unique index on (a, b) and another on (c), and the user passes in (a,b,c)=(1,1,1). It's hard for me to imagine that the user will be happy to update either (1,1,2) or (2,2,1), whichever exists. In what situation would that be the desired behavior? Also, under such a programming model, if somebody drops a unique index or adds a new one, the behavior of someone's application can completely change. I have a hard time swallowing that. It's an established precedent that dropping a unique index can make some other operation fail (e.g. ADD FOREIGN KEY, and more recently CREATE VIEW .. GROUP BY), and of course it can cause performance or plan changes. But overturning the semantics is, I think, something new, and it doesn't feel like a good direction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 30, 2013 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I doubt that any change to HeapTupleSatisfiesMVCC() will be >>> acceptable. This feature needs to restrain itself to behavior changes >>> that only affect users of this feature, I think. >> >> I agree with the principle of what you're saying, but I'm not aware >> that those changes to HeapTupleSatisfiesMVCC() imply any behavioral >> changes for those not using the feature. > Well, at a minimum, it's a performance worry. Those functions are > *hot*. Single branches do matter there. Well, that certainly is a reasonable concern. Offhand, I suspect that branch prediction helps immensely. But even if it doesn't, couldn't it be the case that returning earlier there actually helps? Where we have a real xid (so TransactionIdIsCurrentTransactionId() must do more than a single test of a scalar variable), and the row is locked *only* (which is already very cheap to check - it's another scalar variable that we already test in a few other places in that function), isn't there on average a high chance that the tuple ought to be visible to our snapshot anyway? >> I am starting to wonder if it's really necessary to have a "blessed" >> update that can see the locked, not-otherwise-visible tuple. > If we're not going to just error out over the invisible tuple the user > needs some way to interact with it. The details are negotiable. I think that we will error out over an invisible tuple with higher isolation levels. Certainly, what we do there today instead of EvalPlanQual() looping is consistent with that behavior. >> Couldn't EvalPlanQual() be said to be an MVCC violation on similar >> grounds? It also "reaches into the future". Locking a row isn't really >> that distinct from updating it in terms of the code footprint, but >> also from a logical perspective. > > Yes, EvalPlanQual() is definitely an MVCC violation. So I think that you can at least see why I'd consider that the two (my tweaks to HeapTupleSatisfiesMVCC() and EvalPlanQual()) are isomorphic. It just becomes the job of this new locking infrastructure to worry about the would-be invisibility of the locked tuple, and raise a serialization error accordingly at higher isolation levels. >> The important observation here is that an updater, in effect, locks >> both the old and new sets of values (for non-HOT updates). And as I've >> already noted, any practical "value locking" implementation isn't >> going to be able to prevent the update from immediately locking the >> old, because that doesn't touch an index. Hence, there is an >> irresolvable disconnect between value and row locking. > > This part I don't follow. "locking the old"? What irresolvable > disconnect? I mean, they're different things; I get *that*. Well, if you update a row, the old row version's values are locked, in the sense that any upserter interested in inserting the same values as the old version is going to have to block pending the outcome of the updating xact. The disconnect is that any attempt at a clever dance, to interplay value and row locking such that this definitely just works first time seems totally futile - I'm emphasizing this because it's the obvious way to approach this basic problem. It turns out that it could only be done at great expense, in a way that would immediately be dismissed as totally outlandish. > OK, I take your point, I think. The existing system already acquires > value locks when a tuple lock is held, during an UPDATE, and we can't > change that. Right. >> I think that re-ordering is an important property of any design where >> we cannot bail out with serialization failures. > I worry about the behavior being confusing and hard to understand in > the presence of multiple unique indexes and reordering. Perhaps I > simply don't understand the problem domain well-enough yet. It's only confusing if you are worried about what concurrent sessions do with respect to each other at this low level. In which case, just use a higher isolation level and pay the price. I'm not introducing any additional anomalies described and prohibited by the standard by doing this, and indeed the order of retrying in the event of a conflict today is totally undefined, so this line of thinking is not inconsistent with how things work today. Today, strictly speaking some unique constraint violations might be more appropriate as serialization failures. So with this new functionality, when used, they're going to be actual serialization failures where that's appropriate, where we'd otherwise go do something else other than error. Why burden read committed like that? (Actually, fwiw I suspect that currently the SSI guarantees *can* be violated with unique retry re-ordering, but that's a whole other story, and is pretty subtle). Let me come right out and say it: Yes, part of the reason that I'm taking this line is because it's convenient to my implementation from a number of different perspectives. But one of those perspectives is that it will help performance in the face of contention immensely, without violating any actual precept held today (by us or by the standard or by anyone else AFAIK), and besides, my basic design is informed by sincerely-held beliefs about what will actually work within the constraints presented. > From a user perspective, I would really think people would want to > specify a set of key columns and then update if a match is found on > those key columns. Suppose there's a unique index on (a, b) and > another on (c), and the user passes in (a,b,c)=(1,1,1). It's hard for > me to imagine that the user will be happy to update either (1,1,2) or > (2,2,1), whichever exists. In what situation would that be the > desired behavior? You're right - that isn't desirable. The reason that we go to all this trouble with locking multiple values concurrently boils down to preventing the user from having to specify a constraint name - it's usually really obvious *to users* that understand their schema, so why bother them with that esoteric detail? The user *is* more or less required to have a particular constraint in mind when writing their DML (for upsert). It could be that that constraint has a 1:1 correlation with another constraint in practice, which would also work out fine - they'd specify one or the other constrained column (maybe both) in the subsequent update's predicate. But generally, yes, they're out of luck here, until we get around to implementing MERGE in its full generality, which I think what I've proposed is a logical stepping stone towards (again, because it involves locking values across unique indexes simultaneously). Now, at least what I've proposed has the advantage of allowing the user to add some controls in their update's predicate. So if they only had updating (1,1,2) in mind, they could put WHERE a = 1 AND b = 1 in there too (I'm imagining the wCTE pattern is used). They'd then be able to inspect the value of the FOUND pseudo-variable or whatever. Now, I'm assuming that we'd somehow be able to tell that the insert hasn't succeeded (i.e. it locked), and maybe that doesn't accord very well with these kinds of facilities as they exist today, but it doesn't seem like too much extra work (MySQL would consider that both the locked and updated rows were affected, which might help us here). MySQL's INSERT...ON DUPLICATE KEY UPDATE has nothing like this - there is no guidance as to why you went to update, and you cannot have a separate update qual. Users better just get it right! Maybe what's really needed here is INSERT...ON DUPLICATE KEY LOCK FOR UPDATE RETURNING LOCKED... . You can see what was actually locked, and act on *that* as appropriate. Though you don't get to see the actual value of default expressions and so on, which is a notable disadvantage over RETURNING REJECTS... . The advantage of RETURNING LOCKED would be you could check if it LOCKED for the reason you thought it should have. If it didn't, then surely what you'd prefer would be a unique constraint violation, so you can just go throw an error in application code (or maybe consider another value for the columns that surprised you). What do others think? > Also, under such a programming model, if somebody drops a unique index > or adds a new one, the behavior of someone's application can > completely change. I have a hard time swallowing that. It's an > established precedent that dropping a unique index can make some other > operation fail (e.g. ADD FOREIGN KEY, and more recently CREATE VIEW .. > GROUP BY), and of course it can cause performance or plan changes. > But overturning the semantics is, I think, something new, and it > doesn't feel like a good direction. In what sense is causing, or preventing an error (the current state of affairs) not a behavioral change? I'd have thought it a very significant one. If what you're saying here is true, wouldn't that mandate that we specify the name of a unique index inline, within DML? I thought we were in agreement that that wasn't desirable. If you think it's a bit odd that we lock every value while the user essentially has one constraint in mind when writing their DML, consider: 1) We need this for MERGE anyway. 2) Don't underestimate the intellectual overhead for developers and operations personnel of adding an application-defined significance to unique indexes that they don't otherwise have. It sure would suck if a refactoring effort to normalize unique index naming style had the effect of breaking a slew of application code. Certainly, everyone else seems to have reached the same conclusion in their own implementation of upsert, because they don't require that a unique index be specified, even when that could have unexpected results. 3) The problems that getting the details wrong present can be ameliorated by developers who feel it might be a problem for them, as already described. I think in the vast majority of cases it just obviously won't be a problem to begin with. -- Peter Geoghegan
On Mon, Sep 30, 2013 at 3:45 PM, Peter Geoghegan <pg@heroku.com> wrote: > If you think it's a bit odd that we lock every value while the user > essentially has one constraint in mind when writing their DML, > consider: I should add to that list: 4) Locking all the values at once is necessary for the behavior of the locking to be well-defined -- I feel we need to know that some exact tuple is to blame (according to our well defined ordering for checking unique indexes for conflicts) for at least one instant in time. Given that we need to be the first to change the row without anything being altered to it, this ought to be sufficient. If you think it's bad that some other session can come in and insert a tuple that would have caused us to decide differently (before *our* transaction commits but *after* we've inserted), now you're into blaming the *wrong* tuple in the future, and I can't get excited about that - we always prefer a tuple normally visible to our snapshot, but if forced to (if there is none) we just throw a serialization failure (where appropriate). So for read committed you can have no *principled* beef with this, but for serializable you're going to naturally prefer the currently-visible tuple generally (that's the only correct behavior there that won't error - there *better* be something visible). Besides, the way the user tacitly has to use the feature with one particular constraint in mind kind of implies that this cannot happen... -- Peter Geoghegan
On Mon, Sep 30, 2013 at 9:11 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Sep 30, 2013 at 3:45 PM, Peter Geoghegan <pg@heroku.com> wrote: >> If you think it's a bit odd that we lock every value while the user >> essentially has one constraint in mind when writing their DML, >> consider: > > I should add to that list: > > 4) Locking all the values at once is necessary for the behavior of the > locking to be well-defined -- I feel we need to know that some exact > tuple is to blame (according to our well defined ordering for checking > unique indexes for conflicts) for at least one instant in time. > > Given that we need to be the first to change the row without anything > being altered to it, this ought to be sufficient. If you think it's > bad that some other session can come in and insert a tuple that would > have caused us to decide differently (before *our* transaction commits > but *after* we've inserted), now you're into blaming the *wrong* tuple > in the future, and I can't get excited about that - we always prefer a > tuple normally visible to our snapshot, but if forced to (if there is > none) we just throw a serialization failure (where appropriate). So > for read committed you can have no *principled* beef with this, but > for serializable you're going to naturally prefer the > currently-visible tuple generally (that's the only correct behavior > there that won't error - there *better* be something visible). > > Besides, the way the user tacitly has to use the feature with one > particular constraint in mind kind of implies that this cannot > happen... This patch is still marked as "Needs Review" in the CommitFest application. There's no reviewer, but in fact Andres and I both spent quite a lot of time providing design feedback (probably more than I spent on any other CommitFest patch). I think it's clear that the patch as submitted is not committable, so as far as the CommitFest goes I'm going to mark it Returned with Feedback. I think there are still some design considerations to work out here, but honestly I'm not totally sure what the remaining points of disagreement are. It would be nice to here the opinions of a few more people on the concurrency issues, but beyond that I think that a lot of this is going to boil down to whether the details of the value locking can be made to seem palatable enough and sufficiently low-overhead in the common case. I don't believe we can comment on that in the abstract. There's still some question in my mind as to what the semantics ought to be. I do understand Peter's point that having to specify a particular index would be grotty, but I'm not sure it invalidates my point that having to work across multiple indexes could lead to surprising results in some scenarios. I'm not going to stand here and hold my breath, though: if that's the only thing that makes me nervous about the final patch, I'll not object to it on that basis. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 9, 2013 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote: > This patch is still marked as "Needs Review" in the CommitFest > application. There's no reviewer, but in fact Andres and I both spent > quite a lot of time providing design feedback (probably more than I > spent on any other CommitFest patch). Right, thank you both. > I think there are still some design considerations to work out here, > but honestly I'm not totally sure what the remaining points of > disagreement are. It would be nice to here the opinions of a few more > people on the concurrency issues, but beyond that I think that a lot > of this is going to boil down to whether the details of the value > locking can be made to seem palatable enough and sufficiently > low-overhead in the common case. I don't believe we can comment on > that in the abstract. I agree that we cannot comment on it in the abstract. I am optimistic that we can make the value locking work better without regressing the common cases (especially if we're only concerned about not regressing users that never use the feature, as opposed to having some expectation for regular inserters inserting values into the same ranges as an upserter). That's not my immediate concern, though - my immediate concern is getting the concurrency and visibility issues scrutinized. What would it take to get the patch into a committable state if the value locking had essentially the same properties (they were held instantaneously), but were perfect? There is no point in giving the value locking implementation too much further consideration unless that question can be answered. In the past I've said that row locking and value locking cannot be considered separately, but that was when it was generally assumed that value locks had to persist for a long time in a way that I don't think is feasible (and I think Robert would now agree that it's at the very least very hard). Persisting value locks basically make not regressing the general case hard, when you think about the implementation. As Robert remarked, regular btree index insertion blockings on an xid, not a value, and cannot be easily made to appreciate that the "value lock" that a would-be duplicate index tuple represents may just be held for a short time, and not the entire duration of their inserter's transaction. > There's still some question in my mind as to what the semantics ought > to be. I do understand Peter's point that having to specify a > particular index would be grotty, but I'm not sure it invalidates my > point that having to work across multiple indexes could lead to > surprising results in some scenarios. I'm not going to stand here and > hold my breath, though: if that's the only thing that makes me nervous > about the final patch, I'll not object to it on that basis. I should be so lucky! :-) Unfortunately, I have a very busy schedule in the month ahead, including travelling to Ireland and Japan, so I don't think I'm going to get the opportunity to work on this too much. I'll try and produce a V4 that formally proposes some variant of my ideas around visibility of locked tuples. Here are some things you might not like about this patch, if we're still assuming that the value locks are prototype and it's useful to defer discussion around their implementation: * The lock starvation hazards around going from value locking to row locking, and retrying if it doesn't work out (i.e. if the row and its descendant rows cannot be locked without what would ordinarily necessitate using EvalPlanQual()). I don't see what we could do about those, other than checking for changes in the rows unique index values, which would be complex. I understand the temptation to do that, but the fact is that that isn't going to work all the time - some unique index value may well change every time. By doing that you've already accepted whatever hazard may exist, and it becomes a question of degree. Which is fine, but I don't see that the current degree is actually much of problem in the real world. * Reordering of value locks generally. I still need to ensure this will behave reasonably at higher isolation levels (i.e. they'll get a serialization failure). I think that Robert accepts that this isn't inconsistent with read committed's documented behavior, and that it is useful, and maybe even essential. * The basic question of whether or not it's possible to lock values and rows at the same time, and if that matters (because it turns out what looks like that isn't, because deleters will effectively lock values without even touching an index). I think Robert saw the difficulty of doing this, but it would be nice to get a definitive answer. I think that any MERGE implementation worth its salt will not deadlock without the potential for multiple rows to be locked in an inconsistent order, so this shouldn't either, and as I believe I demonstrated, value locks and row locks should not be held at the same time for at least that reason. Right? * The syntax. I like the composability, and the way it's likely to become idiomatic to combine it with wCTEs. Others may not. * The visibility hacks that V4 is likely to have. The fact that preserving the composable syntax may imply changes to HeapTupleSatisfiesMVCC() so that rows locked but with no currently visible version (under conventional rules) are visible to our snapshot by virtue of having been locked all the same (this only matters at read committed). So I think that what this patch really could benefit from is lots of scrutiny around the concurrency issues. It would be unfair to ask for that before at least producing a V4, so I'll clean up what I already have and post it, probably on Sunday. -- Peter Geoghegan
On Wed, Oct 9, 2013 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote: > * The lock starvation hazards around going from value locking to row > locking, and retrying if it doesn't work out (i.e. if the row and its > descendant rows cannot be locked without what would ordinarily > necessitate using EvalPlanQual()). I don't see what we could do about > those, other than checking for changes in the rows unique index > values, which would be complex. I understand the temptation to do > that, but the fact is that that isn't going to work all the time - > some unique index value may well change every time. By doing that > you've already accepted whatever hazard may exist, and it becomes a > question of degree. Which is fine, but I don't see that the current > degree is actually much of problem in the real world. Some of the decisions we make here may end up being based on measured performance rather than theoretical analysis. > * Reordering of value locks generally. I still need to ensure this > will behave reasonably at higher isolation levels (i.e. they'll get a > serialization failure). I think that Robert accepts that this isn't > inconsistent with read committed's documented behavior, and that it is > useful, and maybe even essential. I think there's a sentence missing here, or something. Obviously, the behavior at higher isolation levels is neither consistent nor inconsistent with read committed's documented behavior; it's another issue entirely. > * The basic question of whether or not it's possible to lock values > and rows at the same time, and if that matters (because it turns out > what looks like that isn't, because deleters will effectively lock > values without even touching an index). I think Robert saw the > difficulty of doing this, but it would be nice to get a definitive > answer. I think that any MERGE implementation worth its salt will not > deadlock without the potential for multiple rows to be locked in an > inconsistent order, so this shouldn't either, and as I believe I > demonstrated, value locks and row locks should not be held at the same > time for at least that reason. Right? Right. > * The syntax. I like the composability, and the way it's likely to > become idiomatic to combine it with wCTEs. Others may not. I've actually lost track of what syntax you're proposing. > * The visibility hacks that V4 is likely to have. The fact that > preserving the composable syntax may imply changes to > HeapTupleSatisfiesMVCC() so that rows locked but with no currently > visible version (under conventional rules) are visible to our snapshot > by virtue of having been locked all the same (this only matters at > read committed). I continue to think this is a bad idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 9, 2013 at 5:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> * Reordering of value locks generally. I still need to ensure this >> will behave reasonably at higher isolation levels (i.e. they'll get a >> serialization failure). I think that Robert accepts that this isn't >> inconsistent with read committed's documented behavior, and that it is >> useful, and maybe even essential. > > I think there's a sentence missing here, or something. Obviously, the > behavior at higher isolation levels is neither consistent nor > inconsistent with read committed's documented behavior; it's another > issue entirely. Here, "this" referred to the reordering concept generally. So I was just saying that I'm not actually introducing any anomaly that is described by the standard at read committed, and that at repeatable read+, we can have actual serial ordering of value locks without requiring them to last a long time, because we can throw serialization failures, and can even do so when not strictly logically necessary. >> * The basic question of whether or not it's possible to lock values >> and rows at the same time, and if that matters (because it turns out >> what looks like that isn't, because deleters will effectively lock >> values without even touching an index). I think Robert saw the >> difficulty of doing this, but it would be nice to get a definitive >> answer. I think that any MERGE implementation worth its salt will not >> deadlock without the potential for multiple rows to be locked in an >> inconsistent order, so this shouldn't either, and as I believe I >> demonstrated, value locks and row locks should not be held at the same >> time for at least that reason. Right? > > Right. I'm glad we're on the same page with that - it's a very important consideration to my mind. >> * The syntax. I like the composability, and the way it's likely to >> become idiomatic to combine it with wCTEs. Others may not. > > I've actually lost track of what syntax you're proposing. I'm continuing to propose: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE with a much less interesting variant that could be jettisoned: INSERT...ON DUPLICATE KEY IGNORE I'm also proposing extended RETURNING to make it work with this. So the basic idea is that within Postgres, the idiomatic way to correctly do upsert becomes something like: postgres=# with r as ( insert into foo(a,b) values (5, '!'), (6, '@') on duplicate key lock for update returning rejects * ) update foo set b = r.b from r where foo.a = r.a; >> * The visibility hacks that V4 is likely to have. The fact that >> preserving the composable syntax may imply changes to >> HeapTupleSatisfiesMVCC() so that rows locked but with no currently >> visible version (under conventional rules) are visible to our snapshot >> by virtue of having been locked all the same (this only matters at >> read committed). > > I continue to think this is a bad idea. Fair enough. Is it just because of performance concerns? If so, that's probably not that hard to address. It either has a measurable impact on performance for a very unsympathetic benchmark or it doesn't. I guess that's the standard that I'll be held to, which is probably fair. Do you see the appeal of the composable syntax? I appreciate that it's odd that serializable transactions now have to worry about seeing something they shouldn't have seen (when they conclusively have to go lock a row version not current to their snapshot). But that's simpler than any of the alternatives that I see. Does there really need to be a new snapshot type with one tiny difference that apparently doesn't actually affect conventional clients of MVCC snapshots? -- Peter Geoghegan
On Wed, Oct 9, 2013 at 9:30 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> * The syntax. I like the composability, and the way it's likely to >>> become idiomatic to combine it with wCTEs. Others may not. >> >> I've actually lost track of what syntax you're proposing. > > I'm continuing to propose: > > INSERT...ON DUPLICATE KEY LOCK FOR UPDATE > > with a much less interesting variant that could be jettisoned: > > INSERT...ON DUPLICATE KEY IGNORE > > I'm also proposing extended RETURNING to make it work with this. So > the basic idea is that within Postgres, the idiomatic way to correctly > do upsert becomes something like: > > postgres=# with r as ( > insert into foo(a,b) > values (5, '!'), (6, '@') > on duplicate key lock for update > returning rejects * > ) > update foo set b = r.b from r where foo.a = r.a; I can't claim to be enamored of this syntax. >>> * The visibility hacks that V4 is likely to have. The fact that >>> preserving the composable syntax may imply changes to >>> HeapTupleSatisfiesMVCC() so that rows locked but with no currently >>> visible version (under conventional rules) are visible to our snapshot >>> by virtue of having been locked all the same (this only matters at >>> read committed). >> >> I continue to think this is a bad idea. > > Fair enough. > > Is it just because of performance concerns? If so, that's probably not > that hard to address. It either has a measurable impact on performance > for a very unsympathetic benchmark or it doesn't. I guess that's the > standard that I'll be held to, which is probably fair. That's part of it; but I also think that HeapTupleSatisfiesMVCC() is a pretty fundamental bit of the system that I am loathe to tamper with. We can try to talk ourselves into believing that the definition change will only affect this case, but I'm wary that there will be unanticipated consequences, or simply that we'll find, after it's far too late to do anything about it, that we don't particularly care for the new semantics. It's probably an overstatement to say that I'll oppose any whatsoever that touches the semantics of that function, but not by much. > Do you see the appeal of the composable syntax? To some extent. It seems to me that what we're designing is a giant grotty hack, albeit a convenient one. But if we're not really going to get MERGE, I'm not sure how much good it is to try to pretend we've got something general. > I appreciate that it's odd that serializable transactions now have to > worry about seeing something they shouldn't have seen (when they > conclusively have to go lock a row version not current to their > snapshot). Surely that's never going to be acceptable. At read committed, locking a version not current to the snapshot might be acceptable if we hold our nose, but at any higher level I think we have to fail with a serialization complaint. > But that's simpler than any of the alternatives that I see. > Does there really need to be a new snapshot type with one tiny > difference that apparently doesn't actually affect conventional > clients of MVCC snapshots? I think that's the wrong way of thinking about it. If you're introducing a new type of snapshot, or tinkering with the semantics of an existing one, I think that's a reason to reject the patch straight off. We should be looking for a design that doesn't require that. If we can't find one, I'm not sure we should do this at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-10-11 08:43:43 -0400, Robert Haas wrote: > > I appreciate that it's odd that serializable transactions now have to > > worry about seeing something they shouldn't have seen (when they > > conclusively have to go lock a row version not current to their > > snapshot). > > Surely that's never going to be acceptable. At read committed, > locking a version not current to the snapshot might be acceptable if > we hold our nose, but at any higher level I think we have to fail with > a serialization complaint. I think an UPSERTish action in RR/SERIALIZABLE that notices a concurrent update should and has to *ALWAYS* raise a serialization failure. Anything else will cause violations of the given guarantees. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Oct 11, 2013 at 10:02 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-10-11 08:43:43 -0400, Robert Haas wrote: >> > I appreciate that it's odd that serializable transactions now have to >> > worry about seeing something they shouldn't have seen (when they >> > conclusively have to go lock a row version not current to their >> > snapshot). >> >> Surely that's never going to be acceptable. At read committed, >> locking a version not current to the snapshot might be acceptable if >> we hold our nose, but at any higher level I think we have to fail with >> a serialization complaint. > > I think an UPSERTish action in RR/SERIALIZABLE that notices a concurrent > update should and has to *ALWAYS* raise a serialization > failure. Anything else will cause violations of the given guarantees. Sorry, this was just a poor choice of words on my part. I totally agree with you here. Although I wasn't even talking about noticing a concurrent update - I was talking about noticing that a tuple that it's necessary to lock isn't visible to a serializable snapshot in the first place (which should also fail). What I actually meant was that it's odd that that one case (reason for returning) added to HeapTupleSatisfiesMVCC() will always obligate Serializable transactions to throw a serialization failure. Though that isn't strictly true; the modifications to HeapTupleSatisfiesMVCC() that I'm likely to propose also redundantly work for other cases where, if I'm not mistaken, that's okay (today, if you've exclusively locked a tuple and it hasn't been updated/deleted, why shouldn't it be visible to your snapshot?). The onus is on the executor-level code to notice this should-be-invisibility for non-read-committed, probably immediately after returning from value locking. -- Peter Geoghegan
On Fri, Oct 11, 2013 at 5:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> * The visibility hacks that V4 is likely to have. The fact that >>>> preserving the composable syntax may imply changes to >>>> HeapTupleSatisfiesMVCC() so that rows locked but with no currently >>>> visible version (under conventional rules) are visible to our snapshot >>>> by virtue of having been locked all the same (this only matters at >>>> read committed). >>> >>> I continue to think this is a bad idea. >> Is it just because of performance concerns? / > That's part of it; but I also think that HeapTupleSatisfiesMVCC() is a > pretty fundamental bit of the system that I am loathe to tamper with. > We can try to talk ourselves into believing that the definition change > will only affect this case, but I'm wary that there will be > unanticipated consequences, or simply that we'll find, after it's far > too late to do anything about it, that we don't particularly care for > the new semantics. It's probably an overstatement to say that I'll > oppose any whatsoever that touches the semantics of that function, but > not by much. A tuple that is exclusively locked by our transaction and not updated or deleted being visible on that basis alone isn't *that* hard to reason about. Granted, we need to be very careful here, but we're talking about 3 lines of code. >> Do you see the appeal of the composable syntax? > > To some extent. It seems to me that what we're designing is a giant > grotty hack, albeit a convenient one. But if we're not really going > to get MERGE, I'm not sure how much good it is to try to pretend we've > got something general. Well, to be fair perhaps all of the things that you consider grotty hacks seem like inherent requirements to me, for any half-way reasonable upsert implementation on any system, that has the essential property of upsert: an atomic insert-or-update (or maybe serialization failure). >> But that's simpler than any of the alternatives that I see. >> Does there really need to be a new snapshot type with one tiny >> difference that apparently doesn't actually affect conventional >> clients of MVCC snapshots? > > I think that's the wrong way of thinking about it. If you're > introducing a new type of snapshot, or tinkering with the semantics of > an existing one, I think that's a reason to reject the patch straight > off. We should be looking for a design that doesn't require that. If > we can't find one, I'm not sure we should do this at all. I'm confused by this. We need to lock a row not visible to our snapshot under conventional rules. I think we can rule out serialization failures at read committed. That just leaves changing something about the visibility rules of an existing snapshot type, or creating a new snapshot type, no? It would also be unacceptable to update a tuple, and not have the new row version (which of course will still have "information from the future") visible to our snapshot - what would regular RETURNING return? So what do you have in mind? I don't think that locking a row and updating it are really that distinct anyway. The benefit of locking is that we don't have to update. We can delete, for example. Perhaps I've totally missed your point here, but to me it sounds like you're saying that certain properties must always be preserved that are fundamentally in tension with upsert working in the way people expect, and the way it is bound to actually work in numerous other systems. -- Peter Geoghegan
On Wed, Oct 9, 2013 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote: > Unfortunately, I have a very busy schedule in the month ahead, > including travelling to Ireland and Japan, so I don't think I'm going > to get the opportunity to work on this too much. I'll try and produce > a V4 that formally proposes some variant of my ideas around visibility > of locked tuples. V4 is attached. Most notably, this adds the modifications to HeapTupleSatisfiesMVCC(), though they're neater than in the snippet I sent earlier. There is also some clean-up around row-level locking. That code has been simplified. I also try and handle serialization failures in a better way, though that really needs the attention of a subject matter expert. There are a few additional XXX comments highlighting areas of concern, particularly around serializable behavior. I've deferred making higher isolation levels care about wrongfully relying on the special HeapTupleSatisfiesMVCC() exception (e.g. they won't throw a serialization failure, mostly because I couldn't decide on where to do the test on time prior to travelling tomorrow). I've added code to do heap_prepare_insert before value locks are held. Whatever our eventual value locking implementation, that's going to be a useful optimization. Though unfortunately I ran out of time to give this the scrutiny it really deserves, I suppose that it's something that we can return to later. I ask that reviewers continue to focus on concurrency issues and broad design issues, and continue to defer discussion about an eventual value locking implementation. I continue to think that that's the most useful way of proceeding for the time being. My earlier points about probable areas of concern [1] remain a good place for reviewers to start. [1] http://www.postgresql.org/message-id/CAM3SWZSvSrTzPhjNPjahtJ0rFfS-gJFhU86Vpewf+eO8GwZXNQ@mail.gmail.com -- Peter Geoghegan
Attachment
On Fri, Oct 11, 2013 at 2:30 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> But that's simpler than any of the alternatives that I see. >>> Does there really need to be a new snapshot type with one tiny >>> difference that apparently doesn't actually affect conventional >>> clients of MVCC snapshots? >> >> I think that's the wrong way of thinking about it. If you're >> introducing a new type of snapshot, or tinkering with the semantics of >> an existing one, I think that's a reason to reject the patch straight >> off. We should be looking for a design that doesn't require that. If >> we can't find one, I'm not sure we should do this at all. > > I'm confused by this. We need to lock a row not visible to our > snapshot under conventional rules. I think we can rule out > serialization failures at read committed. That just leaves changing > something about the visibility rules of an existing snapshot type, or > creating a new snapshot type, no? > > It would also be unacceptable to update a tuple, and not have the new > row version (which of course will still have "information from the > future") visible to our snapshot - what would regular RETURNING > return? So what do you have in mind? I don't think that locking a row > and updating it are really that distinct anyway. The benefit of > locking is that we don't have to update. We can delete, for example. Well, the SQL standard way of doing this type of operation is MERGE. The alternative we know exists in other databases is REPLACE; there's also INSERT .. ON DUPLICATE KEY update. In all of those cases, whatever weirdness exists around MVCC is confined to that one command.I tend to think we should do similarly, with the goalthat HeapTupleSatisfiesMVCC need not change at all. I don't have the only vote here, of course, but my feeling is that that's more likely to be a good route. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 15, 2013 at 5:15 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Well, the SQL standard way of doing this type of operation is MERGE. > The alternative we know exists in other databases is REPLACE; there's > also INSERT .. ON DUPLICATE KEY update. In all of those cases, > whatever weirdness exists around MVCC is confined to that one command. > I tend to think we should do similarly, with the goal that > HeapTupleSatisfiesMVCC need not change at all. I don't think that it's very pragmatic to define success in terms of not modifying a single visibility function. I feel it would be more useful to define it as providing acceptable, non-surprising semantics, while not regressing performance in other areas. The fact remains that you're going to have a create a new snapshot type even for this special case, so I don't see any win as regards managing invasiveness here. Quite the contrary, in fact. > I don't have the only vote here, of course, but my feeling is that > that's more likely to be a good route. Naturally we all want MERGE. It seems self-defeating to insist on something significantly harder that there is significant less demand for, though. I thought that there was at least informal agreement that this sort of approach was preferable to MERGE in its full generality, based on feedback at the 2012 developer meeting. I really don't think that what I've done here is any worse than INSERT...ON DUPLICATE KEY UPDATE in any of the areas you express concern about here. REPLACE has some serious problems, and I just don't see it as a viable alternative at all - just ask any MySQL user. MERGE is of course more flexible to what I have here in some ways, but actually less flexible in other ways. I think that the real point of MERGE is that it's defined in a way that serves data warehousing use cases very well: the semantics constrain things such that the executor only has to execute a single ModifyTable node that does inserts, updates and deletes in a single scan. That's great, but what if it's useful to do that CRUD (yes, this can include selects) to entirely different tables? Or what if the relevant DML will only come in a later statement in the same transaction? -- Peter Geoghegan
On Tue, Oct 15, 2013 at 8:07 AM, Peter Geoghegan <pg@heroku.com> wrote: > Naturally we all want MERGE. It seems self-defeating to insist on > something significantly harder that there is significant less demand > for, though. I hasten to add: which is not to imply that you're insisting rather than expressing a sentiment. -- Peter Geoghegan
On Tue, Oct 15, 2013 at 11:07 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Oct 15, 2013 at 5:15 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Well, the SQL standard way of doing this type of operation is MERGE. >> The alternative we know exists in other databases is REPLACE; there's >> also INSERT .. ON DUPLICATE KEY update. In all of those cases, >> whatever weirdness exists around MVCC is confined to that one command. >> I tend to think we should do similarly, with the goal that >> HeapTupleSatisfiesMVCC need not change at all. > > I don't think that it's very pragmatic to define success in terms of > not modifying a single visibility function. I feel it would be more > useful to define it as providing acceptable, non-surprising semantics, > while not regressing performance in other areas. > > The fact remains that you're going to have a create a new snapshot > type even for this special case, so I don't see any win as regards > managing invasiveness here. Quite the contrary, in fact. Well, we might have to agree to disagree. >> I don't have the only vote here, of course, but my feeling is that >> that's more likely to be a good route. > > Naturally we all want MERGE. It seems self-defeating to insist on > something significantly harder that there is significant less demand > for, though. I thought that there was at least informal agreement that > this sort of approach was preferable to MERGE in its full generality, > based on feedback at the 2012 developer meeting. I really don't think > that what I've done here is any worse than INSERT...ON DUPLICATE KEY > UPDATE in any of the areas you express concern about here. REPLACE has > some serious problems, and I just don't see it as a viable alternative > at all - just ask any MySQL user. > > MERGE is of course more flexible to what I have here in some ways, but > actually less flexible in other ways. I think that the real point of > MERGE is that it's defined in a way that serves data warehousing use > cases very well: the semantics constrain things such that the executor > only has to execute a single ModifyTable node that does inserts, > updates and deletes in a single scan. That's great, but what if it's > useful to do that CRUD (yes, this can include selects) to entirely > different tables? Or what if the relevant DML will only come in a > later statement in the same transaction? I'm not saying "go implement MERGE". I'm saying, make the insert-or-update operation a single statement, using some syntax TBD, instead of requiring the use of a new insert statement that makes invisible rows visible as a side effect, so that you can wrap that in a CTE and feed it to an update statement. That's complex and, AFAICS, unlike how any other database product handles this. Again, other people can have different opinions on this, and that's fine. I'm just giving you mine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-10-15 11:11:24 -0400, Robert Haas wrote: > I'm not saying "go implement MERGE". I'm saying, make the > insert-or-update operation a single statement, using some syntax TBD, > instead of requiring the use of a new insert statement that makes > invisible rows visible as a side effect, so that you can wrap that in > a CTE and feed it to an update statement. That's complex and, AFAICS, > unlike how any other database product handles this. I think we most definitely should provide a single statement variant. That's the one users yearn for. I also would like a variant where I can lock a row on conflict, for multimaster scenarios, but that doesn't necessarily have to be exposed to SQL. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-10-15 10:19:17 -0700, Peter Geoghegan wrote: > On Tue, Oct 15, 2013 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > Well, I don't know that any of us can claim to have a lock on what the > > syntax should look like. > > Sure. But it's not just syntax. We're talking about functional > differences too, since you're talking about mandating an update, which > is a not the same as an "update locked row only conditionally", or a > delete. I think anything that only works by breaking visibility rules that way is a nonstarter. Doing that from the C level is one thing, exposing it this way seems a bad idea. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Oct 15, 2013 at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I think anything that only works by breaking visibility rules that way > is a nonstarter. Doing that from the C level is one thing, exposing it > this way seems a bad idea. What visibility rule is that? Upsert *has* to do effectively the same thing as what I've proposed - there is no getting away from it. So maybe the visibility rulebook (which as far as I can tell is "the way things work today") needs to be updated. If we did, say, INSERT...ON DUPLICATE KEY UPDATE, we'd have to update a row with potentially no visible-to-snapshot version *at all*, and make a new version of that visible. That's just what it takes. What's the difference between that and just locking? If the only difference is that it isn't necessary to modify tqual.c because you're passing a tid directly, that isn't a user-visible difference - the "rule" has been broken just the same. Arguably, it's even more of a hack, since it's a special, out-of-band visibility exception. I'm happy to have total scrutiny of changes to tqual.c, but I'm surprised that the mere fact of it having been modified is being weighed so heavily. Another thing that I'm not clear on is how an update can be backed out of if the row is modified by another xact. As I think I've already illustrated, the row locking that takes place has to be kind of opportunistic. I'm sure you could do it, but it would probably be quite invasive. -- Peter Geoghegan
On 10/15/2013 08:11 AM, Robert Haas wrote: > I'm not saying "go implement MERGE". I'm saying, make the > insert-or-update operation a single statement, using some syntax TBD, > instead of requiring the use of a new insert statement that makes > invisible rows visible as a side effect, so that you can wrap that in > a CTE and feed it to an update statement. That's complex and, AFAICS, > unlike how any other database product handles this. Hmmm. Is the plan NOT to eventually get to a single-statement upsert? If not, then I'm not that keen on this feature. I can't say that anybody I know who's migrating from MySQL would use a 2-statement version of upsert; if they were prepared for that, then they'd be prepared to just rewrite their stuff as proper insert/updates anyway. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Oct 15, 2013 at 10:58 AM, Josh Berkus <josh@agliodbs.com> wrote: > Hmmm. Is the plan NOT to eventually get to a single-statement upsert? > If not, then I'm not that keen on this feature. See the original e-mail in the thread for what I imagine idiomatic usage will look like. http://www.postgresql.org/message-id/CAM3SWZThwrKtvurf1aWAiH8qThGNMZAfyDcNw8QJu7pqHk5AGQ@mail.gmail.com -- Peter Geoghegan
On Tue, Oct 15, 2013 at 11:05 AM, Peter Geoghegan <pg@heroku.com> wrote: > See the original e-mail in the thread for what I imagine idiomatic > usage will look like. > > http://www.postgresql.org/message-id/CAM3SWZThwrKtvurf1aWAiH8qThGNMZAfyDcNw8QJu7pqHk5AGQ@mail.gmail.com Note also that this doesn't preclude a variant with a more direct update part (not that I think that's all that compelling). Doing things this way was motivated by: 1) Serving the needs of logical changeset generation plugins, even if Andres doesn't think that needs to be exposed through SQL. He and I both want something that does this with low overhead (in particular, no subtransactions). 2) Getting something effective into the next release. MERGE-like flexibility seems like a very desirable thing. And the implementation's infrastructure can be used by an eventual MERGE implementation. 3) Being simple enough that huge bike shedding over syntax might not be necessary. Making insert statements grow an update tumor is likely to get messy fast. I know because I tried it myself. -- Peter Geoghegan
Peter, > Note also that this doesn't preclude a variant with a more direct > update part (not that I think that's all that compelling). Doing > things this way was motivated by: I can see the value in the CTE format for this for existing PostgreSQL users. (although, AFAICT it doesn't allow for the implementation of one of my personal desires, which is UPDATE ... ON NOT FOUND INSERT, for cases where updates are expected to occur 95% of the time, but that's another topic. Unless "rejects" for an Update could be the leftover rows, but then we're getting into full MERGE.). I'm just pointing out that this doesn't do much for the MySQL migration case; the rewrite is too complex to automate. I'd been assuming that we had some plans to implement a MySQL-friendly syntax for 9.5, and this version was a stepping stone to that. Does this version make a distinction between PRIMARY KEY constraints and UNIQUE indexes? If not, how does it pick among keys? If so, what about tables with no PRIMARY KEY for various reasons (like unique GiST indexes?) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Oct 15, 2013 at 11:23 AM, Josh Berkus <josh@agliodbs.com> wrote: > (although, AFAICT it doesn't allow for the implementation of one of my > personal desires, which is UPDATE ... ON NOT FOUND INSERT, for cases > where updates are expected to occur 95% of the time, but that's another > topic. Unless "rejects" for an Update could be the leftover rows, but > then we're getting into full MERGE.). This isn't really all that inefficient for that case. Certainly, the balance in cost between mostly-insert cases and mostly-update cases is a strength of my basic approach over others. > Does this version make a distinction between PRIMARY KEY constraints and > UNIQUE indexes? If not, how does it pick among keys? If so, what about > tables with no PRIMARY KEY for various reasons (like unique GiST indexes?) We thought about prioritizing where to look (mostly as a performance optimization), but right now no. It works with amcanunique methods, which in practice means btrees. There is no such thing as a GiST unique index, so I guess you're referring to an exclusion constraint on an equality operator. That doesn't work with this, but why would you want it to? As for generalizing this to work with exclusion constraints, which I guess you might have also meant, that's a much more difficult and much less compelling proposition, in my opinion. -- Peter Geoghegan
On 10/15/2013 11:38 AM, Peter Geoghegan wrote: > We thought about prioritizing where to look (mostly as a performance > optimization), but right now no. It works with amcanunique methods, > which in practice means btrees. There is no such thing as a GiST > unique index, so I guess you're referring to an exclusion constraint > on an equality operator. That doesn't work with this, but why would > you want it to? As for generalizing this to work with exclusion > constraints, which I guess you might have also meant, that's a much > more difficult and much less compelling proposition, in my opinion. Yeah, that was one thing I was thinking of. Also, because you can't INDEX CONCURRENTLY a PK, I've been building a lot of databases which have no PKs, only UNIQUE indexes. Historically, this hasn't been an issue because aside from wonky annoyances (like the CONCURRENTLY case), Postgres doesn't distinguish between UNIQUE indexes and PRIMARY KEYs -- as, indeed, it shouldn't, since they're both keys, adn the whole concept of a "primary key" is a legacy of index-organized databases, which PostgreSQL is not. However, it does seem like the new syntax could be extended with and optional "USING unqiue_index_name" in the future (9.5), no? I'm just checking that we're not painting ourselves into a corner with this particular implementation. It's OK if it doesn't implement most things now; it's bad if it is impossible to build on and we have to support it forever. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Oct 15, 2013 at 11:55 AM, Josh Berkus <josh@agliodbs.com> wrote: > However, it does seem like the new syntax could be extended with and > optional "USING unqiue_index_name" in the future (9.5), no? There is no reason why we couldn't do that and just consider that one unique index. Whether we should is another question - I certainly think that mandating it would be very bad. > I'm just checking that we're not painting ourselves into a corner with > this particular implementation. It's OK if it doesn't implement most > things now; it's bad if it is impossible to build on and we have to > support it forever. I don't believe it does. In essence this just simply inserts a row, and rather than throwing a unique constraint violation, locks the row that prevented insertion from proceeding in respect of any tuple proposed for insertion where it does not. That's all. You can build lots of things with it that you can't today. Or you can not use it at all. So that covers semantics, I'd say. As for implementation: I believe that the implementation is by far the most forward thinking (in terms of building infrastructure for a proper MERGE) of any proposal to date. -- Peter Geoghegan
On 10/15/2013 12:03 PM, Peter Geoghegan wrote: > On Tue, Oct 15, 2013 at 11:55 AM, Josh Berkus <josh@agliodbs.com> wrote: >> However, it does seem like the new syntax could be extended with and >> optional "USING unqiue_index_name" in the future (9.5), no? > > There is no reason why we couldn't do that and just consider that one > unique index. Whether we should is another question - What's the "shouldn't" argument, if any? > I certainly > think that mandating it would be very bad. Agreed. If there is a PK, we should allow the user to use it implicitly. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-10-15 10:53:35 -0700, Peter Geoghegan wrote: > On Tue, Oct 15, 2013 at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I think anything that only works by breaking visibility rules that way > > is a nonstarter. Doing that from the C level is one thing, exposing it > > this way seems a bad idea. > > What visibility rule is that? The early return you added to HTSMVCC. At the very least it opens you to lots of halloween problem like scenarios. > Upsert *has* to do effectively the same thing as what I've proposed - > there is no getting away from it. So maybe the visibility rulebook > (which as far as I can tell is "the way things work today") needs to > be updated. If we did, say, INSERT...ON DUPLICATE KEY UPDATE, we'd > have to update a row with potentially no visible-to-snapshot version > *at all*, and make a new version of that visible. That's just what it > takes. What's the difference between that and just locking? If the > only difference is that it isn't necessary to modify tqual.c because > you're passing a tid directly, that isn't a user-visible difference - > the "rule" has been broken just the same. Arguably, it's even more of > a hack, since it's a special, out-of-band visibility exception. No, doing it in special case code is fundamentally different since those locations deal only with one row at a time. There's no scans that can pass over that row. That's why I think exposing the "on conflict lock" logic to anything but C isn't going to fly btw. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-10-15 11:23:44 -0700, Josh Berkus wrote: > (although, AFAICT it doesn't allow for the implementation of one of my > personal desires, which is UPDATE ... ON NOT FOUND INSERT, for cases > where updates are expected to occur 95% of the time, but that's another > topic. Unless "rejects" for an Update could be the leftover rows, but > then we're getting into full MERGE.). FWIW I can't see the above syntax as something working very well - you fundamentally have to SET every column and it only makes sense in UPDATEs that provably affect only one row. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-10-15 11:55:06 -0700, Josh Berkus wrote: > Also, because you can't INDEX CONCURRENTLY a PK, I've been building a > lot of databases which have no PKs, only UNIQUE indexes. You know that you can add prebuilt primary keys using ALTER TABLE ... ADD CONSTRAINT ... PRIMARY KEY (...) USING indexname? > Postgres doesn't distinguish between UNIQUE indexes > and PRIMARY KEYs -- as, indeed, it shouldn't, since they're both keys, > adn the whole concept of a "primary key" is a legacy of index-organized > databases, which PostgreSQL is not. There's some other differences, fro one primary keys are automatically picked up by foreign keys if the referenced columns aren't specified, for another we do not yet automatically recognize NOT NULL UNIQUE columns in GROUP BY. > However, it does seem like the new syntax could be extended with and > optional "USING unqiue_index_name" in the future (9.5), no? Yes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10/15/2013 02:31 PM, Andres Freund wrote: > On 2013-10-15 11:55:06 -0700, Josh Berkus wrote: >> Also, because you can't INDEX CONCURRENTLY a PK, I've been building a >> lot of databases which have no PKs, only UNIQUE indexes. > > You know that you can add prebuilt primary keys using ALTER TABLE > ... ADD CONSTRAINT ... PRIMARY KEY (...) USING indexname? That still requires an ACCESS EXCLUSIVE lock, and then can't be dropped using DROP INDEX CONCURRENTLY. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Oct 15, 2013 at 2:25 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-10-15 10:53:35 -0700, Peter Geoghegan wrote: >> On Tue, Oct 15, 2013 at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > I think anything that only works by breaking visibility rules that way >> > is a nonstarter. Doing that from the C level is one thing, exposing it >> > this way seems a bad idea. >> >> What visibility rule is that? > > The early return you added to HTSMVCC. > > At the very least it opens you to lots of halloween problem like > scenarios. The term "visibility rule" as you've used it here is suggestive of some authoritative rule that should obviously never even be bent. I'd suggest that what Postgres does isn't very useful as an authority on this matter, because Postgres doesn't have upsert. Besides, today Postgres doesn't just bend the rules (that is, some kind of classic notion of MVCC as described in "Concurrency Control in Distributed Database Systems" or something), it totally breaks them, at least in READ COMMITTED mode (and what I've proposed here just occurs in RC mode). It is not actually in evidence that this approach introduces Halloween problems. In order for HTSMVCC to controversially indicate visibility under my scheme, it is not sufficient for the row version to just be exclusive locked by our xact without otherwise being visible - it must also *not be updated*. Now, I'll freely admit that this could still be problematic - there might have been a subtlety I missed. But since an actual example of where this is problematic hasn't been forthcoming, I take it that it isn't obvious to either yourself or Robert that it actually is. Any scheme that involves playing cute tricks with visibility (which is to say, any credible upsert implementation) needs very careful thought. -- Peter Geoghegan
On Tue, Oct 15, 2013 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm not saying "go implement MERGE". I'm saying, make the > insert-or-update operation a single statement, using some syntax TBD, > instead of requiring the use of a new insert statement that makes > invisible rows visible as a side effect, so that you can wrap that in > a CTE and feed it to an update statement. That's complex and, AFAICS, > unlike how any other database product handles this. Well, lots of other databases have their own unique way of doing this - apart from MySQL's INSERT...ON DUPLICATE KEY UPDATE, there is a variant within Teradata, Sybase and SQLite. They're all different. And in the case of Teradata, it was an interim feature towards MERGE which came in a much later release, which is how I see this. No other database system even has writeable CTEs, of course. It's a fairly recent idea. > Again, other people can have different opinions on this, and that's > fine. I'm just giving you mine. I will defer to the majority opinion here. But you also expressed concern about surprising results due to the wrong unique constraint violation being the source of a conflict. Couldn't this syntax (with the wCTE upsert pattern) help with that, by naming the constant inserted in the update too? It would be pretty simple to expose that, and far less grotty than naming a unique index in DML. -- Peter Geoghegan
On Tue, Oct 15, 2013 at 11:34 AM, Peter Geoghegan <pg@heroku.com> wrote: >> Again, other people can have different opinions on this, and that's >> fine. I'm just giving you mine. > > I will defer to the majority opinion here. But you also expressed > concern about surprising results due to the wrong unique constraint > violation being the source of a conflict. Couldn't this syntax (with > the wCTE upsert pattern) help with that, by naming the constant > inserted in the update too? It would be pretty simple to expose that, > and far less grotty than naming a unique index in DML. Well, I don't know that any of us can claim to have a lock on what the syntax should look like. I think we need to hear some proposals. You've heard my gripe about the current syntax (which Andres appears to share), but I shan't attempt to prejudice you in favor of my preferred alternative, because I don't have one yet. There could be other ways of avoiding that problem, though. Here's an example: UPSERT table (keycol1, ..., keycoln) = (keyval1, ..., keyvaln) SET (nonkeycol1, ..., nonkeycoln) = (nonkeyval1, ..., nonkeyvaln) That's pretty ugly on multiple levels, and I'm definitely not proposing that exact thing, but the idea is: look for a record that matches on the key columns/values; if found, update the non-key columns with the corresponding values; if not found, construct a new row with both the key and nonkey column sets and insert it. If no matching unique index exists we'll have to fail, but we stop short of having to mention the name of that index. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 15, 2013 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Well, I don't know that any of us can claim to have a lock on what the > syntax should look like. Sure. But it's not just syntax. We're talking about functional differences too, since you're talking about mandating an update, which is a not the same as an "update locked row only conditionally", or a delete. I get that it's a little verbose, but then this is ORM plumbing for many of those that would prefer a more succinct syntax. Those people would also benefit from having their ORM do something much more powerful for them when needed. > I think we need to hear some proposals. Agreed. > You've heard my gripe about the current syntax (which Andres appears > to share), but I shan't attempt to prejudice you in favor of my > preferred alternative, because I don't have one yet. FWIW, I sincerely see very real advantages to what I've proposed here. To me, the fact that it's convenient to implement is beside the point. > There could be > other ways of avoiding that problem, though. Here's an example: > > UPSERT table (keycol1, ..., keycoln) = (keyval1, ..., keyvaln) SET > (nonkeycol1, ..., nonkeycoln) = (nonkeyval1, ..., nonkeyvaln) > > That's pretty ugly on multiple levels, and I'm definitely not > proposing that exact thing, but the idea is: look for a record that > matches on the key columns/values; if found, update the non-key > columns with the corresponding values; if not found, construct a new > row with both the key and nonkey column sets and insert it. If no > matching unique index exists we'll have to fail, but we stop short of > having to mention the name of that index. What if you want to update the key columns - either the potential conflict-causing one, or another? What about composite unique constraints? MySQL certainly supports all that, for example. -- Peter Geoghegan
On Tue, Oct 15, 2013 at 1:19 PM, Peter Geoghegan <pg@heroku.com> wrote: >> There could be >> other ways of avoiding that problem, though. Here's an example: >> >> UPSERT table (keycol1, ..., keycoln) = (keyval1, ..., keyvaln) SET >> (nonkeycol1, ..., nonkeycoln) = (nonkeyval1, ..., nonkeyvaln) >> >> That's pretty ugly on multiple levels, and I'm definitely not >> proposing that exact thing, but the idea is: look for a record that >> matches on the key columns/values; if found, update the non-key >> columns with the corresponding values; if not found, construct a new >> row with both the key and nonkey column sets and insert it. If no >> matching unique index exists we'll have to fail, but we stop short of >> having to mention the name of that index. > > What if you want to update the key columns - either the potential > conflict-causing one, or another? I'm not sure what that means in the context of an UPSERT operation. If the update case is, when a = 1 then make a = 2, then which value goes in column a when we insert, 1 or 2? But I suppose if you can work that out it's just a matter of mentioning the column as both a key column and a non-key column. > What about composite unique > constraints? MySQL certainly supports all that, for example. That's why it allows you to specify N key columns rather than restricting you to just one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 14.10.2013 07:12, Peter Geoghegan wrote: > On Wed, Oct 9, 2013 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Unfortunately, I have a very busy schedule in the month ahead, >> including travelling to Ireland and Japan, so I don't think I'm going >> to get the opportunity to work on this too much. I'll try and produce >> a V4 that formally proposes some variant of my ideas around visibility >> of locked tuples. > > V4 is attached. > > Most notably, this adds the modifications to HeapTupleSatisfiesMVCC(), > though they're neater than in the snippet I sent earlier. > > There is also some clean-up around row-level locking. That code has > been simplified. I also try and handle serialization failures in a > better way, though that really needs the attention of a subject matter > expert. > > There are a few additional XXX comments highlighting areas of concern, > particularly around serializable behavior. I've deferred making higher > isolation levels care about wrongfully relying on the special > HeapTupleSatisfiesMVCC() exception (e.g. they won't throw a > serialization failure, mostly because I couldn't decide on where to do > the test on time prior to travelling tomorrow). > > I've added code to do heap_prepare_insert before value locks are held. > Whatever our eventual value locking implementation, that's going to be > a useful optimization. Though unfortunately I ran out of time to give > this the scrutiny it really deserves, I suppose that it's something > that we can return to later. > > I ask that reviewers continue to focus on concurrency issues and broad > design issues, and continue to defer discussion about an eventual > value locking implementation. I continue to think that that's the most > useful way of proceeding for the time being. My earlier points about > probable areas of concern [1] remain a good place for reviewers to > start. I think it's important to recap the design goals of this. I don't think these have been listed before, so let me try: * It should be usable and perform well for both large batch updates and small transactions. * It should perform well both when there are no duplicates, and when there are lots of duplicates And from that follows some finer requirements: * Performance when there are no duplicates should be close to raw INSERT performance. * Performance when all rows are duplicates should be close to raw UPDATE performance. * We should not leave behind large numbers of dead tuples in either case. Anything else I'm missing? What about exclusion constraints? I'd like to see this work for them as well. Currently, exclusion constraints are checked after the tuple is inserted, and you abort if the constraint was violated. We could still insert the heap and index tuples first, but instead of aborting on violation, we would kill the heap tuple we already inserted and retry. There are some complications there, like how to wake up any other backends that are waiting to grab a lock on the tuple we just killed, but it seems doable. That would, however, perform badly and leave garbage behind if there are duplicates. A refinement of that would be to first check for constraint violations, then insert the tuple, and then check again. That would avoid the garbage in most cases, but would perform much more poorly when there are no duplicates, because it needs two index scans for every insertion. A further refinement would be to keep track of how many duplicates there have been recently, and switch between the two strategies based on that. That cost of doing two scans could be alleviated by using markpos/restrpos to do the second scan. That is presumably cheaper than starting a whole new scan with the same key. (markpos/restrpos don't currently work for non-MVCC snapshots, so that'd need to be fixed, though) And that detour with exclusion constraints takes me back to the current patch :-). What if you implemented the unique check in a similar fashion too (when doing INSERT ON DUPLICATE KEY ...)? First, scan for a conflicting key, and mark the position. Then do the insertion to that position. If the insertion fails because of a duplicate key (which was inserted after we did the first scan), mark the heap tuple as dead, and start over. The indexam changes would be quite similar to the changes you made in your patch, but instead of keeping the page locked, you'd only hold a pin on the target page (if even that). The first indexam call would check that the key doesn't exist, and remember the insert position. The second call would re-find the previous position, and insert the tuple, checking again that there really wasn't a duplicate key violation. The locking aspects would be less scary than your current patch. I'm not sure if that would perform as well as your current patch. I must admit your current approach is pretty optimal performance-wise. But I'd like to see it, and that would be a solution for exclusion constraints in any case. One fairly limitation with your current approach is that the number of lwlocks you can hold simultaneously is limited (MAX_SIMUL_LWLOCKS == 100). Another limitation is that the minimum for shared_buffers is only 16. Neither of those is a serious problem in real applications - no-one runs with shared_buffers=16 and no sane schema has a hundred unique indexes, but it's still something to consider. - Heikki
On Mon, Nov 18, 2013 at 6:44 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > I think it's important to recap the design goals of this. Seems reasonable to list them out. > * It should be usable and perform well for both large batch updates and > small transactions. I think that that's a secondary goal, a question to be considered but perhaps deferred during this initial effort. I agree that it certainly is important. > * It should perform well both when there are no duplicates, and when there > are lots of duplicates I think this is very important. > And from that follows some finer requirements: > > * Performance when there are no duplicates should be close to raw INSERT > performance. > > * Performance when all rows are duplicates should be close to raw UPDATE > performance. > > * We should not leave behind large numbers of dead tuples in either case. I agree with all that. > Anything else I'm missing? I think so, yes. I'll add: * Should not deadlock unreasonably. If the UPDATE case is to work and perform almost as well as a regular UPDATE, that must mean that it has essentially the same characteristics as plain UPDATE. In particular, I feel fairly strongly that it is not okay for upserts to deadlock with each other unless the possibility of each transaction locking multiple rows (in an inconsistent order) exists. I don't want to repeat the mistakes of MySQL here. This is a point that I stressed to Robert on a previous occasion [1]. It's why value locks and row locks cannot be held at the same time. Incidentally, that implies that all alternative schemes involving bloat will bloat once per attempt, I believe. I'll also add: * Should anticipate a day when Postgres needs plumbing for SQL MERGE, which is still something we want, particularly for batch operations. I realize that the standard doesn't strictly require MERGE to handle the concurrency issues, but even still I don't think that an implementation that doesn't is practicable - does such an implementation currently exist in any other system? > What about exclusion constraints? I'd like to see this work for them as > well. Currently, exclusion constraints are checked after the tuple is > inserted, and you abort if the constraint was violated. We could still > insert the heap and index tuples first, but instead of aborting on > violation, we would kill the heap tuple we already inserted and retry. There > are some complications there, like how to wake up any other backends that > are waiting to grab a lock on the tuple we just killed, but it seems doable. I agree that it's at least doable. > That would, however, perform badly and leave garbage behind if there are > duplicates. A refinement of that would be to first check for constraint > violations, then insert the tuple, and then check again. That would avoid > the garbage in most cases, but would perform much more poorly when there are > no duplicates, because it needs two index scans for every insertion. A > further refinement would be to keep track of how many duplicates there have > been recently, and switch between the two strategies based on that. Seems like an awful lot of additional mechanism. > That cost of doing two scans could be alleviated by using markpos/restrpos > to do the second scan. That is presumably cheaper than starting a whole new > scan with the same key. (markpos/restrpos don't currently work for non-MVCC > snapshots, so that'd need to be fixed, though) Well, it seems like we could already use a "pick up where you left off" mechanism in the case of regular btree index tuple insertions into unique indexes -- after all, we don't do that in the event of blocking pending the outcome of the other transaction (that inserted a duplicate that we need to conclusively know has or has not committed) today. The fact that this doesn't already exist leaves me less than optimistic about the prospect of making it work to facilitate a scheme such as the one you describe here. (Today we still need to catch a committed version of the tuple that would make our tuple a duplicate from a fresh index scan, only *after* waiting for a transaction to commit/abort at the end of our original index scan). So we're already pretty naive about this, even though it would pay to not be. Making something like markpos work for the purposes of an upsert implementation seems not only hard, but also like a possible modularity violation. Are we not unreasonably constraining the implementation going forward? My patch respects the integrity of the am abstraction, and doesn't really add any knowledge to the core system about how amcanunique index methods might go about implementing the new "amlock" method. The core system worries a bit about the "low level locks" (as it naively refers to value locks), and doesn't consider that it has the right to hold on to them for more than an instant, but that's about it. Plus we don't have to worry about whether something does or does not work for a certain snapshot type with my approach, because as with the current unique index btree coding, it operates at a lower level than that, and does not need to consider visibility as such. The markpos and restpos am methods only called for regular index (only) scans, that don't need to worry about things that are not visible. Of course, upsert needs to worry about invisible-but-conclusively-live things. This seems much harder, and basically implies value locking of some kind, if I'm not mistaken. So have you really gained anything? So what I've done, aside from being, as you say below, close to optimal, is in a sense defined in terms of existing, well-established abstractions. I feel it's easier to reason about the implications of holding value locks (whatever the implementation) for longer and across multiple operations than it is to do all this instead. What I've done with locking is scary, but not as scary as the worst case of alternative implementations. > And that detour with exclusion constraints takes me back to the current > patch :-). What if you implemented the unique check in a similar fashion too > (when doing INSERT ON DUPLICATE KEY ...)? First, scan for a conflicting key, > and mark the position. Then do the insertion to that position. If the > insertion fails because of a duplicate key (which was inserted after we did > the first scan), mark the heap tuple as dead, and start over. The indexam > changes would be quite similar to the changes you made in your patch, but > instead of keeping the page locked, you'd only hold a pin on the target page > (if even that). The first indexam call would check that the key doesn't > exist, and remember the insert position. The second call would re-find the > previous position, and insert the tuple, checking again that there really > wasn't a duplicate key violation. The locking aspects would be less scary > than your current patch. > > I'm not sure if that would perform as well as your current patch. I must > admit your current approach is pretty optimal performance-wise. But I'd like > to see it, and that would be a solution for exclusion constraints in any > case. I'm certainly not opposed to making something like this work for exclusion constraints. Certainly, I want this to be as general as possible. But I don't think that it needs to be a blocker, and I don't think we gain anything in code footprint by addressing that by being as general as possible in our approach to the basic concurrency issue. After all, we're going to have to repeat the basic pattern in multiple modules. With exclusion constraints, we'd have to worry about a single slot proposed for insertion violating (and therefore presumably obliging us to lock) every row in the table. Are we going to have a mechanism for spilling a tid array potentially sized in gigabytes to disk (relating to just one slot proposed for insertion)? Is it principled to have that one slot project out rejects consisting of (say) the entire table? Is it even useful to lock multiple rows if we can't really update them, because they'll overlap each other when all updated with the one value? These are complicated questions, and frankly I don't have the bandwidth to answer them too soon. I just want to implement a feature that there is obviously huge pent up demand for, that has in the past put Postgres at a strategic disadvantage. I don't think it is unsound to define ON DUPLICATE KEY in terms of unique indexes. That's how we represent uniques...it isn't spelt ON OVERLAPPING or whatever. That seems like an addition, a nice-to-have, and maybe not even that, because exclusion-constrained columns *aren't* keys, and people aren't likely to want to upsert details of a booking (the typical exclusion constraint use-case) with the booking range in the UPDATE part's predicate. They'd just do it by key, because they'd already have a booking number PK value or whatever. Making this perform as well as possible is an important consideration. All alternative approaches that involve bloat concern me, and for reasons that I'm not sure were fully appreciated during earlier discussion on this thread: I'm worried about the worst case, not the average case. I am worried about a so-called "thundering herd" scenario. You need something like LockTuple() to arbitrate ordering, which seems complex, and like a further modularity violation. If this is to perform well when there are lots of existing tuples to be updated (with contention that wouldn't be considered unreasonable for plain updates), the amount of bloat generated by a thundering herd could be really really bad (once per attempt per "head of cattle"/upserter) . It's hard to say for sure how much of a problem this is, but I think it needs to be considered. It's a problem that I'm not sure we have the tools to analyze ahead of time. It's easier to pin down and reason about the conventional value locking stuff, because we know how deadlocks work. We know how to do analysis of deadlock hazards, and the surface area actually turns out to be not too large there. > One fairly limitation with your current approach is that the number of > lwlocks you can hold simultaneously is limited (MAX_SIMUL_LWLOCKS == 100). > Another limitation is that the minimum for shared_buffers is only 16. > Neither of those is a serious problem in real applications - no-one runs > with shared_buffers=16 and no sane schema has a hundred unique indexes, but > it's still something to consider. I was under the impression, based on previously feedback, that what I've done with LWLocks was unlikely to be accepted. I proceeded under the assumption that we'll be able to ameliorate these problems, as for example by implementing an alternative value locking mechanism (an SLRU?) that is similar to what I've done to date (in particular, very cheap and fast), but without all the down-sides that concerned Robert and Andres, and now you. As I said, I still think that's easier and safer than all alternative approaches described to date. It just so happens that I also believe it will perform a lot better in the average case too, but that isn't a key advantage to my mind. You're right that the value locking is scary. I think we need to very carefully consider it, once I have buy-in on the basic approach. I really do think it's the least-worst approach described to date. It isn't like we can't discuss making it inherently less scary, but I hesitate to do that now, given that I don't know if that discussing will go anywhere. Thanks for your efforts on reviewing my work here! Do you think it would be useful at this juncture to write a patch to make the order of locking across unique indexes well-defined? I think it may well have independent value to get the insertion into unique indexes (that can throw errors) out of the way when doing a regular slot insertion. Better to abort the transaction as soon as possible. [1] http://www.postgresql.org/message-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9M7_g@mail.gmail.com -- Peter Geoghegan
On Mon, Nov 18, 2013 at 4:37 PM, Peter Geoghegan <pg@heroku.com> wrote: > You're right that the value locking is scary. I think we need to very > carefully consider it, once I have buy-in on the basic approach. I > really do think it's the least-worst approach described to date. It > isn't like we can't discuss making it inherently less scary, but I > hesitate to do that now, given that I don't know if that discussing > will go anywhere. One possible compromise would be "promise tuples" where we know we'll be able to keep our promise. In other words: 1. We lock values in the first phase, in more or less the manner of the extant patch. 2. When a consensus exists that heap tuple insertion proceeds, we proceed with insertion of these promise index tuples (and probably keep just a pin on the relevant pages). 3. Proceed with insertion of the heap tuple (with no "value locks" of any kind held). 3. Go back to the unique indexes, update the heap tid and unset the index tuple flag (that indicates that the tuples are in this promise state). Probably we can even be bright about re-finding the existing promise tuples with their proper heap tid (e.g. maybe we can avoid doing a regular index scan at least some of the time - chances are pretty good that the index tuple is on the same page as before, so it's generally well worth a shot looking there first). As with the earlier promise tuple proposals, we store our xid in the ItemPointer. 4. Finally, insertion of non-unique index tuples occurs in the regular manner. Obviously the big advantage here is that we don't have to worry about value locking across heap tuple insertion at all, and yet we don't have to worry about bloating, because we really do know that insertion proper will proceed when inserting *this* type of promise index tuple. Maybe that even makes it okay to just use buffer locks, if we think some more about the other edge cases. Regular index scans take the aforementioned flag as a kind of visibility hint, perhaps, so we don't have to worry about them. And VACUUM would kill any dead promise tuples - this would be much less of a concern than with the earlier promise tuple proposals, because it is extremely non routine. Maybe it's fine to not make autovacuum concerned about a whole new class of (index-only) bloat, which seemed like a big problem with those earlier proposals, simply because crashes within this tiny window are hopefully so rare that it couldn't possibly amount to much bloat in the grand scheme of things (at least before a routine VACUUM - UPDATEs tend to necessitate those). If you have 50 upserting backends in this tiny window during a crash, that would be only 50 dead index tuples. Given the window is so tiny, I doubt it would be much of a problem at all - even 50 seems like a very high number. The track_counts counts that drive autovacuum here are already not crash safe, so I see no regression. Now, you still have to value lock across multiple btree unique indexes, and I understand there are reservations about this. But the surface area is made significantly smaller at reasonably low cost. Furthermore, doing TOASTing out-of-line and so on ceases to be necessary. The LOCK FOR UPDATE case is the same as before. Nothing else changes. FWIW, without presuming anything about value locking implementation, I'm not too worried about making the implementation scale to very large numbers of unique indexes, with very low shared_buffer settings. We already have a fairly similar situation with max_locks_per_transaction and so on, no? -- Peter Geoghegan
On 19.11.2013 02:37, Peter Geoghegan wrote: > On Mon, Nov 18, 2013 at 6:44 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> * It should be usable and perform well for both large batch updates and >> small transactions. > > I think that that's a secondary goal, a question to be considered but > perhaps deferred during this initial effort. I agree that it certainly > is important. Ok. Which use case are you targeting during this initial effort, batch updates or small OLTP transactions? >> Anything else I'm missing? > > I think so, yes. I'll add: > > * Should not deadlock unreasonably. > > If the UPDATE case is to work and perform almost as well as a regular > UPDATE, that must mean that it has essentially the same > characteristics as plain UPDATE. In particular, I feel fairly strongly > that it is not okay for upserts to deadlock with each other unless the > possibility of each transaction locking multiple rows (in an > inconsistent order) exists. Agreed. >> What about exclusion constraints? I'd like to see this work for them as >> well. Currently, exclusion constraints are checked after the tuple is >> inserted, and you abort if the constraint was violated. We could still >> insert the heap and index tuples first, but instead of aborting on >> violation, we would kill the heap tuple we already inserted and retry. There >> are some complications there, like how to wake up any other backends that >> are waiting to grab a lock on the tuple we just killed, but it seems doable. > > I agree that it's at least doable. > >> That would, however, perform badly and leave garbage behind if there are >> duplicates. A refinement of that would be to first check for constraint >> violations, then insert the tuple, and then check again. That would avoid >> the garbage in most cases, but would perform much more poorly when there are >> no duplicates, because it needs two index scans for every insertion. A >> further refinement would be to keep track of how many duplicates there have >> been recently, and switch between the two strategies based on that. > > Seems like an awful lot of additional mechanism. Not really. Once you have the code in place to do the kill-inserted-tuple dance on a conflict, all you need is to do an extra index search before it. And once you have that, it's not hard to add some kind of a heuristic to either do the pre-check or skip it. >> That cost of doing two scans could be alleviated by using markpos/restrpos >> to do the second scan. That is presumably cheaper than starting a whole new >> scan with the same key. (markpos/restrpos don't currently work for non-MVCC >> snapshots, so that'd need to be fixed, though) > > Well, it seems like we could already use a "pick up where you left > off" mechanism in the case of regular btree index tuple insertions > into unique indexes -- after all, we don't do that in the event of > blocking pending the outcome of the other transaction (that inserted a > duplicate that we need to conclusively know has or has not committed) > today. The fact that this doesn't already exist leaves me less than > optimistic about the prospect of making it work to facilitate a scheme > such as the one you describe here. (Today we still need to catch a > committed version of the tuple that would make our tuple a duplicate > from a fresh index scan, only *after* waiting for a transaction to > commit/abort at the end of our original index scan). So we're already > pretty naive about this, even though it would pay to not be. We just haven't bothered to optimize for the case that you have to wait. That's going to be slow anyway. Also, after sleeping, the insertion position might've moved right a lot, if a lot of insertions happened during the sleep, so it might be best to do a new scan anyway. > Making something like markpos work for the purposes of an upsert > implementation seems not only hard, but also like a possible > modularity violation. Are we not unreasonably constraining the > implementation going forward? My patch respects the integrity of the > am abstraction, and doesn't really add any knowledge to the core > system about how amcanunique index methods might go about implementing > the new "amlock" method. The core system worries a bit about the "low > level locks" (as it naively refers to value locks), and doesn't > consider that it has the right to hold on to them for more than an > instant, but that's about it. Plus we don't have to worry about > whether something does or does not work for a certain snapshot type > with my approach, because as with the current unique index btree > coding, it operates at a lower level than that, and does not need to > consider visibility as such. > > The markpos and restpos am methods only called for regular index > (only) scans, that don't need to worry about things that are not > visible. Of course, upsert needs to worry about > invisible-but-conclusively-live things. This seems much harder, and > basically implies value locking of some kind, if I'm not mistaken. So > have you really gained anything? I probably shouldn't have mentioned markpos/restrpos, you're right that it's not a good idea to conflate that with index insertion. Nevertheless, some kind of an API for doing a duplicate-key check prior to insertion, and remembering the location for the actual insert later, seems sensible. It's certainly no more of a modularity violation than the value-locking scheme you're proposing. What I'm thinking is a new indexam function, let's call it "pre-insert". The pre-insert function checks for any possible unique key violations, just like insertion, but doesn't modify the index. Also, as an optimization, it can remember the position where the insertion will go to later, and return an opaque token to represent that. That token can be passed to the insert-function later, which can use it to quickly re-find the insert position. In other words, very similar to the index_lock function you're proposing, but it doesn't keep the page locked. >> And that detour with exclusion constraints takes me back to the current >> patch :-). What if you implemented the unique check in a similar fashion too >> (when doing INSERT ON DUPLICATE KEY ...)? First, scan for a conflicting key, >> and mark the position. Then do the insertion to that position. If the >> insertion fails because of a duplicate key (which was inserted after we did >> the first scan), mark the heap tuple as dead, and start over. The indexam >> changes would be quite similar to the changes you made in your patch, but >> instead of keeping the page locked, you'd only hold a pin on the target page >> (if even that). The first indexam call would check that the key doesn't >> exist, and remember the insert position. The second call would re-find the >> previous position, and insert the tuple, checking again that there really >> wasn't a duplicate key violation. The locking aspects would be less scary >> than your current patch. >> >> I'm not sure if that would perform as well as your current patch. I must >> admit your current approach is pretty optimal performance-wise. But I'd like >> to see it, and that would be a solution for exclusion constraints in any >> case. > > I'm certainly not opposed to making something like this work for > exclusion constraints. Certainly, I want this to be as general as > possible. But I don't think that it needs to be a blocker, and I don't > think we gain anything in code footprint by addressing that by being > as general as possible in our approach to the basic concurrency issue. > After all, we're going to have to repeat the basic pattern in multiple > modules. Well, I don't know what to say. I *do* have a hunch that we'd gain much in code footprint by making this general. I don't understand what pattern you'd need to repeat in multiple modules. Here's a patch, implementing a rough version of the scheme I'm trying to explain. It's not as polished as yours, but it ought to be enough to evaluate the code footprint and performance. It doesn't make any changes to the indexam API, and it works the same with exclusion constraints and unique constraints. As it stands, it doesn't leave bloat behind, except when a concurrent insertion with a conflicting key happens between the first "pre-check" and the actual insertion. That should be rare in practice. What have you been using to performance test this? > With exclusion constraints, we'd have to worry about a single slot > proposed for insertion violating (and therefore presumably obliging us > to lock) every row in the table. Are we going to have a mechanism for > spilling a tid array potentially sized in gigabytes to disk (relating > to just one slot proposed for insertion)? Is it principled to have > that one slot project out rejects consisting of (say) the entire > table? Is it even useful to lock multiple rows if we can't really > update them, because they'll overlap each other when all updated with > the one value? Hmm. I think what you're referring to is the case where you try to insert a row so that it violates an exclusion constraint, and in a way that it conflicts with a large number of existing tuples. For example, if you have a calendar application with a constraint that two reservations must not overlap, and you try to insert a new reservation that covers, say, a whole decade. That's not a problem for ON DUPLICATE KEY IGNORE, as you just ignore the conflict and move on. For ON DUPLICATE KEY LOCK FOR UPDATE, I guess we would need to handle a large TID array. Or maybe we can arrange it so that the tuples are locked as we scan them, without having to collect them all in a large array. (the attached patch only locks the first existing tuple that conflicts; that needs to be fixed) RETURNING REJECTS is not an issue here, as that just returns the rejected rows we were about to insert, not the existing rows in the table. - Heikki
Attachment
On Tue, Nov 19, 2013 at 5:13 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Ok. Which use case are you targeting during this initial effort, batch > updates or small OLTP transactions? OLTP transactions are probably my primary concern. I just realized that I wasn't actually very clear on that point in my most recent e-mail -- my apologies. What we really need for batching, and what we should work towards in the medium term is MERGE, where a single table scan does everything. However, I also care about facilitating conflict resolution in multi-master replication systems, so I think we definitely ought to consider that carefully if at all possible. Incidentally, Andres said a few weeks back that he thinks that what I've proposed ought to be only exposed to C code, owing to the fact that it necessitates the visibility trick (actually, I think UPSERT does generally, but what I've done has, I suppose, necessitated making it more explicit/general - i.e. modifications are added to HeapTupleSatisfiesMVCC()). I don't understand what difference it makes to only exposed it at the C level - what I've proposed in this area is either correct or incorrect (Andres mentioned the Halloween problem). Furthermore, I presume that it's broadly useful to have Bucardo-style custom conflict resolution policies, without people having to get their hands dirty with C, and I think having this at the SQL level helps there. Plus, as I've said many times, the flexibility this syntax offers is likely to be broadly useful for ordinary SQL clients - this is almost as good as SQL MERGE for many cases. >> Seems like an awful lot of additional mechanism. > > Not really. Once you have the code in place to do the kill-inserted-tuple > dance on a conflict, all you need is to do an extra index search before it. > And once you have that, it's not hard to add some kind of a heuristic to > either do the pre-check or skip it. Perhaps. > I probably shouldn't have mentioned markpos/restrpos, you're right that it's > not a good idea to conflate that with index insertion. Nevertheless, some > kind of an API for doing a duplicate-key check prior to insertion, and > remembering the location for the actual insert later, seems sensible. It's > certainly no more of a modularity violation than the value-locking scheme > you're proposing. I'm not so sure - in principle, any locking implementation can be used by any conceivable amcanunique indexing method. The core system knows that it isn't okay to sit on them all day long, but that doesn't seem very onerous. >> I'm certainly not opposed to making something like this work for >> exclusion constraints. Certainly, I want this to be as general as >> possible. But I don't think that it needs to be a blocker, and I don't >> think we gain anything in code footprint by addressing that by being >> as general as possible in our approach to the basic concurrency issue. >> After all, we're going to have to repeat the basic pattern in multiple >> modules. > > > Well, I don't know what to say. I *do* have a hunch that we'd gain much in > code footprint by making this general. I don't understand what pattern you'd > need to repeat in multiple modules. Now that I see this rough patch, I better appreciate what you mean. I withdraw this objection. > Here's a patch, implementing a rough version of the scheme I'm trying to > explain. It's not as polished as yours, but it ought to be enough to > evaluate the code footprint and performance. It doesn't make any changes to > the indexam API, and it works the same with exclusion constraints and unique > constraints. As it stands, it doesn't leave bloat behind, except when a > concurrent insertion with a conflicting key happens between the first > "pre-check" and the actual insertion. That should be rare in practice. > > What have you been using to performance test this? I was just testing my patch against a custom pgbench workload, involving running upserts against a table from a fixed range of PK values. It's proven kind of difficult to benchmark this in the way that pgbench has proved useful for in the past. Pretty soon the table's PK range is "saturated", so they're all updates, but on the other hand how do you balance the INSERT or UPDATE case? Multiple unique indexes are the interesting case for comparing both approaches. I didn't really worry about performance so much as correctness, and for multiple unique constraints your approach clearly falls down, as explained below. >> Is it even useful to lock multiple rows if we can't really >> update them, because they'll overlap each other when all updated with >> the one value? > > > Hmm. I think what you're referring to is the case where you try to insert a > row so that it violates an exclusion constraint, and in a way that it > conflicts with a large number of existing tuples. For example, if you have a > calendar application with a constraint that two reservations must not > overlap, and you try to insert a new reservation that covers, say, a whole > decade. Right. > That's not a problem for ON DUPLICATE KEY IGNORE, as you just ignore the > conflict and move on. For ON DUPLICATE KEY LOCK FOR UPDATE, I guess we would > need to handle a large TID array. Or maybe we can arrange it so that the > tuples are locked as we scan them, without having to collect them all in a > large array. > > (the attached patch only locks the first existing tuple that conflicts; that > needs to be fixed) I'm having a hard time seeing how ON DUPLICATE KEY LOCK FOR UPDATE is of very much use to exclusion constraints at all. Perhaps I lack imagination here. However, ON DUPLICATE KEY IGNORE certainly *is* useful with exclusion constraints, and I'm not dismissive of that. I think we ought to at least be realistic about the concerns that inform your approach here. I don't think that making this work for exclusion constraints is all that compelling; I'll take it, I guess (not that there is obviously a dichotomy between doing btree locking and doing ECs too), but I doubt people put "overlaps" operators in the predicates of DML very often *at all*, and doubt even more that there is actual demand for upserting there. I think that the reason that you prefer this design is almost entirely down to possible hazards with btree locking around what I've done (or, indeed anything that approximates what I've done); maybe that's so obvious that you didn't even occur to you to mention it, but I think it should be acknowledged. I don't think that using index locking of *some* form is unreasonable. Certainly, I think that from reading the literature (e.g. [1]) one can find evidence that btree page index locking as part of value locking seems like a common technique in many popular RDBMSs, and presumably forms an important part of their SQL MERGE implementations. As it says in that paper: """ Thus, non-leaf pages do not require locks and are protected by latches only. The remainder of this paper focuses on locks. """ They talk here about a very traditional System-R architecture - "Assumptions about the database environment are designed to be very traditional". Latches here are basically equivalent to our buffer locks, and what they call locks we call heavyweight locks. So I'm pretty sure many other *traditional* systems handle value locking by escalating a "latch" to a leaf-page-level heavyweight lock (it's often more granular too). I think that the advantages are fairly fundamental. I think that "4.1 Locks on keys and ranges" of this paper is interesting. I've also found a gentler introduction to traditional btree key locking [2]. In that paper, section "5 Protecting a B-tree’s logical contents" it is said: """ Latches must be managed carefully in key range locking if lockable resources are defined by keys that may be deleted if not protected. Until the lock request is inserted into the lock manager’s data structures, the latch on the data structure in the buffer pool is required to ensure the existence of the key value. On the other hand, if a lock cannot be granted immediately, the thread should not hold a latch while the transaction waits. Thus, after waiting for a key value lock, a transaction must repeat its root-to-leaf search for the key. """ So I strongly suspect that some other systems have found it useful to escalate from a latch (buffer/page lock) to a lock (heavyweight lock). I have some concerns about what you've done that may limit my immediate ability to judge performance, and the relative merits of both approaches generally. Now, I know you just wanted to sketch something out, and that's fine, but I'm only sharing my thoughts. I am particularly worried about the worst case (for either approach), particularly with more than 1 unique index. I am also worried about livelock hazards (again, in particular with more than 1 index) - I am not asserting that they exist in your patch, but they are definitely more difficult to reason about. Value locking works because once a page lock is acquired, all unique indexes are inserted into. Could you have two upserters livelock each other with two unique indexes with 1:1 correlated values in practice (i.e. 2 unique indexes that might almost work as 1 composite index)? That is a reasonable usage of upsert, I think. We never wait on another transaction if there is a conflict when inserting - we just do the usual UNIQUE_CHECK_PARTIAL thing (we don't wait for other xact during btree insertion). This breaks the IGNORE case (how does it determine the final outcome of the transaction that inserted what may be a conflict, iff the conflict was only found during insertion?), which would probably be fine for our purposes if that were the only issue, but I have concerns about its effects on the ON DUPLICATE KEY LOCK FOR UPDATE case too. I don't like that an upserter's ExecInsertIndexTuples() won't wait on other xids generally, I think. Why should the code btree-insert even though it knows it's going to kill the heap tuple? It makes things very hard to reason about. If you are just mostly thinking about exclusion constraints here, then I'm not sure that even at this point that it's okay that the IGNORE case doesn't work there, because IGNORE is the only thing that makes much sense for exclusion constraints. The unacceptable-deadlocking-pattern generally occurs when we try to lock two different row versions. Your patch is fairly easy to make deadlock. Regarding this: /* * At this point we have either a conflict or a potential conflict. If * we're not supposed to raise error, just return the fact of the * potential conflict without waiting to see if it's real. */ if (errorOK && !wait) { conflict = true; if (conflictTid) *conflictTid = tup->t_self; break; } Don't we really just have only a potential conflict? Even if conflictTid is committed? I think it's odd that you insert btree index tuples without ever worrying about waiting (which is what breaks the IGNORE case, you might say). UNIQUE_CHECK_PARTIAL never gives an xid to wait on from within _bt_check_unique(). Won't that itself make other sessions block pending the outcome of our transaction (in non-upserting ExecInsertIndexTuples(), or in ExecCheckIndexConstraints())? Could that be why your patch deadlocks unreasonable (that is, in the way you've already agreed, in your most recent mail, isn't okay)? Isn't it only already okay that UNIQUE_CHECK_PARTIAL might do that for deferred unique indexes because of re-checking, which may then abort the xact? How will this work?: * XXX: If we know or assume that there are few duplicates, it would * be better to skip this, and just optimistically proceed with the * insertion below. You would then leave behind some garbage when a * conflict happens, but if it's rare, it doesn't matter much. Some * kind of heuristic might be in order here, like stop doing these * pre-checks if the last 100 insertions have not been duplicates. ...when you consider that the only place a tid can come from is this pre-check? Anyway, consider the following simple test-case of your patch. postgres=# create unlogged table foo ( a int4 primary key, b int4 unique ); CREATE TABLE If I run the attached pgbench script like this: pg@hamster:~/pgbench-tools/tests$ pgbench -f upsert.sql -n -c 50 -T 20 I can get it to deadlock (and especially to throw unique constraint violations) like crazy. Single unique indexes seemed okay, though I have my doubts that only allowing one unique index gets us far, or that it will be acceptable to have the user specify a unique index in DML or something. I discussed this with Robert in relation to his design upthread. Multiple unique constraints were *always* the hard case. I mean, my patch only really does something unconventional *because* of that case, really. One unique index is easy. Leaving discussion of value locking aside, just how rough is this revision of yours? What do you think of certain controversial aspects of my design that remain unchanged, such as the visibility trick (as actually implemented, and/or just in principle)? What about the syntax itself? It is certainly valuable to have additional MERGE-like functionality above and beyond the basic "upsert", not least for multi-master conflict resolution with complex resolution policies, and this syntax gets us much of that. How would you feel about making it possible for the UPDATE to use a tidscan, by projecting out the tid that caused a conflict, as a semi-documented optimization? It might be unfortunate if someone tried to UPDATE based on that ctid twice, but that is a less common requirement. It is kind of an abuse of notation, because of course you're not supposed to be projecting out the conflict-causer but the rejects, but perhaps we can live with that, if we can live with the basic idea. I'm sorry if my thoughts here are not fully developed, but it's hard to pin this stuff down. Especially since I'm guessing what is and isn't essential to your design in this rough sketch. Thanks [1] http://zfs.informatik.rwth-aachen.de/btw2007/paper/p18.pdf [2] http://www.hpl.hp.com/techreports/2010/HPL-2010-9.pdf -- Peter Geoghegan
Attachment
On Sat, Nov 23, 2013 at 11:52 PM, Peter Geoghegan <pg@heroku.com> wrote: > pg@hamster:~/pgbench-tools/tests$ pgbench -f upsert.sql -n -c 50 -T 20 > > I can get it to deadlock (and especially to throw unique constraint > violations) like crazy. I'm sorry, this test-case is an earlier one that is actually entirely invalid for the purpose stated (though my concerns stated above remain - I just didn't think the multi-unique-index case had been exercised enough, and so did this at the last minute). Please omit it from your consideration. I think I have been working too late... -- Peter Geoghegan
On Sat, Nov 23, 2013 at 11:52 PM, Peter Geoghegan <pg@heroku.com> wrote: > I have some concerns about what you've done that may limit my > immediate ability to judge performance, and the relative merits of > both approaches generally. Now, I know you just wanted to sketch > something out, and that's fine, but I'm only sharing my thoughts. I am > particularly worried about the worst case (for either approach), > particularly with more than 1 unique index. I am also worried about > livelock hazards (again, in particular with more than 1 index) - I am > not asserting that they exist in your patch, but they are definitely > more difficult to reason about. Value locking works because once a > page lock is acquired, all unique indexes are inserted into. Could you > have two upserters livelock each other with two unique indexes with > 1:1 correlated values in practice (i.e. 2 unique indexes that might > almost work as 1 composite index)? That is a reasonable usage of > upsert, I think. So I had it backwards: In fact, it isn't possible to get your patch to deadlock when it should - it livelocks instead (where with my patch, as far as I can tell, we predictably and correctly have detected deadlocks). I see an infinite succession of "insertion conflicted after pre-check" DEBUG1 elog messages, and no progress, which is an obvious indication of livelock. My test does involve 2 unique indexes - that's generally the hard case to get right. Dozens of backends are tied-up in livelock. Test case for this is attached. My patch is considerably slowed down by the way this test-case tangles everything up, but does get through each pgbench run/loop in the bash script predictably enough. And when I kill the test-case, a bunch of backends are not left around, stuck in perpetual livelock (with my patch it takes only a few seconds for the deadlock detector to get around to killing every backend). I'm also seeing this: Client 45 aborted in state 2: ERROR: attempted to lock invisible tuple Client 55 aborted in state 2: ERROR: attempted to lock invisible tuple Client 41 aborted in state 2: ERROR: attempted to lock invisible tuple To me this seems like a problem with the (potential) total lack of locking that your approach takes (inserting btree unique index tuples as in your patch is a form of value locking...sort of...it's a little hard to reason about as presented). Do you think this might be an inherent problem, or can you suggest a way to make your approach still work? So I probably should have previously listed as a requirement for our design: * Doesn't just work with one unique index. Naming a unique index directly in DML, or assuming that the PK is intended seems quite weak to me. This is something I discussed plenty with Robert, and I guess I just forgot to repeat myself when asked. Thanks -- Peter Geoghegan
Attachment
On 11/18/2013 06:44 AM, Heikki Linnakangas wrote: > I think it's important to recap the design goals of this. I don't think > these have been listed before, so let me try: > > * It should be usable and perform well for both large batch updates and > small transactions. > > * It should perform well both when there are no duplicates, and when > there are lots of duplicates > > And from that follows some finer requirements: > > * Performance when there are no duplicates should be close to raw INSERT > performance. > > * Performance when all rows are duplicates should be close to raw UPDATE > performance. > > * We should not leave behind large numbers of dead tuples in either case. I think this is setting the bar way too high for an initial feature. Would we like to eventually have all of those things? Yes. Do we need to have all of them for 9.4? No. It's more useful to measure this feature against the current alternatives used by our users, which are upsert functions and similar patterns. If we can make things easier and more efficient than those (which shouldn't be hard), then it's a worthwhile step forwards. That being said, the other requirement I am concerned about is being able to support the syntax of this feature in commonly used ORMs. That is, can I write a fairly small Django or Rails extension which does upsert using this patch? Fortunately, I think I can ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Nov 26, 2013 at 9:11 AM, Josh Berkus <josh@agliodbs.com> wrote: >> * It should be usable and perform well for both large batch updates and >> small transactions. >> >> * It should perform well both when there are no duplicates, and when >> there are lots of duplicates >> >> And from that follows some finer requirements: >> >> * Performance when there are no duplicates should be close to raw INSERT >> performance. >> >> * Performance when all rows are duplicates should be close to raw UPDATE >> performance. >> >> * We should not leave behind large numbers of dead tuples in either case. > > I think this is setting the bar way too high for an initial feature. > Would we like to eventually have all of those things? Yes. Do we need > to have all of them for 9.4? No. The requirements around performance/bloat have a lot to do with making the feature work reasonably well for multi-master conflict resolution. They also have much more to do with the worst case than the average case. If the worst case really is terribly bad, that ends up being a major gotcha. I'm not concerned about bloat as such, but in any case whether or not Heikki's design can mostly avoid bloat is, for now, of secondary importance. I feel the need to re-iterate something I've already said: I don't see that I have a concession to make here with a view to pragmatically getting something useful into 9.4. I am playing it as safe as I think I can. > It's more useful to measure this feature against the current > alternatives used by our users, which are upsert functions and similar > patterns. If we can make things easier and more efficient than those > (which shouldn't be hard), then it's a worthwhile step forwards. Actually, it's very hard. I don't have license to burn through xids. -- Peter Geoghegan
On 11/26/13 01:59, Peter Geoghegan wrote: > On Sat, Nov 23, 2013 at 11:52 PM, Peter Geoghegan <pg@heroku.com> wrote: >> I have some concerns about what you've done that may limit my >> immediate ability to judge performance, and the relative merits of >> both approaches generally. Now, I know you just wanted to sketch >> something out, and that's fine, but I'm only sharing my thoughts. I am >> particularly worried about the worst case (for either approach), >> particularly with more than 1 unique index. I am also worried about >> livelock hazards (again, in particular with more than 1 index) - I am >> not asserting that they exist in your patch, but they are definitely >> more difficult to reason about. Value locking works because once a >> page lock is acquired, all unique indexes are inserted into. Could you >> have two upserters livelock each other with two unique indexes with >> 1:1 correlated values in practice (i.e. 2 unique indexes that might >> almost work as 1 composite index)? That is a reasonable usage of >> upsert, I think. > > So I had it backwards: In fact, it isn't possible to get your patch to > deadlock when it should - it livelocks instead (where with my patch, > as far as I can tell, we predictably and correctly have detected > deadlocks). I see an infinite succession of "insertion conflicted > after pre-check" DEBUG1 elog messages, and no progress, which is an > obvious indication of livelock. My test does involve 2 unique indexes > - that's generally the hard case to get right. Dozens of backends are > tied-up in livelock. > > Test case for this is attached. Great, thanks! I forgot to reset the "conflicted" variable when looping to retry, so that once it got into the "insertion conflicted after pre-check" situation, it never got out of it. After fixing that bug, I'm getting a correctly-detected deadlock every now and then with that test case. > I'm also seeing this: > > Client 45 aborted in state 2: ERROR: attempted to lock invisible tuple > Client 55 aborted in state 2: ERROR: attempted to lock invisible tuple > Client 41 aborted in state 2: ERROR: attempted to lock invisible tuple Hmm. That's because the trick I used to kill the just-inserted tuple confuses a concurrent heap_lock_tuple call. It doesn't expect the tuple it's locking to become invisible. Actually, doesn't your patch have the same bug? If you're about to lock a tuple in ON DUPLICATE KEY LOCK FOR UPDATE, and the transaction that inserted the duplicate row aborts just before the heap_lock_tuple() call, I think you'd also see that error. > To me this seems like a problem with the (potential) total lack of > locking that your approach takes (inserting btree unique index tuples > as in your patch is a form of value locking...sort of...it's a little > hard to reason about as presented). Do you think this might be an > inherent problem, or can you suggest a way to make your approach still > work? Just garden-variety bugs :-). Attached patch fixes both issues. > So I probably should have previously listed as a requirement for our design: > > * Doesn't just work with one unique index. Naming a unique index > directly in DML, or assuming that the PK is intended seems quite weak > to me. > > This is something I discussed plenty with Robert, and I guess I just > forgot to repeat myself when asked. Totally agreed on that. - Heikki
Attachment
On Tue, Nov 26, 2013 at 11:32 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > After fixing that bug, I'm getting a correctly-detected deadlock every now > and then with that test case. We'll probably want to carefully consider how predictably/deterministically this occurs. > Hmm. That's because the trick I used to kill the just-inserted tuple > confuses a concurrent heap_lock_tuple call. It doesn't expect the tuple it's > locking to become invisible. Actually, doesn't your patch have the same bug? > If you're about to lock a tuple in ON DUPLICATE KEY LOCK FOR UPDATE, and the > transaction that inserted the duplicate row aborts just before the > heap_lock_tuple() call, I think you'd also see that error. Yes, that's true. It will occur much more frequently with your previous revision, but the V4 patch is also affected. >> To me this seems like a problem with the (potential) total lack of >> locking that your approach takes (inserting btree unique index tuples >> as in your patch is a form of value locking...sort of...it's a little >> hard to reason about as presented). Do you think this might be an >> inherent problem, or can you suggest a way to make your approach still >> work? > > > Just garden-variety bugs :-). Attached patch fixes both issues. Great. I'll let you know what I think. >> * Doesn't just work with one unique index. Naming a unique index >> directly in DML, or assuming that the PK is intended seems quite weak >> to me. > Totally agreed on that. Good. BTW, you keep forgetting to add "expected" output of the new isolation tests. -- Peter Geoghegan
On Tue, Nov 26, 2013 at 1:41 PM, Peter Geoghegan <pg@heroku.com> wrote: > Great. I'll let you know what I think. So having taken a look at what you've done here, some concerns remain. I'm coming up with a good explanation/test case, which might be easier than trying to explain it any other way. There are some visibility-related race conditions even still, with the same test case as before. It takes a good while to recreate, but can be done after several hours on an 8 core server under my control: pg@gerbil:~/pgdata$ ls -l -h -a hack_log.log -rw-rw-r-- 1 pg pg 1.6G Nov 27 05:10 hack_log.log pg@gerbil:~/pgdata$ cat hack_log.log | grep visible ERROR: attempted to update invisible tuple ERROR: attempted to update invisible tuple ERROR: attempted to update invisible tuple FWIW I'm pretty sure that my original patch has the same bug, but it hardly matters now. -- Peter Geoghegan
On Tue, Nov 26, 2013 at 8:19 PM, Peter Geoghegan <pg@heroku.com> wrote: > There are some visibility-related race conditions even still I also see this, sandwiched between the very many "deadlock detected" errors recorded over 6 or so hours (this is in chronological order, with no ERRORs omitted within the range shown): ERROR: deadlock detected ERROR: deadlock detected ERROR: deadlock detected ERROR: unable to fetch updated version of tuple ERROR: unable to fetch updated version of tuple ERROR: unable to fetch updated version of tuple ERROR: unable to fetch updated version of tuple ERROR: unable to fetch updated version of tuple ERROR: unable to fetch updated version of tuple ERROR: unable to fetch updated version of tuple ERROR: unable to fetch updated version of tuple ERROR: unable to fetch updated version of tuple ERROR: unable to fetch updated version of tuple ERROR: unable to fetch updated version of tuple ERROR: deadlock detected ERROR: deadlock detected ERROR: deadlock detected ERROR: deadlock detected This, along with the already-discussed "attempted to update invisible tuple" forms a full account of unexpected ERRORs seen during the extended run of the test case, so far. Since it took me a relatively long time to recreate this, it may not be trivial to do so. Unless you don't think it's useful to do so, I'm going to give this test a full 24 hours, just in case it shows up anything else like this. -- Peter Geoghegan
On 2013-11-27 01:09:49 -0800, Peter Geoghegan wrote: > On Tue, Nov 26, 2013 at 8:19 PM, Peter Geoghegan <pg@heroku.com> wrote: > > There are some visibility-related race conditions even still > > I also see this, sandwiched between the very many "deadlock detected" > errors recorded over 6 or so hours (this is in chronological order, > with no ERRORs omitted within the range shown): > > ERROR: deadlock detected > ERROR: deadlock detected > ERROR: deadlock detected > ERROR: unable to fetch updated version of tuple > ERROR: unable to fetch updated version of tuple > ERROR: unable to fetch updated version of tuple > ERROR: unable to fetch updated version of tuple > ERROR: unable to fetch updated version of tuple > ERROR: unable to fetch updated version of tuple > ERROR: unable to fetch updated version of tuple > ERROR: unable to fetch updated version of tuple > ERROR: unable to fetch updated version of tuple > ERROR: unable to fetch updated version of tuple > ERROR: unable to fetch updated version of tuple > ERROR: deadlock detected > ERROR: deadlock detected > ERROR: deadlock detected > ERROR: deadlock detected > > This, along with the already-discussed "attempted to update invisible > tuple" forms a full account of unexpected ERRORs seen during the > extended run of the test case, so far. I think at least the "unable to fetch updated version of tuple" ERRORs are likely to be an unrelated 9.3+ BUG that I've recently reported. Alvaro has a patch. C.f. 20131124000203.GA4403@alap2.anarazel.de Even the "deadlock detected" errors might be a fkey-locking issue. Bug #8434, but that's really hard to know without more details. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 27, 2013 at 1:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Even the "deadlock detected" errors might be a fkey-locking issue. Bug > #8434, but that's really hard to know without more details. Thanks, I was aware of that but didn't make the connection. I've written a test-case that is designed to exercise one case that deadlocks like crazy - deadlocking is the expected, correct behavior. The deadlock errors are not in themselves suspicious. Actually, if anything I find it suspicious that there aren't more deadlocks. -- Peter Geoghegan
On Wed, Nov 27, 2013 at 1:09 AM, Peter Geoghegan <pg@heroku.com> wrote: > Since it took me a relatively long time to recreate this, it may not > be trivial to do so. Unless you don't think it's useful to do so, I'm > going to give this test a full 24 hours, just in case it shows up > anything else like this. I see a further, distinct error message this morning: "ERROR: unrecognized heap_lock_tuple status: 1" This is a would-be "attempted to lock invisible tuple" error, but with the error raised by some heap_lock_tuple() call site, unlike the previous situation where heap_lock_tuple() raised the error directly. Since with the most recent revision, we handle this (newly possible) return code in the new ExecLockHeapTupleForUpdateSpec() function, that just leaves EvalPlanQualFetch() as a plausible place to see it, given the codepaths exercised in the test case. -- Peter Geoghegan
On Wed, Nov 27, 2013 at 1:09 AM, Peter Geoghegan <pg@heroku.com> wrote: > This, along with the already-discussed "attempted to update invisible > tuple" forms a full account of unexpected ERRORs seen during the > extended run of the test case, so far. Actually, it was slightly misleading of me to say it's the same test-case; in fact, this time I ran each pgbench run with a variable, random number of seconds between 2 and 20 inclusive (as opposed to always 2 seconds). If you happen to need help recreating this, I am happy to give it. -- Peter Geoghegan
What's the status of this patch? I posted my version using a quite different approach than your original patch. You did some testing of that, and ran into unrelated bugs. Have they been fixed now? Where do we go from here? Are you planning to continue based on my proof-of-concept patch, fixing the known issues with that? Or do you need more convincing? - Heikki
On Thu, Dec 12, 2013 at 1:23 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > What's the status of this patch? I posted my version using a quite different > approach than your original patch. You did some testing of that, and ran > into unrelated bugs. Have they been fixed now? Sorry, I dropped the ball on this. I'm doing a bit more testing of an approach to fixing the new bugs. I'll let you know how I get on tomorrow (later today for you). -- Peter Geoghegan
On Thu, Dec 12, 2013 at 1:47 AM, Peter Geoghegan <pg@heroku.com> wrote: > Sorry, I dropped the ball on this. Thank you for your patience, Heikki. I attached two revisions - one of my patch (btreelock_insert_on_dup) and one of your alternative design (exclusion_insert_on_dup). In both cases I've added a new visibility rule to HeapTupleSatisfiesUpdate(), and enabled projecting on duplicate-causing-tid by means of the ctid system column when RETURNING REJECTS. I'm not in an immediate position to satisfy myself that the former revision is correct (I'm travelling tomorrow morning and running a bit short on time) and I'm not proposing the latter for inclusion as part of the feature (that's a discussion we may have in time, but it serves a useful purpose during testing). Both of these revisions have identical ad-hoc test cases included as new files - see testcase.sh and upsert.sql. My patch doesn't have any unique constraint violations, and has pretty consistent performance, while yours has many unique constraint violations. I'd like to hear your thoughts on the testcase, and the design implications. -- Peter Geoghegan
Attachment
Peter Geoghegan <pg@heroku.com> writes: > I attached two revisions - one of my patch (btreelock_insert_on_dup) > and one of your alternative design (exclusion_insert_on_dup). I spent a little bit of time looking at btreelock_insert_on_dup. AFAICT it executes FormIndexDatum() for later indexes while holding assorted buffer locks in earlier indexes. That really ain't gonna do, because in the case of an expression index, FormIndexDatum will execute nearly arbitrary user-defined code, which might well result in accesses to those indexes or others. What we'd have to do is refactor so that all the index tuple values get computed before we start to insert any of them. That doesn't seem impossible, but it implies a good deal more refactoring than has been done here. Once we do that, I wonder if we couldn't get rid of the LWLockWeaken/ Strengthen stuff. That scares the heck out of me; I think it's deadlock problems waiting to happen. Another issue is that the number of buffer locks being held doesn't seem to be bounded by anything much. The current LWLock infrastructure has a hard limit on how many lwlocks can be held per backend. Also, the lack of any doc updates makes it hard to review this. I can see that you don't want to touch the user-facing docs until the syntax is agreed on, but at the very least you ought to produce real updates for the indexam API spec, since you're changing that significantly. BTW, so far as the syntax goes, I'm quite distressed by having to make REJECTS into a fully-reserved word. It's not reserved according to the standard, and it seems pretty likely to be something that apps might be using as a table or column name. regards, tom lane
On Fri, Dec 13, 2013 at 4:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I spent a little bit of time looking at btreelock_insert_on_dup. AFAICT > it executes FormIndexDatum() for later indexes while holding assorted > buffer locks in earlier indexes. That really ain't gonna do, because in > the case of an expression index, FormIndexDatum will execute nearly > arbitrary user-defined code, which might well result in accesses to those > indexes or others. What we'd have to do is refactor so that all the index > tuple values get computed before we start to insert any of them. That > doesn't seem impossible, but it implies a good deal more refactoring than > has been done here. We were proceeding on the basis that what I'd done, if deemed acceptable in principle, could eventually be replaced by an alternative value locking implementation that more or less similarly extends the limited way in which value locking already occurs (i.e. unique index enforcement's buffer locking), but without the downsides. While I certainly appreciate your input, I still think that there is a controversy about what implementation gets us the most useful semantics, and I think we should now focus on resolving it. I am not sure that Heikki's approach is functionally equivalent to mine. At the very least, I think the trade-off of doing one or the other should be well understood. > Once we do that, I wonder if we couldn't get rid of the LWLockWeaken/ > Strengthen stuff. That scares the heck out of me; I think it's deadlock > problems waiting to happen. There are specific caveats around using those. I think that they could be useful elsewhere, but are likely to only ever have a few clients. As previously mentioned, the same semantics appear in other similar locking primitives in other domains, so fwiw it really doesn't strike me as all that controversial. I agree that their *usage* is not acceptable as-is. I've only left the usage in the patch to give us some basis for reasoning about the performance on mixed workloads for comparative purposes. Perhaps I shouldn't have even done that, to better focus reviewer attention on the semantics implied by each implementation. > Also, the lack of any doc updates makes it hard to review this. I can > see that you don't want to touch the user-facing docs until the syntax > is agreed on, but at the very least you ought to produce real updates > for the indexam API spec, since you're changing that significantly. I'll certainly do that in any future revision. -- Peter Geoghegan
On Thu, Dec 12, 2013 at 4:18 PM, Peter Geoghegan <pg@heroku.com> wrote: > Both of these revisions have identical ad-hoc test cases included as > new files - see testcase.sh and upsert.sql. My patch doesn't have any > unique constraint violations, and has pretty consistent performance, > while yours has many unique constraint violations. I'd like to hear > your thoughts on the testcase, and the design implications. I withdraw the test-case. Both approaches behave similarly if you look for long enough, and that's okay. I also think that changes to HeapTupleSatisfiesUpdate() are made unnecessary by recent bug fixes to that function. The test case previously described [1] that broke that is no longer recreatable, at least so far. Do you think that we need to throw a serialization failure within ExecLockHeapTupleForUpdateSpec() iff heap_lock_tuple() returns HeapTupleInvisible and IsolationUsesXactSnapshot()? Also, I'm having a hard time figuring out a good choke point to catch MVCC snapshots availing of our special visibility rule where they should not due to IsolationUsesXactSnapshot(). It seems sufficient to continue to assume that Postgres won't attempt to lock any tid invisible under conventional MVCC rules in the first place, except within ExecLockHeapTupleForUpdateSpec(), but what do we actually do within ExecLockHeapTupleForUpdateSpec()? I'm thinking of a new tqual.c routine concerning the tuple being in the future that we re-check when IsolationUsesXactSnapshot(). That's not very modular, though. Maybe we'd go through heapam.c. I think it doesn't matter that what now constitute MVCC snapshots (with the new, special "reach into the future" rule) have that new rule, for the purposes of higher isolation levels, because we'll have a serialization failure within ExecLockHeapTupleForUpdateSpec() before this is allowed to become a problem. In order for the new rule to be relevant, we'd have to be the Xact to lock in the first place, and as an xact in non-read-committed mode, we'd be sure to call the new tqual.c "in the future" routine or whatever. Only upserters can lock a row in the future, so it is the job of upserters to care about this special case. Incidentally, I tried to rebase recently, and saw some shift/reduce conflicts due to 1b4f7f93b4693858cb983af3cd557f6097dab67b, "Allow empty target list in SELECT". The fix for that is not immediately obvious. So I think we should proceed with the non-conclusive-check-first approach (if only on pragmatic grounds), but even now I'm not really sure. I think there might be unprincipled deadlocking should ExecInsertIndexTuples() fail to be completely consistent about its ordering of insertion - the use of dirty snapshots (including as part of conventional !UNIQUE_CHECK_PARTIAL unique index enforcement) plays a part in this risk. Roughly speaking, heap_delete() doesn't render the tuple immediately invisible to some-other-xact's dirty snapshot [2], and I think that could have unpleasant interactions, even if it is also beneficial in some ways. Our old, dead tuples from previous attempts stick around, and function as "value locks" to everyone else, since for example _bt_check_unique() cares about visibility having merely been affected, which is grounds for blocking. More counter-intuitive still, we go ahead with "value locking" (i.e. btree UNIQUE_CHECK_PARTIAL tuple insertion originating from the main speculative ExecInsertIndexTuples() call) even though we already know that we will delete the corresponding heap row (which, as noted, still satisfies HeapTupleSatisfiesDirty() and so is value-lock-like). Empirically, retrying because ExecInsertIndexTuples() returns some recheckIndexes occurs infrequently, so maybe that makes all of this okay. Or maybe it happens infrequently *because* we don't give up on insertion when it looks like the current iteration is futile. Maybe just inserting into every unique index, and then blocking on an xid within ExecCheckIndexConstraints(), works out fairly and performs reasonably in all common cases. It's pretty damn subtle, though, and I worry about the worst case performance, and basic correctness issues for these reasons. The fact that deferred unique indexes also use UNIQUE_CHECK_PARTIAL is cold comfort -- that only ever has to through an error on conflict, and only once. We haven't "earned the right" to lock *all* values in all unique indexes, but kind of do so anyway in the event of an "insertion conflicted after pre-check". Another concern that bears reiterating is: I think making the lock-for-update case work for exclusion constraints is a lot of additional complexity for a very small return. Do you think it's worth optimizing ExecInsertIndexTuples() to avoid futile non-unique/exclusion constrained index tuple insertion? [1] http://www.postgresql.org/message-id/CAM3SWZS2--GOvUmYA2ks_aNyfesb0_H6T95_k8+wyx7Pi=CQvw@mail.gmail.com [2] https://github.com/postgres/postgres/blob/94b899b829657332bda856ac3f06153d09077bd1/src/backend/utils/time/tqual.c#L798 -- Peter Geoghegan
On Wed, Dec 18, 2013 at 8:39 PM, Peter Geoghegan <pg@heroku.com> wrote: > Empirically, retrying because ExecInsertIndexTuples() returns some > recheckIndexes occurs infrequently, so maybe that makes all of this > okay. Or maybe it happens infrequently *because* we don't give up on > insertion when it looks like the current iteration is futile. Maybe > just inserting into every unique index, and then blocking on an xid > within ExecCheckIndexConstraints(), works out fairly and performs > reasonably in all common cases. It's pretty damn subtle, though, and I > worry about the worst case performance, and basic correctness issues > for these reasons. I realized that it's possible to create the problem that I'd previously predicted with "promise tuples" [1] some time ago, that are similar in some regards to what Heikki has here. At the time, Robert seemed to agree that this was a concern [2]. I have a very simple testcase attached, much simpler that previous testcases, that reproduces deadlock for the patch exclusion_insert_on_dup.2013_12_12.patch.gz at scale=1 frequently, and occasionally when scale=10 (for tiny, single-statement transactions). With scale=100, I can't get it to deadlock on my laptop (60 clients in all cases), at least in a reasonable time period. With the patch btreelock_insert_on_dup.2013_12_12.patch.gz, it will never deadlock, even with scale=1, simply because value locks are not held on to across row locking. This is why I characterized the locking as "opportunistic" on several occasions in months past. The test-case is actually much simpler than the one I describe in [1], and much simpler than all previous test-cases, as there is only one unique index, though the problem is essentially the same. It is down to old "value locks" held across retries - with "exclusion_...", we can't *stop* locking things from previous locking attempts (where a locking attempt is btree insertion with the UNIQUE_CHECK_PARTIAL flag), because dirty snapshots still see inserted-then-deleted-in-other-xact tuples. This deadlocking seems unprincipled and unjustified, which is a concern that I had all along, and a concern that Heikki seemed to share more recently [3]. This is why I felt strongly all along that value locks ought to be cheap to both acquire and _release_, and it's unfortunate that so much time was wasted on tangential issues, though I do accept some responsibility for that. So, I'd like to request as much scrutiny as possible from as wide as possible a cross section of contributors of this test case specifically. This feature's development is deadlocked on resolving this unprincipled deadlocking controversy. This is a relatively easy thing to have an opinion on, and I'd like to hear as many as possible. Is this deadlocking something we can live with? What is a reasonable path forward? Perhaps I am being pedantic in considering unnecessary deadlocking as ipso facto unacceptable (after all, MySQL lived with this kind of problem for long enough, even if it has gotten better for them recently), but there is a very real danger of painting ourselves into a corner with these concurrency issues. I aim to have the community understand ahead of time the exact set of trade-offs/semantics implied by our chosen implementation, whatever the outcome. That seems very important. I myself lean towards this being a blocker for the "exclusion_" approach at least as presented. Now, you might say to yourself "why should I assume that this isn't just attributable to btree page buffer locks being coarser than other approaches to value locking?". That's a reasonable point, and indeed it's why I avoided lower scale values in prior, more complicated test-cases, but that doesn't actually account for the problem highlighted: In this test-case we do not hold buffer locks across other buffer locks within a single backends (at least in any new way), nor do we lock rows while holding buffer locks within a single backend. Quite simply, the conventional btree value locking approach doesn't attempt to lock 2 things within a backend at the same time, and you need to do that to get a deadlock, so there are no deadlocks. Importantly, the "btree_..." implementation can release value locks. Thanks P.S. In the interest of reproducibility, I attach new revisions of each patch, even though there is no reason to believe that any changes since the last revision posted are significant to the test-case. There was some diff minimization, plus I incorporated some (but not all) unrelated feedback from Tom. It wasn't immediately obvious, at least to me, that "rejects" can be made to be an unreserved keyword, due to shift/reduce conflicts, but I did document AM changes. Hopefully this gives some indication of the essential nature or intent of my design that we may work towards refining (depending on the outcome of discussion here, of course). P.P.S. Be careful not to fall afoul of the shift/reduce conflicts when applying either patch on top of commit 1b4f7f93b4693858cb983af3cd557f6097dab67b. I'm working on a fix that allows a clean rebase. [1] http://www.postgresql.org/message-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9M7_g@mail.gmail.com [2] http://www.postgresql.org/message-id/CA+TgmobwDZSVcKWTmVNBxeHSe4LCnW6zon2soH6L7VoO+7tAzw@mail.gmail.com [3] http://www.postgresql.org/message-id/528B640F.50601@vmware.com -- Peter Geoghegan
Attachment
On 12/20/2013 06:06 AM, Peter Geoghegan wrote: > On Wed, Dec 18, 2013 at 8:39 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Empirically, retrying because ExecInsertIndexTuples() returns some >> recheckIndexes occurs infrequently, so maybe that makes all of this >> okay. Or maybe it happens infrequently *because* we don't give up on >> insertion when it looks like the current iteration is futile. Maybe >> just inserting into every unique index, and then blocking on an xid >> within ExecCheckIndexConstraints(), works out fairly and performs >> reasonably in all common cases. It's pretty damn subtle, though, and I >> worry about the worst case performance, and basic correctness issues >> for these reasons. > > I realized that it's possible to create the problem that I'd > previously predicted with "promise tuples" [1] some time ago, that are > similar in some regards to what Heikki has here. At the time, Robert > seemed to agree that this was a concern [2]. > > I have a very simple testcase attached, much simpler that previous > testcases, that reproduces deadlock for the patch > exclusion_insert_on_dup.2013_12_12.patch.gz at scale=1 frequently, and > occasionally when scale=10 (for tiny, single-statement transactions). > With scale=100, I can't get it to deadlock on my laptop (60 clients in > all cases), at least in a reasonable time period. With the patch > btreelock_insert_on_dup.2013_12_12.patch.gz, it will never deadlock, > even with scale=1, simply because value locks are not held on to > across row locking. This is why I characterized the locking as > "opportunistic" on several occasions in months past. > > The test-case is actually much simpler than the one I describe in [1], > and much simpler than all previous test-cases, as there is only one > unique index, though the problem is essentially the same. It is down > to old "value locks" held across retries - with "exclusion_...", we > can't *stop* locking things from previous locking attempts (where a > locking attempt is btree insertion with the UNIQUE_CHECK_PARTIAL > flag), because dirty snapshots still see > inserted-then-deleted-in-other-xact tuples. This deadlocking seems > unprincipled and unjustified, which is a concern that I had all along, > and a concern that Heikki seemed to share more recently [3]. This is > why I felt strongly all along that value locks ought to be cheap to > both acquire and _release_, and it's unfortunate that so much time was > wasted on tangential issues, though I do accept some responsibility > for that. Hmm. If I understand the problem correctly, it's that as soon as another backend sees the tuple you've inserted and calls XactLockTableWait(), it will not stop waiting even if we later decide to kill the already-inserted tuple. One approach to fix that would be to release and immediately re-acquire the transaction-lock, when you kill an already-inserted tuple. Then teach the callers of XactLockTableWait() to re-check if the tuple is still alive. I'm just waving hands here, but the general idea is to somehow wake up others when you kill the tuple. We could make use of that facility to also let others to proceed, if you delete a tuple in the same transaction that you insert it. It's a corner case, not worth much on its own, but I think it would fall out of the above machinery for free, and be an easier way to test it than inducing deadlocks with ON DUPLICATE. - Heikki
On Fri, Dec 20, 2013 at 3:39 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Hmm. If I understand the problem correctly, it's that as soon as another > backend sees the tuple you've inserted and calls XactLockTableWait(), it > will not stop waiting even if we later decide to kill the already-inserted > tuple. > > One approach to fix that would be to release and immediately re-acquire the > transaction-lock, when you kill an already-inserted tuple. Then teach the > callers of XactLockTableWait() to re-check if the tuple is still alive. That particular mechanism sounds like a recipe for unintended consequences. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas escribió: > On Fri, Dec 20, 2013 at 3:39 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: > > Hmm. If I understand the problem correctly, it's that as soon as another > > backend sees the tuple you've inserted and calls XactLockTableWait(), it > > will not stop waiting even if we later decide to kill the already-inserted > > tuple. > > > > One approach to fix that would be to release and immediately re-acquire the > > transaction-lock, when you kill an already-inserted tuple. Then teach the > > callers of XactLockTableWait() to re-check if the tuple is still alive. > > That particular mechanism sounds like a recipe for unintended consequences. Yep, what I thought too. There are probably other ways to make that general idea work though. I didn't follow this thread carefully, but is the idea that there would be many promise tuples "live" at any one time, or only one? Because if there's only one, or a very limited number, it might be workable to sleep on that tuple's lock instead of the xact's lock. Another thought is to have a different LockTagType that signals a transaction that's doing the INSERT/ON DUPLICATE thingy, and remote backends sleep on that instead of the regular transaction lock. That different lock type could be released and reacquired as proposed by Heikki above without danger of unintended consequences. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 12/20/2013 10:56 PM, Alvaro Herrera wrote: > Robert Haas escribió: >> On Fri, Dec 20, 2013 at 3:39 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> Hmm. If I understand the problem correctly, it's that as soon as another >>> backend sees the tuple you've inserted and calls XactLockTableWait(), it >>> will not stop waiting even if we later decide to kill the already-inserted >>> tuple. >>> >>> One approach to fix that would be to release and immediately re-acquire the >>> transaction-lock, when you kill an already-inserted tuple. Then teach the >>> callers of XactLockTableWait() to re-check if the tuple is still alive. >> >> That particular mechanism sounds like a recipe for unintended consequences. > > Yep, what I thought too. > > There are probably other ways to make that general idea work though. I > didn't follow this thread carefully, but is the idea that there would be > many promise tuples "live" at any one time, or only one? Because if > there's only one, or a very limited number, it might be workable to > sleep on that tuple's lock instead of the xact's lock. Only one. heap_update() and heap_delete() also grab a heavy-weight lock on the tuple, before calling XactLockTableWait(). _bt_doinsert() does not, but it could. Perhaps we can take advantage of that. - Heikki
On Fri, Dec 20, 2013 at 12:39 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Hmm. If I understand the problem correctly, it's that as soon as another > backend sees the tuple you've inserted and calls XactLockTableWait(), it > will not stop waiting even if we later decide to kill the already-inserted > tuple. Forgive me for being pedantic, but I wouldn't describe it that way. Quite simply, the tuples speculatively inserted (and possibly later deleted) are functionally value locks, that presently cannot be easily released (so my point is it doesn't matter if you're currently waiting on XactLockTableWait() or are just about to). I have to wonder about the performance implications of fixing this, even if we suppose the fix is itself inexpensive. The current approach probably benefits from not having to re-acquire value locks from previous attempts, since everyone still has to care about "value locks" from our previous attempts. The more I think about it, the more opposed I am to letting this slide, which is an notion I had considered last night, if only because MySQL did so for many years. This is qualitatively different from other cases where we deadlock. Even back when we exclusive locked rows as part of foreign key enforcement, I think it was more or less always possible to do an analysis of the dependencies that existed, to ensure that locks were acquired in a predictable order so that deadlocking could not occur. Now, maybe that isn't practical for an entire app, but it is practical to do in a localized way as problems emerge. In contrast, if we allowed unprincipled deadlocking, the only advice we could give is "stop doing so much upserting". -- Peter Geoghegan
On Fri, Dec 20, 2013 at 1:12 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> There are probably other ways to make that general idea work though. I >> didn't follow this thread carefully, but is the idea that there would be >> many promise tuples "live" at any one time, or only one? Because if >> there's only one, or a very limited number, it might be workable to >> sleep on that tuple's lock instead of the xact's lock. > > Only one. > > heap_update() and heap_delete() also grab a heavy-weight lock on the tuple, > before calling XactLockTableWait(). _bt_doinsert() does not, but it could. > Perhaps we can take advantage of that. I am skeptical of this approach. It sounds like you're saying that you'd like to intersperse value and row locking, such that you'd definitely get a row lock on your first attempt after detecting a duplicate. With respect, I dismissed this months ago. Why should it be okay to leave earlier, actually inserted index tuples (from earlier unique indexes) behind? You still have to delete those (that is, the heap tuple) on conflict, and what you outline is sufficiently hand-wavey for me to strongly doubt the feasibility of making earlier btree tuples not behave as pseudo value locks ***in all relevant contexts***. How exactly do you determine that row versions were *deleted*? How do you sensibly differentiate between updates and deletes, or do you? What of lock starvation hazards? Perhaps I've misunderstood, but detecting and reasoning about deletedness like this seems like a major modularity violation, even by the standards of the btree AM. Do XactLockTableWait() callers have to re-check tuple-deletedness both before and after their XactLockTableWait() call? For regular non-upserting inserters too? I think that the way forward is to refine my design in order to upgrade locks from exclusive buffer locks to something else, managed by the lock manager but perhaps through an additional layer of indirection. As previously outlined, I'm thinking of a new SLRU-based granular value locking infrastructure built for this purpose, with btree inserters marking pages as having an entry in this table. That doesn't sound like much fun to go and implement, but it's reasonably well precedented, if authoritative transaction processing papers are anything to go by, as previously noted [1]. I hate to make a plausibility argument, particularly at this late stage, but: no one, myself included, has managed to find any holes in the semantics implied by my implementation in the last few months. It is relatively easy to reason about, and doesn't leave the idea of an amcanunique abstraction in tatters, nor does it expand the already byzantine tuple locking infrastructure in a whole new direction. These are strong advantages. It really isn't hard to imagine a totally sound implementation of the same idea -- what I do with buffer locks, but without actual buffer locks and their obvious attendant disadvantages, and without appreciably regressing the performance of non-upsert use-cases. AFAICT, there is way less uncertainty around doing this, unless you think that unprincipled deadlocking is morally defensible, which I don't believe you or anyone else does. [1] http://www.postgresql.org/message-id/CAM3SWZQ9XMM8bZyNX3memy1AMQcKqXuUSy8t1iFqZz999U_AGQ@mail.gmail.com -- Peter Geoghegan
On Fri, Dec 20, 2013 at 11:59 PM, Peter Geoghegan <pg@heroku.com> wrote: > I think that the way forward is to refine my design in order to > upgrade locks from exclusive buffer locks to something else, managed > by the lock manager but perhaps through an additional layer of > indirection. As previously outlined, I'm thinking of a new SLRU-based > granular value locking infrastructure built for this purpose, with > btree inserters marking pages as having an entry in this table. I'm working on a revision that holds lmgr page-level exclusive locks (and buffer pins) across multiple operations. This isn't too different to what you've already seen, since they are still only held for an instant. Notably, hash indexes currently quickly grab and release lmgr page-level locks, though they're the only existing clients of that infrastructure. I think on reflection that fully-fledged value locking may be overkill, given the fact that these locks are only held for an instant, and only need to function as a choke point for unique index insertion, and only when upserting occurs. This approach seems promising. It didn't take me very long to get it to a place where it passed a few prior test-cases of mine, with fairly varied input, though the patch isn't likely to be posted for another few days. I think I can get it to a place where it doesn't regress regular insertion at all. I think that that will tick all of the many boxes, without unwieldy complexity and without compromising conceptual integrity. I mention this now because obviously time is a factor. If you think there's something I need to do, or that there's some way that I can more usefully coordinate with you, please let me know. Likewise for anyone else following. -- Peter Geoghegan
On Sun, Dec 22, 2013 at 6:42 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Dec 20, 2013 at 11:59 PM, Peter Geoghegan <pg@heroku.com> wrote: >> I think that the way forward is to refine my design in order to >> upgrade locks from exclusive buffer locks to something else, managed >> by the lock manager but perhaps through an additional layer of >> indirection. As previously outlined, I'm thinking of a new SLRU-based >> granular value locking infrastructure built for this purpose, with >> btree inserters marking pages as having an entry in this table. > > I'm working on a revision that holds lmgr page-level exclusive locks > (and buffer pins) across multiple operations. This isn't too > different to what you've already seen, since they are still only held > for an instant. Notably, hash indexes currently quickly grab and > release lmgr page-level locks, though they're the only existing > clients of that infrastructure. I think on reflection that > fully-fledged value locking may be overkill, given the fact that these > locks are only held for an instant, and only need to function as a > choke point for unique index insertion, and only when upserting > occurs. > > This approach seems promising. It didn't take me very long to get it > to a place where it passed a few prior test-cases of mine, with fairly > varied input, though the patch isn't likely to be posted for another > few days. I think I can get it to a place where it doesn't regress > regular insertion at all. I think that that will tick all of the many > boxes, without unwieldy complexity and without compromising conceptual > integrity. > > I mention this now because obviously time is a factor. If you think > there's something I need to do, or that there's some way that I can > more usefully coordinate with you, please let me know. Likewise for > anyone else following. I don't think this is a project to rush through. We've lived without MERGE/UPSERT for several years now, and we can live without it for another release cycle while we try to reach agreement on the way forward. I can tell that you're convinced you know the right way forward here, and you may be right, but I don't think you've convinced everyone else - maybe not even anyone else. I wouldn't suggest modeling anything you do on the way hash indexes using heavyweight locks. That is a performance disaster, not to mention being basically a workaround for the fact that whoever wrote the code originally didn't bother figuring out any way that splitting a bucket could be accomplished in a crash-safe manner, even in theory.If it weren't for that, we'd be using buffer locksthere. That doesn't necessarily mean that page-level heavyweight locks aren't the right thing here, but the performance aspects of any such approach will need to be examined carefully. To be honest, I am still not altogether sold on any part of this feature. I don't like the fact that it violates MVCC - although I admit that some form of violation is inevitable in any feature in this area unless we're content to live with many serialization failures, I don't like the particular way it violates MVCC, I don't like the syntax (returns rejects? blech), and I don't like the fact that getting the locking right, or even getting the semantics right, seems to be so darned hard. I think we're in real danger of building something that will be too complex, or just too weird, for users to use, and too complex to maintain as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 23, 2013 at 7:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't think this is a project to rush through. We've lived without > MERGE/UPSERT for several years now, and we can live without it for > another release cycle while we try to reach agreement on the way > forward. I can tell that you're convinced you know the right way > forward here, and you may be right, but I don't think you've convinced > everyone else - maybe not even anyone else. That may be. Attention from reviewers has been in relatively short supply. Not that that isn't always true. > I wouldn't suggest modeling anything you do on the way hash indexes > using heavyweight locks. That is a performance disaster, not to > mention being basically a workaround for the fact that whoever wrote > the code originally didn't bother figuring out any way that splitting > a bucket could be accomplished in a crash-safe manner, even in theory. > If it weren't for that, we'd be using buffer locks there. Having looked at the code for the first time recently, I'd agree that hash indexes are a disaster. A major advantage of The Lehman and Yao Algorithm, as prominently noted in the paper, is that exclusive locks are only acquired on leaf pages to increase concurrency. Since I only propose to extend this to a heavyweight page lock, and still only for an instant, it seems reasonable to assume that the performance will be acceptable for an initial version of this. It's not as if most places will have to pay any heed to this heavyweight lock - index scans and non-upserting inserts are generally unaffected. We can later optimize performance as we measure a need to do so. Early indications are that the performance is reasonable. Holding value locks for more than an instant doesn't make sense. The reason is simple: when upserting, we're tacitly only really looking for violations on one particular unique index. We just lock them all at once because the implementation doesn't know *which* unique index. So in actuality, it's really no different from existing potential-violation handling for unique indexes, except we have to do some extra work in addition to the usual restart from scratch stuff (iff we have multiple unique indexes). > To be honest, I am still not altogether sold on any part of this > feature. I don't like the fact that it violates MVCC - although I > admit that some form of violation is inevitable in any feature in this > area unless we're content to live with many serialization failures, I > don't like the particular way it violates MVCC Discussions around visibility issues have not been very useful. As I've said, I don't like the term "MVCC violation", because it's suggestive of some classical, codified definition of MVCC, a definition that doesn't actually exist anywhere, even in research papers, AFAICT. So while I understand your concerns around the modifications to HeapTupleSatisfiesMVCC(), and while I respect that we need to be mindful of the performance impact, my position is that if that really is what we need to do, we might as well be honest about it, and express intent succinctly and directly. This is a position that is orthogonal to the proposed syntax, even if that is convenient to my patch. It's already been demonstrated that yes, the MVCC violation can be problematic when we call HeapTupleSatisfiesUpdate(), which is a bug that was fixed by making another modest modification to HeapTupleSatisfiesUpdate(). It is notable that that bug would have still occurred had a would-be-HTSMVCC-invisible tuple been passed through any other means. What problem, specifically, do you envisage avoiding by doing it some other way? What other way do you have in mind? We invested huge effort into more granular FK locking when we had a few complaints about it. I wouldn't be surprised if that effort modestly regressed HeapTupleSatisfiesMVCC(). On the other hand, this feature has been in very strong demand for over a decade, and has a far smaller code footprint. I don't want to denigrate the FK locking stuff in any way - it is a fantastic piece of work - but it's important to have a sense of proportion about these things. In order to make visibility work in the way we require, we're almost always just doing additional checking of infomask bits, and the t_infomask variable is probably already in a CPU register (this is a huge simplification, but is essentially true). Like you, I have noted that HeapTupleSatisfiesMVCC() is a fairly hot routine during profiling before, but it's not *that* hot. It's understandable that you raise these points, but from my perspective it's hard to address your concerns without more concrete objections. > I don't like the > syntax (returns rejects? blech) I suppose it isn't ideal in some ways. On the other hand, it is extremely flexible, with many of the same advantages of SQL MERGE. Importantly, it will facilitate merging as part of conflict resolution on multi-master replication systems, which I think is of considerable strategic importance even beyond having a simple upsert. I would like to see us get this into 9.4, and get MERGE into 9.5. > and I don't like the fact that > getting the locking right, or even getting the semantics right, seems > to be so darned hard. I think we're in real danger of building > something that will be too complex, or just too weird, for users to > use, and too complex to maintain as well. Please don't conflate confusion or uncertainty around alternative approaches with confusion or uncertainty around mine - *far* more time has been spent discussing the former. While I respect the instinct that says we ought to be very conservative around changing anything that the btree AM does, I really don't think my design is itself all that complicated. I've been very consistent even in the face of strong criticism. What I have now is essentially the same design as back in early September. After the initial ON DUPLICATE KEY IGNORE patch in August, I soon realized that value locking and row locking could not be sensibly considered in isolation, and over the objections of others pushed ahead with integrating the two. I believe now as I believed then that value locks need to be cheap to release (or it at least needs to be possible), and that it was okay to drop all value locks when we need to deal with a possible conflict/getting an xid shared lock - if those unlocked pages have separate conflicts on our next attempt, the feature is being badly misused (for upserting) or it doesn't matter because we only need one conclusive "No" answer (for IGNOREing, but also for upserting). I have been trying to focus attention of these aspects throughout this discussion. I'm not sure how successful I was here. -- Peter Geoghegan
On Mon, Dec 23, 2013 at 5:59 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Dec 23, 2013 at 7:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't think this is a project to rush through. We've lived without >> MERGE/UPSERT for several years now, and we can live without it for >> another release cycle while we try to reach agreement on the way >> forward. I can tell that you're convinced you know the right way >> forward here, and you may be right, but I don't think you've convinced >> everyone else - maybe not even anyone else. > > That may be. Attention from reviewers has been in relatively short > supply. Not that that isn't always true. I think concrete concerns about usability have largely been subordinated to abstruse discussions about locking protocols. A discussion strictly about what syntax people would consider reasonable, perhaps on another thread, might elicit broader participation (although this week might not be the right time to try to attract an audience). > Having looked at the code for the first time recently, I'd agree that > hash indexes are a disaster. A major advantage of The Lehman and Yao > Algorithm, as prominently noted in the paper, is that exclusive locks > are only acquired on leaf pages to increase concurrency. Since I only > propose to extend this to a heavyweight page lock, and still only for > an instant, it seems reasonable to assume that the performance will be > acceptable for an initial version of this. It's not as if most places > will have to pay any heed to this heavyweight lock - index scans and > non-upserting inserts are generally unaffected. We can later optimize > performance as we measure a need to do so. Early indications are that > the performance is reasonable. OK. >> To be honest, I am still not altogether sold on any part of this >> feature. I don't like the fact that it violates MVCC - although I >> admit that some form of violation is inevitable in any feature in this >> area unless we're content to live with many serialization failures, I >> don't like the particular way it violates MVCC > > Discussions around visibility issues have not been very useful. As > I've said, I don't like the term "MVCC violation", because it's > suggestive of some classical, codified definition of MVCC, a > definition that doesn't actually exist anywhere, even in research > papers, AFAICT. I don't know whether or not that's literally true, but like Potter Stewart, I don't think there's any real ambiguity about the underlying concept. The concepts of read->write, write->read, and write->write dependencies between transactions are well-described in textbooks such as Jim Gray's Transaction Processing: Concepts and Techniques and this paper on MVCC: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.552&rep=rep1&type=pdf I think the definition of an MVCC violation is that a snapshot sees the effects of a transaction which committed after that snapshot was taken. And maybe it's good and right that this patch is introducing a new way for that to happen, or maybe it's not, but the point is that we get to decide. > I've been very consistent even in the face of strong criticism. What I > have now is essentially the same design as back in early September. > After the initial ON DUPLICATE KEY IGNORE patch in August, I soon > realized that value locking and row locking could not be sensibly > considered in isolation, and over the objections of others pushed > ahead with integrating the two. I believe now as I believed then that > value locks need to be cheap to release (or it at least needs to be > possible), and that it was okay to drop all value locks when we need > to deal with a possible conflict/getting an xid shared lock - if those > unlocked pages have separate conflicts on our next attempt, the > feature is being badly misused (for upserting) or it doesn't matter > because we only need one conclusive "No" answer (for IGNOREing, but > also for upserting). I'm not saying that you haven't been consistent, or that you've done anything wrong at all. I'm just saying that the default outcome is that we change nothing, and the fact that nobody's been able to demonstrate an approach is clearly superior to what you've proposed does not mean we have to accept what you've proposed. I am not necessarily advocating for rejecting your proposed approach, although I do have concerns about it, but I think it is clear that it is not backed by any meaningful amount of consensus. Maybe that will change in the next two months, and maybe it won't. If it doesn't, whether through any fault of yours or not, I don't think this is going in. If this is all perfectly clear to you already, then I apologize for belaboring the point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-12-23 14:59:31 -0800, Peter Geoghegan wrote: > On Mon, Dec 23, 2013 at 7:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > I don't think this is a project to rush through. We've lived without > > MERGE/UPSERT for several years now, and we can live without it for > > another release cycle while we try to reach agreement on the way > > forward. Agreed, but I really think it's one of the biggest weaknesses of postgres at this point. > > I can tell that you're convinced you know the right way > > forward here, and you may be right, but I don't think you've convinced > > everyone else - maybe not even anyone else. > That may be. Attention from reviewers has been in relatively short > supply. Not that that isn't always true. I don't really see the lack of review as being crucial at this point. At least I have quite some doubts about the approach you've chosen and I have voiced it - so have others. Whether yours is workable seems to hinge entirely on whether you can build a scalable, maintainable value-locking scheme. Besides some thoughts about using slru.c for it I haven't seen much about the design of that part - might just have missed it though. Personally I can't ad-lib a design for it, but I haven't though about it too much. I don't think there's too much reviewers can do before you've provided a POC implementation of real value locking. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 24, 2013 at 4:09 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I don't really see the lack of review as being crucial at this point. At > least I have quite some doubts about the approach you've chosen and I > have voiced it - so have others. Apparently you haven't been keeping up with this thread. The approach that Heikki outlined with his POC patch was demonstrated to deadlock in an unprincipled manner - it took me a relatively long time to figure this out because I didn't try a simple enough test-case. There is every reason to think that alternative promise tuple approaches would behave similarly, short of some very invasive, radical changes to how we wait on XID share locks that I really don't think are going to fly. That's why I chose this approach: at no point did anyone have a plausible alternative that didn't have similar problems, and I personally saw no alternative. It wasn't really a choice at all. In hindsight I should have known better than to think that people would be willing to defer discussion of a more acceptable value locking implementation to consider the interactions between the different subsystems, which I felt were critical and warranted up-front discussion, a belief which has now been borne out. Lesson learned. It's a pity that that's the way things are, because that discussion could have been really useful, and saved us all some time. > I don't think there's too much reviewers can do before you've provided a > POC implementation of real value locking. I don't see what is functionally insufficient about the value locking that you've already seen. I'm currently working towards extended the buffer locking to use a heavyweight lock held only for an instant, but potentially across multiple operations, although of course only when upserting occurs so as to not regress regular insertion. If you're still of the opinion that it is necessary to hold value locks of some form on earlier unique indexes, as you wait maybe for hours on some conflicting xid, then I still disagree with you for reasons recently re-stated [1]. You never stated a reason why you thought it was necessary. If you have one now, please share it. Note that I release all value locks before row locking too, which is necessary because to do any less will cause unprincipled deadlocking, as we've seen. Other than that, I have no idea what your continued objection to my design would be once the buffer level exclusive locks are replaced with page level heavyweight locks across complex (though brief) operations (I guess you might not like the visibility stuff or the syntax, but that isn't what you're talking about here). More granular value locking might help boost performance, but maybe not even by much, since we're only locking a single leaf page per unique index against insertion, and only for an instant. I see no reason to make the coarser-than-necessary granularity of the value locking a blocker. Predicate locks on btree leaf pages acquired by SSI are also coarser than strictly necessary. [1] http://www.postgresql.org/message-id/CAM3SWZSOdUmg4899tJc09R2uoRTYhb0VL9AasC1Fz7AW4GsR-g@mail.gmail.com -- Peter Geoghegan
Hi, On 2013-12-24 13:18:36 -0800, Peter Geoghegan wrote: > On Tue, Dec 24, 2013 at 4:09 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I don't really see the lack of review as being crucial at this point. At > > least I have quite some doubts about the approach you've chosen and I > > have voiced it - so have others. > > Apparently you haven't been keeping up with this thread. The approach > that Heikki outlined with his POC patch was demonstrated to deadlock > in an unprincipled manner - it took me a relatively long time to > figure this out because I didn't try a simple enough test-case. So? I still have the fear that you approach will end up being way too complicated and full of layering violations. I didn't say it's a no-go (not that I have veto powers, even if I'd consider it one). And yes, I still think that promise tuples might be a better solution regardless of the issues you mentioned, but you know what? That doesn't matter. Me thinking it's the better approach is primarily based on gut feeling, and I clearly haven't voiced clear enough reasons to convince you. So you going with your own, possibly more substantiated, gut feeling is perfectly alright. Unless I go ahead and write a POC of my own at least ;) > In hindsight I should have known better than to think that people > would be willing to defer discussion of a more acceptable value > locking implementation to consider the interactions between the > different subsystems, which I felt were critical and warranted > up-front discussion, a belief which has now been borne out. > Lesson learned. It's a pity that that's the way things are, because that > discussion could have been really useful, and saved us all some time. Whoa? What? Not convincing everyone is far from it being a useless discussion. Such an attitude sure is not the way to go to elicit more feedback. And it clearly gave you the feedback that most people regard holding buffer locks across other nontrivial operations, in a potentially unbounded number, as a fundamental problem. > > I don't think there's too much reviewers can do before you've provided a > > POC implementation of real value locking. > > I don't see what is functionally insufficient about the value locking > that you've already seen. I still think it's fundamentally unacceptable to hold buffer locks across any additional complex operations. So yes, I think the current state is fundamentally insufficient. Note that the case of the existing uniqueness checking already is bad, but it at least will never run any user defined code in that context, just HeapTupleSatisfies* and HOT code. So I don't think arguments of the "we're already doing it in uniqueness checking" ilk have much merit. > If you're still of the opinion that it is necessary to hold value locks of some > form on earlier unique indexes, as you wait maybe for hours on some > conflicting xid, then I still disagree with you for reasons recently > re-stated [1]. I guess you're referring to: On 2013-12-23 14:59:31 -0800, Peter Geoghegan wrote: > Holding value locks for more than an instant doesn't make sense. The > reason is simple: when upserting, we're tacitly only really looking > for violations on one particular unique index. We just lock them all > at once because the implementation doesn't know *which* unique index. > So in actuality, it's really no different from existing > potential-violation handling for unique indexes, except we have to do > some extra work in addition to the usual restart from scratch stuff > (iff we have multiple unique indexes). I think the point here really is that that you assume that we're always only looking for conflicts with one unique index. If that's all we want to support - sure, only the keys in that index need to be locked. I don't think that's necessarily a given, especially when you just want to look at the conflict in detail, without using a subtransaction. > You never stated a reason why you thought it was > necessary. If you have one now, please share it. Note that I release > all value locks before row locking too, which is necessary because to > do any less will cause unprincipled deadlocking, as we've seen. I can't sensibly comment upon that right now, I'd need to read more code to understand what you're doing there. > Other than that, I have no idea what your continued objection to my > design would be once the buffer level exclusive locks are replaced > with page level heavyweight locks across complex (though brief) > operations Well, you haven't delivered that part yet, that's pretty much my point, no? I don't think you can easily do this by just additionally taking a new kind of heavyweight locks in the new codepaths - that will still allow deadlocks with the old codepaths taking only lwlocks. So this is a nontrivial sub-project which very well might influence whether the approach is deemed acceptable or not. > (I guess you might not like the visibility stuff or the > syntax, but that isn't what you're talking about here). I don't particularly care about that for now. I think we can find common ground, even if it will take some further emails. It probably isn't what's in there right now, but I don't think you'e intended it as such. I don't think the visibility modifications are a good thing (or correct) as is, but I don't think they are neccessary for your approach to make sense. > I've been very consistent even in the face of strong criticism. What I > have now is essentially the same design as back in early September. Uh. And why's that necessarily a good thing? Minor details I noticed in passing: * Your tqual.c bit isn't correct, you're forgetting multixacts. * You several mention "core" in comments as if this wouldn't be part of it, that seems confusing. * Doesn't ExecInsert() essentially busy-loop if there's a concurrent non-committed insert? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 25, 2013 at 6:25 AM, Andres Freund <andres@2ndquadrant.com> wrote: > So? I still have the fear that you approach will end up being way too > complicated and full of layering violations. I didn't say it's a no-go > (not that I have veto powers, even if I'd consider it one). Apart from not resulting in unprincipled deadlocking, it respects the AM abstraction more than all other approaches outlined. Inserting tuples as value locks just isn't a great approach, even if you ignore the fact you must come up with a whole new way to release your "value locks" without ending your xact. > And yes, I still think that promise tuples might be a better solution > regardless of the issues you mentioned, but you know what? That doesn't > matter. Me thinking it's the better approach is primarily based on gut > feeling, and I clearly haven't voiced clear enough reasons to convince > you. So you going with your own, possibly more substantiated, gut > feeling is perfectly alright. Unless I go ahead and write a POC of my > own at least ;) My position is not based on a gut feeling. It is based on carefully considering the interactions of the constituent parts, plus the experience of actually building a working prototype. > Whoa? What? Not convincing everyone is far from it being a useless > discussion. Such an attitude sure is not the way to go to elicit more > feedback. > And it clearly gave you the feedback that most people regard holding > buffer locks across other nontrivial operations, in a potentially > unbounded number, as a fundamental problem. Uh, I knew that it was a problem all along. While I explored ways of ameliorating the problem, I specifically stated that we should discuss the subsystems interactions/design, which you were far too quick to dismiss. The overall design is far more pertinent than one specific mechanism. While I certainly welcome your participation, if you want to be an effective reviewer I suggest examining your own attitude. Everyone wants this feature. >> I don't see what is functionally insufficient about the value locking >> that you've already seen. > > I still think it's fundamentally unacceptable to hold buffer locks > across any additional complex operations. So yes, I think the current > state is fundamentally insufficient. I said *functionally* insufficient. Buffer locks demonstrably do a perfectly fine job of value locking. Of course the current state is insufficient, but I'm talking about design here. >> Holding value locks for more than an instant doesn't make sense. The >> reason is simple: when upserting, we're tacitly only really looking >> for violations on one particular unique index. We just lock them all >> at once because the implementation doesn't know *which* unique index. >> So in actuality, it's really no different from existing >> potential-violation handling for unique indexes, except we have to do >> some extra work in addition to the usual restart from scratch stuff >> (iff we have multiple unique indexes). > > I think the point here really is that that you assume that we're always > only looking for conflicts with one unique index. If that's all we want > to support - sure, only the keys in that index need to be locked. > I don't think that's necessarily a given, especially when you just want > to look at the conflict in detail, without using a subtransaction. Why would I not assume that? It's perfectly obvious from the syntax that you can't do much if you don't know ahead of time where the conflict might be. It's just like the MySQL feature - the user had better know where it might be. Now, at least with my syntax as a user you have some capacity to recover if you consider ahead of time that you might get it wrong. But clearly rejected, and not conflicting rows are projected, and multiple conflicts per row are not accounted for. We lock on the first conflict, which with idiomatic usage will be the only possible conflict. That isn't the only reason why value locks don't need to be held for more than an instant. It's just the most obvious one. Incidentally, there are many implementation reasons why "true value locking", where value locks are held indefinitely is extremely difficult. When I referred to an SLRU, I was just exploring the idea of making value locks (still only held for an instant) more granular. On closer examination it looks to me like premature optimization, though. >> You never stated a reason why you thought it was >> necessary. If you have one now, please share it. Note that I release >> all value locks before row locking too, which is necessary because to >> do any less will cause unprincipled deadlocking, as we've seen. > > I can't sensibly comment upon that right now, I'd need to read more code > to understand what you're doing there. You could have looked at it back in September, if only you'd given these interactions the up-front consideration that they warranted. Nothing has changed there at all. > Well, you haven't delivered that part yet, that's pretty much my point, > no? > I don't think you can easily do this by just additionally taking a new > kind of heavyweight locks in the new codepaths - that will still allow > deadlocks with the old codepaths taking only lwlocks. So this is a > nontrivial sub-project which very well might influence whether the > approach is deemed acceptable or not. I have already written the code, and am in the process of cleaning it up and gaining confidence that I haven't missed something. It's not trivial, and there are some subtleties, but I think that your level of skepticism around the difficulty of doing this is excessive. Naturally, non-speculative insertion does have to care about the heavyweight locks sometimes, but only when a page-level flag is found to be set. >> (I guess you might not like the visibility stuff or the >> syntax, but that isn't what you're talking about here). > > I don't particularly care about that for now. I think we can find common > ground, even if it will take some further emails. It probably isn't > what's in there right now, but I don't think you'e intended it as such. I certainly hope we can find common ground. I want to work with you. >> I've been very consistent even in the face of strong criticism. What I >> have now is essentially the same design as back in early September. > > Uh. And why's that necessarily a good thing? It isn't necessarily, but you've taken my comments out of context. I was addressing Robert, and his general point about there being confusion around the semantics and locking protocol aspects. My point was: if that general impression was created, it is almost entirely because of discussion of other approaches. The fact that I've been consistent on design aspects clearly indicates that no one has said anything to make me reconsider my position. If that's just because there hasn't been enough scrutiny of my design, then I can hardly be blamed; I've been begging for that kind of scrutiny. I have been the one casting doubt on other designs, and quite successfully I might add. The fact that there was confusion about those other approaches should not prejudice anyone against my approach. That doesn't mean I'm right, of course, but as long as no one is examining those aspects, and as long as no one appreciates what considerations informed the design I came up with, we won't make progress. Can we focus on the design, and how things fit together, please? > Minor details I noticed in passing: > * Your tqual.c bit isn't correct, you're forgetting multixacts. I knew that was broken, but I don't grok the tuple locking code. Perhaps you can suggest a correct formulation. > * You several mention "core" in comments as if this wouldn't be part of > it, that seems confusing. Well, the executor is naive of the underlying AM, even if it is btree. What terminology do you suggest that captures that? > * Doesn't ExecInsert() essentially busy-loop if there's a concurrent > non-committed insert? No, it does not. It frees earlier value locks, and waits on the other xact in the usual manner, and then restarts from scratch. There is a XactLockTableWait() call in _bt_lockinsert(), justlike _bt_doinsert(). It might be the case that waiters have to spin a few times if there is lots of conflicts on the same row, but that's similar to the current state of affairs in _bt_doinsert(). If there were transactions that kept aborting you'd see the same thing today, as when doing upserting with subtransactions with lots of conflicts. It is true that there is nothing to arbitrate the ordering (i.e. there is no LockTupleTuplock()/Locktuple() call in the btree code), but I think that doesn't matter because that arbitration occurs when something close to conventional row locking occurs in nodeModifyTable.c (or else we just IGNORE). If you think that there could be unpleasant interactions because of the lack of Locktuple() arbitration within _bt_lockinsert() or something like that, please provide a test-case. -- Peter Geoghegan
On Fri, Dec 20, 2013 at 12:39 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Hmm. If I understand the problem correctly, it's that as soon as another > backend sees the tuple you've inserted and calls XactLockTableWait(), it > will not stop waiting even if we later decide to kill the already-inserted > tuple. > > One approach to fix that would be to release and immediately re-acquire the > transaction-lock, when you kill an already-inserted tuple. Then teach the > callers of XactLockTableWait() to re-check if the tuple is still alive. I'm > just waving hands here, but the general idea is to somehow wake up others > when you kill the tuple. While mulling this over further, I had an idea about this: suppose we marked the tuple in some fashion that indicates that it's a promise tuple. I imagine an infomask bit, although the concept makes me wince a bit since we don't exactly have bit space coming out of our ears there. Leaving that aside for the moment, whenever somebody looks at the tuple with a mind to calling XactLockTableWait(), they can see that it's a promise tuple and decide to wait on some other heavyweight lock instead. The simplest thing might be for us to acquire a heavyweight lock on the promise tuple before making index entries for it, and then have callers wait on that instead always instead of transitioning from the tuple lock to the xact lock. Then we don't need to do anything screwy like releasing our transaction lock; if we decide to kill the promise tuple, we have a lock to release that pertains specifically to that tuple. This might be a dumb idea; I'm just thinking out loud. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 26, 2013 at 5:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: > While mulling this over further, I had an idea about this: suppose we > marked the tuple in some fashion that indicates that it's a promise > tuple. I imagine an infomask bit, although the concept makes me wince > a bit since we don't exactly have bit space coming out of our ears > there. Leaving that aside for the moment, whenever somebody looks at > the tuple with a mind to calling XactLockTableWait(), they can see > that it's a promise tuple and decide to wait on some other heavyweight > lock instead. The simplest thing might be for us to acquire a > heavyweight lock on the promise tuple before making index entries for > it, and then have callers wait on that instead always instead of > transitioning from the tuple lock to the xact lock. I think the interlocking with buffer locks and heavyweight locks to make that work could be complex. I'm working on a scheme where we always acquire a page heavyweight lock ahead of acquiring an equivalent buffer lock, and without any other buffer locks held (for the critical choke point buffer, to implement value locking). With my scheme, you may have to retry, but only in the event of page splits and only at the choke point. In any case, what you describe here strikes me as an expansion on the already less than ideal modularity violation within the btree AM (i.e. the way it buffer locks the heap with its own index buffers concurrently for uniqueness checking). It might be that the best argument for explicit value locks (implemented as page heavyweight locks or whatever) is that they are completely distinct to row locks, and are an abstraction managed entirely by the AM itself, quite similar to the historic, limited value locking that unique index enforcement has always used. If we take Heikki's POC patch as representative of promise tuple schemes in general, this scheme might not be good enough. Index tuple insertions don't wait on each other there, and immediately report conflict. We need pre-checking to get an actual conflict TID in that patch, with no help from btree available. I'm generally opposed to making value locks of any stripe be held for more than an instant (so we should not hold them indefinitely pending another conflicting xact finishing). It's not just that it's convenient to my implementation; I also happen to think that it makes no sense. Should you really lock a value in an earlier unique index for hours, pending conflicter xact finishing, because you just might happen to want to insert said value, but probably not? -- Peter Geoghegan
On 2013-12-26 21:11:27 -0800, Peter Geoghegan wrote: > I'm generally opposed to making value locks of any stripe be held for > more than an instant (so we should not hold them indefinitely pending > another conflicting xact finishing). It's not just that it's > convenient to my implementation; I also happen to think that it makes > no sense. Should you really lock a value in an earlier unique index > for hours, pending conflicter xact finishing, because you just might > happen to want to insert said value, but probably not? There are some advantages: For one, it allows you to guarantee forward progress if you do it right, which surely isn't a bad propert to have. For another, it's much more in line with the way normal uniqueness checks works. Possibly the disadvantages outweigh the advantages, but that's a far cry from making no sense. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2013-12-25 15:27:36 -0800, Peter Geoghegan wrote: > Uh, I knew that it was a problem all along. While I explored ways of > ameliorating the problem, I specifically stated that we should discuss > the subsystems interactions/design, which you were far too quick to > dismiss. Aha? > The overall design is far more pertinent than one specific > mechanism. While I certainly welcome your participation, if you want > to be an effective reviewer I suggest examining your own attitude. > Everyone wants this feature. You know what. I don't particularly feel the need to be a reviewer of this patch. I comment because there didn't seem enough comments on some parts and because I see some things as problematic. If you don't want those comments, ok. No problem. > >> Holding value locks for more than an instant doesn't make sense. The > >> reason is simple: when upserting, we're tacitly only really looking > >> for violations on one particular unique index. We just lock them all > >> at once because the implementation doesn't know *which* unique index. > >> So in actuality, it's really no different from existing > >> potential-violation handling for unique indexes, except we have to do > >> some extra work in addition to the usual restart from scratch stuff > >> (iff we have multiple unique indexes). > > > > I think the point here really is that that you assume that we're always > > only looking for conflicts with one unique index. If that's all we want > > to support - sure, only the keys in that index need to be locked. > > I don't think that's necessarily a given, especially when you just want > > to look at the conflict in detail, without using a subtransaction. > > Why would I not assume that? It's perfectly obvious from the syntax > that you can't do much if you don't know ahead of time where the > conflict might be. Because it's a damn useful feature to have. As I said above: > if that's all we want to support - sure, only the keys in that index > need to be locked. I don't think the current syntax the feature implements can be used as the sole argument what the feature should be able to support. If you think from the angle of a async MM replication solution replicating a table with multiple unique keys, not having to specify a single index we to expect conflicts from, is surely helpful. > >> You never stated a reason why you thought it was > >> necessary. If you have one now, please share it. Note that I release > >> all value locks before row locking too, which is necessary because to > >> do any less will cause unprincipled deadlocking, as we've seen. > > > > I can't sensibly comment upon that right now, I'd need to read more code > > to understand what you're doing there. > > You could have looked at it back in September, if only you'd given > these interactions the up-front consideration that they warranted. > Nothing has changed there at all. Holy fuck. Peter. Believe it or not, I don't remember all code, comments & design that I've read at some point. And that sometimes means that I need to re-read code to judge some things. That I don't have time to fully do so on the 24th doesn't strike me as particularly suprising. > > Well, you haven't delivered that part yet, that's pretty much my point, > > no? > > I don't think you can easily do this by just additionally taking a new > > kind of heavyweight locks in the new codepaths - that will still allow > > deadlocks with the old codepaths taking only lwlocks. So this is a > > nontrivial sub-project which very well might influence whether the > > approach is deemed acceptable or not. > > I have already written the code, and am in the process of cleaning it > up and gaining confidence that I haven't missed something. It's not > trivial, and there are some subtleties, but I think that your level of > skepticism around the difficulty of doing this is excessive. > Naturally, non-speculative insertion does have to care about the > heavyweight locks sometimes, but only when a page-level flag is found > to be set. Cool then. > >> I've been very consistent even in the face of strong criticism. What I > >> have now is essentially the same design as back in early September. > > > > Uh. And why's that necessarily a good thing? > > It isn't necessarily, but you've taken my comments out of context. It's demonstrative of the reaction to a good part of the doubts expressed. > Can we focus on the design, and how things fit together, > please? I don't understand you here. You want people to discuss the high level design but then criticize us for discussion the high level design when it involves *possibly* doing things differently. Evaluating approaches *is* focusing on the design. And saying that a basic constituent part doesn't work - like using the buffer locking for value locking, which you loudly doubted for some time - *is* design critizism. The pointed out weakness very well might be non-existant because of a misunderstanding, or relatively easily fixable. > > Minor details I noticed in passing: > > * Your tqual.c bit isn't correct, you're forgetting multixacts. > > I knew that was broken, but I don't grok the tuple locking code. > Perhaps you can suggest a correct formulation. I don't think there's nice high-level infrastructure to do what you need here yet. You probably need a variant of MultiXactIdIsRunning() like MultiXactIdAreMember() that checks whether any of our xids is participating. Which you then check when xmax is a multi. Unfortunately I am afraid that it won't be ok to check HEAP_XMAX_IS_LOCKED_ONLY xmaxes only - it might have been a no-key update + some concurrent key-share lock where the updater aborted. Now, I think you only acquire FOR UPDATE locks so far, but using subtransactions you still can get into such a scenario, even involving FOR UPDATE locks. > > * You several mention "core" in comments as if this wouldn't be part of > > it, that seems confusing. > > Well, the executor is naive of the underlying AM, even if it is btree. > What terminology do you suggest that captures that? I don't have a particularly nice suggestion. "generic" maybe. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 27, 2013 at 12:57 AM, Andres Freund <andres@2ndquadrant.com> wrote: > You know what. I don't particularly feel the need to be a reviewer of > this patch. I comment because there didn't seem enough comments on some > parts and because I see some things as problematic. If you don't want > those comments, ok. No problem. I was attempting to make a point about the controversy this generated in September. Which is: we were talking past each other. It was an unfortunate, unproductive use of our time - there was some useful discussion, but in general far more heat than light was generated. I don't want to play the blame game. I want to avoid that situation in the future, since it's obvious to me that it was totally avoidable. Okay? > I don't think the current syntax the feature implements can be used as > the sole argument what the feature should be able to support. > > If you think from the angle of a async MM replication solution > replicating a table with multiple unique keys, not having to specify a > single index we to expect conflicts from, is surely helpful. Well, you're not totally on your own for something like that with this feature. You can project the conflicter's tid, and possibly do a more sophisticated recovery, like inspecting the locked row and iterating. That's probably not at all ideal, but then I can't even imagine what the best interface for what you describe here looks like. If there are multiple conflicts, do you delete or update some or all of them? How do you express that concept from a DML statement? Maybe you could project the conflict rows (with perhaps more than 1 for each row proposed for insertion) and not the rejected, but it's hard to imagine making that intuitive or succinct (conflicting rows could be projected twice or more for separate conflicts, etc). Maybe what I have here is in practical terms actually a pretty decent approximation of what you want. It seems undesirable to give other use-cases baggage around locking values for an indefinite period, just to make this work for MM replication, especially since it isn't clear that it actually could be used effectively by a MM replication solution given the syntax, or any conceivable alternative syntax or variant. Could SQL MERGE do this for you? Offhand I don't think that it could. In fact, I think it would be much less useful than what I've proposed for this use-case. Its "WHEN NOT MATCHED THEN" clause doesn't let you introspect details of what matched and did not match. Furthermore, though I haven't verified this, off-hand I suspect other systems are fussy about what you want to merge on. All examples of MERGE use I've found after a quick Google search shows merging on a simple equi-join criteria. >> Can we focus on the design, and how things fit together, >> please? > > I don't understand you here. You want people to discuss the high level > design but then criticize us for discussion the high level design when > it involves *possibly* doing things differently. Evaluating approaches > *is* focusing on the design. I spent several weeks earnestly thrashing out details of Heikki's design. I am open to any alternative design that meets the criteria I outlined to Heikki, with which Heikki was in full agreement. One of those criterions was that unprincipled deadlocking, that would never occur with equivalent update statements should not occur. Unfortunately, Heikki's POC patch did not meet that standard. I have limited enthusiasm for making it or a similar scheme meet that standard by further burdening the btree AM with additional knowledge of the heap or row locking. Since in the past you've expressed general concerns about the modularity violation within the btree AM today, I assume that you aren't too enthusiastic about that kind of expansion either. > Unfortunately I am afraid that it won't be ok to check > HEAP_XMAX_IS_LOCKED_ONLY xmaxes only - it might have been a no-key > update + some concurrent key-share lock where the updater aborted. Now, > I think you only acquire FOR UPDATE locks so far That's right. Just FOR UPDATE locks. > but using > subtransactions you still can get into such a scenario, even involving > FOR UPDATE locks. Sigh. -- Peter Geoghegan
Attached revision only uses heavyweight page locks across complex operations. I haven't benchmarked it, but it appears to perform reasonably well. I haven't attempted to measure a regression for regular insertion, but offhand it seems likely that any regression would be well within the noise - more or less immeasurably small. I won't repeat too much of what is already well commented in the patch. For those that would like a relatively quick summary of what I've done, I include inline a new section that I've added to the nbtree README: Notes about speculative insertion --------------------------------- As an amcanunique AM, the btree implementation is required to support "speculative insertion". This means that the value locking method through which unique index enforcement conventionally occurs is extended and generalized, such that insertion is staggered: the core code attempts to get full consensus on whether values proposed for insertion will not cause duplicate key violations. Speculative insertion is only possible for unique index insertion without deferred uniqueness checking (since speculative insertion into a deferred unique constraint's index is a contradiction in terms). For conventional unique index insertion, the Btree implementation exclusive locks a buffer holding the first page that the value to be inserted could possibly be on, though only for an instant, during and shortly after uniqueness verification. It would not be acceptable to hold this lock across complex operations for the duration of the remainder of the first phase of speculative insertion. Therefore, we convert this exclusive buffer lock to an exclusive page lock managed by the lock manager, thereby greatly ameliorating the consequences of undiscovered deadlocking implementation bugs (though deadlocks are not expected), and minimizing the impact on system interruptibility, while not affecting index scans. It may be useful to informally think of the page lock type acquired by speculative insertion as similar to an intention exclusive lock, a type of lock found in some legacy 2PL database systems that use multi-granularity locking. A session establishes the exclusive right to subsequently establish a full write lock, without actually blocking reads of the page unless and until a lock conversion actually occurs, at which point both reads and writes are blocked. Under this mental model, buffer shared locks can be thought of as intention shared locks. As implemented, these heavyweight locks are only relevant to the insertion case; at no other point are they actually considered, since insertion is the only way through which new values are introduced. The first page a value proposed for insertion into an index could be on represents a natural choke point for our extended, though still rather limited system of value locking. Naturally, when we perform a "lock escalation" and acquire an exclusive buffer lock, all other buffer locks on the same buffer are blocked, which is how the implementation localizes knowledge about the heavyweight lock to insertion-related routines. Apart from deletion, which is concomitantly prevented by holding a pin on the buffer throughout, all exclusive locking of Btree buffers happen as a direct or indirect result of insertion, so this approach is sufficient. (Actually, an exclusive lock may still be acquired without insertion to initialize a root page, but that hardly matters.) Note that all value locks (including buffer pins) are dropped immediately as speculative insertion is aborted, as the implementation waits on the outcome of another xact, or as "insertion proper" occurs. These page-level locks are not intended to last more than an instant. In general, the protocol for heavyweight locking Btree pages is that heavyweight locks are acquired before any buffer locks are held, while the locks are only released after all buffer locks are released. While not a hard and fast rule, presently we avoid heavyweight page locking more than one page per unique index concurrently. Happy new year -- Peter Geoghegan
Attachment
On 12/26/2013 01:27 AM, Peter Geoghegan wrote: > On Wed, Dec 25, 2013 at 6:25 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> And yes, I still think that promise tuples might be a better solution >> regardless of the issues you mentioned, but you know what? That doesn't >> matter. Me thinking it's the better approach is primarily based on gut >> feeling, and I clearly haven't voiced clear enough reasons to convince >> you. So you going with your own, possibly more substantiated, gut >> feeling is perfectly alright. Unless I go ahead and write a POC of my >> own at least ;) > > My position is not based on a gut feeling. It is based on carefully > considering the interactions of the constituent parts, plus the > experience of actually building a working prototype. I also carefully considered all that stuff, and reached a different conclusion. Plus I also actually built a working prototype (for some approximation of "working" - it's still a prototype). >> Whoa? What? Not convincing everyone is far from it being a useless >> discussion. Such an attitude sure is not the way to go to elicit more >> feedback. >> And it clearly gave you the feedback that most people regard holding >> buffer locks across other nontrivial operations, in a potentially >> unbounded number, as a fundamental problem. > > Uh, I knew that it was a problem all along. While I explored ways of > ameliorating the problem, I specifically stated that we should discuss > the subsystems interactions/design, which you were far too quick to > dismiss. The overall design is far more pertinent than one specific > mechanism. While I certainly welcome your participation, if you want > to be an effective reviewer I suggest examining your own attitude. > Everyone wants this feature. Frankly I'm pissed off that you dismissed from the start the approach that seems much better to me. I gave you a couple of pointers very early on: look at the way we do exclusion constraints, and try to do something like promise tuples or killing an already-inserted tuple. You dismissed that, so I had to write that prototype myself. Even after that, you have spent zero effort to resolve the remaining issues with that approach, proclaiming that it's somehow fundamentally flawed and that locking index pages is obviously better. It's not. Sure, it still needs work, but the remaining issue isn't that difficult to resolve. Surely not any more complicated than what you did with heavy-weight locks on b-tree pages in your latest patch. Now, enough with the venting. Back to drawing board, to figure out how best to fix the deadlock issue with the insert_on_dup-kill-on-conflict-2.patch. Please help me with that. PS. In btreelock_insert_on_dup_v5.2013_12_28.patch, the language used in the additional text in README is quite difficult to read. Too many difficult sentences and constructs for a non-native English speaker like me. I had to look up "concomitantly" in a dictionary and I'm still not sure I understand that sentence :-). - Heikki
On 12/27/2013 07:11 AM, Peter Geoghegan wrote: > On Thu, Dec 26, 2013 at 5:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> While mulling this over further, I had an idea about this: suppose we >> marked the tuple in some fashion that indicates that it's a promise >> tuple. I imagine an infomask bit, although the concept makes me wince >> a bit since we don't exactly have bit space coming out of our ears >> there. Leaving that aside for the moment, whenever somebody looks at >> the tuple with a mind to calling XactLockTableWait(), they can see >> that it's a promise tuple and decide to wait on some other heavyweight >> lock instead. The simplest thing might be for us to acquire a >> heavyweight lock on the promise tuple before making index entries for >> it, and then have callers wait on that instead always instead of >> transitioning from the tuple lock to the xact lock. Yeah, that seems like it should work. You might not even need an infomask bit for that; just take the "other heavyweight lock" always before calling XactLockTableWait(), whether it's a promise tuple or not. If it's not, acquiring the extra lock is a waste of time but if you're going to sleep anyway, the overhead of one extra lock acquisition hardly matters. > I think the interlocking with buffer locks and heavyweight locks to > make that work could be complex. Hmm. Can you elaborate? The inserter has to acquire the heavyweight lock before releasing the buffer lock, because otherwise another inserter (or deleter or updater) might see the tuple, acquire the heavyweight lock, and fall to sleep on XactLockTableWait(), before the inserter has grabbed the heavyweight lock. If that race condition happens, you have the original problem again, ie. the updater unnecessarily waits for the inserting transaction to finish, even though it already killed the tuple it inserted. That seems easy to avoid. If the heavyweight lock uses the transaction id as the key, just like XactLockTableInsert/XactLockTableWait, you can acquire it before doing the insertion. Peter, can you give that a try, please? - Heikki
On Sun, Dec 29, 2013 at 8:18 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> My position is not based on a gut feeling. It is based on carefully >> considering the interactions of the constituent parts, plus the >> experience of actually building a working prototype. > > > I also carefully considered all that stuff, and reached a different > conclusion. Plus I also actually built a working prototype (for some > approximation of "working" - it's still a prototype). Well, clearly you're in agreement with me about unprincipled deadlocking. That's what I was referring to here. > Frankly I'm pissed off that you dismissed from the start the approach that > seems much better to me. I gave you a couple of pointers very early on: look > at the way we do exclusion constraints, and try to do something like promise > tuples or killing an already-inserted tuple. You dismissed that, so I had to > write that prototype myself. Sorry, but that isn't consistent with my recollection at all. The first e-mail you sent to any of the threads on this was on 2013-11-18. Your first cut at a prototype was on 2013-11-19, the very next day. If you think that I ought to have been able to know what you had in mind based on conversations at pgConf.EU, you're giving me way too much credit. The only thing vaguely along those lines that I recall you mentioning to me in Dublin was that you thought I should make this work with exclusion constraints - I was mostly explaining what I'd done, and why. I was pleased that you listened courteously, but I didn't have a clue what you had in mind, not least because to the best of my recollection you didn't say anything about killing tuples. I'm not going to swear that you didn't say something like that, since a lot of things were said in a relatively short period, but it's certainly true that I was quite curious about how you might go about incorporating exclusion constraints into this for a while before you began visibly participating on list. Now, when you actually posted the prototype, I realized that it was the same basic design that I'd cited in my very first e-mail on the IGNORE patch (the one that didn't have row locking at all) - nobody else wanted to do heap insertion first for promise tuples. I read that 2007 thread [1] a long time ago, but that recognition only came when you posted your first prototype, or perhaps shortly before when you started participating on list. I am unconvinced that making this work for exclusion constraints is compelling, except for IGNORE, which is sensible but something I would not weigh heavily at all. In any case, since your implementation currently doesn't lock more than one row per tuple proposed for insertion (even though exclusion constraints could have a huge number of rows to lock when you propose to insert a row with a range covering a decade, and many rows need to be locked, where with unique indexes you only ever lock either 0 or 1 rows per slot). I could fairly easily extend my patch to have it work for exclusion constraints with IGNORE only. You didn't try and convince me that what you have proposed is better than what I have. You immediately described your approach. You did say some things about buffer locking, but you didn't differentiate between what was essential to my design, and what was incidental, merely calling it scary (i.e. you did something similar to what you're accusing me of here - you didn't dismiss it, but you didn't address it either). If you look back at the discussion throughout late November and much of December, it is true that I am consistently critical, but that was clearly a useful exercise, because now we know there is a problem to fix. Why is your approach better? You never actually said. In short, I think that my approach may be better because it doesn't conflate row locking with value locking (therefore making it easier to reason about, maintaining modularity), and that it never bloats, and that releasing locks is clearly cheap which may matter a lot sometimes. I don't think the "intent exclusive" locking of my most recent revision is problematic for performance - as the L&Y paper says, exclusive locking leaf pages only is not that problematic. Extending that in a way that still allows reads, only for an instant isn't going to be too problematic. I'm not sure that this is essential to your design, and I'm not sure what your thoughts are on this, but Andres has defended the idea of promise tuples that lock old values indefinitely pending the outcome of another xact where we'll probably want to update, and I think this is a bad idea. Andres recently seemed less convinced of this than in the past [2], but I'd like to hear your thoughts. It's very pertinent, because I think releasing locks needs to be cheap, and rendering old promise tuples invisible is not cheap. I'm not trying to piss anyone off here - I need all the help I can get. These are important questions, and I'm not posing them to you to be contrary. > Even after that, you have spent zero effort to > resolve the remaining issues with that approach, proclaiming that it's > somehow fundamentally flawed and that locking index pages is obviously > better. It's not. Sure, it still needs work, but the remaining issue isn't > that difficult to resolve. Surely not any more complicated than what you did > with heavy-weight locks on b-tree pages in your latest patch. I didn't say that locking index pages is obviously better, and I certainly didn't say anything about what you've done being fundamentally flawed. I said that I "have limited enthusiasm for expanding the modularity violation that exists within the btree AM". Based on what Andres has said in the recent past on this thread about the current btree code, that "in my opinion, bt_check_unique() doing [locking heap and btree buffers concurrently] is a bug that needs fixing" [3], can you really blame me? What this patch does not need is another controversy. It seems pretty reasonable and sane that we'd implement this by generalizing from the existing mechanism. Plus there is plenty of evidence of other systems escalating what they call "latches" and what we call buffer locks to heavyweight locks, I believe going back to the 1970s. It's far from radical. > Now, enough with the venting. Back to drawing board, to figure out how best > to fix the deadlock issue with the insert_on_dup-kill-on-conflict-2.patch. > Please help me with that. I will help you. I'll look at it tomorrow. > PS. In btreelock_insert_on_dup_v5.2013_12_28.patch, the language used in the > additional text in README is quite difficult to read. Too many difficult > sentences and constructs for a non-native English speaker like me. I had to > look up "concomitantly" in a dictionary and I'm still not sure I understand > that sentence :-). Perhaps I should have eschewed obfuscation and espoused elucidation here. I was trying to fit the style of the surrounding text. I just mean that aside from the obvious reason for holding a pin, doing so at the same time precludes deletion of the buffer, which requires a "super exclusive" lock on the buffer. [1] http://www.postgresql.org/message-id/1172858409.3760.1618.camel@silverbirch.site [2] http://www.postgresql.org/message-id/20131227075453.GB17584@alap2.anarazel.de [3] http://www.postgresql.org/message-id/20130914221524.GF4071@awork2.anarazel.de -- Peter Geoghegan
On 12/30/2013 05:57 AM, Peter Geoghegan wrote: > Now, when you actually posted the prototype, I realized that it was > the same basic design that I'd cited in my very first e-mail on the > IGNORE patch (the one that didn't have row locking at all) - nobody > else wanted to do heap insertion first for promise tuples. I read that > 2007 thread [1] a long time ago, but that recognition only came when > you posted your first prototype, or perhaps shortly before when you > started participating on list. Ah, I didn't remember that thread. Yeah, apparently I proposed the exact same design back then. Simon complained about the dead tuples being left behind, but I don't think that's a big issue with the design we've been playing around now; you only end up with dead tuples when two backends try to insert the same value concurrently, which shouldn't happen very often. Other than that, there wasn't any discussion on whether that's a good approach or not. > In short, I think that my approach may be better because it doesn't > conflate row locking with value locking (therefore making it easier > to reason about, maintaining modularity), You keep saying that, but I don't understand what you mean. With your approach, an already-inserted heap tuple acts like a value lock, just like in my approach. You have the page-level locks on b-tree pages in addition to that, but the heap-tuple based mechanism is there too. > I'm not sure that this is essential to your design, and I'm not sure > what your thoughts are on this, but Andres has defended the idea of > promise tuples that lock old values indefinitely pending the outcome > of another xact where we'll probably want to update, and I think this > is a bad idea. Andres recently seemed less convinced of this than in > the past [2], but I'd like to hear your thoughts. It's very pertinent, > because I think releasing locks needs to be cheap, and rendering old > promise tuples invisible is not cheap. Well, killing an old promise tuple is not cheap, but it shouldn't happen often. In both approaches, what probably matters more is the overhead of the extra heavy-weight locking. But this is all speculation, until we see some benchmarks. > I said that I "have limited enthusiasm for > expanding the modularity violation that exists within the btree AM". > Based on what Andres has said in the recent past on this thread about > the current btree code, that "in my opinion, bt_check_unique() doing > [locking heap and btree buffers concurrently] is a bug that needs > fixing" [3], can you really blame me? What this patch does not need is > another controversy. It seems pretty reasonable and sane that we'd > implement this by generalizing from the existing mechanism. _bt_check_unique() is a modularity violation, agreed. Beauty is in the eye of the beholder, I guess, but I don't see either patch making that any better or worse. >> Now, enough with the venting. Back to drawing board, to figure out how best >> to fix the deadlock issue with the insert_on_dup-kill-on-conflict-2.patch. >> Please help me with that. > > I will help you. I'll look at it tomorrow. Thanks! > [1] http://www.postgresql.org/message-id/1172858409.3760.1618.camel@silverbirch.site > > [2] http://www.postgresql.org/message-id/20131227075453.GB17584@alap2.anarazel.de > > [3] http://www.postgresql.org/message-id/20130914221524.GF4071@awork2.anarazel.de - Heikki
On 2013-12-29 19:57:31 -0800, Peter Geoghegan wrote: > On Sun, Dec 29, 2013 at 8:18 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: > >> My position is not based on a gut feeling. It is based on carefully > >> considering the interactions of the constituent parts, plus the > >> experience of actually building a working prototype. > > > > > > I also carefully considered all that stuff, and reached a different > > conclusion. Plus I also actually built a working prototype (for some > > approximation of "working" - it's still a prototype). > > Well, clearly you're in agreement with me about unprincipled > deadlocking. That's what I was referring to here. Maybe you should describe what you mean with "unprincipled". Sure, the current patch deadlocks, but I don't see anything fundamental, unresolvable there. So I don't understand what the word unprincipled means in that sentence.. > Andres recently seemed less convinced of this than in > the past [2], but I'd like to hear your thoughts. Not really, I just don't have the time/energy to fight for it (aka write a POC) at the moment. I still think any form of promise tuple, be it index, or heap based, it's a much better, more general, approach than yours. That doesn't preclude other approaches from being workable though. > I didn't say that locking index pages is obviously better, and I > certainly didn't say anything about what you've done being > fundamentally flawed. I said that I "have limited enthusiasm for > expanding the modularity violation that exists within the btree AM". > Based on what Andres has said in the recent past on this thread about > the current btree code, that "in my opinion, bt_check_unique() doing > [locking heap and btree buffers concurrently] is a bug that needs > fixing" [3], can you really blame me? Uh. But that was said in the context of *your* approach being flawed. Because it - at that time, I didn't look at the newest version yet - extended the concept of holding btree page locks across external operation to far much more code, even including user defined code!. And you argued that that isn't a problem using _bt_check_unique() as an argument. I don't really see why your patch is less of a modularity violation than Heikki's POC. It's just a different direction. > > PS. In btreelock_insert_on_dup_v5.2013_12_28.patch, the language used in the > > additional text in README is quite difficult to read. Too many difficult > > sentences and constructs for a non-native English speaker like me. I had to > > look up "concomitantly" in a dictionary and I'm still not sure I understand > > that sentence :-). +1 on the simpler language front as a fellow non-native speaker. Personally, the biggest thing I think you could do in favor of your position, is trying to be a bit more succinct in the mailing list discussions. I certainly fail at times at that as well, but I really try to work on it... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 30, 2013 at 8:19 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Maybe you should describe what you mean with "unprincipled". Sure, the > current patch deadlocks, but I don't see anything fundamental, > unresolvable there. So I don't understand what the word unprincipled > means in that sentence.. Maybe it is resolvable, and maybe it's worth resolving - I never said that it wasn't, I just said that I doubt the latter. By unprincipled deadlocking, I mean deadlocking that cannot be reasonably prevented by a user. Currently, I think that never deadlocking is a reasonable aspiration for all applications. It's never really necessary. When it occurs, we can advise users to do simple analysis and application refactoring to prevent it. With unprincipled deadlocking, we can give no such advice. The only advice we can give is to stop doing so much upserting, which is a big departure from how things are today. AFAICT, no one disagrees with my view that this is bad, and probably unacceptable. > Uh. But that was said in the context of *your* approach being > flawed. Because it - at that time, I didn't look at the newest version > yet - extended the concept of holding btree page locks across external > operation to far much more code, even including user defined code!. And > you argued that that isn't a problem using _bt_check_unique() as an > argument. That's a distortion of my position at the time. I acknowledged from the start that all buffer locking was problematic (e.g. [1]), and was exploring alternative locking approaches and the merit of the design. This is obviously the kind of project that needs to be worked at through iterative prototyping. While arguing that deadlocking would not occur, I lost sight of the bigger picture. But even if that wasn't true, I don't know why you feel the need to go on and on about buffer locking like this months later. Are you trying to be provocative? Can you *please* stop? Everyone knows that the btree heap access is a modularity violation. Even the AM docs says that the heap access is "without a doubt ugly and non-modular". So my original point remains, which is that expanding that is obviously going to be controversial, and probably legitimately so. I thought that your earlier marks on _bt_check_unique() were a good example of this sentiment, but I hardly need such an example. > I don't really see why your patch is less of a modularity violation than > Heikki's POC. It's just a different direction. My approach does not regress modularity because it doesn't do anything extra with the heap at all, and only btree insertion routines are affected. Locking is totally localized to the btree insertion routines - one .c file. At no other point does anything else have to care, and it's obvious that this situation won't change in the future when we decide to do something else cute with infomask bits or whatever. That's a *huge* distinction. [1] http://www.postgresql.org/message-id/CAM3SWZR2X4HJg7rjn0K4+hFdguCYX2prEP0Y3a7nccEjEowqqw@mail.gmail.com -- Peter Geoghegan
On 2013-12-30 12:29:22 -0800, Peter Geoghegan wrote: > But even if that wasn't > true, I don't know why you feel the need to go on and on about buffer > locking like this months later. Are you trying to be provocative? Can > you *please* stop? ERR? Peter? *You* quoted a statement of mine that only made sense in it's original context. And I *did* say that the point about buffer locking applied to the *past* version of the patch. Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 30, 2013 at 12:45 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-12-30 12:29:22 -0800, Peter Geoghegan wrote: >> But even if that wasn't >> true, I don't know why you feel the need to go on and on about buffer >> locking like this months later. Are you trying to be provocative? Can >> you *please* stop? > > ERR? Peter? *You* quoted a statement of mine that only made sense in > it's original context. And I *did* say that the point about buffer > locking applied to the *past* version of the patch. Not so. You suggested it was a bug that needed to be fixed, completely independently of this effort. You clearly referred to the current code. "Yes, it it is different. But, in my opinion, bt_check_unique() doing so is a bug that needs fixing. Not something that we want to extend." -- Peter Geoghegan
On 12/30/2013 12:45 PM, Andres Freund wrote: > On 2013-12-30 12:29:22 -0800, Peter Geoghegan wrote: >> But even if that wasn't >> true, I don't know why you feel the need to go on and on about buffer >> locking like this months later. Are you trying to be provocative? Can >> you *please* stop? > > ERR? Peter? *You* quoted a statement of mine that only made sense in > it's original context. And I *did* say that the point about buffer > locking applied to the *past* version of the patch. Alright this seems to have gone from confusion about the proposal to confusion about the confusion. Might I suggest a cooling off period and a return to the discussion in possibly a Wiki page where the points/counter points could be laid out more efficiently? > > > Andres > -- Adrian Klaver adrian.klaver@gmail.com
On Mon, Dec 30, 2013 at 7:22 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Ah, I didn't remember that thread. Yeah, apparently I proposed the exact > same design back then. Simon complained about the dead tuples being left > behind, but I don't think that's a big issue with the design we've been > playing around now; you only end up with dead tuples when two backends try > to insert the same value concurrently, which shouldn't happen very often. Right, because you check first, which also has a cost, paid in CPU cycles and memory bandwidth, and buffer lock contention. As opposed to a cost almost entirely localized to inserters into a single leaf page per unique index for only an instant. You're checking *all* unique indexes. You call check_exclusion_or_unique_constraint() once per unique index (or EC), and specify to wait on the xact, at least until a conflict is found. So if you're waiting on an xact, your conclusion that earlier unique indexes had no conflicts could soon become totally obsolete. So for non-idiomatic usage, say like the usage Andres in particular cares about for MM conflict resolution, I worry about the implications of that. I'm not asserting that it's a problem, but it does seem like something that's quite hard to reason about. Maybe Andres can comment. >> In short, I think that my approach may be better because it doesn't >> conflate row locking with value locking (therefore making it easier >> to reason about, maintaining modularity), > > You keep saying that, but I don't understand what you mean. With your > approach, an already-inserted heap tuple acts like a value lock, just like > in my approach. You have the page-level locks on b-tree pages in addition to > that, but the heap-tuple based mechanism is there too. Yes, but that historic behavior isn't really value locking at all. That's very much like row locking, because there is a row, not the uncertain intent to try to insert a row. Provided your transaction commits and the client's transaction doesn't delete the row, the row is definitely there. For upsert, conflicts may well be the norm, not the exception. Value locking is the exclusive lock on the buffer held during _bt_check_unique(). I'm trying to safely extend that mechanism, to reach consensus among unique indexes, which to me seems the most logical and conservative approach. For idiomatic usage, it's only sensible for there to be a conflict on one unique index, known ahead of time. If you don't know where the conflict will be, then typically your DML statement is unpredictable, just like the MySQL feature. Otherwise, for MM conflict resolution, I think it makes sense to pick those conflicts off, one at a time, dealing with exactly one row per conflict. I mean, even with your approach, you're still not dealing with later conflicts in later unique indexes, right? The fact that you prevent conflicts on previously non-conflicting unique indexes only, and, I suppose, not later ones too, seems quite arbitrary. > Well, killing an old promise tuple is not cheap, but it shouldn't happen > often. In both approaches, what probably matters more is the overhead of the > extra heavy-weight locking. But this is all speculation, until we see some > benchmarks. Fair enough. We'll know more when we have fixed the exclusion constraint supporting patch, which will allow us to make a fair comparison. I'm working on it. Although I'd suggest that having dead duplicates in indexes where that's avoidable is a cost that isn't necessarily that easily characterized. I especially don't like that you're currently doing the UNIQUE_CHECK_PARTIAL deferred unique constraint thing of always inserting, continuing on for *all* unique indexes regardless of finding a duplicate. Whatever overhead my approach may imply around lock contention, clearly the cost to index scans is zero. The other thing is that if you're holding old "value locks" (i.e. previously inserted btree tuples, from earlier unique indexes) pending resolving a value conflict, you're holding those value locks indefinitely pending the completion of the other guy's xact, just in case there ends up being no conflict, which in general is unlikely. So in extreme cases, that could be the difference between waiting all day (on someone sitting on a lock that they very probably have no use for), and not waiting at all. > _bt_check_unique() is a modularity violation, agreed. Beauty is in the eye > of the beholder, I guess, but I don't see either patch making that any > better or worse. Clearly the way in which you envisage releasing locks to prevent unprincipled deadlocking implies that btree has to know more about the heap, and maybe even that the heap has to know something about btree, or at least about amcanunique AMs (including possible future amcanunique AMs that may or may not be well suited to implementing this the same way). -- Peter Geoghegan
On Sun, Dec 29, 2013 at 9:09 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >>> While mulling this over further, I had an idea about this: suppose we >>> marked the tuple in some fashion that indicates that it's a promise >>> tuple. I imagine an infomask bit, although the concept makes me wince >>> a bit since we don't exactly have bit space coming out of our ears >>> there. Leaving that aside for the moment, whenever somebody looks at >>> the tuple with a mind to calling XactLockTableWait(), they can see >>> that it's a promise tuple and decide to wait on some other heavyweight >>> lock instead. The simplest thing might be for us to acquire a >>> heavyweight lock on the promise tuple before making index entries for >>> it, and then have callers wait on that instead always instead of >>> transitioning from the tuple lock to the xact lock. > > Yeah, that seems like it should work. You might not even need an infomask > bit for that; just take the "other heavyweight lock" always before calling > XactLockTableWait(), whether it's a promise tuple or not. If it's not, > acquiring the extra lock is a waste of time but if you're going to sleep > anyway, the overhead of one extra lock acquisition hardly matters. Are you suggesting that I lock the tuple only (say, through a special LockPromiseTuple() call), or lock the tuple *and* call XactLockTableWait() afterwards? You and Robert don't seem to be in agreement about which here. From here on I assume Robert's idea (only get the special promise lock where appropriate), because that makes more sense to me. I've taken a look at this idea, but got frustrated. You're definitely going to need an infomask bit for this. Otherwise, how do you differentiate between a "pending" promise tuple and a "fulfilled" promise tuple (or a tuple that never had anything to do with promises in the first place)? You'll want to wake up as soon as it becomes clear that the former is not going to become the latter on the one hand. On the other hand, you really will want to wait until xact end on the pending promise tuple when it becomes a fulfilled promise, or on an already-fulfilled promise tuple, or a plain old tuple. It's either locking the promise tuple, or locking the xid; never both, because the combination makes no sense to any case (unless you're talking about the case where you lock the promise tuple and then later *somehow* decide that you need to lock the xid as the upserter releases promise tuple locks directly within ExecInsert() upon successful insertion). The fact that your LockPromiseTuple() call didn't find someone else with the lock does not mean no one ever promised the tuple (assuming no infomask bit has the relevant info). Obviously you can't just have upserters hold on to the promise tuple locks until xact end if the promiser's insertion succeeds, for the same reason we don't with regular in-memory tuple locks: they're totally unbounded. So not only are you going to need an infomask promise bit, you're going to need to go and unset the bit in the event of a *successful* insertion, so that waiters know to wait on your xact now when you finally UnlockPromiseTuple() within ExecInsert() to finish off successful insertion. *And*, all XactLockTableWait() promise waiters need to go back and check that just-in-case. This problem illustrates what I mean about conflating row locking with value locking. >> I think the interlocking with buffer locks and heavyweight locks to >> make that work could be complex. > > Hmm. Can you elaborate? What I meant is that you should be wary of what you go on to describe below. > The inserter has to acquire the heavyweight lock before releasing the buffer > lock, because otherwise another inserter (or deleter or updater) might see > the tuple, acquire the heavyweight lock, and fall to sleep on > XactLockTableWait(), before the inserter has grabbed the heavyweight lock. > If that race condition happens, you have the original problem again, ie. the > updater unnecessarily waits for the inserting transaction to finish, even > though it already killed the tuple it inserted. Right. Can you suggest a workaround to the above problems? -- Peter Geoghegan
On 12/31/2013 09:18 AM, Peter Geoghegan wrote: > On Sun, Dec 29, 2013 at 9:09 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>>> While mulling this over further, I had an idea about this: suppose we >>>> marked the tuple in some fashion that indicates that it's a promise >>>> tuple. I imagine an infomask bit, although the concept makes me wince >>>> a bit since we don't exactly have bit space coming out of our ears >>>> there. Leaving that aside for the moment, whenever somebody looks at >>>> the tuple with a mind to calling XactLockTableWait(), they can see >>>> that it's a promise tuple and decide to wait on some other heavyweight >>>> lock instead. The simplest thing might be for us to acquire a >>>> heavyweight lock on the promise tuple before making index entries for >>>> it, and then have callers wait on that instead always instead of >>>> transitioning from the tuple lock to the xact lock. >> >> Yeah, that seems like it should work. You might not even need an infomask >> bit for that; just take the "other heavyweight lock" always before calling >> XactLockTableWait(), whether it's a promise tuple or not. If it's not, >> acquiring the extra lock is a waste of time but if you're going to sleep >> anyway, the overhead of one extra lock acquisition hardly matters. > > Are you suggesting that I lock the tuple only (say, through a special > LockPromiseTuple() call), or lock the tuple *and* call > XactLockTableWait() afterwards? You and Robert don't seem to be in > agreement about which here. I meant the latter, ie. grab the new kind of lock first, then check if the tuple is still there, and then call XactLockTableWait() as usual. >> The inserter has to acquire the heavyweight lock before releasing the buffer >> lock, because otherwise another inserter (or deleter or updater) might see >> the tuple, acquire the heavyweight lock, and fall to sleep on >> XactLockTableWait(), before the inserter has grabbed the heavyweight lock. >> If that race condition happens, you have the original problem again, ie. the >> updater unnecessarily waits for the inserting transaction to finish, even >> though it already killed the tuple it inserted. > > Right. Can you suggest a workaround to the above problems? Umm, I did, in the next paragraph ;-) : > That seems easy to avoid. If the heavyweight lock uses the > transaction id as the key, just like > XactLockTableInsert/XactLockTableWait, you can acquire it before > doing the insertion. Let me elaborate that. The idea is to have new heavy-weight lock that's just like the transaction lock used by XactLockTableInsert/XactLockTableWait, but separate from that. Let's call it PromiseTupleInsertionLock. The insertion procedure in INSERT ... ON DUPLICATE looks like this: 1. PromiseTupleInsertionLockAcquire(<my xid>) 2. Insert heap tuple 3. Insert index tuples 4. Check if conflict happened. Kill the already-inserted tuple on conflict. 5. PromiseTupleInsertionLockRelease(<my xid>) IOW, the only change to the current patch is that you acquire the new kind of lock before starting the insertion, and you release it after you've killed the tuple, or you know you're not going to kill it. - Heikki
On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > 1. PromiseTupleInsertionLockAcquire(<my xid>) > 2. Insert heap tuple > 3. Insert index tuples > 4. Check if conflict happened. Kill the already-inserted tuple on conflict. > 5. PromiseTupleInsertionLockRelease(<my xid>) > > IOW, the only change to the current patch is that you acquire the new kind > of lock before starting the insertion, and you release it after you've > killed the tuple, or you know you're not going to kill it. Where does row locking fit in there? - you may need to retry when that part is incorporated, of course. What if you have multiple promise tuples from a contended attempt to insert a single slot, or multiple broken promise tuples across multiple slots or even multiple commands in the same xact? -- Peter Geoghegan
On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> Are you suggesting that I lock the tuple only (say, through a special >> LockPromiseTuple() call), or lock the tuple *and* call >> XactLockTableWait() afterwards? You and Robert don't seem to be in >> agreement about which here. > > I meant the latter, ie. grab the new kind of lock first, then check if the > tuple is still there, and then call XactLockTableWait() as usual. I don't follow this either. Through what exact mechanism does the waiter know that there was a wait on the PromiseTupleInsertionLockAcquire() call, and so it should not wait on XactLockTableWait()? Does whatever mechanism you have in mind not have race conditions? -- Peter Geoghegan
On 2013-12-27 14:11:44 -0800, Peter Geoghegan wrote: > On Fri, Dec 27, 2013 at 12:57 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I don't think the current syntax the feature implements can be used as > > the sole argument what the feature should be able to support. > > > > If you think from the angle of a async MM replication solution > > replicating a table with multiple unique keys, not having to specify a > > single index we to expect conflicts from, is surely helpful. > > Well, you're not totally on your own for something like that with this > feature. You can project the conflicter's tid, and possibly do a more > sophisticated recovery, like inspecting the locked row and iterating. Yea, but in that case I *do* conflict with more than one index and old values need to stay locked. Otherwise anything resembling forward-progress guarantee is lost. > That's probably not at all ideal, but then I can't even imagine what > the best interface for what you describe here looks like. If there are > multiple conflicts, do you delete or update some or all of them? How > do you express that concept from a DML statement? For my usecases just getting the tid back is fine - it's in C anyway. But I'd rather be in a position to do it from SQL as well... If there are multiple conflicts the conflicting row should be updated. If we didn't release the value locks on the individual indexes, we can know beforehand whether only one row is going to be affected. If there really are more than one, error out. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jan 2, 2014 at 1:49 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> Well, you're not totally on your own for something like that with this >> feature. You can project the conflicter's tid, and possibly do a more >> sophisticated recovery, like inspecting the locked row and iterating. > > Yea, but in that case I *do* conflict with more than one index and old > values need to stay locked. Otherwise anything resembling > forward-progress guarantee is lost. I'm not sure I understand. In a very real sense they do stay locked. What is insufficient about locking the definitively visible row with the value, rather than the value itself? What distinction are you making? On the first conflict you can delete the row you locked, and then re-try, possibly further merging some stuff from the just-deleted row when you next upsert. It's possible that an "earlier" unique index value that is unlocked before row locking proceeds will get a new would-be duplicate after you're returned a locked row, but it's not obvious that that's a problem for your use-case (a problem that can't be worked around), or that promise tuples get you anything better. >> That's probably not at all ideal, but then I can't even imagine what >> the best interface for what you describe here looks like. If there are >> multiple conflicts, do you delete or update some or all of them? How >> do you express that concept from a DML statement? > > For my usecases just getting the tid back is fine - it's in C > anyway. But I'd rather be in a position to do it from SQL as well... I believe you can. -- Peter Geoghegan
On 2014-01-02 02:20:02 -0800, Peter Geoghegan wrote: > On Thu, Jan 2, 2014 at 1:49 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> Well, you're not totally on your own for something like that with this > >> feature. You can project the conflicter's tid, and possibly do a more > >> sophisticated recovery, like inspecting the locked row and iterating. > > > > Yea, but in that case I *do* conflict with more than one index and old > > values need to stay locked. Otherwise anything resembling > > forward-progress guarantee is lost. > > I'm not sure I understand. In a very real sense they do stay locked. > What is insufficient about locking the definitively visible row with > the value, rather than the value itself? Locking the definitely visible row only works if there's a row matching the index's columns. If the values of the new row don't have corresponding values in all the indexes you have the same old race conditions again. I think to be useful for many cases you really need to be able to ask for a potentially conflicting row and be sure that if there's none you are able to insert the row separately. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> 1. PromiseTupleInsertionLockAcquire(<my xid>) >> 2. Insert heap tuple >> 3. Insert index tuples >> 4. Check if conflict happened. Kill the already-inserted tuple on conflict. >> 5. PromiseTupleInsertionLockRelease(<my xid>) >> >> IOW, the only change to the current patch is that you acquire the new kind >> of lock before starting the insertion, and you release it after you've >> killed the tuple, or you know you're not going to kill it. > > Where does row locking fit in there? - you may need to retry when that > part is incorporated, of course. What if you have multiple promise > tuples from a contended attempt to insert a single slot, or multiple > broken promise tuples across multiple slots or even multiple commands > in the same xact? Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking the tuple, rather than the XID. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/02/2014 02:53 PM, Robert Haas wrote: > On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan <pg@heroku.com> wrote: >> On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> 1. PromiseTupleInsertionLockAcquire(<my xid>) >>> 2. Insert heap tuple >>> 3. Insert index tuples >>> 4. Check if conflict happened. Kill the already-inserted tuple on conflict. >>> 5. PromiseTupleInsertionLockRelease(<my xid>) >>> >>> IOW, the only change to the current patch is that you acquire the new kind >>> of lock before starting the insertion, and you release it after you've >>> killed the tuple, or you know you're not going to kill it. >> >> Where does row locking fit in there? - you may need to retry when that >> part is incorporated, of course. What if you have multiple promise >> tuples from a contended attempt to insert a single slot, or multiple >> broken promise tuples across multiple slots or even multiple commands >> in the same xact? You can only have one speculative insertion in progress at a time. After you've done all the index insertions and checked that you really didn't conflict with anyone, you're not going to go back and kill the tuple anymore. After that point, the insertion is not speculation anymore. > Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking > the tuple, rather than the XID. Well, that would be ideal, because we already have tuple locks. It would be nice to use the same concept for this. It's a bit tricky, however. I guess the most straightforward way to do it would be to grab a heavy-weight lock after you've inserted the tuple, but before releasing the buffer lock. I don't immediately see a problem with that, although it's a bit scary to acquire a heavy-weight lock while holding a buffer lock. - Heikki
On Thu, Jan 2, 2014 at 11:08 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 01/02/2014 02:53 PM, Robert Haas wrote: >> On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan <pg@heroku.com> wrote: >>> >>> On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas >>> <hlinnakangas@vmware.com> wrote: >>>> >>>> 1. PromiseTupleInsertionLockAcquire(<my xid>) >>>> 2. Insert heap tuple >>>> 3. Insert index tuples >>>> 4. Check if conflict happened. Kill the already-inserted tuple on >>>> conflict. >>>> 5. PromiseTupleInsertionLockRelease(<my xid>) >>>> >>>> IOW, the only change to the current patch is that you acquire the new >>>> kind >>>> of lock before starting the insertion, and you release it after you've >>>> killed the tuple, or you know you're not going to kill it. >>> >>> >>> Where does row locking fit in there? - you may need to retry when that >>> part is incorporated, of course. What if you have multiple promise >>> tuples from a contended attempt to insert a single slot, or multiple >>> broken promise tuples across multiple slots or even multiple commands >>> in the same xact? > > You can only have one speculative insertion in progress at a time. After > you've done all the index insertions and checked that you really didn't > conflict with anyone, you're not going to go back and kill the tuple > anymore. After that point, the insertion is not speculation anymore. Yeah... but how does someone examining the tuple know that? We need to avoid having them block on the promise-tuple insertion lock if we've reacquired it meanwhile for a new speculative insertion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I decided to make at least a cursory attempt to measure or characterize the performance of each of our approaches to value locking. Being fair here is a non-trivial matter, because of the fact that upserts can behave quite differently based on the need to insert or update, lock contention and so on. Also, I knew that anything I came up with would not be comparing like with like: as things stand, the btree locking code is more or less correct, and the alternative exclusion constraint supporting implementation is more or less incorrect (of course, you may yet describe a way to fix the unprincipled deadlocking previously revealed by my testcase [1], but it is far from clear what impact this fix will have on performance). Still, something is better than nothing. This was run on a Linux server ("Linux 3.8.0-31-generic #46~precise1-Ubuntu") with these specifications: https://www.hetzner.de/en/hosting/produkte_rootserver/ex40 . Everything fits in shared_buffers, but the I/O system is probably the weakest link here. To be 100% clear, I am comparing btreelock_insert_on_dup.v5.2013_12_28.patch.gz [2] with exclusion_insert_on_dup.2013_12_19.patch.gz [3]. I'm also testing a third approach, involving avoidance of exclusive buffer locks and heavyweight locks for upserts in the first phase of speculative insertion. That patch is unposted, but shows a modest improvement over [2]. I ran this against the table foo: pgbench=# \d+ foo Table "public.foo"Column | Type | Modifiers | Storage | Stats target | Description --------+---------+-----------+----------+--------------+-------------a | integer | not null | plain | |b | integer | | plain | |c | text | | extended | | Indexes: "foo_pkey" PRIMARY KEY, btree (a) Has OIDs: no My custom script was: \setrandom rec 1 :scale with rej as(insert into foo(a, b, c) values(:rec, :rec, 'insert') on duplicate key lock for update returning rejects *) update foo set c = 'update' from rej where foo.a = rej.a; I specified that each pgbench client in each run should last for 200k upserts (with 100k possible distinct key values), not that it should last some number of seconds. The idea is that there is a reasonable mix of inserts and updates initially, for lower client counts, but exactly the same number of queries are run for each patch, so as to normalize the effects of contention across both runs (this sure is hand-wavy, but likely better than nothing). I'm just looking for approximate numbers here, and I'm sure that you could find more than one way to benchmark this feature, with varying degrees of sympathy towards each of our two approaches to value locking. This benchmark isn't sympathetic to btree locking at all, because there is a huge amount of contention for the higher client counts, with 100% of possible rows updated by the time we're done at 16 clients, for example. To compensate somewhat for the relatively low duration of each run, I take an average-of-5, rather than an average-of-3 as representative for each client count + run/patch combination. Full report of results are here: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp/ My executive summary is that the exclusion patch performs about the same on lower client counts, presumably due to not having the additional window of btree lock contention. By 8 clients, the exclusion patch does noticeably better, but it's a fairly modest improvement. Forgive me if I'm belaboring the point, but even though I'm benchmarking the simplest possible upsert statements, had I chosen small pgbench scale factors (e.g. scales that would see 100 - 1000 possible distinct key values in total) the btree locking implementation would surely win very convincingly, just because the alternative implementation would spend almost all of its time deadlocked, waiting for the deadlock detector to free clients in one second deadlock_timeout cycles. My goal here was just to put a rough number on how these two approaches compare, while trying to be as fair as possible. I have to wonder about the extent to which the exclusion approach benefits from holding old value locks. So even if the unprincipled deadlocking issue can be fixed without much additional cost, it might be that the simple fact that that approach holds those pseudo "value locks" (i.e. old dead rows from previous iterations on the same tuple slot) indefinitely helps performance, and losing that property alone will hurt performance, even though it's necessary. For those that wonder what the effect on multiple unique index would be, that isn't really all that relevant, since contention on multiple unique indexes isn't expected with idiomatic usage (though I suppose an upsert's non-HOT update would have to compete). [1] http://www.postgresql.org/message-id/CAM3SWZShbE29KpoD44cVc3vpZJGmDer6k_6FGHiSzeOZGmTFSQ@mail.gmail.com [2] http://www.postgresql.org/message-id/CAM3SWZRpnkuVrENDV3zM=BNTCv8-X3PYXt76pohGyAuP1iq-ug@mail.gmail.com [3] http://www.postgresql.org/message-id/CAM3SWZShbE29KpoD44cVc3vpZJGmDer6k_6FGHiSzeOZGmTFSQ@mail.gmail.com -- Peter Geoghegan
On Thu, Jan 2, 2014 at 8:08 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking >> the tuple, rather than the XID. > > Well, that would be ideal, because we already have tuple locks. It would be > nice to use the same concept for this. It's a bit tricky, however. I guess > the most straightforward way to do it would be to grab a heavy-weight lock > after you've inserted the tuple, but before releasing the buffer lock. I > don't immediately see a problem with that, although it's a bit scary to > acquire a heavy-weight lock while holding a buffer lock. That's a really big modularity violation. Everything after RelationPutHeapTuple() but before the buffer unlock in heap_insert() is currently critical section. I'm not saying that it can't be done, but it certainly is scary. We also have heavyweight page locks, currently used by hash indexes. That approach does not require us to contort the row locking code, and certainly does not require us to acquire heavyweight locks with buffer locks already held. I could understand your initial disinclination to doing things this way, particularly when the unprincipled deadlocking problem was not well understood, but I think that this must tip the balance in favor of the approach I advocate. What I've done with heavyweight locks is a modest, localized, logical expansion on the existing mechanism, that is easy to reason about, with room for further optimization in the future, that still has reasonable performance characteristics today, including I believe better worst-case latency. Heavyweight locks on btree pages are very well precedented, if you look beyond Postgres. -- Peter Geoghegan
On Thu, Jan 2, 2014 at 2:37 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Locking the definitely visible row only works if there's a row matching > the index's columns. If the values of the new row don't have > corresponding values in all the indexes you have the same old race > conditions again. I still don't get it - perhaps you should break down exactly what you mean with an example. I'm talking about potentially doing multiple upserts per row proposed for insertion to handle multiple conflicts, perhaps with some deletes between upserts, not just one upsert with a single update part. > I think to be useful for many cases you really need to be able to ask > for a potentially conflicting row and be sure that if there's none you > are able to insert the row separately. Why? What work do you need to perform after reserving the right to insert but before inserting? Can't you just upsert resulting in insert, and then perform that work, potentially deleting the row inserted if and when you change your mind? Is there any real difference between what that does for you, and what any particular variety of promise tuple might do for you? -- Peter Geoghegan
On Thu, Jan 2, 2014 at 11:58 AM, Peter Geoghegan <pg@heroku.com> wrote: > My executive summary is that the exclusion patch performs about the > same on lower client counts, presumably due to not having the > additional window of btree lock contention. By 8 clients, the > exclusion patch does noticeably better, but it's a fairly modest > improvement. I forgot to mention that synchronous_commit was turned off, so as to eliminate noise that might have been added by commit latency, while still obligating btree to WAL log everything with an exclusive buffer lock held. -- Peter Geoghegan
This patch doesn't apply anymore.
On Fri, Jan 3, 2014 at 7:39 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > This patch doesn't apply anymore. Yes, there was some bit-rot. I previous deferred dealing with a shift/reduce conflict implied by commit 1b4f7f93b4693858cb983af3cd557f6097dab67b. I've fixed that problem now using non operator precedence, and performed a clean rebase on master. I've also fixed the basis of your much earlier complaint about breakage of ecpg's regression tests (without adding support for the feature to ecpg). All make check-world tests pass. Patch is attached. I have yet to figure out how to make REJECTS a non-reserved keyword, or even just a type_func_name_keyword, though intuitively I have a sense that the latter ought to be possible. This is the same basic patch as benchmarked above, with various tricks to avoid stronger lock acquisition when that's likely profitable (we can even do _bt_check_unique() with only a shared lock and no hwlock much of the time, on the well-informed suspicion that it won't be necessary to insert, but only to return a TID). There has also been some clean-up to aspects of serializable behavior, but that needs further attention and scrutiny from a subject matter expert, hopefully Heikki. Though it's probably also true that I should find time to think about transaction isolation some more. I've since had another idea relating to performance optimization, which was to hint that the last attempt to insert a key was unsuccessful, so the next one (after the conflicting transaction's commit/abort) of that same value will very likely conflict too, making lock avoidance profitable on average. This appears to be much more effective than the previous woolly heuristic (never published, just benchmarked), which I've left in as an additional reason to avoid heavyweight locking, if only for discussion. This benchmark now shows my approach winning convincingly with this additional "priorConflict" optimization: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp-2/ If someone had time to independently recreate the benchmark I have here, or perhaps to benchmark the patch in some other way, that would be useful (for full details see my recent e-mail about the prior benchmark, where the exact details are described - this is the same, but with one more run for the priorConflict optimization). Subtleties of visibility also obviously deserve closer inspection, but perhaps I shouldn't be so hasty: No consensus on the way forward looks even close to emerging. How do people feel about my approach now? -- Peter Geoghegan
Attachment
On Fri, Dec 13, 2013 at 4:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, so far as the syntax goes, I'm quite distressed by having to make > REJECTS into a fully-reserved word. It's not reserved according to the > standard, and it seems pretty likely to be something that apps might be > using as a table or column name. I've been looking at this, but I'm having a hard time figuring out a way to eliminate shift/reduce conflicts while not maintaining REJECTS as a fully reserved keyword - I'm pretty sure it's impossible with an LALR parser. I'm not totally enamored with the exact syntax proposed -- I appreciate the flexibility on the one hand, but on the other hand I suppose that REJECTS could just as easily be any number of other words. One possible compromise would be to use a synonym that is not imagined to be in use very widely, although I looked up "reject" in a thesaurus and didn't feel too great about that idea afterwards. Another idea would be to have a REJECTING keyword, as the sort of complement of RETURNING (currently you can still ask for RETURNING, without REJECTS but with ON DUPLICATE KEY LOCK FOR UPDATE if that happens to make sense). I think that would work fine, and might actually be more elegant. Now, REJECTING will probably have to be a reserved keyword, but that seems less problematic, particularly as RETURNING is itself a reserved keyword not described by the standard. In my opinion REJECTING would reinforce the notion of projecting the complement of what RETURNING would project in the same context. -- Peter Geoghegan
I've worked on a simple set of tests, written quickly in bash, that I think exercise interesting cases: https://github.com/petergeoghegan/upsert Perhaps most notably, these tests make comparisons between the performance of ordinary inserts with a serial primary key table, and effectively equivalent upserts that always insert. Even though a SERIAL primary key is used, which you might imagine to be a bad case for the extended heavyweight leaf page locks, performance seems almost as good as regular insertion (though I should add that there was still only one unique index). Upserts that only update are significantly slower, and end up at very approximately 2/3 the performance of equivalent updates (a figure that I arrive at through running the tests on my laptop, which is not terribly rigorous, so perhaps take that with a grain of salt). That the update-only case is slower is hardly surprising, since those require an "extra" index scan as we re-find the conflict tuple. Using ctid projection to avoid the index scan doesn't work, at least not without extra in-application handling, because a tid scan is never used if you write a typical wCTE upsert statement - the planner forces a seqscan. It would have been interesting to see where a tid scan for the Update/ModifyTable node's nestloop join left things, but I didn't get that far. I think that if we had a really representative test (no "extra" index scan), the performance of upserts that only update would similarly be almost as good as regular updates for many representative cases. The upsert tests also provide cursory smoke testing for cases of interest. I suggest comparing the test cases, and their performance/behavior between the exclusion* and btree* patches. A new revision of my patch is attached. There have mostly just been improvement made that are equally applicable to promise tuples, so this should not be interpreted as a rejection of promise tuples, so much as a way of using my time efficiently while I wait to hear back about how others feel my approach compares. I'm still rather curious about what people think in light of recent developments. :-) Changes include: * Another ecpg fix. * Misc polishing and refactoring to btree code. Most notably I now cache btree insertion scankeys across phases (as well as continuing to cache each IndexTuple). * contrib/pageinspect notes when btree leaf pages have the locked flag bit set. * Better lock/unlock ordering. Commit a1dfaef6c6e2da34dbc808f205665a6c96362149 added strict ordering of indexes because of exclusive heavyweight locks held on indexes. Back then (~2000), Heavyweight page locks were also acquired on btree pages [1], and in general it was expected that heavyweight locks could be held on indexes across complex operations (I don't think this is relied upon anymore, but the assumption that it could happen remains). I've extended this so that relcache ensures that we always insert into primary key indexes first (and in the case of my patch, lock values in PKs last, to minimize the locking window). It seems sensible as a general bloat avoidance measure to get their insertion out of the way, so that we're guaranteed that the same slot's index tuples bloat only unique indexes when the slot is responsible for a unique violation, rather than just ordering by oid where you can get dead index tuples in previously inserted non-unique indexes. * I believe I've fixed the bug in the modifications made to HeapTupleSatisfiesMVCC(), though I'd like confirmation that I have the details right. What do you think, Andres? * I stop higher isolation levels from availing of the aforementioned modifications to HeapTupleSatisfiesMVCC(), since that would certainly be inconsistent with user expectations. I'm not sure what "read phenomenon" described by the standard this violates, but it's obviously inconsistent with the spirit of the higher isolation levels to be able to see a row committed by a transaction conceptually still-in-progress. It's bad enough to do it for READ COMMITTED, but I think a consensus may be emerging that that's the least worst thing (correct me if I'm mistaken). * The "priorConfict" optimization, which was previously shown to really help performance [2] has been slightly improved to remember row locking conflicts too. [1] https://github.com/postgres/postgres/blob/a1dfaef6c6e2da34dbc808f205665a6c96362149/src/backend/access/nbtree/nbtpage.c#L314 [2] http://www.postgresql.org/message-id/CAM3SWZQZTAN1fDiq4o2umGOaczbpemyQoM-6OxgUFBzi+dQzkg@mail.gmail.com -- Peter Geoghegan
Attachment
On Tue, Jan 7, 2014 at 8:46 PM, Peter Geoghegan <pg@heroku.com> wrote: > I've worked on a simple set of tests, written quickly in bash, that I > think exercise interesting cases: > > https://github.com/petergeoghegan/upsert > > Perhaps most notably, these tests make comparisons between the > performance of ordinary inserts with a serial primary key table, and > effectively equivalent upserts that always insert. While I realize that everyone is busy, I'm concerned about the lack of discussing here. It's been 6 full days since I posted my benchmark, which I expected to quickly clear some things up, or at least garner interest, and yet no one has commented here since. Here is a summary of the situation, at least as I understand it: * My patch has been shown to perform much better than the alternative "promise tuples" proposal. The benchmark previously published, referred to above makes this evident for workloads with lots of contention [1]. Now, to cover everything, I've gone on to benchmark inserts into a table foo(serial, int4, text) that lock the row using the new infrastructure. The SERIAL column is the primary key. I'm trying to characterize the overhead of the extended value locking here, by showing the same case (a worst case) with and without the overhead. Here are the results: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/vs-vanilla-insert/ (asynchronous commits, logged table) With both extremes covered, the data suggests that my patch performs very well by *any* standard. But if we consider how things compare to the alternative proposal, all indications are that performance is far superior (at least for representative cases without too many unique indexes, not that I suspect things are much different with many). Previous concerns about the cost of extended leaf page locking ought to be almost put to rest by this benchmark, because inserting a sequence of btree index tuple integers in succession is a particularly bad case, and yet in practice the implementation does very well. (With my patch, we're testing the same statement with an ON DUPLICATE KEY LOCK FOR UPDATE part, but there are naturally no conflicts on the SERIAL PK - on master we're testing the same INSERT statement without that, inserting sequence values just as before, only without the worst-case value locking overhead). * The alternative exclusion* patch still deadlocks in an unprincipled fashion, when simple, idiomatic usage encounters contention. Heikki intends to produce a revision that fixes the problem, though having considered it carefully myself, I don't know what mechanism he has in mind, and frankly I'm skeptical. More importantly, I have to question whether we should continue to pursue that alternative approach, giving what we now know about its performance characteristics. It could be improved, but not by terribly much, particularly for the case where there is plenty of update contention, which was shown in [1] to be approximately 2-3 times slower than extended page locking (*and* it's already looking for would-be duplicates *first*). I'm trying to be as fair as possible, and yet the difference is huge. It's going to be really hard to beat something where the decision to try to see if we should insert or update comes so late: the decision is made as late as possible, is based on strong indicators of likely outcome, while the cost of making the wrong choice is very low. With shared buffer locks held calling _bt_check_unique(), we still lock out concurrent would-be duplicate insertion, and so don't need to restart from scratch (to insert instead) in the same way as with the alternative proposal's largely AM-naive approach. * I am not aware that anyone considers that there are any open items yet. I've addressed all of those now. Progress here is entirely blocked on waiting for review feedback. With the new priorConflict lock strength optimization, my patch is in some ways similar to what Heikki proposed (in the exclusion* patch). It's as if the first phase, the locking operation is an index scan with an identity crisis. It can decide to continue to be an "index scan" (albeit an oddball one with an insertion scankey that using shared buffer locks prevents concurrent duplicate insertion, for very efficient uniqueness checks), or it can decide to actually insert, at the last possible moment. The second phase is picked up with much of the work already complete from the first, so the amount of wasted work is very close to zero in all cases. How can anything beat that? If the main argument for the exclusion approach is that it works with exclusion constraints, then I can still go and make what I've done work there too (for the IGNORE case, which I maintain is the only exclusion constraint variant of this that is useful to users). In any case I think making anything work for exclusion constraints should be a relatively low priority. I'd like to hear more opinions on what I've done here, if anyone has bandwidth to spare. I doubt I need to remind anybody that this is a feature of considerable strategic importance. We need this, and we've been unnecessarily at a disadvantage to other systems by not having it for all these years. Every application developer wants this feature - it's a *very* frequent complaint. [1] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp-2/ -- Peter Geoghegan
On 01/10/2014 05:36 AM, Peter Geoghegan wrote: > While I realize that everyone is busy, I'm concerned about the lack of > discussing here. It's been 6 full days since I posted my benchmark, > which I expected to quickly clear some things up, or at least garner > interest, and yet no one has commented here since. Nah, that's nothing. I have a patch in the January commitfest that was already posted for the previous commitfest. It received zero review back then, and still has no reviewer signed up, let alone anyone actually reviewing it. And arguably it's a bug fix! http://www.postgresql.org/message-id/5285071B.1040100@vmware.com Wink wink, if you're looking for patches to review... ;-) > The alternative exclusion* patch still deadlocks in an unprincipled > fashion, when simple, idiomatic usage encounters contention. Heikki > intends to produce a revision that fixes the problem, though having > considered it carefully myself, I don't know what mechanism he has in > mind, and frankly I'm skeptical. Here's an updated patch. Hope it works now... This is based on an older version, and doesn't include any fixes from your latest btreelock_insert_on_dup.v7.2014_01_07.patch. Please check the common parts, and copy over any relevant changes. The fix for the deadlocking issue consists of a few parts. First, there's a new heavy-weight lock type, a speculative insertion lock, which works somewhat similarly to XactLockTableWait(), but is only held for the duration of a single speculative insertion. When a backend is about to begin a speculative insertion, it first acquires the speculative insertion lock. When it's done with the insertion, meaning it has either cancelled it by killing the already-inserted tuple or decided that it's going to go ahead with it, the lock is released. The speculative insertion lock is keyed by Xid, and token. The lock can be taken many times in the same transaction, and token's purpose is to distinguish which insertion is currently in progress. The token is simply a backend-local counter, incremented each time the lock is taken. In addition to the heavy-weight lock, there are new fields in PGPROC to indicate which tuple the backend is currently inserting. When the tuple is inserted, the backend fills in the relation's relfilenode and item pointer in MyProc->specInsert* fields, while still holding the buffer lock. The current speculative insertion token is also stored there. With that mechanism, when another backend sees a tuple whose xmin is still in progress, it can check if the insertion is a speculative insertion. To do that, scan the proc array, and find the backend with the given xid. Then, check that the relfilenode and itempointer in that backend's PGPROC slot match the tuple, and make note of the token the backend had advertised. HeapTupleSatisfiesDirty() does the proc array check, and returns the token in the snapshot, alongside snapshot->xmin. The caller can then use that information in place of XactLockTableWait(). There would be other ways to skin the cat, but this seemed like the quickest to implement. One more straightforward approach would be to use the tuple's TID directly in the speculative insertion lock's key, instead of Xid+token, but then the inserter would have to grab the heavy-weight lock while holding the buffer lock, which seems dangerous. Another alternative would be to store token in the heap tuple header, instead of PGPROC; a tuple that's still being speculatively inserted has no xmax, so it could be placed in that field. Or ctid. > More importantly, I have to question > whether we should continue to pursue that alternative approach, giving > what we now know about its performance characteristics. Yes. > It could be > improved, but not by terribly much, particularly for the case where > there is plenty of update contention, which was shown in [1] to be > approximately 2-3 times slower than extended page locking (*and* it's > already looking for would-be duplicates*first*). I'm trying to be as > fair as possible, and yet the difference is huge. *shrug*. I'm not too concerned about performance during contention. But let's see how this fixed version performs. Could you repeat the tests you did with this? Any guesses what the bottleneck is? At a quick glance at a profile of a pgbench run with this patch, I didn't see anything out of ordinary, so I'm guessing it's lock contention somewhere. - Heikki
Attachment
On 01/08/2014 06:46 AM, Peter Geoghegan wrote: > A new revision of my patch is attached. I'm getting deadlocks with this patch, using the test script you posted earlier in http://www.postgresql.org/message-id/CAM3SWZQh=8xNVgbBzYHJeXUJBHwZNjUTjEZ9t-DBO9t_mX_8Kw@mail.gmail.com. Am doing something wrong, or is that a regression? - Heikki
On Fri, Jan 10, 2014 at 7:12 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > I'm getting deadlocks with this patch, using the test script you posted > earlier in > http://www.postgresql.org/message-id/CAM3SWZQh=8xNVgbBzYHJeXUJBHwZNjUTjEZ9t-DBO9t_mX_8Kw@mail.gmail.com. > Am doing something wrong, or is that a regression? Yes. The point of that test case was that it made your V1 livelock (which you fixed), not deadlock in a way detected by the deadlock detector, which is the correct behavior. This testcase was the one that showed up *unprincipled* deadlocking: http://www.postgresql.org/message-id/CAM3SWZShbE29KpoD44cVc3vpZJGmDer6k_6FGHiSzeOZGmTFSQ@mail.gmail.com I'd focus on that test case. -- Peter Geoghegan
On 01/10/2014 08:37 PM, Peter Geoghegan wrote: > On Fri, Jan 10, 2014 at 7:12 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> I'm getting deadlocks with this patch, using the test script you posted >> earlier in >> http://www.postgresql.org/message-id/CAM3SWZQh=8xNVgbBzYHJeXUJBHwZNjUTjEZ9t-DBO9t_mX_8Kw@mail.gmail.com. >> Am doing something wrong, or is that a regression? > > Yes. The point of that test case was that it made your V1 livelock > (which you fixed), not deadlock in a way detected by the deadlock > detector, which is the correct behavior. Oh, ok. Interesting. With the patch version I posted today, I'm not getting deadlocks. I'm not getting duplicates in the table either, so it looks like the promise tuple approach somehow avoids the deadlocks, while the btreelock patch does not. Why does it deadlock with the btreelock patch? I don't see why it should. If you have two backends inserting a single tuple, and they conflict, one of them should succeed to insert, and the other one should update. - Heikki
On Fri, Jan 10, 2014 at 11:28 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Why does it deadlock with the btreelock patch? I don't see why it should. If > you have two backends inserting a single tuple, and they conflict, one of > them should succeed to insert, and the other one should update. Are you sure that it doesn't make your patch deadlock too, with enough pressure? I've made that mistake myself. That test-case made my patch deadlock (in a detected fashion) when it used buffer locks as a value locking prototype - I say as much right there in the November mail you linked to. I think that's acceptable, because it's non-sensible use of the feature (my point was only that it shouldn't livelock). The test case is naively locking a row without knowing ahead of time (or pro-actively checking) if the conflict is on the first or second unique index. So before too long, you're updating the "wrong" row (no existing lock is really held), based on the 'a' column's projected value, when in actuality the conflict was on the 'b' column's projected value. Conditions are right for deadlock, because two rows are locked, not one. Although I have not yet properly considered your most recent revision, I can't imagine why the same would not apply there, since the row locking component is (probably) still identical. Granted, that distinction between row locking and value locking is a bit fuzzy in your approach, but if you happened to not insert any rows in any previous iterations (i.e. there were no unfilled promise tuples), and you happened to perform conflict handling first, it could still happen, albeit with lower probability, no? -- Peter Geoghegan
On 01/10/2014 10:00 PM, Peter Geoghegan wrote: > On Fri, Jan 10, 2014 at 11:28 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Why does it deadlock with the btreelock patch? I don't see why it should. If >> you have two backends inserting a single tuple, and they conflict, one of >> them should succeed to insert, and the other one should update. > > Are you sure that it doesn't make your patch deadlock too, with enough > pressure? I've made that mistake myself. > > That test-case made my patch deadlock (in a detected fashion) when it > used buffer locks as a value locking prototype - I say as much right > there in the November mail you linked to. I think that's acceptable, > because it's non-sensible use of the feature (my point was only that > it shouldn't livelock). The test case is naively locking a row without > knowing ahead of time (or pro-actively checking) if the conflict is on > the first or second unique index. So before too long, you're updating > the "wrong" row (no existing lock is really held), based on the 'a' > column's projected value, when in actuality the conflict was on the > 'b' column's projected value. Conditions are right for deadlock, > because two rows are locked, not one. I see. Yeah, I also get deadlocks when I change update statement to use "foo.b = rej.b" instead of "foo.a = rej.a". I think it's down to the indexes are processed, ie. which conflict you see first. This is pretty much the same issue we discussed wrt. exclusion contraints. If the tuple being inserted conflicts with several existing tuples, what to do? I think the best answer would be to return and lock them all. It could still deadlock, but it's nevertheless less surprising behavior than returning one of the tuples in random. Actually, we could even avoid the deadlock by always locking the tuples in a certain order, although I'm not sure if it's worth the trouble. - Heikki
On Fri, Jan 10, 2014 at 1:25 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > This is pretty much the same issue we discussed wrt. exclusion contraints. > If the tuple being inserted conflicts with several existing tuples, what to > do? I think the best answer would be to return and lock them all. It could > still deadlock, but it's nevertheless less surprising behavior than > returning one of the tuples in random. Actually, we could even avoid the > deadlock by always locking the tuples in a certain order, although I'm not > sure if it's worth the trouble. I understand and accept that as long as we're intent on locking more than one row per transaction, that action could deadlock with another session doing something similar. Actually, I've even encountered people giving advice in relation to proprietary systems along the lines of: "if your big SQL MERGE statement is deadlocking excessively, you might try hinting to make sure a nested loop join is used". I think that this kind of ugly compromise is unavoidable in those scenarios (in reality the most popular strategy is probably "cross your fingers"). But as everyone agrees, the common case where an xact only upserts one row should never deadlock with another, similar xact. So *that* isn't a problem I have with making row locking work for exclusion constraints. My problem is that in general I'm not sold on the actual utility of making this kind of row locking work with exclusion constraints. I'm sincerely having a hard time thinking of a practical use-case (although, as I've said, I want to make it work with IGNORE). Even if you work all this row locking stuff out, and the spill-to-disk aspect out, the interface is still wrong, because you need to figure out a way to project more than one reject per slot. Maybe I lack imagination around how to make that work, but there are a lot of "ifs" and "buts" either way. -- Peter Geoghegan
On 1/10/14, 4:40 PM, Peter Geoghegan wrote: > My problem is that in general I'm not sold on the actual utility of > making this kind of row locking work with exclusion constraints. I'm > sincerely having a hard time thinking of a practical use-case > (although, as I've said, I want to make it work with IGNORE). Even if > you work all this row locking stuff out, and the spill-to-disk aspect > out, the interface is still wrong, because you need to figure out a > way to project more than one reject per slot. Maybe I lack imagination > around how to make that work, but there are a lot of "ifs" and "buts" > either way. Well, the usual example for exclusion constraints is resource scheduling (ie: scheduling what room a class will be held in).In that context is it hard to believe that you might want to MERGE a set of new classroom assignments in? -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, Jan 10, 2014 at 4:09 PM, Jim Nasby <jim@nasby.net> wrote: > Well, the usual example for exclusion constraints is resource scheduling > (ie: scheduling what room a class will be held in). In that context is it > hard to believe that you might want to MERGE a set of new classroom > assignments in? So you schedule a class that clashes with 3 other classes, and you want to update all 3 rows/classes with details from your one row proposed for insertion? That makes no sense, unless the classes were in fixed time slots, in which case you could use a unique constraint to begin with. You can't change the rows to have the same time range for all 3. So you have to delete two first, and update the range of one. Which two? And you can't really rely on having locked existing rows operating as a kind of "persistent value lock", as I do, because you've locked a row with a different range to the one you care about - someone can still insert another row that doesn't block on that one but blocks on your range. So you really do need a sophisticated, fully formed value locking infrastructure to make it work, for a feature of marginal utility at best. I'm having a hard time imagining any user actually wanting to do any of this, and I'm having a harder time still imagining anyone putting in the work to make it possible, if indeed it is possible. No one has ever implemented fully formed predicate locking in a commercial database system, because it's an NP-complete problem [1], [2]. Only very limited special cases are practicable, and I'm pretty sure this isn't one of them. [1] http://library.riphah.edu.pk/acm/disk_1/text/1-2/SIGMOD79/P127.PDF [2] http://books.google.com/books?id=wV5Ran71zNoC&pg=PA284&lpg=PA284&dq=predicate+locking+np+complete&source=bl&ots=PgNJ5H3L8V&sig=fOZ2Wr4fIxj0eFQD0tCGPLTsfY0&hl=en&sa=X&ei=PpTQUquoBMfFsATtw4CADA&ved=0CDIQ6AEwAQ#v=onepage&q=predicate%20locking%20np%20complete&f=false -- Peter Geoghegan
On 1/10/14, 6:51 PM, Peter Geoghegan wrote: > On Fri, Jan 10, 2014 at 4:09 PM, Jim Nasby<jim@nasby.net> wrote: >> >Well, the usual example for exclusion constraints is resource scheduling >> >(ie: scheduling what room a class will be held in). In that context is it >> >hard to believe that you might want to MERGE a set of new classroom >> >assignments in? > So you schedule a class that clashes with 3 other classes, and you > want to update all 3 rows/classes with details from your one row > proposed for insertion? Nuts, I was misunderstanding the scenario. I thought this was simply going to violate exclusion constraints. I see what you're saying now, and I'm not coming up with a scenario either. Perhaps Jeff Davis could, since he created them...if he can't then I'd say we're safe ignoring that aspect. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, Jan 10, 2014 at 7:09 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Nah, that's nothing. I have a patch in the January commitfest that was > already posted for the previous commitfest. It received zero review back > then, and still has no reviewer signed up, let alone anyone actually > reviewing it. And arguably it's a bug fix! > > http://www.postgresql.org/message-id/5285071B.1040100@vmware.com > > Wink wink, if you're looking for patches to review... ;-) Yeah, I did intend to take a closer look at that one (I've looked at it but have nothing to share yet). I've been a little busy with other things. That patch is more of the kind where it's a matter of determining if what you've done is exactly correct (no one would disagree with the substance of what you propose), whereas there is uncertainty about whether I've gotten the semantics right and so on. But that's no excuse. :-) >> The alternative exclusion* patch still deadlocks in an unprincipled >> fashion > Here's an updated patch. Hope it works now... This is based on an older > version, and doesn't include any fixes from your latest > btreelock_insert_on_dup.v7.2014_01_07.patch. Please check the common parts, > and copy over any relevant changes. Okay, attached is a revision with some of my fixes for other parts of the code merged (in particular, for the grammer, ecpg and some aspects of row locking and visibility). Some quick observations on your patch - maybe this is obvious, and you have work-arounds in mind, but this is just my first impression: * You're always passing HEAP_INSERT_SPECULATIVE to heap_insert, and therefore in the event of any sort of insertion always getting an exclusive lock on the procArray. I guess the fact that this always happens, and not just when upserting is an oversight (I know you just wanted to throw together a POC), but even still that seems kind of questionable. Everyone knows that contention during GetSnapshotData is still a big problem for us. Taking an exclusive ProcArrayLock perhaps as frequently as more than once per slot seems like a really bad idea, even if it's limited to speculative inserters. * It seems questionable that you don't at least have a shared ProcArrayLock when you set the token value in SetSpeculativeInsertionToken() (as you know, MyPgXact->xmin can be set with such a shared lock, so doing something similar here might be okay, but it's far from obvious that no lock is okay). Now, I guess you can point towards MinimumActiveBackends() as a kind of precedent, but that seems substantially less scary than what you've done, because that's just reading if a field is zero or non-zero. Obviously the implications of actually doing this are that things get even worse for performance. And even a shared lock might not be good enough - I'd have to think about it some more to give a firmer opinion. > The fix for the deadlocking issue consists of a few parts. First, there's a > new heavy-weight lock type, a speculative insertion lock, which works > somewhat similarly to XactLockTableWait(), but is only held for the duration > of a single speculative insertion. When a backend is about to begin a > speculative insertion, it first acquires the speculative insertion lock. > When it's done with the insertion, meaning it has either cancelled it by > killing the already-inserted tuple or decided that it's going to go ahead > with it, the lock is released. I'm afraid I must reiterate my earlier objection to the general thrust of what you're doing, which is that it is evidently unnecessary to spread knowledge of value locking around the system, as opposed to localizing knowledge of it to one module, in this case nbtinsert.c. While it's true that the idea of the AM abstraction is already perhaps a little strained, this seems like a big expansion on that problem. Why should this approach make sense for every conceivable AM that supports some notion of a constraint? Heavyweight exclusive locks on indexes (at the page level typically), persisting across complex operations are not a new thing for Postgres. > HeapTupleSatisfiesDirty() does the proc array check, and returns the token > in the snapshot, alongside snapshot->xmin. The caller can then use that > information in place of XactLockTableWait(). That seems like a modularity violation too. The HeapTupleSatisfiesMVCC changes reflect a genuine need to make every MVCC snapshot care about the special visibility exception, whereas only one or two HeapTupleSatisfiesDirty() callers will ever care about speculative insertion. Even if you're unmoved by the modularity/aesthetic argument (which is not to suppose that you actually are), the fact that you're calling SpeculativeInsertionIsInProgress(), which acquires a shared ProcArrayLock much of the time from within HeapTupleSatisfiesDirty(), may have seriously regressed foreign key enforcement, for example. You're going to need something like a new type of snapshot, basically, and we probably already have too many of those. But then, can you really get away with a new snapshot type so most existing places are unaffected? Why shouldn't ri_triggers.c have to care? Offhand I think it must care, unless you go give it some special knowledge too. So you either risk regressing performance badly, or play whack-a-mole with all of the other dirty snapshot callsites. That seems like a textbook example of a really bad modularity violation. The consequences may spread beyond that, further than we can easily predict. >> More importantly, I have to question >> whether we should continue to pursue that alternative approach, giving >> what we now know about its performance characteristics. > > Yes. Okay. Unfortunately, I must press you on this point: what is it that you don't like about what I've done? What aspects of my approach concern you, and specifically what aspects of my approach do you hope to avoid? If you take a close look at how value locking is performed, it actually is very similar to the existing mechanism, counterintuitive though that is. It's a modest expansion on how things already work. I contend that my approach is, apart from everything else, the more conservative of the two. >> It could be >> improved, but not by terribly much, particularly for the case where >> there is plenty of update contention, which was shown in [1] to be >> approximately 2-3 times slower than extended page locking (*and* it's >> already looking for would-be duplicates*first*). I'm trying to be as >> >> fair as possible, and yet the difference is huge. > > *shrug*. I'm not too concerned about performance during contention. But > let's see how this fixed version performs. Could you repeat the tests you > did with this? Why would you not be too concerned about the performance with contention? It's a very important aspect. But even if you don't, if you look at the transaction throughput with only a single client in the update-heavy benchmark [1] (with one client there is a fair mix of inserts and updates), my approach still comes out far ahead. Transaction throughput is almost 100% higher, with the *difference* exceeding 150% at 8 clients but never reaching too much higher. I think that the problem isn't so much with contention between clients as much as with contention between inserts and updates, which affects everyone to approximately the same degree. And the average max latency across runs for one client is 130.447 ms, as opposed to 0.705 ms with my patch - that's less than 1%. Whatever way you cut it, the performance of my approach is far superior. Although we should certainly investigate the impact of your most recent revision, and I intend to, how can you not consider those differences to be extremely significant? And honestly, I have a really hard time imagining what you've done here had anything other than a strong negative effect on performance, in which case the difference in performance will be wider still. > Any guesses what the bottleneck is? At a quick glance at a profile of a > pgbench run with this patch, I didn't see anything out of ordinary, so I'm > guessing it's lock contention somewhere. See my previous remarks on "index scan with an identity crisis" [2]. I'm pretty sure it was mostly down to the way you optimistically proceed with duplicate index tuple insertion (which you'll do even with the btree code knowing that it almost certainly won't work out, something that makes less sense than with deferred unique constraints, where the user has specifically indicated that things may well work out by making the constraint deferred in the first place). I also think that the way my approach very effectively avoids wasted effort (including but not limited to unnecessarily heavy locking) plays an important role in making it perform so well. This turns out to be much more important than the downside of having value locks be slightly coarser than strictly necessary. When I tried to quantify that overhead with a highly unsympathetic benchmark, the difference was barely measurable [2][3]. [1] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp-2/ [2] http://www.postgresql.org/message-id/CAM3SWZQBhS0JriD6EfeW3MoTXy1eK-8Wdr6FvFFR0AyCDgCBvA@mail.gmail.com [3] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/vs-vanilla-insert/ -- Peter Geoghegan
Attachment
On Sat, Jan 11, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Jan 10, 2014 at 4:09 PM, Jim Nasby <jim@nasby.net> wrote: >> Well, the usual example for exclusion constraints is resource scheduling >> (ie: scheduling what room a class will be held in). In that context is it >> hard to believe that you might want to MERGE a set of new classroom >> assignments in? > > So you schedule a class that clashes with 3 other classes, and you > want to update all 3 rows/classes with details from your one row > proposed for insertion? Well, perhaps you want to mark the events as conflicting with your new event? -- greg
On Fri, Jan 10, 2014 at 10:01 PM, Greg Stark <stark@mit.edu> wrote: >> So you schedule a class that clashes with 3 other classes, and you >> want to update all 3 rows/classes with details from your one row >> proposed for insertion? > > > Well, perhaps you want to mark the events as conflicting with your new event? But short of a sophisticated persistent value locking implementation (which I'm pretty skeptical of the feasibility of), more conflicting events could be added at any moment. I doubt that you're appreciably any better off than if you were to simply check with a select query, even though that approach is obviously broken. In general, making row locking work for exclusion constraints, so you can treat them in a way that allows you to merge on arbitrary operators seems to me like a tar pit. -- Peter Geoghegan
On Fri, Jan 10, 2014 at 7:59 PM, Peter Geoghegan <pg@heroku.com> wrote: >> *shrug*. I'm not too concerned about performance during contention. But >> let's see how this fixed version performs. Could you repeat the tests you >> did with this? > > Why would you not be too concerned about the performance with > contention? It's a very important aspect. But even if you don't, if > you look at the transaction throughput with only a single client in > the update-heavy benchmark [1] (with one client there is a fair mix of > inserts and updates), my approach still comes out far ahead. > Transaction throughput is almost 100% higher, with the *difference* > exceeding 150% at 8 clients but never reaching too much higher. I > think that the problem isn't so much with contention between clients > as much as with contention between inserts and updates, which affects > everyone to approximately the same degree. And the average max latency > across runs for one client is 130.447 ms, as opposed to 0.705 ms with > my patch - that's less than 1%. Whatever way you cut it, the > performance of my approach is far superior. Although we should > certainly investigate the impact of your most recent revision, and I > intend to, how can you not consider those differences to be extremely > significant? So I re-ran the same old benchmark, where we're almost exclusively updating. Results for your latest revision were very similar to my patch: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/exclusion-no-deadlock/ This suggests that the main problem encountered was lock contention among old, broken promise tuples. Note that this benchmark doesn't involve any checkpointing, and everything fits in memory. Opportunistic pruning is possible, which I'd imagine helps a lot with the bloat, at least in this benchmark - there are only every 100,000 live tuples. That might not always be true, of course. In any case, my patch is bound to win decisively for the other extreme, the insert-only case, because the overhead of doing an index scan first is always wasted there with your approach, and the overhead of extended btree leaf page locking has been shown to be quite low. In the past you've spoken of avoiding that overhead through an adaptive strategy based on statistics, but I think you'll have a hard time beating a strategy where the decision comes as late as possible, and is informed by highly localized page-level metadata already available. My implementation can abort an attempt to just read an existing would-be duplicate very inexpensively (with no strong locks), going back to just after the _bt_search() to get a heavyweight lock if just reading doesn't work out (if there is no duplicate found), so as to not waste all of its prior work. Doing one of the two extremes of insert-mostly or update-only well is relatively easy; dynamically adapting to one or the other is much harder. Especially if it's a consistent mix of inserts and updates, where general observations aren't terribly useful. All other concerns of mine still remain, including the concern over the extra locking of the proc array - I'm concerned about the performance impact of that on other parts of the system not exercised by this test. -- Peter Geoghegan
On Sat, Jan 11, 2014 at 2:39 AM, Peter Geoghegan <pg@heroku.com> wrote: > So I re-ran the same old benchmark, where we're almost exclusively > updating. Results for your latest revision were very similar to my > patch: > > http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/exclusion-no-deadlock/ To put that in context, here is a previously unpublished repeat of the same benchmark on the slightly improved second most recently submitted revision of mine, v6: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp-3/ (recall that I improved things a bit by remember row-locking conflicts, not just conflicts when we try value locking - that made a small additional difference, reflected here but not in /upsert-cmp-2/ ). The numbers for each patch are virtually identical. I guess I could improve my patch by not always getting a heavyweight lock on the first insert attempt, based on the general observation that we have previously always updated. My concern would be that that would happen at the expense of the other case. -- Peter Geoghegan
On 01/11/2014 12:40 AM, Peter Geoghegan wrote: > My problem is that in general I'm not sold on the actual utility of > making this kind of row locking work with exclusion constraints. I'm > sincerely having a hard time thinking of a practical use-case > (although, as I've said, I want to make it work with IGNORE). Even if > you work all this row locking stuff out, and the spill-to-disk aspect > out, the interface is still wrong, because you need to figure out a > way to project more than one reject per slot. Maybe I lack imagination > around how to make that work, but there are a lot of "ifs" and "buts" > either way. Exclusion constraints can be used to implement uniqueness checks with SP-GiST or GiST indexes. For example, if you want to enforce that there are no two tuples with the same x and y coordinates, ie. use a point as the key. You could add a b-tree index just to enforce the constraint, but it's better if you don't have to. In general, it's just always better if features don't have implementation-specific limitations like this. - Heikki
On 01/11/2014 12:39 PM, Peter Geoghegan wrote: > In any case, my patch is bound to win decisively for the other > extreme, the insert-only case, because the overhead of doing an index > scan first is always wasted there with your approach, and the overhead > of extended btree leaf page locking has been shown to be quite low. Quite possibly. Run the benchmark, and we'll see how big a difference we're talking about. > In > the past you've spoken of avoiding that overhead through an adaptive > strategy based on statistics, but I think you'll have a hard time > beating a strategy where the decision comes as late as possible, and > is informed by highly localized page-level metadata already available. > My implementation can abort an attempt to just read an existing > would-be duplicate very inexpensively (with no strong locks), going > back to just after the _bt_search() to get a heavyweight lock if just > reading doesn't work out (if there is no duplicate found), so as to > not waste all of its prior work. Doing one of the two extremes of > insert-mostly or update-only well is relatively easy; dynamically > adapting to one or the other is much harder. Especially if it's a > consistent mix of inserts and updates, where general observations > aren't terribly useful. Another way to optimize it is to keep the b-tree page pinned after doing the pre-check. Then you don't need to descend the tree again when doing the insert. That would require small indexam API changes, but wouldn't be too invasive, I think. > All other concerns of mine still remain, including the concern over > the extra locking of the proc array - I'm concerned about the > performance impact of that on other parts of the system not exercised > by this test. Yeah, I'm not thrilled about that part either. Fortunately there are other ways to implement that. In fact, I think you could just not bother taking the ProcArrayLock when setting the fields. The danger is that another backend sees a mixed state of the fields, but that's OK. The worst that can happen is that it will do an unnecessary lock/release on the heavy-weight lock. And to reduce the overhead when reading the fields, you could merge the SpeculativeInsertionIsInProgress() check into TransactionIdIsInProgress(). The call site in tqual.c always calls it together with TransactionIdIsInProgress(), which scans the proc array anyway, while holding the lock. - Heikki
On Mon, Jan 13, 2014 at 12:23 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Exclusion constraints can be used to implement uniqueness checks with > SP-GiST or GiST indexes. For example, if you want to enforce that there are > no two tuples with the same x and y coordinates, ie. use a point as the key. > You could add a b-tree index just to enforce the constraint, but it's better > if you don't have to. In general, it's just always better if features don't > have implementation-specific limitations like this. That seems rather narrow. Among other things, I worry about the baggage for users in documenting supporting SP-GiST/GiST. "We support it, but it only really works for the case where you're using exclusion constraints as unique constraints, something that might make sense in certain narrow contexts, contrary to our earlier general statement that a unique index should be preferred there". We catalog amcanunique methods as the way that we support unique indexes. I really do feel that that's the appropriate level to support the feature at, and I have not precluded other amcanunique implementations from doing the same, having documented the intended value locking interface/contract for the benefit of any future amcanunique AM author. It's ON DUPLICATE KEY, not ON OVERLAPPING KEY, or any other syntax suggestive of exclusion constraints and their arbitrary commutative operators. -- Peter Geoghegan
On Mon, Jan 13, 2014 at 1:53 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Jan 13, 2014 at 12:23 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Exclusion constraints can be used to implement uniqueness checks with >> SP-GiST or GiST indexes. For example, if you want to enforce that there are >> no two tuples with the same x and y coordinates, ie. use a point as the key. >> You could add a b-tree index just to enforce the constraint, but it's better >> if you don't have to. In general, it's just always better if features don't >> have implementation-specific limitations like this. > > That seems rather narrow. Among other things, I worry about the > baggage for users in documenting supporting SP-GiST/GiST. "We support > it, but it only really works for the case where you're using exclusion > constraints as unique constraints, something that might make sense in > certain narrow contexts, contrary to our earlier general statement > that a unique index should be preferred there". We catalog amcanunique > methods as the way that we support unique indexes. I really do feel > that that's the appropriate level to support the feature at, and I > have not precluded other amcanunique implementations from doing the > same, having documented the intended value locking interface/contract > for the benefit of any future amcanunique AM author. It's ON DUPLICATE > KEY, not ON OVERLAPPING KEY, or any other syntax suggestive of > exclusion constraints and their arbitrary commutative operators. For what it's worth, I agree with Heikki. There's probably nothing sensible an upsert can do if it conflicts with more than one tuple, but if it conflicts with just exactly one, it oughta be OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 13, 2014 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > For what it's worth, I agree with Heikki. There's probably nothing > sensible an upsert can do if it conflicts with more than one tuple, > but if it conflicts with just exactly one, it oughta be OK. If there is exactly one, *and* the existing value is exactly the same as the value proposed for insertion (or, I suppose, a subset of the existing value, but that's so narrow that it might as well not apply). In short, when you're using an exclusion constraint as a unique constraint. Which is very narrow indeed. Weighing the costs and the benefits, that seems like far more cost than benefit, before we even consider anything beyond simply explaining the applicability and limitations of upserting with exclusion constraints. It's generally far cleaner to define speculative insertion as something that happens with unique indexes only. -- Peter Geoghegan
On 01/13/2014 10:53 PM, Peter Geoghegan wrote: > On Mon, Jan 13, 2014 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> For what it's worth, I agree with Heikki. There's probably nothing >> sensible an upsert can do if it conflicts with more than one tuple, >> but if it conflicts with just exactly one, it oughta be OK. > > If there is exactly one, *and* the existing value is exactly the same > as the value proposed for insertion (or, I suppose, a subset of the > existing value, but that's so narrow that it might as well not apply). > In short, when you're using an exclusion constraint as a unique > constraint. Which is very narrow indeed. Weighing the costs and the > benefits, that seems like far more cost than benefit, before we even > consider anything beyond simply explaining the applicability and > limitations of upserting with exclusion constraints. It's generally > far cleaner to define speculative insertion as something that happens > with unique indexes only. Well, even if you don't agree that locking all the conflicting rows for update is sensible, it's still perfectly sensible to return the rejected rows to the user. For example, you're inserting N rows, and if some of them violate a constraint, you still want to insert the non-conflicting rows instead of rolling back the whole transaction. - Heikki
On Mon, Jan 13, 2014 at 12:58 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Well, even if you don't agree that locking all the conflicting rows for > update is sensible, it's still perfectly sensible to return the rejected > rows to the user. For example, you're inserting N rows, and if some of them > violate a constraint, you still want to insert the non-conflicting rows > instead of rolling back the whole transaction. Right, but with your approach, can you really be sure that you have the right rejecting tuple ctid (not reject)? In other words, as you wait for the exclusion constraint to conclusively indicate that there is a conflict, minutes may have passed in which time other conflicts may emerge in earlier unique indexes. Whereas with an approach where values are locked, you are guaranteed that earlier unique indexes have no conflicting values. Maintaining that property seems useful, since we check in a well-defined order, and we're still projecting a ctid. Unlike when row locking is involved, we can make no assumptions or generalizations around where conflicts will occur. Although that may also be a general concern with your approach when row locking, for multi-master replication use-cases. There may be some value in knowing it cannot have been earlier unique indexes (and so the existing values for those unique indexes in the locked row should stay the same - don't many conflict resolution policies work that way?). -- Peter Geoghegan
On Mon, Jan 13, 2014 at 12:49 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> In any case, my patch is bound to win decisively for the other >> extreme, the insert-only case, because the overhead of doing an index >> scan first is always wasted there with your approach, and the overhead >> of extended btree leaf page locking has been shown to be quite low. > > Quite possibly. Run the benchmark, and we'll see how big a difference we're > talking about. I'll come up with something and let you know. > Another way to optimize it is to keep the b-tree page pinned after doing the > pre-check. Then you don't need to descend the tree again when doing the > insert. That would require small indexam API changes, but wouldn't be too > invasive, I think. You'll still need a callback to drop the pin when it transpires that there is a conflict in a later unique index, and state to pass a bt stack back, at which point you've already made exactly the same changes to the AM interface as in my proposal. The only difference is that the core code doesn't rely on the value locks being released after an instant, but that isn't something that you take advantage of. Furthermore, AFAIK there is no reason to think that anything other than btree will benefit, which makes it a bit unfortunate that the AM has to support it generally. So, again, it's kind of a modularity violation, and it may not even actually be possible, since _bt_search() is only callable with an insertion scankey, which is the context in which the existing guarantee around releasing locks and re-searching from that point applies, for reasons that seem to me to be very subtle. At the very least you need to pass a btstack to _bt_doinsert() to save the work of re-scanning, as I do. >> All other concerns of mine still remain, including the concern over >> the extra locking of the proc array - I'm concerned about the >> performance impact of that on other parts of the system not exercised >> by this test. > > Yeah, I'm not thrilled about that part either. Fortunately there are other > ways to implement that. In fact, I think you could just not bother taking > the ProcArrayLock when setting the fields. The danger is that another > backend sees a mixed state of the fields, but that's OK. The worst that can > happen is that it will do an unnecessary lock/release on the heavy-weight > lock. And to reduce the overhead when reading the fields, you could merge > the SpeculativeInsertionIsInProgress() check into > TransactionIdIsInProgress(). The call site in tqual.c always calls it > together with TransactionIdIsInProgress(), which scans the proc array > anyway, while holding the lock. Currently in your patch all insertions do SpeculativeInsertionLockAcquire(GetCurrentTransactionId()) - presumably this is not something you intend to keep. Also, you should not do this for regular insertion: if (options & HEAP_INSERT_SPECULATIVE)SetSpeculativeInsertion(relation->rd_node, &heaptup->t_self); Can you explain the following, please?: + /* + * Returns a speculative insertion token for waiting for the insertion to + * finish. + */ + uint32 + SpeculativeInsertionIsInProgress(TransactionId xid, RelFileNode rel, ItemPointer tid) + { + uint32 result = 0; + ProcArrayStruct *arrayP = procArray; + int index; Why is this optimization correct? Presently it allows your patch to avoid getting a shared ProcArrayLock from HeapTupleSatisfiesDirty(). + if (TransactionIdPrecedes(xid, TransactionXmin)) + return false; So from HeapTupleSatisfiesDirty(), you're checking if "xid" (the passed tuple's xmin) precedes our transaction's xmin (well, that of our last snapshot updated by GetSnapshotData()). This is set within GetSnapshotData(), but we're dealing with a dirty snapshot with no xmin, so TransactionXmin pertains to our MVCC snapshot, not our dirty snapshot. It isn't really true that TransactionIdIsInProgress() gets the same shared ProcArrayLock in a similar fashion, for a full linear search; I think that the various fast-paths make it far less likely than it is for SpeculativeInsertionIsInProgress() (or, perhaps, should be). Here is what that other routine does in around the same place: /* * Don't bother checking a transaction older than RecentXmin; it could not * possibly still be running. (Note: in particular,this guarantees that * we reject InvalidTransactionId, FrozenTransactionId, etc as not * running.) */if (TransactionIdPrecedes(xid,RecentXmin)){ xc_by_recent_xmin_inc(); return false;} This extant code checks against RecentXmin, *not* TransactionXmin. It also caches things quite effectively, but that caching isn't very useful to you here. It checks latestCompletedXid before doing a linear search through the proc array too. -- Peter Geoghegan
On Mon, Jan 13, 2014 at 6:45 PM, Peter Geoghegan <pg@heroku.com> wrote: > + uint32 > + SpeculativeInsertionIsInProgress(TransactionId xid, RelFileNode rel, > ItemPointer tid) > + { For the purposes of preventing unprincipled deadlocking, commenting out the following (the only caller of the above) has no immediately discernible effect with any of the test-cases that I've published: /* XXX shouldn't we fall through to look at xmax? */ + /* XXX why? or is that now covered by the above check? */ + snapshot->speculativeToken = + SpeculativeInsertionIsInProgress(HeapTupleHeaderGetRawXmin(tuple), + rnode, + &htup->t_self); + + snapshot->xmin = HeapTupleHeaderGetRawXmin(tuple); return true; /* in insertion by other*/ I think that the prevention of unprincipled deadlocking is all down to this immediately prior piece of code, at least in those test cases: ! /* ! * in insertion by other. ! * ! * Before returning true, check for the special case that the ! * tuple was deleted by the same transaction that inserted it. ! * Such a tuple will never be visible to anyone else, whether ! * the transaction commits or aborts. ! */ ! if (!(tuple->t_infomask & HEAP_XMAX_INVALID) && ! !(tuple->t_infomask & HEAP_XMAX_COMMITTED) && ! !(tuple->t_infomask & HEAP_XMAX_IS_MULTI) && ! !HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) && ! HeapTupleHeaderGetRawXmax(tuple) == HeapTupleHeaderGetRawXmin(tuple)) ! { ! return false; ! } But why should it be acceptable to change the semantics of dirty snapshots like this, which previously always returned true when control reached here? It is a departure from their traditional behavior, not limited to clients of this new promise tuple infrastructure. Now, it becomes entirely a matter of whether we tried to insert before or after the deleting xact's deletion (of a tuple it originally inserted) as to whether or not we block. So in general we don't get to "keep our old value locks" until xact end when we update or delete. Even if you don't consider this a bug for existing dirty snapshot clients (I myself do - we can't rely on deleting a row and re-inserting the same values now, which could be particularly undesirable for updates), I have already described how we can take advantage of deleting tuples while still holding on to their "value locks" [1] to Andres. I think it'll be very important for multi-master conflict resolution. I've already described this useful property of dirty snapshots numerous times on this thread in relation to different aspects, as it happens. It's essential. Anyway, I guess you're going to need an infomask bit to fix this, so you can differentiate between 'promise' tuples and 'proper' tuples. Those are in short supply. I still think this problem is more or less down to a modularity violation, and I suspect that this is not the last problem that will be found along these lines if we continue to pursue this approach. [1] http://www.postgresql.org/message-id/CAM3SWZQpLSGPS2Kd=-n6HVYiqkF_mCxmX-Q72ar9UPzQ-X6F6Q@mail.gmail.com -- Peter Geoghegan
On 01/14/2014 12:20 PM, Peter Geoghegan wrote: > I think that the prevention of unprincipled deadlocking is all down to > this immediately prior piece of code, at least in those test cases: > ! /* > ! * in insertion by other. > ! * > ! * Before returning true, check for the special case that the > ! * tuple was deleted by the same transaction that inserted it. > ! * Such a tuple will never be visible to anyone else, whether > ! * the transaction commits or aborts. > ! */ > ! if (!(tuple->t_infomask & HEAP_XMAX_INVALID) && > ! !(tuple->t_infomask & HEAP_XMAX_COMMITTED) && > ! !(tuple->t_infomask & HEAP_XMAX_IS_MULTI) && > ! !HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) && > ! HeapTupleHeaderGetRawXmax(tuple) == HeapTupleHeaderGetRawXmin(tuple)) > ! { > ! return false; > ! } > > But why should it be acceptable to change the semantics of dirty > snapshots like this, which previously always returned true when > control reached here? It is a departure from their traditional > behavior, not limited to clients of this new promise tuple > infrastructure. Now, it becomes entirely a matter of whether we tried > to insert before or after the deleting xact's deletion (of a tuple it > originally inserted) as to whether or not we block. So in general we > don't get to "keep our old value locks" until xact end when we update > or delete. Hmm. So the scenario would be that a process inserts a tuple, but kills it again later in the transaction, and then re-inserts the same value. The expectation is that because it inserted the value once already, inserting it again will not block. Ie. inserting and deleting a tuple effectively acquires a value-lock on the inserted values. > Even if you don't consider this a bug for existing dirty > snapshot clients (I myself do - we can't rely on deleting a row and > re-inserting the same values now, which could be particularly > undesirable for updates), Yeah, it would be bad if updates start failing because of this. We could add a check for that, and return true if the tuple was updated rather than deleted. > I have already described how we can take > advantage of deleting tuples while still holding on to their "value > locks" [1] to Andres. I think it'll be very important for multi-master > conflict resolution. I've already described this useful property of > dirty snapshots numerous times on this thread in relation to different > aspects, as it happens. It's essential. I didn't understand that description. > Anyway, I guess you're going to need an infomask bit to fix this, so > you can differentiate between 'promise' tuples and 'proper' tuples. Yeah, that's one way. Or you could set xmin to invalid, to make the killed tuple look thoroughly dead to everyone. > Those are in short supply. I still think this problem is more or less > down to a modularity violation, and I suspect that this is not the > last problem that will be found along these lines if we continue to > pursue this approach. You have suspected that many times throughout this thread, and every time there's been a relatively simple solutions to the issues you've raised. I suspect that's also going to be true for whatever mundane next issue you come up with. - Heikki
On 01/14/2014 12:44 AM, Peter Geoghegan wrote: > On Mon, Jan 13, 2014 at 12:58 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Well, even if you don't agree that locking all the conflicting rows for >> update is sensible, it's still perfectly sensible to return the rejected >> rows to the user. For example, you're inserting N rows, and if some of them >> violate a constraint, you still want to insert the non-conflicting rows >> instead of rolling back the whole transaction. > > Right, but with your approach, can you really be sure that you have > the right rejecting tuple ctid (not reject)? In other words, as you > wait for the exclusion constraint to conclusively indicate that there > is a conflict, minutes may have passed in which time other conflicts > may emerge in earlier unique indexes. Whereas with an approach where > values are locked, you are guaranteed that earlier unique indexes have > no conflicting values. Maintaining that property seems useful, since > we check in a well-defined order, and we're still projecting a ctid. > Unlike when row locking is involved, we can make no assumptions or > generalizations around where conflicts will occur. Although that may > also be a general concern with your approach when row locking, for > multi-master replication use-cases. There may be some value in knowing > it cannot have been earlier unique indexes (and so the existing values > for those unique indexes in the locked row should stay the same - > don't many conflict resolution policies work that way?). I don't understand what you're saying. Can you give an example? In the use case I was envisioning above, ie. you insert N rows, and if any of them violate constraint, you still want to insert the non-violating instead of rolling back the whole transaction, you don't care. You don't care what existing rows the new rows conflicted with. Even if you want to know what you conflicted with, I can't make sense of what you're saying. In the btreelock approach, the value locks are immediately released once you discover that there's conflict. So by the time you get to do anything with the ctid of the existing tuple you conflicted with, new conflicting tuples might've appeared. - Heikki
On Tue, Jan 14, 2014 at 2:43 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Hmm. So the scenario would be that a process inserts a tuple, but kills it > again later in the transaction, and then re-inserts the same value. The > expectation is that because it inserted the value once already, inserting it > again will not block. Ie. inserting and deleting a tuple effectively > acquires a value-lock on the inserted values. Right. > Yeah, it would be bad if updates start failing because of this. We could add > a check for that, and return true if the tuple was updated rather than > deleted. Why would you fix it that way? >> I have already described how we can take >> advantage of deleting tuples while still holding on to their "value >> locks" [1] to Andres. I think it'll be very important for multi-master >> conflict resolution. I've already described this useful property of >> dirty snapshots numerous times on this thread in relation to different >> aspects, as it happens. It's essential. > > I didn't understand that description. I was describing how deleting existing locked rows, and re-inserting, could deal with multiple conflicts for multi-master replication use-cases. It hardly matters much though, because it's not as if the usefulness and necessity of this property of dirty snapshots is in question. >> Anyway, I guess you're going to need an infomask bit to fix this, so >> you can differentiate between 'promise' tuples and 'proper' tuples. > > Yeah, that's one way. Or you could set xmin to invalid, to make the killed > tuple look thoroughly dead to everyone. I'm think you'll have to use an infomask bit so everyone knows that this is a promise tuple from the start. Otherwise, I suspect that there are race conditions. The problem was that inserted-then-deleted-in-same-xact tuples (both regular and promise) were invisible to all xacts' dirty snapshots, when they should have only been invisible to the deleting xact's dirty snapshot. So it isn't obvious to me how you interlock things such that another xact doesn't incorrectly decide that it has to wait on what is really a promise tuple's xact for the full duration of that xact, having found no speculative insertion token to ShareLock (which implies unprincipled deadlocking), while simultaneously having other sessions not fail to see as dirty-visible a same-xact-inserted-deleted non-promise tuple (thereby ensuring those other sessions correctly conclude that it is necessary to wait for the end of the xmin/xmax xact). If you set the xmin to invalid too late, it doesn't help any existing waiters. Even if setting xmin to invalid is workable, it's a strike against the performance of your approach, because it's another heap buffer exclusive lock. > You have suspected that many times throughout this thread, and every time > there's been a relatively simple solutions to the issues you've raised. I > suspect that's also going to be true for whatever mundane next issue you > come up with. I don't think it's a mundane issue. But in any case, you haven't addressed why you think your proposal is more or less better than my proposal, which is the pertinent question. You haven't given me so much as a high level summary of whatever misgivings you may have about it, even though I've asked you to comment on my approach to value locking several times. You haven't pointed out that it has any specific bug (which is not to suppose that that's because there are none). The point is that it is not my contention that what you're proposing is totally unworkable. Rather, I think that the original proposal will probably ultimately perform better in all cases, is easier to reason about and is certainly far more modular. It appears to me to be the more conservative of the two proposals. In all sincerity, I simply don't know what factors you're weighing here. In saying that, I really don't mean to imply that you're assigning weight to things in a way that I am in disagreement with. I simply don't understand what is important to you here, and why your proposal preserves or enhances the things that you believe are important. Would you please explain your position along those lines? Now, I'll concede that it will be harder to make the IGNORE syntax work with exclusion constraints with what I've done, which would be nice. However, in my opinion that should be given far less weight than these other issues. It's ON DUPLICATE KEY...; no one could reasonably assume that exclusion constraints were covered. Also, upserting with exclusion constraints is a non-starter. It's only applicable to the case where you're using exclusion constraints exactly as you would use unique constraints, which has to be very rare. It will cause much more confusion than anything else. INSERT IGNORE in MySQL works with NOT NULL constraints, unique constraints, and all other constraints. FWIW I think that it would be kind of arbitrary to make IGNORE work with exclusion constraints and not other types of constraints, whereas when it's specifically ON DUPLICATE KEY, that seems far less surprising. -- Peter Geoghegan
On 01/14/2014 11:22 PM, Peter Geoghegan wrote: > On Tue, Jan 14, 2014 at 2:43 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> You have suspected that many times throughout this thread, and every time >> there's been a relatively simple solutions to the issues you've raised. I >> suspect that's also going to be true for whatever mundane next issue you >> come up with. > > I don't think it's a mundane issue. But in any case, you haven't > addressed why you think your proposal is more or less better than my > proposal, which is the pertinent question. 1. It's simpler. 2. Works for exclusion constraints. > You haven't given me so > much as a high level summary of whatever misgivings you may have about > it, even though I've asked you to comment on my approach to value > locking several times. You haven't pointed out that it has any > specific bug (which is not to suppose that that's because there are > none). The point is that it is not my contention that what you're > proposing is totally unworkable. Rather, I think that the original > proposal will probably ultimately perform better in all cases, is > easier to reason about and is certainly far more modular. It appears > to me to be the more conservative of the two proposals. In all > sincerity, I simply don't know what factors you're weighing here. In > saying that, I really don't mean to imply that you're assigning weight > to things in a way that I am in disagreement with. I simply don't > understand what is important to you here, and why your proposal > preserves or enhances the things that you believe are important. Would > you please explain your position along those lines? I guess that simplicity is in the eye of the beholder, but please take a look at git diff --stat: 41 files changed, 1224 insertions(+), 107 deletions(-) vs. 50 files changed, 2215 insertions(+), 240 deletions(-) Admittedly, some of the difference comes from the fact that you've spent a lot more time commenting and polishing the btreelock patch. But mostly I dislike additional complexity required in b-tree for this. I don't think B-tree locking is more conservative. The insert-and-then-check approach is already used by exclusion constraints, I'm just extending it to not abort on conflict, but do something else. - Heikki
On 01/14/2014 11:22 PM, Peter Geoghegan wrote: > The problem was that > inserted-then-deleted-in-same-xact tuples (both regular and promise) > were invisible to all xacts' dirty snapshots, when they should have > only been invisible to the deleting xact's dirty snapshot. Right. > So it isn't > obvious to me how you interlock things such that another xact doesn't > incorrectly decide that it has to wait on what is really a promise > tuple's xact for the full duration of that xact, having found no > speculative insertion token to ShareLock (which implies unprincipled > deadlocking), while simultaneously having other sessions not fail to > see as dirty-visible a same-xact-inserted-deleted non-promise tuple > (thereby ensuring those other sessions correctly conclude that it is > necessary to wait for the end of the xmin/xmax xact). If you set the > xmin to invalid too late, it doesn't help any existing waiters. If a backend finds no speculative insertion token to ShareLock, then it really isn't a speculative insertion, and the process should sleep on the xid as usual. Once we remove the modification in HeapTupleSatisfiesDirty() that made it return false when xmin == xmax, the problem that arises is that another backend that sees the killed tuple incorrectly determines that it has to wait for it that transaction to finish, even though it was a speculatively inserted tuple that was killed, and hence can be ignored. We can avoid that problem by setting xmin to invalid, or otherwise marking the tuple as dead. Attached is a patch doing that, to again demonstrate what I mean. I'm not sure if setting xmin to invalid is really the best way to mark the tuple dead; I don't think a tuple's xmin can currently be InvalidTransaction under any other circumstances, so there might be some code out there that's not prepared for it. So using an infomask bit might indeed be better. Or something else entirely. - Heikki
Attachment
On Tue, Jan 14, 2014 at 2:16 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> I don't think it's a mundane issue. But in any case, you haven't >> addressed why you think your proposal is more or less better than my >> proposal, which is the pertinent question. > > 1. It's simpler. > > 2. Works for exclusion constraints. Thank you for clarifying where you're coming from. > I guess that simplicity is in the eye of the beholder, but please take a > look at git diff --stat: > > 41 files changed, 1224 insertions(+), 107 deletions(-) > > vs. > > 50 files changed, 2215 insertions(+), 240 deletions(-) > > Admittedly, some of the difference comes from the fact that you've spent a > lot more time commenting and polishing the btreelock patch. But mostly I > dislike additional complexity required in b-tree for this. It's very much down to differences in how well commented and documented each patch is. I have a fully formed amendment to the AM interface, complete with documentation of the AM and btree aspects, and detailed comments around how the parts fit together. But you've already explored doing something similar to what I do, to similarly avoid having to refind the page (less the heavyweight locking), which seems almost equivalent to what I propose in terms of its impact on btree, before we consider anything else. > I don't think B-tree locking is more conservative. The insert-and-then-check > approach is already used by exclusion constraints, I'm just extending it to > not abort on conflict, but do something else. If you examine what I actually do, you'll see that it's pretty much equivalent to how the extant value locking of unique btree indexes has always worked. It's just that the process is staggered at an exact point, the point where traditionally we hold no buffer locks, only a buffer pin (although we do additionally verify that the index gives the go-ahead before getting to later indexes, to get consensus to proceed with insertion). The suggestion that mine is the conservative approach is also based on the fact that database systems have made use of page level exclusive locks on indexes, managed by the lock manager, persisting over complex operations in many different contexts for many years. This includes Postgres, where for many years relcache takes precautions again deadlocking in such AMs by ordering the list of indexes associated with a relation by pg_index.indexrelid. Currently this may not be necessary, but the principle stands. The insert-then-check approach of exclusion constraints is quite different to what is proposed here, because exclusion constraints only ever have to abort the xact if things don't work out. There is no value locking. That's far easier to pin down. You definitely don't have to do anything new with visibility. -- Peter Geoghegan
On Tue, Jan 14, 2014 at 3:25 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Attached is a patch doing that, to again demonstrate what I mean. I'm not > sure if setting xmin to invalid is really the best way to mark the tuple > dead; I don't think a tuple's xmin can currently be InvalidTransaction under > any other circumstances, so there might be some code out there that's not > prepared for it. So using an infomask bit might indeed be better. Or > something else entirely. Have you thought about the implications for other snapshot types (or other tqual.c routines)? My concern is that a client of that infrastructure (either current or future) could spuriously conclude that a heap tuple satisfied it, when in fact only a promise tuple satisfied it. It wouldn't necessarily follow that the promise would be fulfilled, nor that there would be some other proper heap tuple equivalent to that fulfilled promise tuple as far as those clients are concerned. heap_delete() will not call HeapTupleSatisfiesUpdate() when you're deleting a promise tuple, which on the face of it is fine - it's always going to technically be instantaneously invisible, because it's always created by the same command id (i.e. HeapTupleSatisfiesUpdate() would just return HeapTupleInvisible if called). So far so good, but we are technically doing something else quite new - deleting a would-be instantaneously invisible tuple. So like your concern about setting xmin to invalid, my concern is that code may exist that treats cmin < cmax as an invariant. Now, you might think that that would be a manageable concern, and to be fair a look at the ComboCids code that mostly arbitrates that stuff seems to indicate that it's okay, but it's still worth noting. I think you should consider breaking off the relcache parts of my patch and committing them, because they're independently useful. If we are going to have a lot of conflicts that need to be handled by a heap_delete(), there is no point in inserting non-unique index tuples for what is not yet conclusively a proper (non-promise) tuple. Those should always come last. And even without upsert, strictly inserting into unique indexes first seems like a useful thing relative to the cost. Unique violations are the cause of many aborted transactions, and there is no need to ever bloat non-unique indexes of the same slot when that happens. -- Peter Geoghegan
On Tue, Jan 14, 2014 at 3:07 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> Right, but with your approach, can you really be sure that you have >> the right rejecting tuple ctid (not reject)? In other words, as you >> wait for the exclusion constraint to conclusively indicate that there >> is a conflict, minutes may have passed in which time other conflicts >> may emerge in earlier unique indexes. Whereas with an approach where >> values are locked, you are guaranteed that earlier unique indexes have >> no conflicting values. Maintaining that property seems useful, since >> we check in a well-defined order, and we're still projecting a ctid. >> Unlike when row locking is involved, we can make no assumptions or >> generalizations around where conflicts will occur. Although that may >> also be a general concern with your approach when row locking, for >> multi-master replication use-cases. There may be some value in knowing >> it cannot have been earlier unique indexes (and so the existing values >> for those unique indexes in the locked row should stay the same - >> don't many conflict resolution policies work that way?). > > I don't understand what you're saying. Can you give an example? > > In the use case I was envisioning above, ie. you insert N rows, and if any > of them violate constraint, you still want to insert the non-violating > instead of rolling back the whole transaction, you don't care. You don't > care what existing rows the new rows conflicted with. > > Even if you want to know what you conflicted with, I can't make sense of > what you're saying. In the btreelock approach, the value locks are > immediately released once you discover that there's conflict. So by the time > you get to do anything with the ctid of the existing tuple you conflicted > with, new conflicting tuples might've appeared. That's true, but at least the timeframe in which an additional conflict may occur on just-locked index values in bound to more or less an instant. In any case how important this is is an interesting question, and perhaps one that Andres can weigh in on as someone that knows a lot about multi-master replication. This issue is particularly interesting because this testcase appears to make both patches livelock, for reasons that I believe are related: https://github.com/petergeoghegan/upsert/blob/master/torture.sh I have an idea of what I could do to fix this, but I don't have time to make sure that my hunch is correct. I'm travelling tomorrow to give a talk at PDX pug, so I'll have limited access to e-mail. -- Peter Geoghegan
On Wed, Jan 15, 2014 at 8:23 PM, Peter Geoghegan <pg@heroku.com> wrote: > I have an idea of what I could do to fix this, but I don't have time > to make sure that my hunch is correct. It might just be a matter of: @@ -186,6 +186,13 @@ ExecLockHeapTupleForUpdateSpec(EState *estate, switch (test) { case HeapTupleInvisible: + /* + * Tuple may have originated from this command, in which case it's + * already locked + */ + if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple.t_data)) && + HeapTupleHeaderGetCmin(tuple.t_data) == estate->es_output_cid) + return true; /* Tuple became invisible; try again */ if (IsolationUsesXactSnapshot()) ereport(ERROR, -- Peter Geoghegan
On 01/16/2014 03:25 AM, Peter Geoghegan wrote: > I think you should consider breaking off the relcache parts of my > patch and committing them, because they're independently useful. If we > are going to have a lot of conflicts that need to be handled by a > heap_delete(), there is no point in inserting non-unique index tuples > for what is not yet conclusively a proper (non-promise) tuple. Those > should always come last. And even without upsert, strictly inserting > into unique indexes first seems like a useful thing relative to the > cost. Unique violations are the cause of many aborted transactions, > and there is no need to ever bloat non-unique indexes of the same slot > when that happens. Makes sense. Can you extract that into a separate patch, please? I was wondering if that might cause deadlocks if an existing index is changed from unique to non-unique, or vice versa, as the ordering would change. But we don't have a DDL command to change that, so the question is moot. - Heikki
On Thu, Jan 16, 2014 at 12:35 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Makes sense. Can you extract that into a separate patch, please? Okay. On an unrelated note, here are results for a benchmark that compares the two patches for an insert heavy workload: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/insert-heavy-cmp/ I should point out that this is a sympathetic case for the exclusion approach; there is only one unique index involved, and the heap tuples were relatively wide: pg@gerbil:~/pgbench-tools/tests$ cat tpc-b-upsert.sql \set nbranches 1000000000 \set naccounts 1000000000 \setrandom aid 1 :naccounts \setrandom bid 1 :nbranches \setrandom delta -5000 5000 with rej as(insert into pgbench_accounts(aid, bid, abalance, filler) values(:aid, :bid, :delta, 'filler') on duplicate key lock for update returning rejects aid, abalance) update pgbench_accounts set abalance = pgbench_accounts.abalance + rej.abalance from rej where pgbench_accounts.aid = rej.aid; (This benchmark used an unlogged table, if only because to do otherwise would severely starve this particular server of I/O). -- Peter Geoghegan
On Wed, Jan 15, 2014 at 11:02 PM, Peter Geoghegan <pg@heroku.com> wrote: > It might just be a matter of: > > @@ -186,6 +186,13 @@ ExecLockHeapTupleForUpdateSpec(EState *estate, > switch (test) > { > case HeapTupleInvisible: > + /* > + * Tuple may have originated from this command, in which case it's > + * already locked > + */ > + if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple.t_data)) > && > + HeapTupleHeaderGetCmin(tuple.t_data) == estate->es_output_cid) > + return true; > /* Tuple became invisible; try again */ > if (IsolationUsesXactSnapshot()) > ereport(ERROR, I think we need to give this some more thought. I have not addressed the implications for MVCC snapshots here. I think that I'll need to raise a WARNING along the lines of "your snapshot isn't going to consider the locked tuple visible because the same command inserted it", or perhaps even raise an ERROR regardless of isolation level (although note that I'm not suggesting that we raise an ERROR in the event of receiving HeapTupleInvisible from heap_lock_tuple()/HTSU() for other reasons, which *is* possible, nor am I suggesting that later commands of the same xact would ever see this ERROR). I'm comfortable with the idea of what you might loosely describe as a "READ COMMITTED mode serialization failure" here, because this case is so much more narrow than the other case I've proposed making a special exception to the general semantics of MVCC snapshots to accommodate (i.e. the case where a tuple is locked from an xact logically still-in-progress to our snapshot in RC mode). I think I'll be happy to declare that usage of the feature that hits this issue is somewhere between questionable and wrong. It probably isn't worth making another, similar HTSMVCC exception for this case. But ISTM that we still have to do *something* other than simply credit users with taking care to avoid tripping up on this. -- Peter Geoghegan
On Thu, Jan 16, 2014 at 3:35 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Makes sense. Can you extract that into a separate patch, please? > > I was wondering if that might cause deadlocks if an existing index is > changed from unique to non-unique, or vice versa, as the ordering would > change. But we don't have a DDL command to change that, so the question is > moot. It's not hard to imagine someone wanting to add such a DDL command. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jan 18, 2014 at 5:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I was wondering if that might cause deadlocks if an existing index is >> changed from unique to non-unique, or vice versa, as the ordering would >> change. But we don't have a DDL command to change that, so the question is >> moot. > > It's not hard to imagine someone wanting to add such a DDL command. Perhaps, but the burden of solving that problem ought to rest with whoever eventually propose the command. Certainly, if someone did so today, I would object on the grounds that their patch precluded us from ever prioritizing unique indexes, to get them out of the way during insertion, so I am not actually making such an effort more difficult than it already is. Moreover, avoiding entirely predictable index bloat is more important than making supporting this yet to be proposed feature's implementation easier. I was surprised when I learned that things didn't already work this way. Attached patch, broken off from my patch has relcache sort indexes by (!indisunique, relindexid). -- Peter Geoghegan
Attachment
On Thu, Jan 16, 2014 at 6:31 PM, Peter Geoghegan <pg@heroku.com> wrote: > I think we need to give this some more thought. I have not addressed > the implications for MVCC snapshots here. So I gave this some more thought, and this is what I came up with: + static bool + ExecLockHeapTupleForUpdateSpec(EState *estate, + ResultRelInfo *relinfo, + ItemPointer tid) + { + Relation relation = relinfo->ri_RelationDesc; + HeapTupleData tuple; + HeapUpdateFailureData hufd; + HTSU_Result test; + Buffer buffer; + + Assert(ItemPointerIsValid(tid)); + + /* Lock tuple for update */ + tuple.t_self = *tid; + test = heap_lock_tuple(relation, &tuple, + estate->es_output_cid, + LockTupleExclusive, false, /* wait */ + true, &buffer, &hufd); + ReleaseBuffer(buffer); + + switch (test) + { + case HeapTupleInvisible: + /* + * Tuple may have originated from this transaction, in which case + * it's already locked. However, to avoid having to consider the + * case where the user locked an instantaneously invisible row + * inserted in the same command, throw an error. + */ + if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple.t_data))) + ereport(ERROR, + (errcode(ERRCODE_UNIQUE_VIOLATION), + errmsg("could not lock instantaneously invisible tuple inserted in same transaction"), + errhint("Ensure that no rows proposed for insertion in the same command have constrained values that duplicate each other."))); + if (IsolationUsesXactSnapshot()) + ereport(ERROR, + (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), + errmsg("could not serialize access due to concurrent update"))); + /* Tuple became invisible due to concurrent update; try again */ + return false; + case HeapTupleSelfUpdated: + /* I'm just throwing an error when locking the tuple returns HeapTupleInvisible, and the xmin of the tuple is our xid. It's sufficient to just check TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple.t_data)), because there is no way that _bt_check_unique() could consider the tuple dirty visible + conclusively fit for a lock attempt if it came from our xact, while at the same time for the same tuple HeapTupleSatisfiesUpdate() indicated invisibility, unless the tuple originated from the same command. Checking against subxacts or ancestor xacts is at worst redundant. I am happy with this. ISTM that it'd be hard to argue that any reasonable and well-informed person would ever thank us for trying harder here, although it took me a while to reach that position. To understand what I mean, consider what MySQL does when in a similar position. I didn't actually check, but based on the fact that their docs don't consider this question I guess MySQL would go update the tuple inserted by that same "INSERT....ON DUPLICATE KEY UPDATE" command. Most of the time the conflicting tuples proposed for insertion by the user are in *some* way different (i.e. if the table was initially empty and you did a regular insert, inserting those same tuples would cause a unique constraint violation all on their own, but without there being any fully identical tuples among these hypothetical tuples proposed for insertion). It seems obvious that the order in which each tuple is evaluated for insert-or-update on MySQL is more or less undefined. And so by allowing this, they arguably allow their users to miss something they should not: they don't end up doing anything useful with the datums originally inserted in the command, but then subsequently updated over with something else in the same command. MySQL users are not notified that this happened, and are probably blissfully unaware that there has been a limited form of data loss. So it's The Right Thing to say to Postgres users: "if you inserted these rows into the table when it was empty, there'd *still* definitely be a unique constraint violation, and you need to sort that out before asking Postgres to handle conflicts with concurrent sessions and existing data, where rows that come from earlier commands in your xact counts as existing data". The only problem I can see with that is that we cannot complain consistently for practical reasons, as when we lock *some other* xact's tuple rather than inserting in the same command two or more times. But at least when that happens they can definitely update two or more times (i.e. the row that we "locked twice" is visible). Naturally we can't catch every error a DML author may make. -- Peter Geoghegan
On Sat, Jan 18, 2014 at 6:17 PM, Peter Geoghegan <pg@heroku.com> wrote: > MySQL users are not notified that this happened, and are probably > blissfully unaware that there has been a limited form of data loss. So > it's The Right Thing to say to Postgres users: "if you inserted these > rows into the table when it was empty, there'd *still* definitely be a > unique constraint violation, and you need to sort that out before > asking Postgres to handle conflicts with concurrent sessions and > existing data, where rows that come from earlier commands in your xact > counts as existing data". I Googled and found evidence indicating that a number of popular proprietary system's SQL MERGE implementations do much the same thing. You may get an "attempt to UPDATE the same row twice" error on both SQL Server and Oracle. I wouldn't like to speculate if the standard requires this of MERGE, but to require it seems very sensible. > The only problem I can see with that is that > we cannot complain consistently for practical reasons, as when we lock > *some other* xact's tuple rather than inserting in the same command > two or more times. Actually, maybe it would be practical to complain that the same UPSERT command attempted to lock a row twice with at least *almost* total accuracy, and not just for the particularly problematic case where tuple visibility is not assured. Personally, I favor just making "case HeapTupleSelfUpdated:" within the patch's ExecLockHeapTupleForUpdateSpec() function complain when "hufd.cmax == estate->es_output_cid)" (currently there is a separate complaint, but only when those two variables are unequal). That's probably almost perfect in practice. If we wanted perfection, which would be to always complain when two rows were locked by the same UPSERT command, it would be a matter of having heap_lock_tuple indicate to the patch's ExecLockHeapTupleForUpdateSpec() caller that the row was already locked, so that it could complain in a special way for the locked-not-updated case. But that is hard, because there is no way for it to know if the current *command* locked the tuple, and that's the only case that we are justified in raising an error for. But now that I think about it some more, maybe always complaining when we lock but have not yet updated is not just not worth the trouble, but is in fact bogus. It's not obvious what precise behavior is correct here. I was worried about someone updating something twice, but maybe it's fully sufficient to do what I've already proposed, while in addition documenting that you cannot on-duplicate-key-lock a tuple that has already been inserted or updated within the same command. It will be very rare for anyone to trip up over that in practice (e.g. by locking twice and spuriously updating the same row twice or more in a later command). Users learn to not try this kind of thing by having it break immediately; the fact that it doesn't break with 100% reliability is good enough (plus it doesn't *really* fail to break when it should because of how things are documented). -- Peter Geoghegan
On Sat, Jan 18, 2014 at 7:49 PM, Peter Geoghegan <pg@heroku.com> wrote: > Personally, I favor just making "case HeapTupleSelfUpdated:" within > the patch's ExecLockHeapTupleForUpdateSpec() function complain when > "hufd.cmax == estate->es_output_cid)" (currently there is a separate > complaint, but only when those two variables are unequal). That's > probably almost perfect in practice. Actually, there isn't really a need to do so, since I believe in practice the tuple locked will always be instantaneously invisible (when we have the scope to avoid this "updated the tuple twice in the same command" problem by forbidding it in the style of SQL MERGE). However, I think I'm going to propose that we still do something in the ExecLockHeapTupleForUpdateSpec() HeapTupleSelfUpdated handler (in addition to HeapTupleInvisible), because that'll still be illustrative dead code. -- Peter Geoghegan
On Thu, Jan 16, 2014 at 12:35 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> I think you should consider breaking off the relcache parts of my >> patch and committing them, because they're independently useful. > > Makes sense. Can you extract that into a separate patch, please? Perhaps you can take a look at this again, when you get a chance. -- Peter Geoghegan
On 02/07/2014 01:27 PM, Peter Geoghegan wrote: > On Thu, Jan 16, 2014 at 12:35 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>> I think you should consider breaking off the relcache parts of my >>> patch and committing them, because they're independently useful. >> >> Makes sense. Can you extract that into a separate patch, please? > > Perhaps you can take a look at this again, when you get a chance. The relcache parts? I don't think a separate patch ever appeared that could be reviewed. Looking again at the last emails in this whole thread, I don't have anything to add. At this point, I think it's pretty clear this won't make it into 9.4, so I'm going to mark this as "returned with feedback". If someone else thinks this still has a chance and is willing to review this and beat it into shape, please resurrect it quickly. - Heikki
On Mon, Feb 10, 2014 at 11:57 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > The relcache parts? I don't think a separate patch ever appeared that could > be reviewed. I posted the patch on January 18th: http://www.postgresql.org/message-id/CAM3SWZTh4VkESoT7dCrWbPRN7zZhNZ-Wa6zmvO1FF7gBNOjNOg@mail.gmail.com I was under the impression that you agreed that this was independently valuable, regardless of the outcome here. -- Peter Geoghegan
On Sun, Jan 19, 2014 at 2:17 AM, Peter Geoghegan <pg@heroku.com> wrote: > I'm just throwing an error when locking the tuple returns > HeapTupleInvisible, and the xmin of the tuple is our xid. I would like some feedback on this point. We need to consider how exactly to avoid updating the same tuple inserted by our command. Updating a tuple we inserted cannot be allowed to happen, not least because to do so causes livelock. A related consideration that I raised in mid to late January that hasn't been commented on is avoiding updating the same tuple twice, and where we come down on that with respect to where our responsibility to the user starts and ends. For example, SQL MERGE officially forbids this, but MySQL's INSERT...ON DUPLICATE KEY UPDATE seems not to, probably due to implementation considerations. -- Peter Geoghegan
On Mon, Feb 10, 2014 at 06:40:30PM +0000, Peter Geoghegan wrote: > On Sun, Jan 19, 2014 at 2:17 AM, Peter Geoghegan <pg@heroku.com> wrote: > > I'm just throwing an error when locking the tuple returns > > HeapTupleInvisible, and the xmin of the tuple is our xid. > > I would like some feedback on this point. We need to consider how > exactly to avoid updating the same tuple inserted by our command. > Updating a tuple we inserted cannot be allowed to happen, not least > because to do so causes livelock. > > A related consideration that I raised in mid to late January that > hasn't been commented on is avoiding updating the same tuple twice, > and where we come down on that with respect to where our > responsibility to the user starts and ends. For example, SQL MERGE > officially forbids this, but MySQL's INSERT...ON DUPLICATE KEY UPDATE > seems not to, probably due to implementation considerations. Where are we on this? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Apr 17, 2014 at 9:52 AM, Bruce Momjian <bruce@momjian.us> wrote: > Where are we on this? My hope is that I can get agreement on a way forward during pgCon. Or, at the very least, explain the issues as I see them in a relatively accessible and succinct way to those interested. -- Peter Geoghegan