Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE |
Date | |
Msg-id | CAM3SWZQUUuYYcGksVytmcGqACVMkf1ui1uvfJekM15YkWZpzhw@mail.gmail.com Whole thread Raw |
In response to | Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
|
List | pgsql-hackers |
On Tue, Sep 24, 2013 at 7:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't really disagree with any of that. TBH, I think the question > of how long value locks (as you call them) are held is going to boil > down to a question of how they end up being implemented. Well, I think we can rule out value locks that are held for the duration of a transaction right away. That's just not going to fly. > As I mentioned to you at PG Open (going through the details here for those > following along at home), we could optimistically insert the new heap > tuple, then go add index entries for it, and if we find a conflict, > then instead of erroring out, we mark the tuple we were inserting dead > and go try update the conflicting tuple instead. In that > implementation, if we find that we have to wait for some other > transaction along the way, it's probably not worth reversing out the > index entries already inserted, because getting them into the index in > the first place was a WAL-logged operation, and therefore relatively > expensive, and IMHO it's most likely better to just hope things work > out than to risk having to redo all of that. I'm afraid that there are things that concern me about this design. It does have one big advantage over promise-tuples, which is that the possibility of index-only bloat, and even the possible need to freeze indexes separately from their heap relation is averted (or are you going to have recovery do promise clean-up instead? Does recovery look for an eventual successful insertion relating to the promise? How far does it look?). However, while I'm just as concerned as you that backing out is too expensive, I'm equally concerned that there is no reasonable alternative to backing out, which is why cheap, quick in-memory value locks are so compelling to me. See my remarks below. > On the other hand, if the locks are strictly in-memory, then the cost > of releasing them all before we go to wait, and of reacquiring them > after we finish waiting, is pretty low. There might be some > modularity issues to work through there, but they might not turn out > to be very painful, and the advantages you mention are certainly worth > accruing if it turns out to be fairly straightforward. It's certainly a difficult situation to judge. > Personally, I think that trying to keep it all in-memory is going to > be hard. The problem is that we can't de-optimize regular inserts or > updates to any significant degree to cater to this feature - because > as valuable as this feature is, the number of times it gets used is > still going to be a whole lot smaller than the number of times it > doesn't get used. Right - I don't think that anyone would argue that any other standard should be applied. Fortunately, I'm reasonably confident that it can work. The last part of index tuple insertion, where we acquire an exclusive lock on a buffer, needs to look out for a page header bit (on pages considered for insertion of its value). The cost of that to anyone not using this feature is likely to be infinitesimally small. We can leave clean-up of that bit to the next inserter, who needs the exclusive lock anyway and doesn't find a corresponding SLRU entry. But really, that's a discussion for another day. I think we'd want to track value locks per pinned-by-upserter buffer, to localize any downsides on concurrency. If we forbid page-splits in respect of a value-locked page, we can still have a stable value (buffer number) to use within a shared memory hash table, or something along those lines. We're still going to want to minimize the duration of locking under this scheme, by doing TOASTing before locking values and so on, which is quite possible. If we're really lucky, maybe the value locking stuff can be generalized or re-used as part of a btree index insertion buffer feature. > Also, I tend to think that we might want to define > the operation as a REPLACE-type operation with respect to a certain > set of key columns; and so we'll do the insert-or-update behavior with > respect only to the index on those columns and let the chips fall > where they may with respect to any others. In that case this all > becomes much less urgent. Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT, not try an INSERT, then maybe mark the heap tuple if there's a unique index dup and then go UPDATE the conflicting tuple. I mention this only because the term REPLACE has a certain baggage, and I feel it's important to be careful about such things. The only way that's going to work is if you say "use this unique index", which will look pretty gross in DML. That might actually be okay with me if we had somewhere to go from there in a future release, but I doubt that's the case. Another issue is that I'm not sure that this helps Andres much (or rather, clients of the logical changeset generation infrastructure that need to do conflict resolution), and that matters a lot to me here. > Suppose we define the operation as REPLACE rather than INSERT...ON > DUPLICATE KEY LOCK FOR UPDATE. Then we could do something like this: > > 1. Try to insert a tuple. If no unique index conflicts occur, stop. > 2. Note the identity of the conflicting tuple and mark the inserted > heap tuple dead. > 3. If the conflicting tuple's inserting transaction is still in > progress, wait for the inserting transaction to end. Sure, this is basically what the code does today (apart from marking a just-inserted tuple dead). > 4. If the conflicting tuple is dead (e.g. because the inserter > aborted), start over. Start over from where? I presume you mean the index tuple insertion, as things are today. Or do you mean the very start? > 5. If the conflicting tuple's key columns no longer match the key > columns of the REPLACE operation, start over. What definition of equality or inequality? I think you're going to have to consider stashing information about the btree operator class, which seems not ideal - a modularity violation beyond what we already do in, say, execQual.c, I think. I think in general we have to worry about the distinction between a particular btree operator class's idea of equality (doesn't have to be = operator), that exists for a particular index, and some qual's idea of equality. It would probably be quite invasive to fix this, which I for one would find hard to justify. I think my scheme is okay here while yours isn't, because mine involves row locking only, and hoping that nothing gets updated in that tiny window after transaction commit - if it doesn't, that's good enough for us, because we know that the btree code's opinion still holds - if I'm not mistaken, *nothing* can have changed to the logical row without us hearing about it (i.e. without heap_lock_tuple() returning HeapTupleUpdated). On the other hand, you're talking about concluding that something is not a duplicate in a way that needs to satisfy btree unique index equality (so whatever operator is associated with btree strategy number 3, equality, for some particular unique index with some particular operator class) and not necessarily just a qual written with a potentially distinct notion of equality in respect of the relevant unique-constrained datums. Maybe you can solve this one problem, but the fact remains that to do so would be a pretty bad modularity violation, even by the standards of the existing btree code. That's the basic reason why I'm averse to using EvalPlanQual() in this fashion, or in a similar fashion. Even if you solve all the problems for btree, I can't imagine what type of burden it puts on amcanunique AM authors generally - I know at least one person who won't be happy with that. :-) > 6. If the conflicting tuple has a valid xmax, wait for the deleting or > locking transaction to end. If xmax is still valid, follow the CTID > chain to the updated tuple, let that be the new conflicting tuple, and > resume from step 5. So you've arbitrarily restricted us to one value lock and one row lock per REPLACE slot processed, which sort of allows us to avoid solving the basic problem of value locking, because it isn't too bad now - no need to backtrack across indexes. Clean-up (marking the heap tuple dead) is much more expensive than releasing locks in memory (although much less expensive than promise tuple killing), but needing to clean-up is maybe less likely because conflicts can only come from one unique index. Has this really bought us anything, though? Consider that conflicts are generally only expected on one unique index anyway. Plus you still have the disconnect between value and row locking, as far as I can tell - "start from scratch" remains a possible step until very late, except you pay a lot more for clean-up - avoiding that expensive clean-up is the major benefit of introducing an SLRU-based shadow value locking scheme to the btree code. I don't see that there is a way to deal with the value locking/row locking disconnect other than to live with it in a smart way. Anyway, your design probably avoids the worst kind of gridlock. Let's assume that it works out -- my next question has to be, where can we go from there? > 7. Update the tuple, even though it may be invisible to our snapshot > (a deliberate MVCC violation!). I realize that you just wanted to sketch a design, but offhand I think that the basic problem with what you describe is that it isn't accepting of the inevitability of there being a disconnect between value and row locking. Also, this doesn't fit with any roadmap for getting a real upsert, and compromises the conceptual integrity of the AM in a way that isn't likely to be accepted, and, at the risk of saying too much before you've defended your design, perhaps even necessitates invasive changes to the already extremely complicated row locking code. > While this behavior is admittedly wonky from an MVCC perspective, I > suspect that it would make a lot of people happy. "Wonky from an MVCC perspective" is the order of the day here. :-) >> My guess is that they have some fancy snapshot type that is used by >> the equivalent of ModifyTable subplans, that is appropriately paranoid >> about the Halloween problem and so on. How that actually might work is >> far from clear, but it's a question that I have begun to consider. As >> I said, a concern is that it would be in tension with the generalized, >> composable syntax, where we don't explicitly have a "special update". >> I'd really like to hear other opinions, though. > > The tension here feels fairly fundamental to me; I don't think our > implementation is to blame. I think the problem isn't so much to > figure out a clever trick that will make this all work in a truly > elegant fashion as it is to decide exactly how we're going to > compromise MVCC semantics in the least blatant way. Yeah, I totally understand the problem that way. I think it would be a bit of a pity to give up the composability, which I liked, but it's something that we'll have to consider. On the other hand, perhaps we can get away with it - we simply don't know enough yet. -- Peter Geoghegan
pgsql-hackers by date: