Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Date
Msg-id CAM3SWZQUUuYYcGksVytmcGqACVMkf1ui1uvfJekM15YkWZpzhw@mail.gmail.com
Whole thread Raw
In response to Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
List pgsql-hackers
On Tue, Sep 24, 2013 at 7:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't really disagree with any of that.  TBH, I think the question
> of how long value locks (as you call them) are held is going to boil
> down to a question of how they end up being implemented.

Well, I think we can rule out value locks that are held for the
duration of a transaction right away. That's just not going to fly.

> As I mentioned to you at PG Open (going through the details here for those
> following along at home), we could optimistically insert the new heap
> tuple, then go add index entries for it, and if we find a conflict,
> then instead of erroring out, we mark the tuple we were inserting dead
> and go try update the conflicting tuple instead.  In that
> implementation, if we find that we have to wait for some other
> transaction along the way, it's probably not worth reversing out the
> index entries already inserted, because getting them into the index in
> the first place was a WAL-logged operation, and therefore relatively
> expensive, and IMHO it's most likely better to just hope things work
> out than to risk having to redo all of that.

I'm afraid that there are things that concern me about this design. It
does have one big advantage over promise-tuples, which is that the
possibility of index-only bloat, and even the possible need to freeze
indexes separately from their heap relation is averted (or are you
going to have recovery do promise clean-up instead? Does recovery look
for an eventual successful insertion relating to the promise? How far
does it look?). However, while I'm just as concerned as you that
backing out is too expensive, I'm equally concerned that there is no
reasonable alternative to backing out, which is why cheap, quick
in-memory value locks are so compelling to me. See my remarks below.

> On the other hand, if the locks are strictly in-memory, then the cost
> of releasing them all before we go to wait, and of reacquiring them
> after we finish waiting, is pretty low.  There might be some
> modularity issues to work through there, but they might not turn out
> to be very painful, and the advantages you mention are certainly worth
> accruing if it turns out to be fairly straightforward.

It's certainly a difficult situation to judge.

> Personally, I think that trying to keep it all in-memory is going to
> be hard.  The problem is that we can't de-optimize regular inserts or
> updates to any significant degree to cater to this feature - because
> as valuable as this feature is, the number of times it gets used is
> still going to be a whole lot smaller than the number of times it
> doesn't get used.

Right - I don't think that anyone would argue that any other standard
should be applied. Fortunately, I'm reasonably confident that it can
work. The last part of index tuple insertion, where we acquire an
exclusive lock on a buffer, needs to look out for a page header bit
(on pages considered for insertion of its value). The cost of that to
anyone not using this feature is likely to be infinitesimally small.
We can leave clean-up of that bit to the next inserter, who needs the
exclusive lock anyway and doesn't find a corresponding SLRU entry. But
really, that's a discussion for another day. I think we'd want to
track value locks per pinned-by-upserter buffer, to localize any
downsides on concurrency. If we forbid page-splits in respect of a
value-locked page, we can still have a stable value (buffer number) to
use within a shared memory hash table, or something along those lines.
We're still going to want to minimize the duration of locking under
this scheme, by doing TOASTing before locking values and so on, which
is quite possible.

If we're really lucky, maybe the value locking stuff can be
generalized or re-used as part of a btree index insertion buffer
feature.

> Also, I tend to think that we might want to define
> the operation as a REPLACE-type operation with respect to a certain
> set of key columns; and so we'll do the insert-or-update behavior with
> respect only to the index on those columns and let the chips fall
> where they may with respect to any others.  In that case this all
> becomes much less urgent.

Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT,
not try an INSERT, then maybe mark the heap tuple if there's a unique
index dup and then go UPDATE the conflicting tuple. I mention this
only because the term REPLACE has a certain baggage, and I feel it's
important to be careful about such things.

The only way that's going to work is if you say "use this unique
index", which will look pretty gross in DML. That might actually be
okay with me if we had somewhere to go from there in a future release,
but I doubt that's the case. Another issue is that I'm not sure that
this helps Andres much (or rather, clients of the logical changeset
generation infrastructure that need to do conflict resolution), and
that matters a lot to me here.

> Suppose we define the operation as REPLACE rather than INSERT...ON
> DUPLICATE KEY LOCK FOR UPDATE.  Then we could do something like this:
>
> 1. Try to insert a tuple.  If no unique index conflicts occur, stop.
> 2. Note the identity of the conflicting tuple and mark the inserted
> heap tuple dead.
> 3. If the conflicting tuple's inserting transaction is still in
> progress, wait for the inserting transaction to end.

Sure, this is basically what the code does today (apart from marking a
just-inserted tuple dead).

> 4. If the conflicting tuple is dead (e.g. because the inserter
> aborted), start over.

Start over from where? I presume you mean the index tuple insertion,
as things are today. Or do you mean the very start?

> 5. If the conflicting tuple's key columns no longer match the key
> columns of the REPLACE operation, start over.

What definition of equality or inequality? I think you're going to
have to consider stashing information about the btree operator class,
which seems not ideal - a modularity violation beyond what we already
do in, say, execQual.c, I think. I think in general we have to worry
about the distinction between a particular btree operator class's idea
of equality (doesn't have to be = operator), that exists for a
particular index, and some qual's idea of equality. It would probably
be quite invasive to fix this, which I for one would find hard to
justify.

I think my scheme is okay here while yours isn't, because mine
involves row locking only, and hoping that nothing gets updated in
that tiny window after transaction commit - if it doesn't, that's good
enough for us, because we know that the btree code's opinion still
holds - if I'm not mistaken, *nothing* can have changed to the logical
row without us hearing about it (i.e. without heap_lock_tuple()
returning HeapTupleUpdated). On the other hand, you're talking about
concluding that something is not a duplicate in a way that needs to
satisfy btree unique index equality (so whatever operator is
associated with btree strategy number 3, equality, for some particular
unique index with some particular operator class) and not necessarily
just a qual written with a potentially distinct notion of equality in
respect of the relevant unique-constrained datums.

Maybe you can solve this one problem, but the fact remains that to do
so would be a pretty bad modularity violation, even by the standards
of the existing btree code. That's the basic reason why I'm averse to
using EvalPlanQual() in this fashion, or in a similar fashion. Even if
you solve all the problems for btree, I can't imagine what type of
burden it puts on amcanunique AM authors generally - I know at least
one person who won't be happy with that.  :-)

> 6. If the conflicting tuple has a valid xmax, wait for the deleting or
> locking transaction to end.  If xmax is still valid, follow the CTID
> chain to the updated tuple, let that be the new conflicting tuple, and
> resume from step 5.

So you've arbitrarily restricted us to one value lock and one row lock
per REPLACE slot processed, which sort of allows us to avoid solving
the basic problem of value locking, because it isn't too bad now - no
need to backtrack across indexes. Clean-up (marking the heap tuple
dead) is much more expensive than releasing locks in memory (although
much less expensive than promise tuple killing), but needing to
clean-up is maybe less likely because conflicts can only come from one
unique index. Has this really bought us anything, though? Consider
that conflicts are generally only expected on one unique index anyway.
Plus you still have the disconnect between value and row locking, as
far as I can tell - "start from scratch" remains a possible step until
very late, except you pay a lot more for clean-up - avoiding that
expensive clean-up is the major benefit of introducing an SLRU-based
shadow value locking scheme to the btree code. I don't see that there
is a way to deal with the value locking/row locking disconnect other
than to live with it in a smart way.

Anyway, your design probably avoids the worst kind of gridlock. Let's
assume that it works out -- my next question has to be, where can we
go from there?

> 7. Update the tuple, even though it may be invisible to our snapshot
> (a deliberate MVCC violation!).

I realize that you just wanted to sketch a design, but offhand I think
that the basic problem with what you describe is that it isn't
accepting of the inevitability of there being a disconnect between
value and row locking. Also, this doesn't fit with any roadmap for
getting a real upsert, and compromises the conceptual integrity of the
AM in a way that isn't likely to be accepted, and, at the risk of
saying too much before you've defended your design, perhaps even
necessitates invasive changes to the already extremely complicated row
locking code.

> While this behavior is admittedly wonky from an MVCC perspective, I
> suspect that it would make a lot of people happy.

"Wonky from an MVCC perspective" is the order of the day here. :-)

>> My guess is that they have some fancy snapshot type that is used by
>> the equivalent of ModifyTable subplans, that is appropriately paranoid
>> about the Halloween problem and so on. How that actually might work is
>> far from clear, but it's a question that I have begun to consider. As
>> I said, a concern is that it would be in tension with the generalized,
>> composable syntax, where we don't explicitly have a "special update".
>> I'd really like to hear other opinions, though.
>
> The tension here feels fairly fundamental to me; I don't think our
> implementation is to blame.  I think the problem isn't so much to
> figure out a clever trick that will make this all work in a truly
> elegant fashion as it is to decide exactly how we're going to
> compromise MVCC semantics in the least blatant way.

Yeah, I totally understand the problem that way. I think it would be a
bit of a pity to give up the composability, which I liked, but it's
something that we'll have to consider. On the other hand, perhaps we
can get away with it - we simply don't know enough yet.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: record identical operator
Next
From: Peter Eisentraut
Date:
Subject: Re: ENABLE/DISABLE CONSTRAINT NAME