Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Date
Msg-id CAM3SWZRNPgZeTwHgzwGHXmC1aVi9=9A3Fi-WnQh=pS9uPjQJNQ@mail.gmail.com
Whole thread Raw
In response to Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
List pgsql-hackers
On Fri, Sep 13, 2013 at 12:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> The reason I wasn't saying "this will never get accepted" are twofold:
> a) I don't want to stiffle alternative ideas to the "promises" idea,
> just because I think it's the way to go. That might stop a better idea
> from being articulated. b) I am not actually in the position to say it's
> not going to be accepted.

Well, the reality is that the promises idea hasn't been described in
remotely enough detail to compare it to what I have here. I've pointed
out plenty of problems with it. After all, it was the first thing that
I considered, and I'm on the record talking about it in the 2012 dev
meeting. I didn't take that approach for many good reasons.

The reason I ended up here is not because I didn't get the memo about
holding buffer locks across complex operations being a bad thing. At
least grant me that. I'm here because in all these years no one has
come up with a suggestion that doesn't have some very major downsides.
Like, even worse than this.

>> As to the rules you refer to, you must mean "These locks are intended
>> to be short-term: they should not be held for long". I don't think
>> that they will ever be held for long. At least, when I've managed the
>> amount of work that a heap_insert() can do better. I expect to produce
>> a revision where toasting doesn't happen with the locks held soon.
>> Actually, I've already written the code, I just need to do some
>> testing.
>
> I personally think - and have stated so before - that doing a
> heap_insert() while holding the btree lock is unacceptable.

Presumably your reason is essentially that we exclusive lock a heap
buffer (exactly one heap buffer) while holding shared locks on btree
index buffers. Is that really so different to holding an exclusive
lock on a btree buffer while holding a shared lock on a heap buffer?
Because that's what _bt_check_unique() does today.

Now, I'll grant you that there is one appreciable difference, which is
that multiple unique indexes may be involved. But limiting ourselves
to the primary key or something like that remains an option. And I'm
not sure that it's really any worse anyway.

>> The btree code is different, though: It implements a well-defined
>> interface, with much clearer separation of concerns.
>
> Which you're completely violating by linking the btree buffer locking
> with the heap locking. It's not about the btree code alone.

You're right that it isn't about just the btree code.

In order for a deadlock to occur, there must be a mutual dependency.
What code could feel entitled to hold buffer locks on btree buffers
and heap buffers at the same time except the btree code itself? It
already does so. But no one else does the same thing. If anyone did
anything with a heap buffer lock held that could result in a call into
one of the btree access method functions (I'm not contemplating the
possibility of this other code locking the btree buffer *directly*),
I'm quite sure that that would be rejected outright today, because
that causes deadlocks. Certainly, vacuumlazy.c doesn't do it, for
example. Why would anyone ever want to do that anyway? I cannot think
of any reason. I suppose that that does still leave "transitive
dependencies", but now you're stretching things. After all, you're not
supposed to hold buffer locks for long! The dependency would have to
transit through, say, one of the system LWLocks used for WAL Logging.
Seems pretty close to impossible that it'd be an issue - index stuff
is only WAL-logged as index tuples are inserted (that is, as the locks
are finally released). Everyone automatically does that kind of thing
in a consistent order of locking, unlocking in the opposite order
anyway.

But what of the btree code deadlocking with itself? There are only a
few functions (2 or 3) where that's possible even in principle. I
think that they're going to be not too hard to analyze. For example,
with insertion, the trick is to always lock in a consistent order and
unlock/insert in the opposite order. The heap shared lock(s) needed in
the btree code cannot deadlock with another upserter because once the
other upserter has that exclusive heap buffer lock, it's *inevitable*
that it will release all of its shared buffer locks. Granted, I could
stand to think about this case more, but you get the idea - it *is*
possible to clamp down on the code that needs to care about this stuff
to a large degree. It's subtle, but btrees are generally considered
pretty complicated, and the btree code already cares about some odd
cases like these (it takes special precuations for catalog indexes,
for example).

The really weird thing about my patch is that the btree code trusts
the executor to call the heapam code to do the right thing in the
right way - it now knows more than I'd prefer. Would you be happier if
the btree code took more direct responsibility for the heap tuple
insertion instead? Before you say "that's ridiculous", consider the
big modularity violation that has always existed. It may be no more
ridiculous than that. And that existing state of affairs may be no
less ridiculous than living with what I've already done.

> At this point I am a bit confused why you are asking for review.

I am asking for us, collectively, through consensus, to resolve the
basic approach to doing this. That was something I stated right up
front, pointing out details of where the discussion had gone in the
past. That was my explicit goal. There has been plenty of discussing
on this down through the years, but nothing ever came from it.

Why is this an intractable problem for over a decade for us alone? Why
isn't this a problem for other database systems? I'm not implying that
it's because they do this. It's something that I am earnestly
interested in, though. A number of people have asked me that, and I
don't have a good answer for them.

>> I mean, if we do the promise tuple thing, and there are multiple
>> unique indexes, what happens when an inserter needs to block pending
>> the outcome of another transaction?  They had better go clean up the
>> promise tuples from the other unique indexes that they're trying to
>> insert into, because they cannot afford to hold value locks for a long
>> time, no matter how they're implemented.
>
> Why? We're using normal transaction visibility rules here. We don't stop
> *other* values on the same index getting updated or similar.

Because you're locking a value in some other, earlier unique index,
all the while waiting *indefinitely* on some other value in a second
or subsequent one. That isn't acceptable. A bunch of backends would
back up just because one backend had this contention on the second
unique index value that the others didn't actually have themselves. My
design allows those other backends to immediately go through and
finish.

Value locks have these kinds of hazards no matter how you implement
them. Deadlocks, and unreasonable stalling as described here is always
unacceptable - whether or not the problems are detected at runtime is
ultimately of marginal interest. Either way, it's a bug.

I think that the details of how this approach compare to others are
totally pertinent. For me, that's the whole point - getting towards
something that will balance all of these concerns and be acceptable.
Yes, it's entirely possible that that could look quite different to
what I have here. I do not want to reduce all this to a question of
"is this one design acceptable or not?". Am I not allowed to propose a
design to drive discussion? That's how the most important features get
implemented around here.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: record identical operator
Next
From: Marko Tiikkaja
Date:
Subject: plpgsql.print_strict_params