Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Date
Msg-id CAM3SWZRyrS02qQq9mjvmzLVoiaCdoXM2WOQs=K31i=YV8GO9TA@mail.gmail.com
Whole thread Raw
In response to Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
List pgsql-hackers
On Wed, Sep 11, 2013 at 2:28 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Nice to see the next version, won't have time to look in any details in
> the next few days tho.

Thanks Andres!

> I think for this approach to be workable you also need to explain how we
> can deal with stuff like toast insertion that may need to write hundreds
> of megabytes all the while leaving an entire value-range of the unique
> key share locked.

Right. That is a question that needs to be addressed in a future revision.

> I still think that even doing a plain heap insertion is longer than
> acceptable to hold even a share lock over a btree page

Well, there is really only one way of judging something like that, and
that's to do a benchmark. I still haven't taken the time to "pick the
low hanging fruit" here that I'd mentioned - there are some fairly
obvious ways to shorten the window in which value locks are held.
Furthermore, I'm sort of at a loss as to what a fair benchmark would
look like - what is actually representative here? Also, what's the
baseline? It's not as if someone has an alternative, competing patch.
We can only hypothesize what additional costs those other approaches
introduce, unless someone has a suggestion as to how they can be
simulated without writing the full patch, which is something I'd
entertain.

As I've already pointed out, all page splits occur with the same
buffer exclusive lock held. Only, in our case, we're weakening that
lock to a shared lock. So I don't think that the heap insertion is
going to be that big of a deal, particularly in the average case.
Having said that, it's a question that surely must be closely examined
before proceeding much further. And yes, the worst case could be
pretty bad, and that surely matters too.

> The easiest answer is doing the toasting before doing the index locking,
> but that will result in bloat, the avoidance of which seems to be the
> primary advantage of your approach.

I would say that the primary advantage of my approach is that it's
much simpler than any other approach that has been considered by
others in the past. The approach is easier to reason about because
it's really just an extension of how btrees already do value locking.
Granted, I haven't adequately demonstrated that things really are so
rosy, but I think I'll be able to. The key point is that with trivial
exception, all other parts of the code, like VACUUM, don't consider
themselves to directly have license to acquire locks on btree buffers
- they go through the AM interface instead. What do they know about
what makes sense for a particular AM? The surface area actually turns
out to be fairly manageable.

With the promise tuple approach, it's more the maintainability
overhead of new *classes* of bloat that I'm concerned about than the
bloat itself, and all the edge cases that are likely to be introduced.
But yes, the overhead of doing all that extra writing (including
WAL-logging twice), and the fact that it all has to happen with an
exclusive lock on the leaf page buffer is also a concern of mine. With
v3 of my patch, we still only have to do all the preliminary work like
finding the right page and verifying that there are no duplicates
once. So with recent revisions, the amount of time spent exclusive
locking with my proposed approach is now approximately half the time
of alternative proposals (assuming no page split is necessary). In the
worst case, the number of values locked on the leaf page is quite
localized and manageable, as a natural consequence of the fact that
it's a btree leaf page. I haven't run any numbers, but for an int4
btree (which really is the worst case here), 200 or so read-locked
values would be quite close to as bad as things got. Plus, if there
isn't a second phase of locking, which is on average a strong
possibility, those locks would be hardly held at all - contrast that
with having to do lots of exclusive locking for all that clean-up.

I might experiment with weakening the exclusive lock even earlier in
my next revision, and/or strengthening later. Off hand, I can't see a
reason for not weakening after we find the first leaf page that the
key might be on (granted, I haven't thought about it that much) -
_bt_check_unique() does not have license to alter the buffer already
proposed for insertion. Come to think of it, all of this new buffer
lock weakening/strengthening stuff might independently justify itself
as an optimization to regular btree index tuple insertion. That's a
whole other patch, though -- it's a big ambition to have as a sort of
incidental adjunct to what is already a big, complex patch.

In practice the vast majority of insertions don't involve TOASTing.
That's not an excuse for allowing the worst case to be really bad in
terms of its impact on query response time, but it may well justify
having whatever ameliorating measures we take result in bloat. It's at
least the kind of bloat we're more or less used to dealing with, and
have already invested a lot in controlling. Plus bloat-wise it can't
be any worse than just inserting the tuple and having the transaction
abort on a duplicate, since that already happens after toasting has
done its work with regular insertion.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: Pending query cancel defeats SIGQUIT
Next
From: Alvaro Herrera
Date:
Subject: Re: 9.4 HEAD: select() failed in postmaster