Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - Mailing list pgsql-hackers

From Robert Haas
Subject Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Date
Msg-id CA+TgmobzvesvFWy+Et0K3Y8TwthLan-dr9n3MBStHrZKd+TE1g@mail.gmail.com
Whole thread Raw
In response to Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE  (Peter Geoghegan <pg@heroku.com>)
Responses Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
List pgsql-hackers
On Fri, Sep 13, 2013 at 2:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I would suggest letting those other individuals speak for themselves
> too. Particularly if you're going to name someone who is on vacation
> like that.

You seem to be under the impression that I'm mentioning Tom's name, or
Andres's, because I need to win some kind of an argument.  I don't.
We're not going to accept a patch that uses lwlocks in the way that
you are proposing.

> I mean, if we do the promise tuple
> thing, and there are multiple unique indexes, what happens when an
> inserter needs to block pending the outcome of another transaction?
> They had better go clean up the promise tuples from the other unique
> indexes that they're trying to insert into, because they cannot afford
> to hold value locks for a long time, no matter how they're
> implemented.

As Andres already pointed out, this is not correct.  Just to add to
what he said, we already have long-lasting value locks in the form of
SIREAD locks. SIREAD can exist at different levels of granularity, but
one of those levels is index-page-level granularity, where they have
the function of guarding against concurrent insertions of values that
would fall within that page, which just so happens to be the same
thing you want to do here.  The difference between those locks and
what you're proposing here is that they are implemented differently.
That is why those were acceptable and this is not.

> That could take much longer than just releasing a shared
> buffer lock, since for each unique index the promise tuple must be
> re-found from scratch. There are huge issues with additional
> complexity and bloat. Oh, and now your lightweight locks aren't so
> lightweight any more.

Yep, totally agreed.  If you simply lock the buffer, or take some
other action which freezes out all concurrent modifications to the
page, then re-finding the lock is much simpler.  On the other hand,
it's much simpler precisely because you've reduced concurrency to the
degree necessary to make it simple.  And reducing concurrency is bad.
Similarly, complexity and bloat are not great things taken in
isolation, but many of our existing locking schemes are already very
complex.  Tuple locks result in a complex jig that involves locking
the tuple via the heavyweight lock manager, performing a WAL-logged
modification to the page, and then releasing the lock in the
heavyweight lock manager.  As here, that is way more expensive than
simply grabbing and holding a share-lock on the page.  But we get a
number of important benefits out of it.  The backend remains
interruptible while the tuple is locked, the protocol for granting
locks is FIFO to prevent starvation, we don't suppress page eviction
while the lock is held, we can simultaneously lock arbitrarily large
numbers of tuples, and deadlocks are detect and handled cleanly.  If
those requirements were negotiable, we would surely have negotiated
them away already, because the performance benefits would be immense.

> If the value locks were made interruptible through some method, such
> as the promise tuples approach, does that really make deadlocking
> acceptable?

Yes.  It's not possible to prevent all deadlocks.  It IS possible to
make sure that they are properly detected and that precisely one of
the transactions involved is rolled back to resolve the deadlock.

> You can hardly compare a buffer's LWLock with a system one that
> protects critical shared memory structures. We're talking about a
> shared lock on a single btree leaf page per unique index involved in
> upserting.

Actually, I can and I am.  Buffers ARE critical shared memory structures.

>> A further problem is that a backend which holds even one lwlock can't
>> be interrupted.  We've had this argument before and it seems that you
>> don't think that non-interruptibility is a problem, but it project
>> policy to allow for timely interrupts in all parts of the backend and
>> we're not going to change that policy for this patch.
>
> I don't think non-interruptibility is a problem? Really, do you think
> that this kind of inflammatory rhetoric helps anybody? I said nothing
> of the sort. I recall saying something about an engineering trade-off.
> Of course I value interruptibility.

I don't see what's inflammatory about that statement.  The point is
that this isn't the first time you've proposed a change which would
harm interruptibility and it isn't the first time I've objected on
precisely that basis.  Interruptibility is not a nice-to-have that we
can trade away from time to time; it's essential and non-negotiable.

> If you're concerned about non-interruptibility, consider XLogFlush().
> That does rather a lot of work with WALWriteLock exclusive locked. On
> a busy system, some backend is very frequently going to experience a
> non-interruptible wait for the duration of however long it takes to
> write and flush perhaps a whole segment. All other flushing backends
> are stuck in non-interruptible waits waiting for that backend to
> finish. I think that the group commit stuff might have regressed
> worst-case interruptibility for flushers by quite a bit; should we
> have never committed that, or do you agree with my view that it's
> worth it?

It wouldn't take a lot to convince me that it wasn't worth it, because
I was never all that excited about that patch to begin with.  I think
it mostly helps in extremely artificial situations that are not likely
to occur on real systems anyway.  But, yeah, WALWriteLock is a
problem, no doubt about it.  We should try to make the number of such
problems go down, not up, even if it means passing up new features
that we'd really like to have.

> In contrast, what I've proposed here is in general quite unlikely to
> result in any I/O for the duration of the time the locks are held.
> Only writers will be blocked. And only those inserting into a narrow
> range of values around the btree leaf page. Much of the work that even
> those writers need to do will be unimpeded anyway; they'll just block
> on attempting to acquire an exclusive lock on the first btree leaf
> page that the value they're inserting could be on.

Sure, but you're talking about broadening the problem from the guy
performing the insert to everybody who might be trying to an insert
that hits one of the same unique-index pages.  Instead of holding one
buffer lock, the guy performing the insert is now holding as many
buffer locks as there are indexes.   That's a non-trivial issue.

For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes,
you'll error out.  In fact, if you get the number of indexes exactly
right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and
panic the whole system.

Oh, and if different backends load the index list in different orders,
because say the system catalog gets vacuumed between their respective
relcache loads, then they may try to lock the indexes in different
orders and cause an undetected deadlock.

And, drifting a bit further off-topic, even to get as far as you have,
you've added overhead to every lwlock acquisition and release, even
for people who never use this functionality.  I'm pretty skeptical
about anything that involves adding additional frammishes to the
lwlock mechanism.  There are a few new primitives I'd like, too, but
every one we add slows things down for everybody.

> And the additional
> non-interruptible wait of those inserters won't be terribly much more
> than the wait of the backend where heap tuple insertion takes a long
> time anyway - that guy already has to do close to 100% of that work
> with a non-interruptible wait today (once we eliminate
> heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is
> right at the end of heap_insert, and the buffer is pinned and locked
> very close to the start.

That's true but somewhat misleading.  Textually most of the function
holds the buffer lock, but heap_prepare_insert(),
CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and
XLogWrite() are the parts that do substantial amounts of computation,
and only the last of those happens while holding the buffer lock.  And
that last is really fundamental, because we can't let any other
backend see the modified buffer until we've xlog'd the changes.  The
problems you're proposing to create do not fall into the same
category.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Satoshi Nagayasu
Date:
Subject: Re: [PoC] pgstattuple2: block sampling to reduce physical read
Next
From: Greg Stark
Date:
Subject: Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE