Thread: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
The attached patch implements INSERT...ON DUPLICATE KEY LOCK FOR
UPDATE. This is similar to INSERT...ON DUPLICATE KEY IGNORE (which is
also proposed at part of this new revision of the patch), but
additionally acquires a row exclusive lock on the row that prevents
insertion from proceeding in respect of some tuple proposed for
insertion.

This feature offers something that I believe could be reasonably
described as upsert. Consider:

postgres=# create table foo(a int4 primary key, b text);
CREATE TABLE
postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;
UPDATE 0

Here there are 0 rows affected by the update, because all work was
done in the insert. If l do it again 2 rows are affected by the
update:

postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;
UPDATE 2

Obviously, rejects were now projected into the wCTE, and the
underlying rows were locked. The idea is that we can update the rows,
confident that each rejection-causing row will be updated in a
race-free fashion. I personally prefer this to something like MySQL's
INSERT...ON DUPLICATE KEY UPDATE, because it's more flexible. For
example, we could have deleted the locked rows instead, if that
happened to make sense. Making this kind of usage idiomatic feels to
me like the Postgres way to do upsert. Others may differ here. I will
however concede that it'll be unfortunate to not have some MySQL
compatibility, for the benefit of people porting widely used web
frameworks.

I'm not really sure if I should have done something brighter here than
lock the first duplicate found, or if it's okay that that's all I do.
That's another discussion entirely. Though previously Andres and I did
cover the question of prioritizing unique indexes, so that the most
sensible duplicate for the particular situation was returned,
according to some criteria.

As previously covered, I felt that including a row locking component
was essential to reasoning about our requirements for what I've termed
"speculative insertion" -- the basic implementation of value locking
that is needed to make all this work. As I said in that earlier
thread, there are many opinions about this, and it isn't obvious which
one is right. Any approach needs to have its interactions with row
locking considered right up front. Those that consider this a new
patch with new functionality, or even a premature expansion on what
I've already posted should carefully consider that. Do we really want
to assume that these two things are orthogonal? I think that they're
probably not, but even if that happens to turn out to have been not
the case, it's an unnecessary risk to take.

Row locking
==========

Row locking is implemented with calls to a new function above
ExecInsert. We don't bother with the usual EvalPlanQual looping
pattern for now, preferring to just re-check from scratch if there is
a concurrent update from another session (see comments in
ExecLockHeapTupleForUpdateSpec() for details). We might do better
here. I haven't considered the row locking functionality in too much
detail since the last revision, preferring to focus on value locking.

Buffer locking/value locking
======================

Andres raised concerns about the previous patch's use of exclusive
buffer locks for extended periods (i.e. during a single heap tuple
insertion). These locks served as extended value locks. With this
revision, we don't hold exclusive buffer locks for the duration of
heap insertion - we hold shared buffer locks instead. I believe that
Andres principal concern was the impact on concurrent index scans by
readers, so I think that all of this will go some way towards
alleviating his concerns generally.

This necessitated inventing entirely new LWLock semantics around
"weakening" (from exclusive to shared) and "strengthening" (from
shared to exclusive) of locks already held. Of course, as you'd
expect, there are some tricky race hazards surrounding these new
functions that clients need to be mindful of. These have been
documented within lwlock.c.

I looked for a precedent for these semantics, and found a few. Perhaps
the most prominent was Boost, a highly regarded, peer-reviews C++
library. Boost implements exactly these semantics for some of its
thread synchronization/mutex primitives:


http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable

They have a concept of upgradable ownership, which is just like shared
ownership, except, I gather, that the owner reserves the exclusive
right to upgrade to an exclusive lock (for them it's not quite an
exclusive lock; it's an upgradeable/downgradable exclusive lock). My
solution is to push that responsibility onto the client - I admonish
something along the lines of "don't let more than one shared locker do
this at a time per LWLock". I am of course mindful of this caveat in
my modifications to the btree code, where I "weaken" and then later
"strengthen" an exclusive lock - the trick here is that before I
weaken I get a regular exclusive lock, and I only actually weaken
after that when going ahead with insertion.

I suspect that this may not be the only place where this trick is helpful.

This intended usage is described in the relevant comments added to lwlock.c.

Testing
======

This time around, in order to build confidence in the new LWLock
infrastructure for buffer locking, on debug builds we re-verify that
the value proposed for insertion on the locked page is in fact not on
that page as expected during the second phase, and that our previous
insertion point calculation is still considered correct. This is kind
of like the way we re-verifying the wait-queue-is-in-lsn-order
invariant in syncrep.c on debug builds. It's really a fancier
assertion - it doesn't just test the state of scalar variables.

This was invaluable during development of the new LWLock infrastructure.

Just as before, but this time with just shared buffer locks held
during heap tuple insertion, the patch has resisted considerable
brute-force efforts to break it (e.g. using pgbench to get many
sessions speculatively inserting values into a table. Many different
INSERT... ON DUPLICATE KEY LOCK FOR UPDATE statements, interspersed
with UPDATE, DELETE and SELECT statements. Seeing if spurious
duplicate tuple insertions occur, or deadlocks, or assertion
failures).

As always, isolation tests are included.

Bugs
====

I fixed the bug that Andres reported in relation to multiple exclusive
indexes' interaction with waits for another transaction's end during
speculative insertion.

I did not get around to fixing the broken ecpg regression tests, as
reported by Peter Eisentraut. I was a little puzzled by the problem
there. I'll return to it in a while, or perhaps someone else can
propose a solution.

Thoughts?

--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sun, Sep 8, 2013 at 10:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
> This necessitated inventing entirely new LWLock semantics around
> "weakening" (from exclusive to shared) and "strengthening" (from
> shared to exclusive) of locks already held. Of course, as you'd
> expect, there are some tricky race hazards surrounding these new
> functions that clients need to be mindful of. These have been
> documented within lwlock.c.

I've since found that I can fairly reliably get this to deadlock at
high client counts (say, 95, which will do it on my 4 core laptop with
a little patience). To get this to happen, I used pgbench with a
single INSERT...ON DUPLICATE KEY IGNORE transaction script. The more
varied workload that I tested this most recent revision (v2) with the
most, with a transaction consisting on a mixture of different
statements (UPDATEs, DELETEs, INSERT...ON DUPLICATE KEY LOCK FOR
UPDATE) did not show the problem.

What I've been doing to recreate this is pgbench runs in an infinite
loop from a bash script, with a new table created for each iteration.
Each iteration has 95 clients "speculatively insert" a total of 1500
possible tuples for 15 seconds. After this period, the table has
exactly 1500 tuples, with primary key values 1 - 1500. Usually, after
about 5 - 20 minutes, deadlock occurs.

This was never a problem with the exclusive lock coding (v1),
unsurprisingly - after all, as far as buffer locks are concerned, it
did much the same thing as the existing code.

I've made some adjustments to LWLockWeaken, LWLockStrengthen and
LWLockRelease that made the deadlocks go away. Or at least, no
deadlocks or other problems manifested themselves using the same test
case for over two hours. Attached revision includes these changes, as
well as a few minor comment tweaks here and there.

I am working on an analysis of the broader deadlock hazards - the
implications of simultaneously holding multiple shared buffer locks
(that is, one for every unique index btree leaf page participating in
value locking) for the duration of a each heap tuple insertion (each
heap_insert() call). I'm particularly looking for unexpected ways in
which this locking could interact with other parts of the code that
also acquire buffer locks, for example vacuumlazy.c. I'll also try and
estimate how much of a maintainability burden unexpected locking
interactions with these other subsystems might be.

In case it isn't obvious, the deadlocking issue addressed by this
revision is not inherent to my design or anything like that - the bugs
fixed by this revision are entirely confined to lwlock.c.

--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
Hi Peter,

Nice to see the next version, won't have time to look in any details in
the next few days tho.

On 2013-09-10 22:25:34 -0700, Peter Geoghegan wrote:
> I am working on an analysis of the broader deadlock hazards - the
> implications of simultaneously holding multiple shared buffer locks
> (that is, one for every unique index btree leaf page participating in
> value locking) for the duration of a each heap tuple insertion (each
> heap_insert() call). I'm particularly looking for unexpected ways in
> which this locking could interact with other parts of the code that
> also acquire buffer locks, for example vacuumlazy.c. I'll also try and
> estimate how much of a maintainability burden unexpected locking
> interactions with these other subsystems might be.

I think for this approach to be workable you also need to explain how we
can deal with stuff like toast insertion that may need to write hundreds
of megabytes all the while leaving an entire value-range of the unique
key share locked.

I still think that even doing a plain heap insertion is longer than
acceptable to hold even a share lock over a btree page, but as long as
stuff like toast insertions happen while doing so that's peanuts.

The easiest answer is doing the toasting before doing the index locking,
but that will result in bloat, the avoidance of which seems to be the
primary advantage of your approach.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Sep 11, 2013 at 2:28 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Nice to see the next version, won't have time to look in any details in
> the next few days tho.

Thanks Andres!

> I think for this approach to be workable you also need to explain how we
> can deal with stuff like toast insertion that may need to write hundreds
> of megabytes all the while leaving an entire value-range of the unique
> key share locked.

Right. That is a question that needs to be addressed in a future revision.

> I still think that even doing a plain heap insertion is longer than
> acceptable to hold even a share lock over a btree page

Well, there is really only one way of judging something like that, and
that's to do a benchmark. I still haven't taken the time to "pick the
low hanging fruit" here that I'd mentioned - there are some fairly
obvious ways to shorten the window in which value locks are held.
Furthermore, I'm sort of at a loss as to what a fair benchmark would
look like - what is actually representative here? Also, what's the
baseline? It's not as if someone has an alternative, competing patch.
We can only hypothesize what additional costs those other approaches
introduce, unless someone has a suggestion as to how they can be
simulated without writing the full patch, which is something I'd
entertain.

As I've already pointed out, all page splits occur with the same
buffer exclusive lock held. Only, in our case, we're weakening that
lock to a shared lock. So I don't think that the heap insertion is
going to be that big of a deal, particularly in the average case.
Having said that, it's a question that surely must be closely examined
before proceeding much further. And yes, the worst case could be
pretty bad, and that surely matters too.

> The easiest answer is doing the toasting before doing the index locking,
> but that will result in bloat, the avoidance of which seems to be the
> primary advantage of your approach.

I would say that the primary advantage of my approach is that it's
much simpler than any other approach that has been considered by
others in the past. The approach is easier to reason about because
it's really just an extension of how btrees already do value locking.
Granted, I haven't adequately demonstrated that things really are so
rosy, but I think I'll be able to. The key point is that with trivial
exception, all other parts of the code, like VACUUM, don't consider
themselves to directly have license to acquire locks on btree buffers
- they go through the AM interface instead. What do they know about
what makes sense for a particular AM? The surface area actually turns
out to be fairly manageable.

With the promise tuple approach, it's more the maintainability
overhead of new *classes* of bloat that I'm concerned about than the
bloat itself, and all the edge cases that are likely to be introduced.
But yes, the overhead of doing all that extra writing (including
WAL-logging twice), and the fact that it all has to happen with an
exclusive lock on the leaf page buffer is also a concern of mine. With
v3 of my patch, we still only have to do all the preliminary work like
finding the right page and verifying that there are no duplicates
once. So with recent revisions, the amount of time spent exclusive
locking with my proposed approach is now approximately half the time
of alternative proposals (assuming no page split is necessary). In the
worst case, the number of values locked on the leaf page is quite
localized and manageable, as a natural consequence of the fact that
it's a btree leaf page. I haven't run any numbers, but for an int4
btree (which really is the worst case here), 200 or so read-locked
values would be quite close to as bad as things got. Plus, if there
isn't a second phase of locking, which is on average a strong
possibility, those locks would be hardly held at all - contrast that
with having to do lots of exclusive locking for all that clean-up.

I might experiment with weakening the exclusive lock even earlier in
my next revision, and/or strengthening later. Off hand, I can't see a
reason for not weakening after we find the first leaf page that the
key might be on (granted, I haven't thought about it that much) -
_bt_check_unique() does not have license to alter the buffer already
proposed for insertion. Come to think of it, all of this new buffer
lock weakening/strengthening stuff might independently justify itself
as an optimization to regular btree index tuple insertion. That's a
whole other patch, though -- it's a big ambition to have as a sort of
incidental adjunct to what is already a big, complex patch.

In practice the vast majority of insertions don't involve TOASTing.
That's not an excuse for allowing the worst case to be really bad in
terms of its impact on query response time, but it may well justify
having whatever ameliorating measures we take result in bloat. It's at
least the kind of bloat we're more or less used to dealing with, and
have already invested a lot in controlling. Plus bloat-wise it can't
be any worse than just inserting the tuple and having the transaction
abort on a duplicate, since that already happens after toasting has
done its work with regular insertion.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Wed, Sep 11, 2013 at 8:47 PM, Peter Geoghegan <pg@heroku.com> wrote:
> In practice the vast majority of insertions don't involve TOASTing.
> That's not an excuse for allowing the worst case to be really bad in
> terms of its impact on query response time, but it may well justify
> having whatever ameliorating measures we take result in bloat. It's at
> least the kind of bloat we're more or less used to dealing with, and
> have already invested a lot in controlling. Plus bloat-wise it can't
> be any worse than just inserting the tuple and having the transaction
> abort on a duplicate, since that already happens after toasting has
> done its work with regular insertion.

Andres is being very polite here, but the reality is that this
approach has zero chance of being accepted.  You can't hold buffer
locks for a long period of time across complex operations.  Full stop.It's a violation of the rules that are clearly
documentedin
 
src/backend/storage/buffer/README, which have been in place for a very
long time, and this patch is nowhere near important enough to warrant
a revision of those rules.  We are not going to risk breaking every
bit of code anywhere in the backend or in third-party code that takes
a buffer lock.  You are never going to convince me, or Tom, that the
benefit of doing that is worth the risk; in fact, I have a hard time
believing that you'll find ANY committer who thinks this approach is
worth considering.

Even if you get the code to run without apparent deadlocks, that
doesn't mean there aren't any; it just means that you haven't found
them all yet.  And even if you managed to squash every such hazard
that exists today, so what?  Fundamentally, locking protocols that
don't include deadlock detection don't scale.  You can use such locks
in limited contexts where proofs of correctness are straightforward,
but trying to stretch them beyond that point results not only in bugs,
but also in bad performance and unmaintainable code.  With a much more
complex locking regimen, even if your code is absolutely bug-free,
you've put a burden on the next guy who wants to change anything; how
will he avoid breaking things?  Our buffer locking regimen suffers
from painful complexity and serious maintenance difficulties as is.
Moreover, we've already got performance and scalability problems that
are attributable to every backend in the system piling up waiting on a
single lwlock, or a group of simultaneously-held lwlocks.
Dramatically broadening the scope of where lwlocks are used and for
how long they're held is going to make that a whole lot worse.  What's
worse, the problems will be subtle, restricted to the people using
this feature, and very difficult to measure on production systems, and
I have no confidence they'd ever get fixed.

A further problem is that a backend which holds even one lwlock can't
be interrupted.  We've had this argument before and it seems that you
don't think that non-interruptibility is a problem, but it project
policy to allow for timely interrupts in all parts of the backend and
we're not going to change that policy for this patch.  Heavyweight
locks are heavy weight precisely because they provide services - like
deadlock detection, satisfactory interrupt handling, and, also
importantly, FIFO queuing behavior - that are *important* for locks
that are held over an extended period of time.  We're not going to go
put those services into the lightweight lock mechanism because then it
would no longer be light weight, and we're not going to ignore the
importance of them, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Sep 13, 2013 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Andres is being very polite here, but the reality is that this
> approach has zero chance of being accepted.

I quite like Andres, but I have yet to see him behave as you describe
in a situation where someone proposed what was fundamentally a bad
idea. Maybe you should let him speak for himself?

> You can't hold buffer
> locks for a long period of time across complex operations.  Full stop.
>  It's a violation of the rules that are clearly documented in
> src/backend/storage/buffer/README, which have been in place for a very
> long time, and this patch is nowhere near important enough to warrant
> a revision of those rules.

The importance of this patch is a value judgement. Our users have been
screaming for this for over ten years, so to my mind it has a fairly
high importance. Also, every other database system of every stripe
worth mentioning has something approximately equivalent to this,
including ones with much less functionality generally. The fact that
we don't is a really unfortunate omission.

As to the rules you refer to, you must mean "These locks are intended
to be short-term: they should not be held for long". I don't think
that they will ever be held for long. At least, when I've managed the
amount of work that a heap_insert() can do better. I expect to produce
a revision where toasting doesn't happen with the locks held soon.
Actually, I've already written the code, I just need to do some
testing.

> We are not going to risk breaking every
> bit of code anywhere in the backend or in third-party code that takes
> a buffer lock.  You are never going to convince me, or Tom, that the
> benefit of doing that is worth the risk; in fact, I have a hard time
> believing that you'll find ANY committer who thinks this approach is
> worth considering.

I would suggest letting those other individuals speak for themselves
too. Particularly if you're going to name someone who is on vacation
like that.

> Even if you get the code to run without apparent deadlocks, that
> doesn't mean there aren't any;

Of course it doesn't. Who said otherwise?

> Our buffer locking regimen suffers
> from painful complexity and serious maintenance difficulties as is.

That's true to a point, but it has more to do with things like how
VACUUM interacts with hio.c. Things like this:

/** Release the file-extension lock; it's now OK for someone else to extend* the relation some more. Note that we
cannotrelease this lock before* we have buffer lock on the new page, or we risk a race condition* against vacuumlazy.c
---see comments therein.*/
 
if (needLock)   UnlockRelationForExtension(relation, ExclusiveLock);

The btree code is different, though: It implements a well-defined
interface, with much clearer separation of concerns. As I've said
already, with trivial exception (think contrib), no external code
considers itself to have license to obtain locks of any sort on btree
buffers. No external code of ours - without exception - does anything
with multiple locks, or exclusive locks on btree buffers. I'll remind
you that I'm only holding shared locks when control is outside of the
btree code.

Even within the btree code, the number of access method functions that
could conflict with what I do here (that acquire exclusive locks) is
very small when you exclude things that only exclusive lock the
meta-page (there are also very few of those). So the surface area is
quite small.

I'm not denying that there is a cost, or that I haven't expanded
things in a direction I'd prefer not to. I just think that it may well
be worth it, particularly when you consider the alternatives - this
may well be the least worst thing. I mean, if we do the promise tuple
thing, and there are multiple unique indexes, what happens when an
inserter needs to block pending the outcome of another transaction?
They had better go clean up the promise tuples from the other unique
indexes that they're trying to insert into, because they cannot afford
to hold value locks for a long time, no matter how they're
implemented. That could take much longer than just releasing a shared
buffer lock, since for each unique index the promise tuple must be
re-found from scratch. There are huge issues with additional
complexity and bloat. Oh, and now your lightweight locks aren't so
lightweight any more.

If the value locks were made interruptible through some method, such
as the promise tuples approach, does that really make deadlocking
acceptable? So at least your system didn't seize up. But on the other
hand, the user randomly had a deadlock error through no fault of their
own. The former may be worse, but the latter is also inexcusable. In
general, the best solution is just to not have deadlock hazards. I
wouldn't be surprised if reasoning about deadlocking was harder with
that alternative approach to value locking, not easier.

> Moreover, we've already got performance and scalability problems that
> are attributable to every backend in the system piling up waiting on a
> single lwlock, or a group of simultaneously-held lwlocks.
> Dramatically broadening the scope of where lwlocks are used and for
> how long they're held is going to make that a whole lot worse.

You can hardly compare a buffer's LWLock with a system one that
protects critical shared memory structures. We're talking about a
shared lock on a single btree leaf page per unique index involved in
upserting.

> A further problem is that a backend which holds even one lwlock can't
> be interrupted.  We've had this argument before and it seems that you
> don't think that non-interruptibility is a problem, but it project
> policy to allow for timely interrupts in all parts of the backend and
> we're not going to change that policy for this patch.

I don't think non-interruptibility is a problem? Really, do you think
that this kind of inflammatory rhetoric helps anybody? I said nothing
of the sort. I recall saying something about an engineering trade-off.
Of course I value interruptibility.

If you're concerned about non-interruptibility, consider XLogFlush().
That does rather a lot of work with WALWriteLock exclusive locked. On
a busy system, some backend is very frequently going to experience a
non-interruptible wait for the duration of however long it takes to
write and flush perhaps a whole segment. All other flushing backends
are stuck in non-interruptible waits waiting for that backend to
finish. I think that the group commit stuff might have regressed
worst-case interruptibility for flushers by quite a bit; should we
have never committed that, or do you agree with my view that it's
worth it?

In contrast, what I've proposed here is in general quite unlikely to
result in any I/O for the duration of the time the locks are held.
Only writers will be blocked. And only those inserting into a narrow
range of values around the btree leaf page. Much of the work that even
those writers need to do will be unimpeded anyway; they'll just block
on attempting to acquire an exclusive lock on the first btree leaf
page that the value they're inserting could be on. And the additional
non-interruptible wait of those inserters won't be terribly much more
than the wait of the backend where heap tuple insertion takes a long
time anyway - that guy already has to do close to 100% of that work
with a non-interruptible wait today (once we eliminate
heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is
right at the end of heap_insert, and the buffer is pinned and locked
very close to the start.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Stephen Frost
Date:
* Peter Geoghegan (pg@heroku.com) wrote:
> I would suggest letting those other individuals speak for themselves
> too. Particularly if you're going to name someone who is on vacation
> like that.

It was my first concern regarding this patch.
Thanks,
    Stephen

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Sep 13, 2013 at 12:14 PM, Stephen Frost <sfrost@snowman.net> wrote:
> It was my first concern regarding this patch.

It was my first concern too.


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-09-13 11:59:54 -0700, Peter Geoghegan wrote:
> On Fri, Sep 13, 2013 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Andres is being very polite here, but the reality is that this
> > approach has zero chance of being accepted.
> 
> I quite like Andres, but I have yet to see him behave as you describe
> in a situation where someone proposed what was fundamentally a bad
> idea. Maybe you should let him speak for himself?

Unfortunately I have to agree with Robert here, I think it's a complete
nogo to do what you propose so far and I've several times now presented
arguments why I think so.
The reason I wasn't saying "this will never get accepted" are twofold:
a) I don't want to stiffle alternative ideas to the "promises" idea,
just because I think it's the way to go. That might stop a better idea
from being articulated. b) I am not actually in the position to say it's
not going to be accepted.

*I* think that unless you make some fundamental and very, very clever
modifications to your algorithm that end up *not holding a lock over
other operations at all*, it's not going to get committed. And I'll chip
in with my -1.
And clever modification doesn't mean slightly restructuring heapam.c's
operations.

> The importance of this patch is a value judgement. Our users have been
> screaming for this for over ten years, so to my mind it has a fairly
> high importance. Also, every other database system of every stripe
> worth mentioning has something approximately equivalent to this,
> including ones with much less functionality generally. The fact that
> we don't is a really unfortunate omission.

I aggree it's quite important but that doesn't mean we have to do stuff
that we think are unacceptable, especially as there *are* other ways to
do it.

> As to the rules you refer to, you must mean "These locks are intended
> to be short-term: they should not be held for long". I don't think
> that they will ever be held for long. At least, when I've managed the
> amount of work that a heap_insert() can do better. I expect to produce
> a revision where toasting doesn't happen with the locks held soon.
> Actually, I've already written the code, I just need to do some
> testing.

I personally think - and have stated so before - that doing a
heap_insert() while holding the btree lock is unacceptable.

> The btree code is different, though: It implements a well-defined
> interface, with much clearer separation of concerns.

Which you're completely violating by linking the btree buffer locking
with the heap locking. It's not about the btree code alone.

At this point I am a bit confused why you are asking for review.

> I mean, if we do the promise tuple thing, and there are multiple
> unique indexes, what happens when an inserter needs to block pending
> the outcome of another transaction?  They had better go clean up the
> promise tuples from the other unique indexes that they're trying to
> insert into, because they cannot afford to hold value locks for a long
> time, no matter how they're implemented.

Why? We're using normal transaction visibility rules here. We don't stop
*other* values on the same index getting updated or similar.
And anyway. It doesn't matter which problem the "promises" idea
has. We're discussing your proposal here.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Sep 13, 2013 at 12:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> The reason I wasn't saying "this will never get accepted" are twofold:
> a) I don't want to stiffle alternative ideas to the "promises" idea,
> just because I think it's the way to go. That might stop a better idea
> from being articulated. b) I am not actually in the position to say it's
> not going to be accepted.

Well, the reality is that the promises idea hasn't been described in
remotely enough detail to compare it to what I have here. I've pointed
out plenty of problems with it. After all, it was the first thing that
I considered, and I'm on the record talking about it in the 2012 dev
meeting. I didn't take that approach for many good reasons.

The reason I ended up here is not because I didn't get the memo about
holding buffer locks across complex operations being a bad thing. At
least grant me that. I'm here because in all these years no one has
come up with a suggestion that doesn't have some very major downsides.
Like, even worse than this.

>> As to the rules you refer to, you must mean "These locks are intended
>> to be short-term: they should not be held for long". I don't think
>> that they will ever be held for long. At least, when I've managed the
>> amount of work that a heap_insert() can do better. I expect to produce
>> a revision where toasting doesn't happen with the locks held soon.
>> Actually, I've already written the code, I just need to do some
>> testing.
>
> I personally think - and have stated so before - that doing a
> heap_insert() while holding the btree lock is unacceptable.

Presumably your reason is essentially that we exclusive lock a heap
buffer (exactly one heap buffer) while holding shared locks on btree
index buffers. Is that really so different to holding an exclusive
lock on a btree buffer while holding a shared lock on a heap buffer?
Because that's what _bt_check_unique() does today.

Now, I'll grant you that there is one appreciable difference, which is
that multiple unique indexes may be involved. But limiting ourselves
to the primary key or something like that remains an option. And I'm
not sure that it's really any worse anyway.

>> The btree code is different, though: It implements a well-defined
>> interface, with much clearer separation of concerns.
>
> Which you're completely violating by linking the btree buffer locking
> with the heap locking. It's not about the btree code alone.

You're right that it isn't about just the btree code.

In order for a deadlock to occur, there must be a mutual dependency.
What code could feel entitled to hold buffer locks on btree buffers
and heap buffers at the same time except the btree code itself? It
already does so. But no one else does the same thing. If anyone did
anything with a heap buffer lock held that could result in a call into
one of the btree access method functions (I'm not contemplating the
possibility of this other code locking the btree buffer *directly*),
I'm quite sure that that would be rejected outright today, because
that causes deadlocks. Certainly, vacuumlazy.c doesn't do it, for
example. Why would anyone ever want to do that anyway? I cannot think
of any reason. I suppose that that does still leave "transitive
dependencies", but now you're stretching things. After all, you're not
supposed to hold buffer locks for long! The dependency would have to
transit through, say, one of the system LWLocks used for WAL Logging.
Seems pretty close to impossible that it'd be an issue - index stuff
is only WAL-logged as index tuples are inserted (that is, as the locks
are finally released). Everyone automatically does that kind of thing
in a consistent order of locking, unlocking in the opposite order
anyway.

But what of the btree code deadlocking with itself? There are only a
few functions (2 or 3) where that's possible even in principle. I
think that they're going to be not too hard to analyze. For example,
with insertion, the trick is to always lock in a consistent order and
unlock/insert in the opposite order. The heap shared lock(s) needed in
the btree code cannot deadlock with another upserter because once the
other upserter has that exclusive heap buffer lock, it's *inevitable*
that it will release all of its shared buffer locks. Granted, I could
stand to think about this case more, but you get the idea - it *is*
possible to clamp down on the code that needs to care about this stuff
to a large degree. It's subtle, but btrees are generally considered
pretty complicated, and the btree code already cares about some odd
cases like these (it takes special precuations for catalog indexes,
for example).

The really weird thing about my patch is that the btree code trusts
the executor to call the heapam code to do the right thing in the
right way - it now knows more than I'd prefer. Would you be happier if
the btree code took more direct responsibility for the heap tuple
insertion instead? Before you say "that's ridiculous", consider the
big modularity violation that has always existed. It may be no more
ridiculous than that. And that existing state of affairs may be no
less ridiculous than living with what I've already done.

> At this point I am a bit confused why you are asking for review.

I am asking for us, collectively, through consensus, to resolve the
basic approach to doing this. That was something I stated right up
front, pointing out details of where the discussion had gone in the
past. That was my explicit goal. There has been plenty of discussing
on this down through the years, but nothing ever came from it.

Why is this an intractable problem for over a decade for us alone? Why
isn't this a problem for other database systems? I'm not implying that
it's because they do this. It's something that I am earnestly
interested in, though. A number of people have asked me that, and I
don't have a good answer for them.

>> I mean, if we do the promise tuple thing, and there are multiple
>> unique indexes, what happens when an inserter needs to block pending
>> the outcome of another transaction?  They had better go clean up the
>> promise tuples from the other unique indexes that they're trying to
>> insert into, because they cannot afford to hold value locks for a long
>> time, no matter how they're implemented.
>
> Why? We're using normal transaction visibility rules here. We don't stop
> *other* values on the same index getting updated or similar.

Because you're locking a value in some other, earlier unique index,
all the while waiting *indefinitely* on some other value in a second
or subsequent one. That isn't acceptable. A bunch of backends would
back up just because one backend had this contention on the second
unique index value that the others didn't actually have themselves. My
design allows those other backends to immediately go through and
finish.

Value locks have these kinds of hazards no matter how you implement
them. Deadlocks, and unreasonable stalling as described here is always
unacceptable - whether or not the problems are detected at runtime is
ultimately of marginal interest. Either way, it's a bug.

I think that the details of how this approach compare to others are
totally pertinent. For me, that's the whole point - getting towards
something that will balance all of these concerns and be acceptable.
Yes, it's entirely possible that that could look quite different to
what I have here. I do not want to reduce all this to a question of
"is this one design acceptable or not?". Am I not allowed to propose a
design to drive discussion? That's how the most important features get
implemented around here.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Kevin Grittner
Date:
Peter Geoghegan <pg@heroku.com> wrote:

> we exclusive lock a heap buffer (exactly one heap buffer) while
> holding shared locks on btree index buffers. Is that really so
> different to holding an exclusive lock on a btree buffer while
> holding a shared lock on a heap buffer? Because that's what
> _bt_check_unique() does today.

Is it possible to get a deadlock doing only one of those two
things?  Is it possible to avoid a deadlock doing both of them?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Fri, Sep 13, 2013 at 2:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I would suggest letting those other individuals speak for themselves
> too. Particularly if you're going to name someone who is on vacation
> like that.

You seem to be under the impression that I'm mentioning Tom's name, or
Andres's, because I need to win some kind of an argument.  I don't.
We're not going to accept a patch that uses lwlocks in the way that
you are proposing.

> I mean, if we do the promise tuple
> thing, and there are multiple unique indexes, what happens when an
> inserter needs to block pending the outcome of another transaction?
> They had better go clean up the promise tuples from the other unique
> indexes that they're trying to insert into, because they cannot afford
> to hold value locks for a long time, no matter how they're
> implemented.

As Andres already pointed out, this is not correct.  Just to add to
what he said, we already have long-lasting value locks in the form of
SIREAD locks. SIREAD can exist at different levels of granularity, but
one of those levels is index-page-level granularity, where they have
the function of guarding against concurrent insertions of values that
would fall within that page, which just so happens to be the same
thing you want to do here.  The difference between those locks and
what you're proposing here is that they are implemented differently.
That is why those were acceptable and this is not.

> That could take much longer than just releasing a shared
> buffer lock, since for each unique index the promise tuple must be
> re-found from scratch. There are huge issues with additional
> complexity and bloat. Oh, and now your lightweight locks aren't so
> lightweight any more.

Yep, totally agreed.  If you simply lock the buffer, or take some
other action which freezes out all concurrent modifications to the
page, then re-finding the lock is much simpler.  On the other hand,
it's much simpler precisely because you've reduced concurrency to the
degree necessary to make it simple.  And reducing concurrency is bad.
Similarly, complexity and bloat are not great things taken in
isolation, but many of our existing locking schemes are already very
complex.  Tuple locks result in a complex jig that involves locking
the tuple via the heavyweight lock manager, performing a WAL-logged
modification to the page, and then releasing the lock in the
heavyweight lock manager.  As here, that is way more expensive than
simply grabbing and holding a share-lock on the page.  But we get a
number of important benefits out of it.  The backend remains
interruptible while the tuple is locked, the protocol for granting
locks is FIFO to prevent starvation, we don't suppress page eviction
while the lock is held, we can simultaneously lock arbitrarily large
numbers of tuples, and deadlocks are detect and handled cleanly.  If
those requirements were negotiable, we would surely have negotiated
them away already, because the performance benefits would be immense.

> If the value locks were made interruptible through some method, such
> as the promise tuples approach, does that really make deadlocking
> acceptable?

Yes.  It's not possible to prevent all deadlocks.  It IS possible to
make sure that they are properly detected and that precisely one of
the transactions involved is rolled back to resolve the deadlock.

> You can hardly compare a buffer's LWLock with a system one that
> protects critical shared memory structures. We're talking about a
> shared lock on a single btree leaf page per unique index involved in
> upserting.

Actually, I can and I am.  Buffers ARE critical shared memory structures.

>> A further problem is that a backend which holds even one lwlock can't
>> be interrupted.  We've had this argument before and it seems that you
>> don't think that non-interruptibility is a problem, but it project
>> policy to allow for timely interrupts in all parts of the backend and
>> we're not going to change that policy for this patch.
>
> I don't think non-interruptibility is a problem? Really, do you think
> that this kind of inflammatory rhetoric helps anybody? I said nothing
> of the sort. I recall saying something about an engineering trade-off.
> Of course I value interruptibility.

I don't see what's inflammatory about that statement.  The point is
that this isn't the first time you've proposed a change which would
harm interruptibility and it isn't the first time I've objected on
precisely that basis.  Interruptibility is not a nice-to-have that we
can trade away from time to time; it's essential and non-negotiable.

> If you're concerned about non-interruptibility, consider XLogFlush().
> That does rather a lot of work with WALWriteLock exclusive locked. On
> a busy system, some backend is very frequently going to experience a
> non-interruptible wait for the duration of however long it takes to
> write and flush perhaps a whole segment. All other flushing backends
> are stuck in non-interruptible waits waiting for that backend to
> finish. I think that the group commit stuff might have regressed
> worst-case interruptibility for flushers by quite a bit; should we
> have never committed that, or do you agree with my view that it's
> worth it?

It wouldn't take a lot to convince me that it wasn't worth it, because
I was never all that excited about that patch to begin with.  I think
it mostly helps in extremely artificial situations that are not likely
to occur on real systems anyway.  But, yeah, WALWriteLock is a
problem, no doubt about it.  We should try to make the number of such
problems go down, not up, even if it means passing up new features
that we'd really like to have.

> In contrast, what I've proposed here is in general quite unlikely to
> result in any I/O for the duration of the time the locks are held.
> Only writers will be blocked. And only those inserting into a narrow
> range of values around the btree leaf page. Much of the work that even
> those writers need to do will be unimpeded anyway; they'll just block
> on attempting to acquire an exclusive lock on the first btree leaf
> page that the value they're inserting could be on.

Sure, but you're talking about broadening the problem from the guy
performing the insert to everybody who might be trying to an insert
that hits one of the same unique-index pages.  Instead of holding one
buffer lock, the guy performing the insert is now holding as many
buffer locks as there are indexes.   That's a non-trivial issue.

For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes,
you'll error out.  In fact, if you get the number of indexes exactly
right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and
panic the whole system.

Oh, and if different backends load the index list in different orders,
because say the system catalog gets vacuumed between their respective
relcache loads, then they may try to lock the indexes in different
orders and cause an undetected deadlock.

And, drifting a bit further off-topic, even to get as far as you have,
you've added overhead to every lwlock acquisition and release, even
for people who never use this functionality.  I'm pretty skeptical
about anything that involves adding additional frammishes to the
lwlock mechanism.  There are a few new primitives I'd like, too, but
every one we add slows things down for everybody.

> And the additional
> non-interruptible wait of those inserters won't be terribly much more
> than the wait of the backend where heap tuple insertion takes a long
> time anyway - that guy already has to do close to 100% of that work
> with a non-interruptible wait today (once we eliminate
> heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is
> right at the end of heap_insert, and the buffer is pinned and locked
> very close to the start.

That's true but somewhat misleading.  Textually most of the function
holds the buffer lock, but heap_prepare_insert(),
CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and
XLogWrite() are the parts that do substantial amounts of computation,
and only the last of those happens while holding the buffer lock.  And
that last is really fundamental, because we can't let any other
backend see the modified buffer until we've xlog'd the changes.  The
problems you're proposing to create do not fall into the same
category.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Greg Stark
Date:
<p dir="ltr">I haven't read the patch and the btree code is an area I really don't know, so take this for what it's
worth....<br/><p dir="ltr">It seems to me that the nature of the problem is that there will unavoidably be a nexus
betweenthe two parts of the code here. We can try to isolate it as much as possible but we're going to need a bit of a
compromise.<pdir="ltr">I'm imagining a function that takes two target heap buffers and a btree key. It would descend
thebtree and holding the leaf page lock do a try_lock on the heap pages. If it fails to get the locks then it releases
whateverit got and returns for the heap update to find new pages and try again.<p dir="ltr">This still leaves the
potentialproblem with page splits and I assume it would still be tricky to call it without unsatisfactorily mixing
executorand btree code. But that's as far as I got. <p dir="ltr">-- <br /> greg 

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-09-14 09:57:43 +0100, Greg Stark wrote:
> It seems to me that the nature of the problem is that there will
> unavoidably be a nexus between the two parts of the code here. We can try
> to isolate it as much as possible but we're going to need a bit of a
> compromise.

I think Roberts and mine point is that there are several ways to
approach this without doing that.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-09-13 14:41:46 -0700, Peter Geoghegan wrote:
> On Fri, Sep 13, 2013 at 12:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > The reason I wasn't saying "this will never get accepted" are twofold:
> > a) I don't want to stiffle alternative ideas to the "promises" idea,
> > just because I think it's the way to go. That might stop a better idea
> > from being articulated. b) I am not actually in the position to say it's
> > not going to be accepted.
> 
> Well, the reality is that the promises idea hasn't been described in
> remotely enough detail to compare it to what I have here. I've pointed
> out plenty of problems with it.

Even if you disagree, I still think that doesn't matter in the very
least. You say:

> I think that the details of how this approach compare to others are
> totally pertinent. For me, that's the whole point - getting towards
> something that will balance all of these concerns and be acceptable.

Well, the two other people involved in the discussion so far have gone
on the record saying that the presented approach is not acceptable to
them. And you haven't started reacting to that.

> Yes, it's entirely possible that that could look quite different to
> what I have here. I do not want to reduce all this to a question of
> "is this one design acceptable or not?".

But the way you're discussing it so far is exactly reducing it that way.

If you want the discussion to be about *how* can we implement it that
the various concerns are addressed: fsck*ing great. I am with you there.

In the end, even though I have my usual strong opinions which is the
best way, I don't care which algorithm gets pursued further. At least,
if, and only if, it has a fighting chance of getting committed. Which
this doesn't.

> After all, it was the first thing that
> I considered, and I'm on the record talking about it in the 2012 dev
> meeting. I didn't take that approach for many good reasons.

Well, I wasn't there when you said that ;)

> The reason I ended up here is not because I didn't get the memo about
> holding buffer locks across complex operations being a bad thing. At
> least grant me that. I'm here because in all these years no one has
> come up with a suggestion that doesn't have some very major downsides.
> Like, even worse than this.

I think you're massively, massively, massively overstating the dangers
of bloat here. It's a known problem that's *NOT* getting worse by any of
the other proposals if you compare it with the loop/lock/catch
implementation of upsert that we have today as the only option. And we
*DO* have infrastructure to deal with bloat, even if could use some
improvement. We *don't* have infrastructure with deadlocks on
lwlocks. And we're not goint to get that infrastructure, because it
would even further remove the "lw" part of lwlocks.

> >> As to the rules you refer to, you must mean "These locks are intended
> >> to be short-term: they should not be held for long". I don't think
> >> that they will ever be held for long. At least, when I've managed the
> >> amount of work that a heap_insert() can do better. I expect to produce
> >> a revision where toasting doesn't happen with the locks held soon.
> >> Actually, I've already written the code, I just need to do some
> >> testing.
> >
> > I personally think - and have stated so before - that doing a
> > heap_insert() while holding the btree lock is unacceptable.
> 
> Presumably your reason is essentially that we exclusive lock a heap
> buffer (exactly one heap buffer) while holding shared locks on btree
> index buffers.

It's that it interleaves an already complex but local locking scheme
that required several years to become correct with another that is just
the same. That's an utterly horrid idea.

> Is that really so different to holding an exclusive
> lock on a btree buffer while holding a shared lock on a heap buffer?
> Because that's what _bt_check_unique() does today.

Yes, it it is different. But, in my opinion, bt_check_unique() doing so
is a bug that needs fixing. Not something that we want to extend.

(Note that _bt_check_unique() already needs to deal with the fact that
it reads an unlocked page, because it moves right in some cases)

And, as you say:

> Now, I'll grant you that there is one appreciable difference, which is
> that multiple unique indexes may be involved. But limiting ourselves
> to the primary key or something like that remains an option. And I'm
> not sure that it's really any worse anyway.

I don't think that's an acceptable limitation. If it were something we
could lift in a release or two, maybe, but that's not what you're
talking about.

> > At this point I am a bit confused why you are asking for review.
> 
> I am asking for us, collectively, through consensus, to resolve the
> basic approach to doing this. That was something I stated right up
> front, pointing out details of where the discussion had gone in the
> past. That was my explicit goal. There has been plenty of discussing
> on this down through the years, but nothing ever came from it.

At the moment ISTM you're not conceding on *ANY* points. That's not very
often the way to find concensus.

> Why is this an intractable problem for over a decade for us alone? Why
> isn't this a problem for other database systems? I'm not implying that
> it's because they do this. It's something that I am earnestly
> interested in, though. A number of people have asked me that, and I
> don't have a good answer for them.

Afaik all those go the route of bloat, don't they? Also, at least in the
past, mysql had a long list of caveats around it...

> >> I mean, if we do the promise tuple thing, and there are multiple
> >> unique indexes, what happens when an inserter needs to block pending
> >> the outcome of another transaction?  They had better go clean up the
> >> promise tuples from the other unique indexes that they're trying to
> >> insert into, because they cannot afford to hold value locks for a long
> >> time, no matter how they're implemented.
> >
> > Why? We're using normal transaction visibility rules here. We don't stop
> > *other* values on the same index getting updated or similar.

> Because you're locking a value in some other, earlier unique index,
> all the while waiting *indefinitely* on some other value in a second
> or subsequent one. That isn't acceptable. A bunch of backends would
> back up just because one backend had this contention on the second
> unique index value that the others didn't actually have themselves. My
> design allows those other backends to immediately go through and
> finish.

That argument doesn't make sense to me. You're inserting a unique
value. It completely makes sense that you can only insert one of
them. If it's unclear whether you can insert, you're going to have to
wait. Thats why they are UNIQUE after all. You're describing a complete
nonadvantange here. It's also how unique indexes already work.
Also note, that wait's on xids are properly supervised by deadlock detection.

Even if it had an advantage, not blocking *for the single unique key alone*
opens you to issues of livelocks where several backends retry because of
each other indefinitely.

> Value locks have these kinds of hazards no matter how you implement
> them. Deadlocks, and unreasonable stalling as described here is always
> unacceptable - whether or not the problems are detected at runtime is
> ultimately of marginal interest. Either way, it's a bug.

Whether postgres locks down in a way that can only resolved by kill -9
or whether it aborts a transaction are, like, a couple of magnitude of a
difference.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sat, Sep 14, 2013 at 12:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I mean, if we do the promise tuple
>> thing, and there are multiple unique indexes, what happens when an
>> inserter needs to block pending the outcome of another transaction?
>> They had better go clean up the promise tuples from the other unique
>> indexes that they're trying to insert into, because they cannot afford
>> to hold value locks for a long time, no matter how they're
>> implemented.
>
> As Andres already pointed out, this is not correct.

While not doing this is not incorrect, it certainly would be useful
for preventing deadlocks and unnecessary contention. In a world where
people expect either an insert or an update, we ought to try and
reduce contention across multiple unique indexes. I can understand why
that doesn't matter today, though - if you're going to insert
duplicates indifferent to whether or not there will be conflicts,
that's a kind of abuse, and not worth optimizing - seems probable that
most transactions will commit. However, it seems much less probable
that most upserters will insert. People may well wish to upsert all
the time where an insert is hardly ever necessary, which is one reason
why I have doubts about other proposals.

Note that today there is no guarantee that the original waiter for a
duplicate-inserting xact to complete will be the first one to get a
second chance, so I think it's hard to question this on correctness
grounds. Even if they are released in FIFO order, there is no reason
to assume that the first waiter will win the race with a second. Most
obviously, the second waiter may not even ever get the chance to block
on the same xid at all (so it's not really a waiter at all) and still
be able to insert, if the blocking-xact aborts after the second
"waiter" starts its descent but before it checks uniqueness. All this,
even though the second "waiter" arrived maybe minutes after the first.

What I'm talking about here is really unlikely to result in lock
starvation, because the original waiter typically gets to observe the
other waiter go through, and that's reason enough to give up entirely.
Now, it's kind of weird that the original waiter will still end up
blocking on the xid that caused it to wait in the first instance. So
there should be more thought put into that, like remembering the xid
and only waiting on it on a retry, or some similar scheme. Maybe you
could contrive a scenario where this causes lock starvation, but I
suspect you could do the same thing for the present btree insertion
code.

> Just to add to
> what he said, we already have long-lasting value locks in the form of
> SIREAD locks. SIREAD can exist at different levels of granularity, but
> one of those levels is index-page-level granularity, where they have
> the function of guarding against concurrent insertions of values that
> would fall within that page, which just so happens to be the same
> thing you want to do here.  The difference between those locks and
> what you're proposing here is that they are implemented differently.
> That is why those were acceptable and this is not.

As the implementer of this patch, I'm obligated to put some checks in
unique index insertion that everyone has to care about. There is no
way around that. Complexity issues aside, I think that an argument
could be made for this approach *reducing* the impact on concurrency
relative to other approaches, if there isn't too many unique indexes
to deal with, which is the case the vast majority of the time. I mean,
those other approaches necessitate doing so much more with *exclusive*
locks held. Like inserting, maybe doing a page split, WAL-logging, all
with the lock, and then either updating in place or killing the
promise tuple, and WAL-logging that, with an exclusive lock held the
second time around. Plus searching for everything twice. I think that
frequently killing all of those broken-promise tuples could have
deleterious effects on concurrency and/or index bloat or the kind only
remedied by reindex. Do you update the freespace map too? More
exclusive locks! Or if you leave it up to VACUUM (and just set the xid
to InvalidXid, which is still extra work), autovacuum has to care
about a new *class* of bloat - index-only bloat. Plus lots of dead
duplicates are bad for performance in btrees generally.

> As here, that is way more expensive than
> simply grabbing and holding a share-lock on the page.  But we get a
> number of important benefits out of it.  The backend remains
> interruptible while the tuple is locked, the protocol for granting
> locks is FIFO to prevent starvation, we don't suppress page eviction
> while the lock is held, we can simultaneously lock arbitrarily large
> numbers of tuples, and deadlocks are detect and handled cleanly.  If
> those requirements were negotiable, we would surely have negotiated
> them away already, because the performance benefits would be immense.

False equivalence. We only need to lock as many unique index *values*
(not tuples) as are proposed for insertion per slot (which can be
reasonably bound), and only for an instant. Clearly it would be
totally unacceptable if tuple-level locks made backends
uninterruptible indefinitely. Of course, this is nothing like that.

>> If the value locks were made interruptible through some method, such
>> as the promise tuples approach, does that really make deadlocking
>> acceptable?
>
> Yes.  It's not possible to prevent all deadlocks.  It IS possible to
> make sure that they are properly detected and that precisely one of
> the transactions involved is rolled back to resolve the deadlock.

You seem to have misunderstood me here, or perhaps I was unclear. I'm
referring to deadlocks that cannot really be predicted or analyzed by
the user at all - see my comments below on insertion order.

>> I don't think non-interruptibility is a problem? Really, do you think
>> that this kind of inflammatory rhetoric helps anybody? I said nothing
>> of the sort. I recall saying something about an engineering trade-off.
>> Of course I value interruptibility.
>
> I don't see what's inflammatory about that statement.

The fact that you simply stated that I don't think
non-interruptibility is a problem in a non-qualified way, obviously.

> Interruptibility is not a nice-to-have that we
> can trade away from time to time; it's essential and non-negotiable.

I seem to recall you saying something about the Linux kernel and their
attitude to interruptibility. Yes, interruptibility is not just a
nice-to-have; it is essentially. However, without dismissing your
other concerns, I have yet to hear a convincing argument as to why
anything I've done here is going to make any difference to
interruptibility that would be appreciable to any human. So far it's
been a slippery slope type argument that can be equally well used to
argue against some facet of almost any substantial patch ever
proposed. I just don't think that regressing interruptibility
marginally is *necessarily* sufficient justification for rejecting an
approach outright. FYI, *that's* how I value interruptibility
generally.

>> In contrast, what I've proposed here is in general quite unlikely to
>> result in any I/O for the duration of the time the locks are held.
>> Only writers will be blocked. And only those inserting into a narrow
>> range of values around the btree leaf page. Much of the work that evehn
>> those writers need to do will be unimpeded anyway; they'll just block
>> on attempting to acquire an exclusive lock on the first btree leaf
>> page that the value they're inserting could be on.
>
> Sure, but you're talking about broadening the problem from the guy
> performing the insert to everybody who might be trying to an insert
> that hits one of the same unique-index pages.

In general, that isn't that much worse than just blocking the value
directly. The number of possible values that could also be blocked is
quite low. The chances of it actually mattering that those additional
values are locked in the still small window in which the buffer locks
are held is in generally fairly low, particularly on larger tables
where there is naturally a large number of possible distinct values. I
will however concede that the impact on inserters that want to insert
a non-locked value that belongs on the locked page or its child might
be worse, but it's already a problem that inserted index tuples can
all end up on the same page, if not to the same extent.

> Instead of holding one
> buffer lock, the guy performing the insert is now holding as many
> buffer locks as there are indexes.   That's a non-trivial issue.

Actually, as many buffer locks as there are *unique* indexes. It might
be a non-trivial issue, but this whole problem is decidedly
non-trivial, as I'm sure we can all agree.

> For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes,
> you'll error out.  In fact, if you get the number of indexes exactly
> right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and
> panic the whole system.

Oh, come on. We can obviously engineer a solution to that problem. I
don't think I've ever seen a table with close to 100 *unique* indexes.
4 or 5 is a very high number. If we just raised on error if someone
tried to do this with more than 10 unique indexes, I would guess
that'd we'd get exactly zero complaints about it.

> Oh, and if different backends load the index list in different orders,
> because say the system catalog gets vacuumed between their respective
> relcache loads, then they may try to lock the indexes in different
> orders and cause an undetected deadlock.

Undetected deadlock is really not much worse than detected deadlock
here. Either way, it's a bug. And it's something that any kind of
implementation will need to account for. It's not okay to
*unpredictably* deadlock, in a way that the user has no control over.
Today, someone can do an analysis of their application and eliminate
deadlocks if they need to. That might not be terribly practical much
of the time, but it can be done. It certainly is practical to do it in
a localized way. I wouldn't like to compromise that.

So yes, you're right that I need to control for this sort of thing
better than in the extant patch, and in fact this was discussed fairly
early on. But it's an inherent problem.

> And, drifting a bit further off-topic, even to get as far as you have,
> you've added overhead to every lwlock acquisition and release, even
> for people who never use this functionality.

If you look at the code, you'll see that I've made very modest
modifications to LWLockRelease only. I would be extremely surprised if
the overhead was not only in the noise, but was completely impossible
to detect through any conventional benchmark. These are the same kind
of very modest changes made for LWLockAcquireOrWait(), and you said
nothing about that at the time. Despite the fact that you now appear
to think that that whole effort was largely a waste of time.

> That's true but somewhat misleading.  Textually most of the function
> holds the buffer lock, but heap_prepare_insert(),
> CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and
> XLogWrite() are the parts that do substantial amounts of computation,
> and only the last of those happens while holding the buffer lock.

I've already written modifications so that I don't have to do
heap_prepare_insert() with the locks held. There is no reason to call
CheckForSerializableConflictIn() with the additional locks held
either. After all, "For a heap insert, we only need to check for
table-level SSI locks". As for RelationGetBufferForTuple(), yes, the
majority of the time it will have to do very little without acquiring
an exclusive lock, because it's going to get that from the last place
a heap tuple was inserted from.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sat, Sep 14, 2013 at 3:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Well, the reality is that the promises idea hasn't been described in
>> remotely enough detail to compare it to what I have here. I've pointed
>> out plenty of problems with it.
>
> Even if you disagree, I still think that doesn't matter in the very
> least.

It matters if you care about getting this feature.

> You say:
>
>> I think that the details of how this approach compare to others are
>> totally pertinent. For me, that's the whole point - getting towards
>> something that will balance all of these concerns and be acceptable.
>
> Well, the two other people involved in the discussion so far have gone
> on the record saying that the presented approach is not acceptable to
> them. And you haven't started reacting to that.

Uh, yes I have. I'm not really sure what you could mean by that. What
am I refusing to address?

>> Yes, it's entirely possible that that could look quite different to
>> what I have here. I do not want to reduce all this to a question of
>> "is this one design acceptable or not?".
>
> But the way you're discussing it so far is exactly reducing it that way.

The fact that I was motivated to do things this way serves to
illustrate the problems generally.

> If you want the discussion to be about *how* can we implement it that
> the various concerns are addressed: fsck*ing great. I am with you there.

Isn't that what we were doing? There has been plenty of commentary on
alternative approaches.

> In the end, even though I have my usual strong opinions which is the
> best way, I don't care which algorithm gets pursued further. At least,
> if, and only if, it has a fighting chance of getting committed. Which
> this doesn't.

I don't think that any design that has been described to date doesn't
have serious problems. Causing excessive bloat, particularly in
indexes is a serious problem also.

>> The reason I ended up here is not because I didn't get the memo about
>> holding buffer locks across complex operations being a bad thing. At
>> least grant me that. I'm here because in all these years no one has
>> come up with a suggestion that doesn't have some very major downsides.
>> Like, even worse than this.
>
> I think you're massively, massively, massively overstating the dangers
> of bloat here. It's a known problem that's *NOT* getting worse by any of
> the other proposals if you compare it with the loop/lock/catch
> implementation of upsert that we have today as the only option.

Why would I compare it with that? That's terrible, and very few of our
users actually know about it anyway. Also, will an UPDATE followed by
an INSERT really bloat all that much anyway?

> And we
> *DO* have infrastructure to deal with bloat, even if could use some
> improvement. We *don't* have infrastructure with deadlocks on
> lwlocks. And we're not goint to get that infrastructure, because it
> would even further remove the "lw" part of lwlocks.

Everything I said so far is predicated on LWLocks not deadlocking
here, so I'm not really sure why you'd say that. If I can't find a way
to prevent deadlock, then clearly the approach is doomed.

> It's that it interleaves an already complex but local locking scheme
> that required several years to become correct with another that is just
> the same. That's an utterly horrid idea.

You're missing my point, which is that it may be possible, with
relatively modest effort, to analyze things to ensure that deadlock is
impossible - regardless of the complexities of the two systems -
because they're reasonably well encapsulated. See below, under "I'll
say it again".

Now, I can certainly understand why you wouldn't be willing to accept
that at face value. The idea isn't absurd, though. You could think of
the heap_insert() call as being under the control of the btree code
(just as, say, heap_hot_search() is), even though the code isn't at
all structured that way, and that's awkward. I'm actually slightly
tempted to structure it that way.

>> Is that really so different to holding an exclusive
>> lock on a btree buffer while holding a shared lock on a heap buffer?
>> Because that's what _bt_check_unique() does today.
>
> Yes, it it is different. But, in my opinion, bt_check_unique() doing so
> is a bug that needs fixing. Not something that we want to extend.

Well, I think you know that that's never going to happen. There are
all kinds of reasons why it works that way that cannot be disavowed.
My definition of a bug includes a user being affected.

>> > At this point I am a bit confused why you are asking for review.
>>
>> I am asking for us, collectively, through consensus, to resolve the
>> basic approach to doing this. That was something I stated right up
>> front, pointing out details of where the discussion had gone in the
>> past. That was my explicit goal. There has been plenty of discussing
>> on this down through the years, but nothing ever came from it.
>
> At the moment ISTM you're not conceding on *ANY* points. That's not very
> often the way to find concensus.

Really? I've conceded plenty of points. Just now I conceded a point to
Robert about insertion being blocked for inserters that want to insert
a value that isn't already locked/existing, and he didn't even raise
that in the first place. Most prominently, I've conceded that it is
entirely questionable that I hold the buffer locks for longer - before
you even responded to my original patch! I've said it many many times
many many ways. It should be heavily scrutinized. But you both seem to
be making general points along those lines, without reference to what
I've actually done. Those general points could almost to the same
extent apply to  _bt_check_unique() today, which is why I have a hard
time accepting them at face value. To say that what that function does
is "a bug" is just not credible, because it's been around in
essentially the same form since at least a time when you and I were in
primary school. I'll remind you that you haven't been able to
demonstrate deadlock in a way that invalidates my approach. While of
course that's not how this is supposed to work, I've been too busy
defending myself here to get down to the business of carefully
analysing the relatively modest interactions between btree and heap
that could conceivably introduce a deadlock. Yes, the burden to prove
this can't deadlock is mine, but I thought I'd provide you with the
opportunity to prove that it can.

I'll say it again: For a deadlock, there needs to be a mutual
dependency. Provided the locking phase doesn't acquire any locks other
than buffer locks, and during the interaction with the heap btree
inserters (or the locking phase) cannot acquire heap locks in a way
that conflicts with other upserters, we will be fine. It doesn't
necessarily matter how complex each system individually is, because
the two meet in such a limited area (well, two areas now, I suppose),
and they only meet in one direction - there is no reciprocation where
the heap code locks or otherwise interacts with index buffers. When
the heap insertion is performed, all index value locks are already
acquired. The locking phase cannot block itself because of the
ordering of locking, but also because the locks on the heap that it
takes are only shared locks.

Now, this analysis is somewhat complex, and underdeveloped. But as
Robert said, there are plenty of things about locking in Postgres that
are complex and subtle. He also said that it doesn't matter if I can
prove that it won't deadlock, but I'd like a second opinion on that,
since my proof might actually be, if not simple, short, and therefore
may not represent an ongoing burden in the way Robert seemed to think
it would.

> That argument doesn't make sense to me. You're inserting a unique
> value. It completely makes sense that you can only insert one of
> them.

> Even if it had an advantage, not blocking *for the single unique key alone*
> opens you to issues of livelocks where several backends retry because of
> each other indefinitely.

See my remarks to Robert.

> Whether postgres locks down in a way that can only resolved by kill -9
> or whether it aborts a transaction are, like, a couple of magnitude of a
> difference.

Not really. I can see the advantage of having the deadlock be
detectable from a defensive-coding standpoint. But index locking
ordering inconsistencies, and the deadlocks they may cause are not
acceptable generally.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <stark@mit.edu> wrote:
> It seems to me that the nature of the problem is that there will unavoidably
> be a nexus between the two parts of the code here. We can try to isolate it
> as much as possible but we're going to need a bit of a compromise.

Exactly. That's why all the proposals with the exception of this one
have to date involved unacceptable bloating - that's how they try and
span the nexus.

I'll find it very difficult to accept any implementation that is going
to bloat things even worse than our upsert looping example. The only
advantage of such an implementation over the upsert example is that
it'll avoid burning through subxacts. The main reason I don't want to
take that approach is that I know it won't be accepted, because it's a
disaster. That's why the people that proposed this in various forms
down through the years haven't gone and implemented it themselves. I
do not accept that all of this is like the general situation with row
locks. I do not think that the big costs of having many dead
duplicates in a unique index can be overlooked (or perhaps the cost of
cleaning them up eagerly, which is something I'd also expect to work
very badly). That's something that's going to reverberate all over the
place. Imagine a simple, innocent looking pattern that resulted in
there being unique indexes that became hugely bloated. It's not hard.

What I will concede (what I have conceded, actually) is that it would
be better if the locks were more granular. Now, I'm not so much
concerned about concurrent inserters inserting values that just so
happen to be values that were locked. It's more the case that I'm
worried about inserters blocking on other values that are incidentally
locked despite not already existing, that would go on the locked page
or maybe a later page. In particular, I'm concerned about the impact
on SERIAL primary key columns. Not exactly an uncommon case (though
one I'd already thought to optimize by locking last).

What I think might actually work acceptably is if we were to create an
SLRU that kept track of value-locks per buffer. The challenge there
would be to have regular unique index inserters care about them, while
having little to no impact on their regular performance. This might be
possible by having them check the buffer for external value locks in
the SLRU immediately after exclusive locking the buffer - usually that
only has to happen once per index tuple insertion (assuming no
duplicates necessitate retry). If they find their value in the SLRU,
they do something like unlock and block on the other xact and restart.
Now, obviously a lot of the details would have to be worked out, but
it seems possible.

In order for any of this to really be possible, there'd have to be
some concession made to my position, as Greg mentions here. In other
words, I'd need buy-in for the general idea of holding locks in shared
memory from indexes across heap tuple insertion (subject to a sound
deadlock analysis, of course). Some modest compromises may need to be
made around interruptibility. I'd also probably need agreement that
it's okay that value locks can not last more than an instant (they
cannot be held indefinitely pending the end of a transaction). This
isn't something that I imagine to be too controversial, because it's
true today for a single unique index. As I've already outlined, anyone
waiting on another transaction with a would-be duplicate to commit has
very few guarantees about the order that it'll get its second shot
relative to the order it initial queued up behind the successful but
not-yet-committed inserter.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Greg Stark
Date:
<p dir="ltr"><br /> On 15 Sep 2013 10:19, "Peter Geoghegan" <<a href="mailto:pg@heroku.com">pg@heroku.com</a>>
wrote:<br/> ><br /> > On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <<a
href="mailto:stark@mit.edu">stark@mit.edu</a>>wrote:<br /> > > It seems to me that the nature of the problem
isthat there will unavoidably<br /> > > be a nexus between the two parts of the code here. We can try to isolate
it<br/> > > as much as possible but we're going to need a bit of a compromise.<br /><p dir="ltr">> In order
forany of this to really be possible, there'd have to be<br /> > some concession made to my position, as Greg
mentionshere. In other<br /> > words, I'd need buy-in for the general idea of holding locks in shared<br /> >
memoryfrom indexes across heap tuple insertion (subject to a sound<br /> > deadlock analysis, of course). <p
dir="ltr">Actuallythat wasn't what I meant by that.<p dir="ltr">What I meant is that there going to be some code
couplingbetween the executor and btree code. That's purely a question of course structure, and will be true regardless
ofthe algorithm you settle on.<p dir="ltr">What I was suggesting was an api for a function that would encapsulate that
coupling.The executor would call this function which would promise to obtain all the locks needed for both operations
orgive up. Effectively it would be a special btree operation which would have special knowledge of the executor only in
thatit knows that being able to get a lock on two heap buffers is something the executor needs sometimes.<p
dir="ltr">I'mnot sure this fits well with your syntax since it assumes the update will happen at the same time as the
indexlookup but as I said I haven't read your patch, maybe it's not incompatible. I'm writing all this on my phone so
it'smostly just pie in the sky brainstorming. I'm sorry if it's entirely irrelevant.<br /> 

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Kevin Grittner
Date:
Peter Geoghegan <pg@heroku.com> wrote:

> There is no reason to call CheckForSerializableConflictIn() with
> the additional locks held either. After all, "For a heap insert,
> we only need to check for table-level SSI locks".

You're only talking about not covering that call with a *new*
LWLock, right?  We put some effort into making sure that such calls
were only inside of LWLocks which were needed for correctness.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-09-15 02:19:41 -0700, Peter Geoghegan wrote:
> On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <stark@mit.edu> wrote:
> > It seems to me that the nature of the problem is that there will unavoidably
> > be a nexus between the two parts of the code here. We can try to isolate it
> > as much as possible but we're going to need a bit of a compromise.
> 
> Exactly. That's why all the proposals with the exception of this one
> have to date involved unacceptable bloating - that's how they try and
> span the nexus.

> I'll find it very difficult to accept any implementation that is going
> to bloat things even worse than our upsert looping example.

How would any even halfway sensible example cause *more* bloat than the
upsert looping thing?
I'll concede that bloat is something to be aware of, but just because
it's *an* issue, it's not *the* only issue.

In all the solutions I can think of/have heard of that have the chance
of producing additional bloat also have good chance of cleaning up the
additional bloat.

In the "promises" approach you simply can mark the promise index tuples
as LP_DEAD in the IGNORE case if you've found a conflicting tuple. In
the OR UPDATE case you can immediately reuse them. There's no heap
bloat. The logic for dead items already exists in nbtree, so that's not
too much complication. The case where that doesn't work is when postgres
dies inbetween or we're signalled to abort. But that produces bloat for
normal DML anyway. Any vacuum or insert can check whether the promise
xid has committed and remove the promise otherwise.

In the proposals that involve just inserting the heaptuple and then
handle the uniqueness violation when inserting the index tuples you can
immediately mark the index tuples as dead and mark it as prunable.


> The only  advantage of such an implementation over the upsert example is that
> it'll avoid burning through subxacts. The main reason I don't want to
> take that approach is that I know it won't be accepted, because it's a
> disaster. That's why the people that proposed this in various forms
> down through the years haven't gone and implemented it themselves. I
> do not accept that all of this is like the general situation with row
> locks.

The primary advantage will be that it's actually usable by users without
massive overhead in writing dozens of functions.

I don't think the bloat issue had much to do with the feature not
getting implemented so far. It's that nobody was willing to do the work
and endure the discussions around it. And I definitely applaud you for
finally tackling the issue despite that.

> I do not think that the big costs of having many dead
> duplicates in a unique index can be overlooked

Why would there be so many duplicate index tuples? The primary user of
this is going to be UPSERT. In case there's a conflicting tuple, there
is going to be a new tuple version. Which will need a new index entry
quite often. If there's no conflict, we will insert anyway.
So, there's the case of UPSERTs that could be done as HOT updates
because there's enough space on the page and none of the indexes
actually have changed. As explained above, we can simply mark the index
tuple as dead in that case (don't even need an exclusive lock for that,
if done right).

> (or perhaps the cost of
> cleaning them up eagerly, which is something I'd also expect to work
> very badly).

Why? Remember the page you did the insert to, do a _bt_moveright() to
catch eventual splits. Mark the item as dead. Done. The next insert will
repack the page if necessary (cf. _bt_findinsertloc).

> What I will concede (what I have conceded, actually) is that it would
> be better if the locks were more granular. Now, I'm not so much
> concerned about concurrent inserters inserting values that just so
> happen to be values that were locked. It's more the case that I'm
> worried about inserters blocking on other values that are incidentally
> locked despite not already existing, that would go on the locked page
> or maybe a later page. In particular, I'm concerned about the impact
> on SERIAL primary key columns. Not exactly an uncommon case (though
> one I'd already thought to optimize by locking last).

Yes, I think that's the primary issue from a scalability and performance
POV. Locking entire ranges of values, potentially even on inner pages
(because you otherwise would have to split) isn't going to work.

> What I think might actually work acceptably is if we were to create an
> SLRU that kept track of value-locks per buffer. The challenge there
> would be to have regular unique index inserters care about them, while
> having little to no impact on their regular performance. This might be
> possible by having them check the buffer for external value locks in
> the SLRU immediately after exclusive locking the buffer - usually that
> only has to happen once per index tuple insertion (assuming no
> duplicates necessitate retry). If they find their value in the SLRU,
> they do something like unlock and block on the other xact and restart.
> Now, obviously a lot of the details would have to be worked out, but
> it seems possible.

If you can make that work, without locking heap and btree pages at the
same time, yes, I think that's a possible way forward. One way to offset
the cost of SLRU in the common case where there is no contention would
be to have a page level flag that triggers that lookup. There should be
space in btpo_flags.

> In order for any of this to really be possible, there'd have to be
> some concession made to my position, as Greg mentions here. In other
> words, I'd need buy-in for the general idea of holding locks in shared
> memory from indexes across heap tuple insertion (subject to a sound
> deadlock analysis, of course).

I don't have a fundamental problem with holding locks during the
insert. I have a problem with holding page level lightweight locks on
the btree and the heap at the same time.

> Some modest compromises may need to be made around interruptibility.

Why? As far as I understand that proposal, I don't see why that would be needed?

> I'd also probably need agreement that
> it's okay that value locks can not last more than an instant (they
> cannot be held indefinitely pending the end of a transaction). This
> isn't something that I imagine to be too controversial, because it's
> true today for a single unique index. As I've already outlined, anyone
> waiting on another transaction with a would-be duplicate to commit has
> very few guarantees about the order that it'll get its second shot
> relative to the order it initial queued up behind the successful but
> not-yet-committed inserter.

I forsee problems here.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Sat, Sep 14, 2013 at 6:27 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Note that today there is no guarantee that the original waiter for a
> duplicate-inserting xact to complete will be the first one to get a
> second chance, so I think it's hard to question this on correctness
> grounds. Even if they are released in FIFO order, there is no reason
> to assume that the first waiter will win the race with a second. Most
> obviously, the second waiter may not even ever get the chance to block
> on the same xid at all (so it's not really a waiter at all) and still
> be able to insert, if the blocking-xact aborts after the second
> "waiter" starts its descent but before it checks uniqueness. All this,
> even though the second "waiter" arrived maybe minutes after the first.

ProcLockWakeup() only wakes as many waiters from the head of the queue
as can all be granted the lock without any conflicts.  So I don't
think there is a race condition in that path.

> So far it's
> been a slippery slope type argument that can be equally well used to
> argue against some facet of almost any substantial patch ever
> proposed.

I don't completely agree with that characterization, but you do have a
point.  Obviously, if the differences in the area of interruptibility,
starvation, deadlock risk, etc. can be made small enough relative to
the status quo can be made small enough, then those aren't reasons to
reject the approach.

But I'm skeptical that you're going to be able to accomplish that,
especially without adversely affecting maintainability.  I think the
way that you're proposing to use lwlocks here is sufficiently
different from what the rest of the system does that it's going to be
hard to avoid system-wide affects that can't easily be caught during
code review; and like Andres, I don't share your skepticism about
alternative approaches.

>> For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes,
>> you'll error out.  In fact, if you get the number of indexes exactly
>> right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and
>> panic the whole system.
>
> Oh, come on. We can obviously engineer a solution to that problem. I
> don't think I've ever seen a table with close to 100 *unique* indexes.
> 4 or 5 is a very high number. If we just raised on error if someone
> tried to do this with more than 10 unique indexes, I would guess
> that'd we'd get exactly zero complaints about it.

That's not a solution; that's a hack.

> Undetected deadlock is really not much worse than detected deadlock
> here. Either way, it's a bug. And it's something that any kind of
> implementation will need to account for. It's not okay to
> *unpredictably* deadlock, in a way that the user has no control over.
> Today, someone can do an analysis of their application and eliminate
> deadlocks if they need to. That might not be terribly practical much
> of the time, but it can be done. It certainly is practical to do it in
> a localized way. I wouldn't like to compromise that.

I agree that unpredictable deadlocks are bad.  I think the fundamental
problem with UPSERT, MERGE, and this proposal is what happens when the
conflicting tuple is present but not visible to your scan, either
because it hasn't committed yet or because it has committed but is not
visible to your snapshot.  I'm not clear on how you handle that in
your approach.

> If you look at the code, you'll see that I've made very modest
> modifications to LWLockRelease only. I would be extremely surprised if
> the overhead was not only in the noise, but was completely impossible
> to detect through any conventional benchmark. These are the same kind
> of very modest changes made for LWLockAcquireOrWait(), and you said
> nothing about that at the time. Despite the fact that you now appear
> to think that that whole effort was largely a waste of time.

Well, I did have some concerns about the performance impact of that patch:

http://www.postgresql.org/message-id/CA+TgmoaPyQKEaoFz8HkDGvRDbOmRpkGo69zjODB5=7Jh3hbPQA@mail.gmail.com

I also discovered, after it was committed, that it didn't help in the
way I expected:

http://www.postgresql.org/message-id/CA+TgmoY8P3sD=oUViG+xZjmZk5-phuNV39rtfyzUQxU8hJtZxw@mail.gmail.com

It's true that I didn't raise those concerns contemporaneously with
the commit, but I didn't understand the situation well enough at that
time to realize how narrow the benefit was.

I've wished, on a number of occasions, to be able to add more lwlock
primitives.  The problem with that is that if everybody does it, we'll
pretty soon end up with a mess.  I attempted to address that with this
proposal:

http://www.postgresql.org/message-id/CA+Tgmob4YE_k5dpO0T07PNf1SOKPybo+wj4m4FryOS7Z4_yOzg@mail.gmail.com

...but nobody (including me) was very sure that was the right way
forward, and it never went anywhere.  However, I think the basic issue
remains.  I was sad to discover last week that Heikki handled this
problem for the WAL scalability patch by basically copy-and-pasting
much of the lwlock code and then hacking it up.  I think we're well on
our way to an unmaintainable mess already, and I don't want it to get
worse.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-09-17 12:29:51 -0400, Robert Haas wrote:
> But I'm skeptical that you're going to be able to accomplish that,
> especially without adversely affecting maintainability.  I think the
> way that you're proposing to use lwlocks here is sufficiently
> different from what the rest of the system does that it's going to be
> hard to avoid system-wide affects that can't easily be caught during
> code review;

I actually think extending lwlocks to allow downgrading an exclusive
lock is a good idea, independent of this path, and I think there are
some areas of the code where we could use that capability to increase
scalability. Now, that might be because I pretty much suggested using
them in such a way to solve some of the problems :P

I don't think they solve the issue of this patch (holding several nbtree
pages locked across heap operations) though.

> I agree that unpredictable deadlocks are bad.  I think the fundamental
> problem with UPSERT, MERGE, and this proposal is what happens when the
> conflicting tuple is present but not visible to your scan, either
> because it hasn't committed yet or because it has committed but is not
> visible to your snapshot.  I'm not clear on how you handle that in
> your approach.

Hm. I think it should be handled exactly the way we handle it for unique
indexes today. Wait till it's clear whether you can proceed.
At some point we might to extend that logic to more cases, but that
should be separate discussion imo.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Sep 17, 2013 at 6:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I agree that unpredictable deadlocks are bad.  I think the fundamental
>> problem with UPSERT, MERGE, and this proposal is what happens when the
>> conflicting tuple is present but not visible to your scan, either
>> because it hasn't committed yet or because it has committed but is not
>> visible to your snapshot.  I'm not clear on how you handle that in
>> your approach.
>
> Hm. I think it should be handled exactly the way we handle it for unique
> indexes today. Wait till it's clear whether you can proceed.

That's what I do, although getting those details right has been of
secondary concern for obvious reasons.

> At some point we might to extend that logic to more cases, but that
> should be separate discussion imo.

This is essentially why I went and added a row locking component over
your objections. Value locks (regardless of implementation)
effectively stop an insertion from finishing, but not from starting.
ISTM that locking the row with value locks held can cause deadlock.
So, unfortunately, we cannot really discuss value locking and row
locking separately, even though I see the appeal of trying to. Gaining
an actual representative notion of the expense of releasing and
re-acquiring the locks is too tightly coupled with how this is handled
and how frequently we need to restart. Plus there may well be other
issues in the same vein that we've yet to consider.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-09-18 00:54:38 -0500, Peter Geoghegan wrote:
> > At some point we might to extend that logic to more cases, but that
> > should be separate discussion imo.
> 
> This is essentially why I went and added a row locking component over
> your objections.

I didn't object to implementing row level locking. I said that if your
basic algorithm without row level locks is viewn as being broken, it
won't be fixed by implementing row level locking.

What I meant here is just that we shouldn't implement a mode with less
waiting for now even if there might be usecases because that will open
another can of worms.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Sep 17, 2013 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Sep 14, 2013 at 6:27 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Note that today there is no guarantee that the original waiter for a
>> duplicate-inserting xact to complete will be the first one to get a
>> second chance

> ProcLockWakeup() only wakes as many waiters from the head of the queue
> as can all be granted the lock without any conflicts.  So I don't
> think there is a race condition in that path.

Right, but what about XactLockTableWait() itself? It only acquires a
ShareLock on the xid of the got-there-first inserter that potentially
hasn't yet committed/aborted. There will be no conflicts between
multiple second-chance-seeking blockers trying to acquire this lock
concurrently, and so in fact there is (what I guess you'd consider to
be) a race condition in the current btree insertion code. So my
earlier point about according an upsert implementation license to
optimize ordering of retries across multiple unique indexes -- that it
isn't really inconsistent with the current code when dealing with only
one unique index insertion -- has not been invalidated.

EvalPlanQualFetch() and Do_MultiXactIdWait() also call
XactLockTableWait(), for similar reasons. In my patch, the later row
locking code used by INSERT...ON DUPLICATE KEY LOCK FOR UPDATE calls
XactLockTableWait() too.

>> So far it's
>> been a slippery slope type argument that can be equally well used to
>> argue against some facet of almost any substantial patch ever
>> proposed.
>
> I don't completely agree with that characterization, but you do have a
> point.  Obviously, if the differences in the area of interruptibility,
> starvation, deadlock risk, etc. can be made small enough relative to
> the status quo can be made small enough, then those aren't reasons to
> reject the approach.

That all seems fair to me. That's the standard that I'd apply as a
reviewer myself.

> But I'm skeptical that you're going to be able to accomplish that,
> especially without adversely affecting maintainability.  I think the
> way that you're proposing to use lwlocks here is sufficiently
> different from what the rest of the system does that it's going to be
> hard to avoid system-wide affects that can't easily be caught during
> code review;

Fair enough. In case it isn't already totally clear to someone, I
concede that it isn't going to be workable to hold even shared buffer
locks across all these operations. Let's get past that, though.

> and like Andres, I don't share your skepticism about
> alternative approaches.

Well, I expressed skepticism about one alternative approach in
particular, which is the promise tuples approach. Andres seems to
think that I'm overly concerned about bloat, but I'm not sure he
appreciates why I'm so sensitive to it in this instance. I'll be
particularly sensitive to it if value locks need to be held
indefinitely rather than there being a speculative
grab-the-value-locks attempt (because that increases the window in
which another session can necessitate that we retry at row locking
time quite considerably - see below).

> I think the fundamental
> problem with UPSERT, MERGE, and this proposal is what happens when the
> conflicting tuple is present but not visible to your scan, either
> because it hasn't committed yet or because it has committed but is not
> visible to your snapshot.

Yeah, you're right. As I mentioned to Andres already, when row locking
happens and there is this kind of conflict, my approach is to retry
from scratch (go right back to before value lock acquisition) in the
sort of scenario that generally necessitates EvalPlanQual() looping,
or to throw a serialization failure where that's appropriate. After an
unsuccessful attempt at row locking there could well be an interim
wait for another xact to finish, before retrying (at read committed
isolation level). This is why I think that value locking/retrying
should be cheap, and should avoid bloat if at all possible.

Forgive me if I'm making a leap here, but it seems like what you're
saying is that the semantics of upsert that one might naturally expect
are *arguably* fundamentally impossible, because they entail
potentially locking a row that isn't current to your snapshot, and you
cannot throw a serialization failure at read committed. I respectfully
suggest that that exact definition of upsert isn't a useful one,
because other snapshot isolation/MVCC systems operating within the
same constraints must have the same issues, and yet they manage to
implement something that could be called upsert that people seem happy
with.

> I also discovered, after it was committed, that it didn't help in the
> way I expected:
>
> http://www.postgresql.org/message-id/CA+TgmoY8P3sD=oUViG+xZjmZk5-phuNV39rtfyzUQxU8hJtZxw@mail.gmail.com

Well, at the time you didn't also provide raw commit latency benchmark
results for your hardware using a tool like pg_test_fsync, which I'd
consider absolutely essential to such a discussion. That's mostly or
entirely what the group commit stuff does - amortize that cost among
concurrently flushing transactions. Around this time, the patch was
said by Heikki to just relieve lock contention around WALWriteLock -
the 9.2 release notes say much the same. I never understood it that
way, though Heikki disagreed with that [1].

Certainly, if relieving contention was all the patch did, then you
wouldn't expect the 9.3 commit_delay implementation to help anyone,
but it does: with a slow fsync holding the lock 50% *longer* can
actually help tremendously. So I *always* agreed with you that there
was hardware where group commit would barely help with a moderately
sympathetic benchmark like the pgbench default. Not that it matters
much now.

> It's true that I didn't raise those concerns contemporaneously with
> the commit, but I didn't understand the situation well enough at that
> time to realize how narrow the benefit was.
>
> I've wished, on a number of occasions, to be able to add more lwlock
> primitives.  The problem with that is that if everybody does it, we'll
> pretty soon end up with a mess.

I wouldn't go that far. The number of possible additional primitives
that are useful isn't that high, unless we decide that LWLocks are
going to be a fundamentally different thing, which I consider
unlikely.

> http://www.postgresql.org/message-id/CA+Tgmob4YE_k5dpO0T07PNf1SOKPybo+wj4m4FryOS7Z4_yOzg@mail.gmail.com
>
> ...but nobody (including me) was very sure that was the right way
> forward, and it never went anywhere.  However, I think the basic issue
> remains.  I was sad to discover last week that Heikki handled this
> problem for the WAL scalability patch by basically copy-and-pasting
> much of the lwlock code and then hacking it up.  I think we're well on
> our way to an unmaintainable mess already, and I don't want it to get
> worse.  :-(

I hear what you're saying about LWLocks. I did follow the FlexLocks
stuff at the time myself. Obviously we aren't going to add new lwlock
operations if they have exactly no clients. However, I think that the
semantics implemented (weakening and strengthening of locks) may well
be handy somewhere else. So while I wouldn't go and commit that stuff
on the off chance that it will be useful, it's worth bearing in mind
going forward that it's quite possible to weaken/strengthen locks.


[1] http://www.postgresql.org/message-id/4FB0A673.7040002@enterprisedb.com

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Stephen Frost
Date:
Peter,

* Peter Geoghegan (pg@heroku.com) wrote:
> Forgive me if I'm making a leap here, but it seems like what you're
> saying is that the semantics of upsert that one might naturally expect
> are *arguably* fundamentally impossible, because they entail
> potentially locking a row that isn't current to your snapshot, and you
> cannot throw a serialization failure at read committed. I respectfully
> suggest that that exact definition of upsert isn't a useful one,

I'm not sure I follow this completely- you're saying that a definition
of 'upsert' which includes having to lock rows which aren't in your
current snapshot (for reasons stated) isn't a useful one.  Is the
implication that a useful definition of 'upsert' is that it *doesn't*
have to lock rows which aren't in your current snapshot, and if so, then
what would the semantics of that upsert look like?

> because other snapshot isolation/MVCC systems operating within the
> same constraints must have the same issues, and yet they manage to
> implement something that could be called upsert that people seem happy
> with.

This I am generally in agreement with, to the extent that 'upsert' is
something we really want and we should figure out a way to get there
from here, but it wouldn't be the first time that we worked out a
better solution than existing implementations.  So, another '+1' from me
wrt your working this issue and please don't get too discouraged that
there's a lot of pressure to find a magic bullet- I think part of it is
exactly because everyone wants this and wants it to be better than
what's out there today.
Thanks,
    Stephen

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
Hi Stephen,

On Fri, Sep 20, 2013 at 6:55 PM, Stephen Frost <sfrost@snowman.net> wrote:
> I'm not sure I follow this completely- you're saying that a definition
> of 'upsert' which includes having to lock rows which aren't in your
> current snapshot (for reasons stated) isn't a useful one.  Is the
> implication that a useful definition of 'upsert' is that it *doesn't*
> have to lock rows which aren't in your current snapshot, and if so, then
> what would the semantics of that upsert look like?

No, I'm suggesting that the useful semantics are that it does
potentially lock rows not yet visible to our snapshot that have
committed - the latest row version. I see no alternative (we can't
throw a serialization failure at read committed isolation level), and
Andres seemed to agree that this was the way forward. Robert described
problems he saw with this a few years ago [1]. It *is* a problem (we
need to think very carefully about it), but, as I've said, it is a
problem that anyone implementing this feature for a Snapshot
Isolation/MVCC database would have to deal with, and several have.

So, what the patch does right now is (if you squint) analogous to how
SELECT FOR UPDATE uses EvalPlanQual already. However, instead of
re-verifying a qual, we're re-verifying that the value locking has
identified the right tid (there will probably be a different one in
the subsequent iteration, or maybe we *can* go insert this time). We
need consensus across unique indexes to go ahead with insertion, but
once we know that we can't (and have a tid to lock), value locks can
go away - we'll know if anything has changed about the tid's logical
row that we need to care about when row locking. Besides, holding
value locks while row locking has deadlock hazards, and, because value
locks only stop insertions *finishing*, holding on to them is at best
pointless.

The tid we get from locking, that points to a would-be duplicate heap
tuple has always committed - otherwise we'd never return from locking,
because that blocks pending the outcome of a duplicate-inserting-xact
(and only returns the tid when that xact commits). Even though this
tuple is known to be visible, it may be deleted in the interim before
row locking, in which case restarting from before value locking is
appropriate. It might also be updated, which would necessitate locking
a later row version in order to prevent race conditions. But it seems
dangerous, invasive, and maybe even generally impossible to try and
wait for the transaction that updated to commit or abort so that we
can lock that later version the usual way (the usual EvalPlanQual
looping thing) - better to restart value locking.

The fundamental observation about value locking (at least for any
half-way reasonable implementation), that I'd like to emphasize, is
that short of a radical overhaul that would have many downsides, it
can only ever prevent insertion from *finishing*. The big picture of
my design is that it tries to quickly grab value locks, release them
and grab a row lock (or insert heap tuples, index tuples, and then
release value locks). If row locking fails, it waits for the
conflicter xact to finish, and then restarts before the value locking
of the current slot. If you think that's kind of questionable, maybe
you have a point, but consider:

1. How else are you going to handle it if row locking needs to handle
conflicts? You might say "I can re-verify that no unique index columns
were affected instead", and maybe you can, but what if that doesn't
help because they *were* changed? Besides, doesn't this break the
amcanunique contract? Surely judging what's really a duplicate is the
AM's job.

You're back to "I need to throw an error to get out of this but I have
no good excuse to do so at read committed" -- you've lost the usual
duplicate key error "excuse". I don't think you can expect holding the
value locks throughout row locking to help, because, as I've said,
that causes morally indefensible deadlocks, and besides, it doesn't
stop what row locking would consider to be a conflict, it just stops
insertion from *finishing*.

2. In the existing btree index insertion code, the order that retries
occur in the event of unique index tuple insertion finding an
unfinished conflicting xact *is* undefined. Yes, that's right - the
first waiter is not guaranteed to be the first to get a second chance.
It's not even particularly probable! See remarks from my last mail to
Robert for more information.

3. People with a real purist's view on the (re)ordering of value
locking must already think that EvalPlanQual() is completely ghetto
for very similar reasons, and as such should just go use a higher
isolation level. For the rest of us, what concurrency control anomaly
can allowing this cause over and above what's already possible there?
Are lock starvation risks actually appreciably raised at all?

What Andres and Robert seem to expect generally - that value locks
only be released when we the locker has a definitive answer - actually
*can* be ensured at the higher isolation levels, where the system has
license to bail out by throwing a serialization failure. The trick
there is just to throw an error if the first *retry* at cross-index
value locking is unsuccessful or blocks on a whole other xact -- a
serialization error (and not a unique constraint violation error, as
would often but not necessarily otherwise occur for non-upserters).
Naturally, it could also happen that at > read committed, row locking
throws a serialization failure (as is always mandated over using
EvalPlanQual() or other monkeying around at higher isolation levels).

> This I am generally in agreement with, to the extent that 'upsert' is
> something we really want and we should figure out a way to get there
> from here, but it wouldn't be the first time that we worked out a
> better solution than existing implementations.  So, another '+1' from me
> wrt your working this issue and please don't get too discouraged that
> there's a lot of pressure to find a magic bullet

Thanks for the encouragement!

[1] http://www.postgresql.org/message-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com
-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sun, Sep 15, 2013 at 8:23 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I'll find it very difficult to accept any implementation that is going
>> to bloat things even worse than our upsert looping example.
>
> How would any even halfway sensible example cause *more* bloat than the
> upsert looping thing?

I was away in Chicago over the week, and didn't get to answer this.
Sorry about that.

In the average/uncontended case, the subxact example bloats less than
all alternatives to my design proposed to date (including the "unborn
heap tuple" idea Robert mentioned in passing to me in person the other
day, which I think is somewhat similar to a suggestion of Heikki's
[1]). The average case is very important, because in general
contention usually doesn't happen. But you need to also appreciate
that because of the way row locking works and the guarantees value
locking makes, any ON DUPLICATE KEY LOCK FOR UPDATE implementation is
going to have to potentially restart in more places (as compared to
the doc's example), maybe including value locking of each unique index
and certainly including row locking. So the contended case might even
be worse as well.

On average, it is quite likely that either the UPDATE or INSERT will
succeed - there has to be some concurrent activity around the same
values for either to fail, and in general that's quite unlikely. If
the UPDATE doesn't succeed, it won't bloat, and it's then very likely
that the INSERT at the end of the loop will go ahead and succeed
without itself creating bloat.

Going forward with this discussion, I would like us all to take as
read that the buffer locking stuff is a prototype approach to value
locking, to be refined later (subject to my basic design being judged
fundamentally sound). I don't think anyone believes that it's
fundamentally incorrect in that it doesn't do something that it claims
to do (concerns are more around what it might do or prevent that it
shouldn't), and it can still drive discussion in a very useful
direction. So far criticism of this patch has almost entirely been on
aspects of buffer locking, but it would be much more useful for the
time being to simply assume that the buffer locks *are* interruptible.
It's probably okay with me to still be a bit suspicious of
deadlocking, though, because if we refine the buffer locking using a
more granular SLRU value locking approach, that doesn't necessarily
guarantee that it's impossible, even if it does (I guess) prevent
undesirable interactions with other buffer locking.

[1] http://www.postgresql.org/message-id/45E845C4.6030000@enterprisedb.com

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Sep 20, 2013 at 5:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> ProcLockWakeup() only wakes as many waiters from the head of the queue
>> as can all be granted the lock without any conflicts.  So I don't
>> think there is a race condition in that path.
>
> Right, but what about XactLockTableWait() itself? It only acquires a
> ShareLock on the xid of the got-there-first inserter that potentially
> hasn't yet committed/aborted. There will be no conflicts between
> multiple second-chance-seeking blockers trying to acquire this lock
> concurrently, and so in fact there is (what I guess you'd consider to
> be) a race condition in the current btree insertion code.

I should add: README.tuplock says the following:

"""
 The protocol for waiting for a tuple-level lock is really
    LockTuple()    XactLockTableWait()    mark tuple as locked by me    UnlockTuple()

When there are multiple waiters, arbitration of who is to get the lock next
is provided by LockTuple().

"""

So because this isn't a tuple-level lock - it's really a value-level
lock - LockTuple() is not called by the btree code at all, and so
arbitration of who gets the lock is, as I've said, essentially
undefined.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sat, Sep 21, 2013 at 7:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
> So because this isn't a tuple-level lock - it's really a value-level
> lock - LockTuple() is not called by the btree code at all, and so
> arbitration of who gets the lock is, as I've said, essentially
> undefined.

Addendum: It isn't even a value-level lock, because the buffer locks
are of course released before the XactLockTableWait() call. It's a
simple attempt to acquire a shared lock on an xid.


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
Hi,

I don't have time to answer the other emails today (elections,
climbing), but maybe you could clarify the below?

On 2013-09-21 17:07:11 -0700, Peter Geoghegan wrote:
> On Sun, Sep 15, 2013 at 8:23 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I'll find it very difficult to accept any implementation that is going
> >> to bloat things even worse than our upsert looping example.
> >
> > How would any even halfway sensible example cause *more* bloat than the
> > upsert looping thing?
> 
> I was away in Chicago over the week, and didn't get to answer this.
> Sorry about that.
> 
> In the average/uncontended case, the subxact example bloats less than
> all alternatives to my design proposed to date (including the "unborn
> heap tuple" idea Robert mentioned in passing to me in person the other
> day, which I think is somewhat similar to a suggestion of Heikki's
> [1]). The average case is very important, because in general
> contention usually doesn't happen.

I can't follow here. Why does e.g. the promise tuple approach bloat more
than the subxact example?
The protocol is roughly:
1) Insert index pointer containing an xid to be waiting upon instead of  the target tid into all indexes
2) Insert heap tuple, we can be sure there's no conflict now
3) Go through the indexes and repoint the item to point to the tid of the  heaptuple instead of the xid.

There's zero heap or index bloat in the uncontended case. In the
contended case it's just the promise tuples from 1) that are inserted
before the conflict is detected. Those can be marked as dead when the
conflict happened.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sun, Sep 22, 2013 at 2:10 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I can't follow here. Why does e.g. the promise tuple approach bloat more
> than the subxact example?
> The protocol is roughly:
> 1) Insert index pointer containing an xid to be waiting upon instead of
>    the target tid into all indexes
> 2) Insert heap tuple, we can be sure there's no conflict now
> 3) Go through the indexes and repoint the item to point to the tid of the
>    heaptuple instead of the xid.
>
> There's zero heap or index bloat in the uncontended case. In the
> contended case it's just the promise tuples from 1) that are inserted
> before the conflict is detected. Those can be marked as dead when the
> conflict happened.

It depends on your definition of the contended case. You're assuming
that insertion is the most probable outcome, when in fact much of the
time updating is just as likely or even more likely. Many promise
tuples may be inserted before actually seeing a conflict and deciding
to update/lock for update. In order for the example in the docs to
bloat at all, both the UPDATE and the INSERT need to fail within a
tiny temporal window - that's what I mean by uncontended (it is
usually tiny because if the UPDATE blocks, that often means it will
succeed anyway, but if not the INSERT will very probably succeed).

This is because the UPDATE won't bloat when no existing row is seen,
because its subplan will return no rows. The INSERT will only bloat if
it fails, which is generally very unlikely because of the fact that
the UPDATE just did nothing. Contrast that with bloating almost every
time an UPDATE is necessary (I think that bloat that is generally
cleaned up synchronously is still bloat). That's before we even talk
about the additional overhead. Making the locks expensive to
release/clean-up could really hurt, since it appears they'll *have* to
be unlocked before row locking, and during that time concurrent
activity affecting the row to be locked can necessitate a full restart
- that's a window we want to keep as small as possible.

I think reviewer time would for now be much better spent discussing
the patch at a higher level (along the lines of my recent mail to
Stephen and Robert). I've been at least as guilty as anyone else in
getting mired in these details. We'll be much better equipped to have
this discussion afterwards, because it isn't clear to us if we really
need or would find it at all useful to have long-lasting value locks,
how frequently we'll need to retry and for what reasons, and so on.

My immediate concern as the patch author is to come up with a better
answer to the problem that Robert described [1], because "hey, I
locked the row -- you take it from here user that might not have any
version of it visible to you" is not good enough. I hope that there
isn't a tension between solving that problem and offering the
flexibility and composability of the proposed syntax.

[1] http://www.postgresql.org/message-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-09-22 12:54:57 -0700, Peter Geoghegan wrote:
> On Sun, Sep 22, 2013 at 2:10 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I can't follow here. Why does e.g. the promise tuple approach bloat more
> > than the subxact example?
> > The protocol is roughly:
> > 1) Insert index pointer containing an xid to be waiting upon instead of
> >    the target tid into all indexes
> > 2) Insert heap tuple, we can be sure there's no conflict now
> > 3) Go through the indexes and repoint the item to point to the tid of the
> >    heaptuple instead of the xid.
> >
> > There's zero heap or index bloat in the uncontended case. In the
> > contended case it's just the promise tuples from 1) that are inserted
> > before the conflict is detected. Those can be marked as dead when the
> > conflict happened.
> 
> It depends on your definition of the contended case. You're assuming
> that insertion is the most probable outcome, when in fact much of the
> time updating is just as likely or even more likely. Many promise
> tuples may be inserted before actually seeing a conflict and deciding
> to update/lock for update. 

I still fail to see how that's relevant. For every index there's two
things that can happen:
a) there's a conflicting tuple. In that case we can fail at that
point/convert to an update. No Bloat.
b) there's no conflicting tuple. In that case we will insert a promise
tuple. If there's no conflict in further indexes (i.e. we INSERT), the
promise will converted to a plain tuple. If there *is* a further
conflict, you *still* need the new index tuple because by definition
(the index changed) it cannot be an HOT update. So you convert it as
well. No Bloat.

> I think that bloat that is generally cleaned up synchronously is still
> bloat

I don't think it's particularly relevant because the above will just
cause bloat in case of rollbacks and such which is nothin new, but:
I fail to fee the point of such a position.

> I think reviewer time would for now be much better spent discussing
> the patch at a higher level (along the lines of my recent mail to
> Stephen and Robert).

Yes, I plan to reply to those, I just didn't have time to do so this
weekend. There's other stuff than PG every now and then ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sun, Sep 22, 2013 at 1:39 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I still fail to see how that's relevant. For every index there's two
> things that can happen:
> a) there's a conflicting tuple. In that case we can fail at that
> point/convert to an update. No Bloat.

Well, yes - if the conflict is in the first unique index you look at.

> b) there's no conflicting tuple. In that case we will insert a promise
> tuple.

Yeah, if there is no conflict relating to any of the tuples, the cost
is limited to updating the promise tuples in-place. Not exactly a
trivial additional cost even then though, because you have to
exclusive lock and WAL-log twice per index tuple.

> If there's no conflict in further indexes (i.e. we INSERT), the
> promise will converted to a plain tuple.

Sure.

> If there *is* a further
> conflict, you *still* need the new index tuple because by definition
> (the index changed) it cannot be an HOT update.

By definition? What do you mean? This isn't MySQL's REPLACE. This
feature is almost certainly going to tacitly require the user to write
the upsert SQL with a particular unique index in mind (to figure that
out for ourselves, we'd need to somehow ask/infer, which is ugly/very
hard to impossible). The UPDATE, as typically written, probably
*won't* actually update any of the other, incidentally
unique-constrained/value locked columns that we have to check in case
that's what the user really meant, and very probably not the
"interesting" column appearing in the UPDATE qual itself, so it
probably *will* be a HOT update.

> So you convert it as
> well. No Bloat.

Even if this is a practical possibility, which I doubt, the book
keeping sounds very messy and invasive indeed.

> Yes, I plan to reply to those, I just didn't have time to do so this
> weekend.

Great, thanks. I cannot strongly emphasize enough how I think that's
the way to frame all of this. So much so that I almost managed to
resist answering the above points. :-)

>  There's other stuff than PG every now and then ;)

Hope you enjoyed the hike.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Fri, Sep 20, 2013 at 8:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Sep 17, 2013 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Sep 14, 2013 at 6:27 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> Note that today there is no guarantee that the original waiter for a
>>> duplicate-inserting xact to complete will be the first one to get a
>>> second chance
>
>> ProcLockWakeup() only wakes as many waiters from the head of the queue
>> as can all be granted the lock without any conflicts.  So I don't
>> think there is a race condition in that path.
>
> Right, but what about XactLockTableWait() itself? It only acquires a
> ShareLock on the xid of the got-there-first inserter that potentially
> hasn't yet committed/aborted.

That's an interesting point.  As you pointed out in later emails, that
cases is handled for heap tuple locks, but btree uniqueness conflicts
are a different kettle of fish.

> Yeah, you're right. As I mentioned to Andres already, when row locking
> happens and there is this kind of conflict, my approach is to retry
> from scratch (go right back to before value lock acquisition) in the
> sort of scenario that generally necessitates EvalPlanQual() looping,
> or to throw a serialization failure where that's appropriate. After an
> unsuccessful attempt at row locking there could well be an interim
> wait for another xact to finish, before retrying (at read committed
> isolation level). This is why I think that value locking/retrying
> should be cheap, and should avoid bloat if at all possible.
>
> Forgive me if I'm making a leap here, but it seems like what you're
> saying is that the semantics of upsert that one might naturally expect
> are *arguably* fundamentally impossible, because they entail
> potentially locking a row that isn't current to your snapshot,

Precisely.

> and you cannot throw a serialization failure at read committed.

Not sure that's true, but at least it might not be the most desirable behavior.

> I respectfully
> suggest that that exact definition of upsert isn't a useful one,
> because other snapshot isolation/MVCC systems operating within the
> same constraints must have the same issues, and yet they manage to
> implement something that could be called upsert that people seem happy
> with.

Yeah.  I wonder how they do that.

> I wouldn't go that far. The number of possible additional primitives
> that are useful isn't that high, unless we decide that LWLocks are
> going to be a fundamentally different thing, which I consider
> unlikely.

I'm not convinced, but we can save that argument for another day.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Sep 23, 2013 at 12:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Right, but what about XactLockTableWait() itself? It only acquires a
>> ShareLock on the xid of the got-there-first inserter that potentially
>> hasn't yet committed/aborted.
>
> That's an interesting point.  As you pointed out in later emails, that
> cases is handled for heap tuple locks, but btree uniqueness conflicts
> are a different kettle of fish.

Right.

It suits my purposes to have the value locks be held for only an
instant, because:

1) It will perform much better and deadlock much less in certain
scenarios if sessions are given leeway to not block each other across
multiple values in multiple unique indexes (i.e. we make them
"considerate", like a person with a huge shopping cart that lets
another person with one thing ahead of them in a queue, and perhaps in
doing so greatly reduces their own wait because the guy with one thing
makes the cashier immediately say 'no' to the person with all those
groceries. Ahem). I don't think that this implies any additional
anomalies at read committed, and I'm reasonably confident that this
doesn't regress things to any degree lock starvation wise - lock
starvation can only come from a bunch of inserters of the same value
that consistently abort, just like the present situation with one
unique index (I think it's better with multiple unique indexes than
with only one - more opportunities for the would-be-starved session to
hear a definitive no answer and give up).

2) It will probably be considerably easier if and when we improve on
the buffer locking stuff (by adding a value locking SLRU) to assume
that they'll be shortly held. For example, maybe it's okay that the
implementation doesn't allow page splits on value-locked pages, and
maybe that makes things much easier to reason about. If you're
determined to have a strict serial ordering of value locking *without
serialization failures*, I think what I've already said about the
interactions between row locking and value locking demonstrates that
that's close to or actually impossible. Plus, it would really suck for
performance if that SLRU had to actually swap value locks to and from
disk, which becomes a real possibility if they're really long held
(mere index scans aren't going to keep the cache warm, so the
worst-case latency for an innocent inserter into some narrow range of
values might be really bad).

Speaking of ease of implementation, how do you guarantee that the
value locking waiters get the right to insert in serial order (if
that's something that you value, which I don't at RC)? You have to fix
the same "race" that already exists when acquiring a ShareLock on an
xid, and blocking on value lock acquisition. The only possible remedy
I can see for that is to integrate heap and btree locking in a much
more intimate and therefore sketchy way. You need something like
LockTuple() to arbitrate ordering, but what, and how, and where, and
with how many buffer locks held?

Most importantly:

3) As I've already mentioned, heavy value locks (like promise tuples
or similar schemes, as opposed to actual heavyweight locks)
concomitantly increase the window in which a conflict can be created
for row locking. Most transactions last but an instant, and so the
fact that other session may already be blocked locking on the
would-be-duplicate row perhaps isn't that relevant. Doing all that
clean-up is going to give other sessions increased opportunity to lock
the row themselves, and ruin our day.

But these points are about long held value locks, not the cost of
making their acquisition relatively expensive or inexpensive (but
still more or less instantaneous), so why mention that at all? Well,
since we're blocking everyone else with our value locks, they get to
have a bad day too. All the while, they're perhaps virtually
pre-destined to find some row to lock, but the window for something to
happen to that row for that to conflict with eventual row locking (to
*unnecessarily* conflict, as for example when an innocuous HOT update
occurs) gets larger and larger as they wait longer and longer on value
locks. Loosely speaking, things get multiplicatively worse - total
gridlock is probably possible, with the deadlock detector only
breaking the gridlock up a bit if we get lucky (unless, maybe, if
value locks last until transaction end...which I think is nigh on
impossible anyway).

The bottom line is that long lasting value locks - value locks that
last the duration of a transaction and are acquired serially, while
guaranteeing that the inserter that gets all the value locks needed
itself gets to insert - have the potential to cascade horribly, in
ways that I can only really begin to reason about. That is, they do
*if* we presume that they have the interactions with row locking that
I believe they do, a belief that no one has taken issue with yet.

Even *considering* this is largely academic, though, because without
some kind of miracle guaranteeing serial ordering, a miracle that
doesn't allow for serialization failures and also doesn't seriously
slow down, for example, updates (by making them care about value locks
*before* they do anything, or in the case of HOT updates *at all*),
all of this is _impossible_. So, I say let's just do the
actually-serial-ordering for value lock acquisition with serialization
failures where we're > read committed. I've seriously considered what
it would take to do it any other way so things would work how you and
Andres expect for read committed, and it makes my head hurt, because
apart from seeming unnecessary to me, it also seems completely
hopeless.

Am I being too negative here? Well, I guess that's possible. The fact
is that it's really hard to judge, because all of this is really hard
to reason about. That's what I really don't like about it.

>> I respectfully
>> suggest that that exact definition of upsert isn't a useful one,
>> because other snapshot isolation/MVCC systems operating within the
>> same constraints must have the same issues, and yet they manage to
>> implement something that could be called upsert that people seem happy
>> with.
>
> Yeah.  I wonder how they do that.

My guess is that they have some fancy snapshot type that is used by
the equivalent of ModifyTable subplans, that is appropriately paranoid
about the Halloween problem and so on. How that actually might work is
far from clear, but it's a question that I have begun to consider. As
I said, a concern is that it would be in tension with the generalized,
composable syntax, where we don't explicitly have a "special update".
I'd really like to hear other opinions, though.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Sep 23, 2013 at 12:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> and you cannot throw a serialization failure at read committed.
>
> Not sure that's true, but at least it might not be the most desirable behavior.

I'm pretty sure that that's totally true. "You don't have to worry
about serialization failures at read committed, except when you do"
seems kind of weak to me. Especially since none of the usual suspects
say the same thing. That said, it sure would be convenient if it
wasn't true!

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - visibility semantics

From
Andres Freund
Date:
Hi,

Various messages are discussing semantics around visibility. I by now
have a hard time keeping track. So let's keep the discussion of the
desired semantics to this thread.

There have been some remarks about serialization failures in read
committed transactions. I agree, those shouldn't occur. But I don't
actually think they are so much of a problem if we follow the path set
by existing uses of the EPQ logic. The scenario described seems to be an
UPSERT conflicting with a row it cannot see in the original snapshot of
the query.
In that case I think we just have to follow the example laid by
ExecUpdate, ExecDelete and heap_lock_tuple. Use the EPQ machinery (or an
alternative approach with similar enough semantics) to get a new
snapshot and follow the ctid chain. When we've found the end of the
chain we try to update that tuple.
That surely isn't free of surprising semantics, but it would follows existing
semantics. Which everybody writing concurrent applications in read
committed should (but doesn't) know. Adding a different set of semantics
seems like a bad idea.
Robert seems to have been the primary sceptic around this, what scenario
are you actually concerned about?

There are some scenarios that doesn't trivially answer. But I'd like to
understand the primary concerns first.

Regards,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Mon, Sep 23, 2013 at 7:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It suits my purposes to have the value locks be held for only an
> instant, because:
>
> [ detailed explanation ]

I don't really disagree with any of that.  TBH, I think the question
of how long value locks (as you call them) are held is going to boil
down to a question of how they end up being implemented.   As I
mentioned to you at PG Open (going through the details here for those
following along at home), we could optimistically insert the new heap
tuple, then go add index entries for it, and if we find a conflict,
then instead of erroring out, we mark the tuple we were inserting dead
and go try update the conflicting tuple instead.  In that
implementation, if we find that we have to wait for some other
transaction along the way, it's probably not worth reversing out the
index entries already inserted, because getting them into the index in
the first place was a WAL-logged operation, and therefore relatively
expensive, and IMHO it's most likely better to just hope things work
out than to risk having to redo all of that.

On the other hand, if the locks are strictly in-memory, then the cost
of releasing them all before we go to wait, and of reacquiring them
after we finish waiting, is pretty low.  There might be some
modularity issues to work through there, but they might not turn out
to be very painful, and the advantages you mention are certainly worth
accruing if it turns out to be fairly straightforward.

Personally, I think that trying to keep it all in-memory is going to
be hard.  The problem is that we can't de-optimize regular inserts or
updates to any significant degree to cater to this feature - because
as valuable as this feature is, the number of times it gets used is
still going to be a whole lot smaller than the number of times it
doesn't get used.  Also, I tend to think that we might want to define
the operation as a REPLACE-type operation with respect to a certain
set of key columns; and so we'll do the insert-or-update behavior with
respect only to the index on those columns and let the chips fall
where they may with respect to any others.  In that case this all
becomes much less urgent.

> Even *considering* this is largely academic, though, because without
> some kind of miracle guaranteeing serial ordering, a miracle that
> doesn't allow for serialization failures and also doesn't seriously
> slow down, for example, updates (by making them care about value locks
> *before* they do anything, or in the case of HOT updates *at all*),
> all of this is _impossible_. So, I say let's just do the
> actually-serial-ordering for value lock acquisition with serialization
> failures where we're > read committed. I've seriously considered what
> it would take to do it any other way so things would work how you and
> Andres expect for read committed, and it makes my head hurt, because
> apart from seeming unnecessary to me, it also seems completely
> hopeless.
>
> Am I being too negative here? Well, I guess that's possible. The fact
> is that it's really hard to judge, because all of this is really hard
> to reason about. That's what I really don't like about it.

Suppose we define the operation as REPLACE rather than INSERT...ON
DUPLICATE KEY LOCK FOR UPDATE.  Then we could do something like this:

1. Try to insert a tuple.  If no unique index conflicts occur, stop.
2. Note the identity of the conflicting tuple and mark the inserted
heap tuple dead.
3. If the conflicting tuple's inserting transaction is still in
progress, wait for the inserting transaction to end.
4. If the conflicting tuple is dead (e.g. because the inserter
aborted), start over.
5. If the conflicting tuple's key columns no longer match the key
columns of the REPLACE operation, start over.
6. If the conflicting tuple has a valid xmax, wait for the deleting or
locking transaction to end.  If xmax is still valid, follow the CTID
chain to the updated tuple, let that be the new conflicting tuple, and
resume from step 5.
7. Update the tuple, even though it may be invisible to our snapshot
(a deliberate MVCC violation!).

While this behavior is admittedly wonky from an MVCC perspective, I
suspect that it would make a lot of people happy.

>>> I respectfully
>>> suggest that that exact definition of upsert isn't a useful one,
>>> because other snapshot isolation/MVCC systems operating within the
>>> same constraints must have the same issues, and yet they manage to
>>> implement something that could be called upsert that people seem happy
>>> with.
>>
>> Yeah.  I wonder how they do that.
>
> My guess is that they have some fancy snapshot type that is used by
> the equivalent of ModifyTable subplans, that is appropriately paranoid
> about the Halloween problem and so on. How that actually might work is
> far from clear, but it's a question that I have begun to consider. As
> I said, a concern is that it would be in tension with the generalized,
> composable syntax, where we don't explicitly have a "special update".
> I'd really like to hear other opinions, though.

The tension here feels fairly fundamental to me; I don't think our
implementation is to blame.  I think the problem isn't so much to
figure out a clever trick that will make this all work in a truly
elegant fashion as it is to decide exactly how we're going to
compromise MVCC semantics in the least blatant way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - visibility semantics

From
Robert Haas
Date:
On Tue, Sep 24, 2013 at 5:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Various messages are discussing semantics around visibility. I by now
> have a hard time keeping track. So let's keep the discussion of the
> desired semantics to this thread.
>
> There have been some remarks about serialization failures in read
> committed transactions. I agree, those shouldn't occur. But I don't
> actually think they are so much of a problem if we follow the path set
> by existing uses of the EPQ logic. The scenario described seems to be an
> UPSERT conflicting with a row it cannot see in the original snapshot of
> the query.
> In that case I think we just have to follow the example laid by
> ExecUpdate, ExecDelete and heap_lock_tuple. Use the EPQ machinery (or an
> alternative approach with similar enough semantics) to get a new
> snapshot and follow the ctid chain. When we've found the end of the
> chain we try to update that tuple.
> That surely isn't free of surprising semantics, but it would follows existing
> semantics. Which everybody writing concurrent applications in read
> committed should (but doesn't) know. Adding a different set of semantics
> seems like a bad idea.
> Robert seems to have been the primary sceptic around this, what scenario
> are you actually concerned about?

I'm not skeptical about offering it as an option; in fact, I just
suggested basically the same thing on the other thread, before reading
this.  Nonetheless it IS an MVCC violation; the chances that someone
will be able to demonstrate serialization anomalies that can't occur
today with this new facility seem very high to me.  I feel it's
perfectly fine to respond to that by saying: yep, we know that's
possible, if it's a concern in your environment then don't use this
feature.  But it should be clearly documented.

I do think that it will be easier to get this to work if we have a
define the operation as REPLACE, bundling all of the magic inside a
single SQL command.  If the user issues an INSERT first and then must
try an UPDATE afterwards if the INSERT doesn't actually insert, then
you're going to have problems if the UPDATE can't see the tuple with
which the INSERT conflicted, and you're going to need some kind of a
loop in case the UPDATE itself fails.  Even if we can work out all the
details, a single command that does insert-or-update seems like it
will be easier to use and more efficient.  You might also want to
insert multiple tuples using INSERT ... VALUES (...), (...), (...);
figuring out which ones were inserted and which ones must now be
updated seems like a chore better avoided.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Sep 24, 2013 at 7:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't really disagree with any of that.  TBH, I think the question
> of how long value locks (as you call them) are held is going to boil
> down to a question of how they end up being implemented.

Well, I think we can rule out value locks that are held for the
duration of a transaction right away. That's just not going to fly.

> As I mentioned to you at PG Open (going through the details here for those
> following along at home), we could optimistically insert the new heap
> tuple, then go add index entries for it, and if we find a conflict,
> then instead of erroring out, we mark the tuple we were inserting dead
> and go try update the conflicting tuple instead.  In that
> implementation, if we find that we have to wait for some other
> transaction along the way, it's probably not worth reversing out the
> index entries already inserted, because getting them into the index in
> the first place was a WAL-logged operation, and therefore relatively
> expensive, and IMHO it's most likely better to just hope things work
> out than to risk having to redo all of that.

I'm afraid that there are things that concern me about this design. It
does have one big advantage over promise-tuples, which is that the
possibility of index-only bloat, and even the possible need to freeze
indexes separately from their heap relation is averted (or are you
going to have recovery do promise clean-up instead? Does recovery look
for an eventual successful insertion relating to the promise? How far
does it look?). However, while I'm just as concerned as you that
backing out is too expensive, I'm equally concerned that there is no
reasonable alternative to backing out, which is why cheap, quick
in-memory value locks are so compelling to me. See my remarks below.

> On the other hand, if the locks are strictly in-memory, then the cost
> of releasing them all before we go to wait, and of reacquiring them
> after we finish waiting, is pretty low.  There might be some
> modularity issues to work through there, but they might not turn out
> to be very painful, and the advantages you mention are certainly worth
> accruing if it turns out to be fairly straightforward.

It's certainly a difficult situation to judge.

> Personally, I think that trying to keep it all in-memory is going to
> be hard.  The problem is that we can't de-optimize regular inserts or
> updates to any significant degree to cater to this feature - because
> as valuable as this feature is, the number of times it gets used is
> still going to be a whole lot smaller than the number of times it
> doesn't get used.

Right - I don't think that anyone would argue that any other standard
should be applied. Fortunately, I'm reasonably confident that it can
work. The last part of index tuple insertion, where we acquire an
exclusive lock on a buffer, needs to look out for a page header bit
(on pages considered for insertion of its value). The cost of that to
anyone not using this feature is likely to be infinitesimally small.
We can leave clean-up of that bit to the next inserter, who needs the
exclusive lock anyway and doesn't find a corresponding SLRU entry. But
really, that's a discussion for another day. I think we'd want to
track value locks per pinned-by-upserter buffer, to localize any
downsides on concurrency. If we forbid page-splits in respect of a
value-locked page, we can still have a stable value (buffer number) to
use within a shared memory hash table, or something along those lines.
We're still going to want to minimize the duration of locking under
this scheme, by doing TOASTing before locking values and so on, which
is quite possible.

If we're really lucky, maybe the value locking stuff can be
generalized or re-used as part of a btree index insertion buffer
feature.

> Also, I tend to think that we might want to define
> the operation as a REPLACE-type operation with respect to a certain
> set of key columns; and so we'll do the insert-or-update behavior with
> respect only to the index on those columns and let the chips fall
> where they may with respect to any others.  In that case this all
> becomes much less urgent.

Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT,
not try an INSERT, then maybe mark the heap tuple if there's a unique
index dup and then go UPDATE the conflicting tuple. I mention this
only because the term REPLACE has a certain baggage, and I feel it's
important to be careful about such things.

The only way that's going to work is if you say "use this unique
index", which will look pretty gross in DML. That might actually be
okay with me if we had somewhere to go from there in a future release,
but I doubt that's the case. Another issue is that I'm not sure that
this helps Andres much (or rather, clients of the logical changeset
generation infrastructure that need to do conflict resolution), and
that matters a lot to me here.

> Suppose we define the operation as REPLACE rather than INSERT...ON
> DUPLICATE KEY LOCK FOR UPDATE.  Then we could do something like this:
>
> 1. Try to insert a tuple.  If no unique index conflicts occur, stop.
> 2. Note the identity of the conflicting tuple and mark the inserted
> heap tuple dead.
> 3. If the conflicting tuple's inserting transaction is still in
> progress, wait for the inserting transaction to end.

Sure, this is basically what the code does today (apart from marking a
just-inserted tuple dead).

> 4. If the conflicting tuple is dead (e.g. because the inserter
> aborted), start over.

Start over from where? I presume you mean the index tuple insertion,
as things are today. Or do you mean the very start?

> 5. If the conflicting tuple's key columns no longer match the key
> columns of the REPLACE operation, start over.

What definition of equality or inequality? I think you're going to
have to consider stashing information about the btree operator class,
which seems not ideal - a modularity violation beyond what we already
do in, say, execQual.c, I think. I think in general we have to worry
about the distinction between a particular btree operator class's idea
of equality (doesn't have to be = operator), that exists for a
particular index, and some qual's idea of equality. It would probably
be quite invasive to fix this, which I for one would find hard to
justify.

I think my scheme is okay here while yours isn't, because mine
involves row locking only, and hoping that nothing gets updated in
that tiny window after transaction commit - if it doesn't, that's good
enough for us, because we know that the btree code's opinion still
holds - if I'm not mistaken, *nothing* can have changed to the logical
row without us hearing about it (i.e. without heap_lock_tuple()
returning HeapTupleUpdated). On the other hand, you're talking about
concluding that something is not a duplicate in a way that needs to
satisfy btree unique index equality (so whatever operator is
associated with btree strategy number 3, equality, for some particular
unique index with some particular operator class) and not necessarily
just a qual written with a potentially distinct notion of equality in
respect of the relevant unique-constrained datums.

Maybe you can solve this one problem, but the fact remains that to do
so would be a pretty bad modularity violation, even by the standards
of the existing btree code. That's the basic reason why I'm averse to
using EvalPlanQual() in this fashion, or in a similar fashion. Even if
you solve all the problems for btree, I can't imagine what type of
burden it puts on amcanunique AM authors generally - I know at least
one person who won't be happy with that.  :-)

> 6. If the conflicting tuple has a valid xmax, wait for the deleting or
> locking transaction to end.  If xmax is still valid, follow the CTID
> chain to the updated tuple, let that be the new conflicting tuple, and
> resume from step 5.

So you've arbitrarily restricted us to one value lock and one row lock
per REPLACE slot processed, which sort of allows us to avoid solving
the basic problem of value locking, because it isn't too bad now - no
need to backtrack across indexes. Clean-up (marking the heap tuple
dead) is much more expensive than releasing locks in memory (although
much less expensive than promise tuple killing), but needing to
clean-up is maybe less likely because conflicts can only come from one
unique index. Has this really bought us anything, though? Consider
that conflicts are generally only expected on one unique index anyway.
Plus you still have the disconnect between value and row locking, as
far as I can tell - "start from scratch" remains a possible step until
very late, except you pay a lot more for clean-up - avoiding that
expensive clean-up is the major benefit of introducing an SLRU-based
shadow value locking scheme to the btree code. I don't see that there
is a way to deal with the value locking/row locking disconnect other
than to live with it in a smart way.

Anyway, your design probably avoids the worst kind of gridlock. Let's
assume that it works out -- my next question has to be, where can we
go from there?

> 7. Update the tuple, even though it may be invisible to our snapshot
> (a deliberate MVCC violation!).

I realize that you just wanted to sketch a design, but offhand I think
that the basic problem with what you describe is that it isn't
accepting of the inevitability of there being a disconnect between
value and row locking. Also, this doesn't fit with any roadmap for
getting a real upsert, and compromises the conceptual integrity of the
AM in a way that isn't likely to be accepted, and, at the risk of
saying too much before you've defended your design, perhaps even
necessitates invasive changes to the already extremely complicated row
locking code.

> While this behavior is admittedly wonky from an MVCC perspective, I
> suspect that it would make a lot of people happy.

"Wonky from an MVCC perspective" is the order of the day here. :-)

>> My guess is that they have some fancy snapshot type that is used by
>> the equivalent of ModifyTable subplans, that is appropriately paranoid
>> about the Halloween problem and so on. How that actually might work is
>> far from clear, but it's a question that I have begun to consider. As
>> I said, a concern is that it would be in tension with the generalized,
>> composable syntax, where we don't explicitly have a "special update".
>> I'd really like to hear other opinions, though.
>
> The tension here feels fairly fundamental to me; I don't think our
> implementation is to blame.  I think the problem isn't so much to
> figure out a clever trick that will make this all work in a truly
> elegant fashion as it is to decide exactly how we're going to
> compromise MVCC semantics in the least blatant way.

Yeah, I totally understand the problem that way. I think it would be a
bit of a pity to give up the composability, which I liked, but it's
something that we'll have to consider. On the other hand, perhaps we
can get away with it - we simply don't know enough yet.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Bruce Momjian
Date:
On Sat, Sep 21, 2013 at 05:07:11PM -0700, Peter Geoghegan wrote:
> In the average/uncontended case, the subxact example bloats less than
> all alternatives to my design proposed to date (including the "unborn
> heap tuple" idea Robert mentioned in passing to me in person the other
> day, which I think is somewhat similar to a suggestion of Heikki's
> [1]). The average case is very important, because in general
> contention usually doesn't happen.

This thread had a lot of discussion about bloating.  I wonder, does the
code check to see if there is a matching row _before_ adding any data? 
Our test-and-set code first checks to see if the lock is free, then if
it it is, it locks the bus and does a test-and-set.   Couldn't we easily
check the indexes for matches before doing any locking?  It seems that
would avoid bloat in most cases, and allow for a simpler implementation.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote:
> This thread had a lot of discussion about bloating.  I wonder, does the
> code check to see if there is a matching row _before_ adding any data?

That's pretty much what the patch does.

> Our test-and-set code first checks to see if the lock is free, then if
> it it is, it locks the bus and does a test-and-set.   Couldn't we easily
> check the indexes for matches before doing any locking?  It seems that
> would avoid bloat in most cases, and allow for a simpler implementation.

The value locks are only really necessary for getting consensus across
unique indexes on whether or not to go forward, and to ensure that
insertion can *finish* unhindered once we're sure that's appropriate.
Once we've committed to insertion, we hold them across heap tuple
insertion and release each value lock as part of something close to
conventional btree index tuple insertion (with an index tuple with an
ordinary heap pointer inserted). I believe that all schemes proposed
to date have some variant of what could be described as value locking,
such as ordinary index tuples inserted speculatively.

Value locks are *not* held during row locking, and an attempt at row
locking is essentially opportunistic for various reasons (it boils
down to the fact that re-verifying uniqueness outside of the btree
code is very unappealing, and in any case would naturally sometimes be
insufficient - what happens if key values change across row
versions?). This might sound a bit odd, but is in a sense no different
to the current state of affairs, where the first waiter on a blocking
xact that inserted a would-be duplicate is not guaranteed to be the
first to get a second chance at inserting. I don't believe that there
are any significant additional lock starvation hazards.

In the simple case where there is a conflicting tuple that's already
committed, value locks above and beyond what the btree code does today
are unnecessary (provided the attempt to acquire a row lock is
eventually successful, which mostly means that no one else has
updated/deleted - otherwise we try again).

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Bruce Momjian
Date:
On Wed, Sep 25, 2013 at 08:48:11PM -0700, Peter Geoghegan wrote:
> On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > This thread had a lot of discussion about bloating.  I wonder, does the
> > code check to see if there is a matching row _before_ adding any data?
> 
> That's pretty much what the patch does.

So, I guess my question is if we are only bloating on a contended
operation, do we expect that to happen so much that bloat is a problem?

I think the big objection to the patch is the additional code complexity
and the potential to slow down other sessions.  If it is only bloating
on a contended operation, are these two downsides worth avoiding the
bloat?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Bruce Momjian
Date:
On Thu, Sep 26, 2013 at 07:43:15AM -0400, Bruce Momjian wrote:
> On Wed, Sep 25, 2013 at 08:48:11PM -0700, Peter Geoghegan wrote:
> > On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > > This thread had a lot of discussion about bloating.  I wonder, does the
> > > code check to see if there is a matching row _before_ adding any data?
> > 
> > That's pretty much what the patch does.
> 
> So, I guess my question is if we are only bloating on a contended
> operation, do we expect that to happen so much that bloat is a problem?
> 
> I think the big objection to the patch is the additional code complexity
> and the potential to slow down other sessions.  If it is only bloating
> on a contended operation, are these two downsides worth avoiding the
> bloat?

Also, this isn't like the case where we are incrementing sequences --- I
am unclear what workload is going to cause a lot of contention.  If two
sessions try to insert the same key, there will be bloat, but later
upsert operations will already see the insert and not cause any bloat.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Sep 26, 2013 at 4:43 AM, Bruce Momjian <bruce@momjian.us> wrote:
> So, I guess my question is if we are only bloating on a contended
> operation, do we expect that to happen so much that bloat is a problem?

Maybe I could have done a better job of explaining the nature of my
concerns around bloat.

I am specifically concerned about bloat and the clean-up of bloat that
occurs between (or during) value locking and eventual row locking,
because of the necessarily opportunistic nature of the way we go from
one to the other. Bloat, and the obligation to clean it up
synchronously, make row lock conflicts more likely. Conflicts make
bloat more likely, because a conflict implies that another iteration,
complete with more bloat, is necessary.

When you consider that the feature will frequently be used with the
assumption that updating is a much more likely outcome, it becomes
clear that we need to be careful about this sort of interplay.

Having said all that, I would have no objection to some reasonable,
bound amount of bloat occurring elsewhere if that made sense. For
example, I'd certainly be happy to consider the question of whether or
not it's worth doing a kind of speculative heap insertion before
acquiring value locks, because that doesn't need to happen again and
again in the same, critical place, in the interim between value
locking and row locking. The advantage of doing that particular thing
would be to reduce the duration that value locks are held - the
disadvantages would be the *usual* disadvantages of bloat. However,
this is obviously a premature discussion to have now, because the
eventual exact nature of value locks are not known.

> I think the big objection to the patch is the additional code complexity
> and the potential to slow down other sessions.  If it is only bloating
> on a contended operation, are these two downsides worth avoiding the
> bloat?

I believe that all other schemes proposed have some degree of bloat
even in the uncontended case, because they optimistically assume than
an insert will occur, when in general an update is perhaps just as
likely, and will bloat just the same. So, as I've said before,
definition of uncontended is important here.

There is no reason to assume that alternative proposals will affect
concurrency any less than my proposal - the buffer locking thing
certainly isn't essential to my design. You need to weigh things like
WAL-logging multiples times, that other proposals have. You're right
to say that all of this is complex, but I really think that quite
apart from anything else, my design is simpler than others. For
example, the design that Robert sketched would introduce a fairly
considerable modularity violation, per my recent remarks to him, and
actually plastering over that would be a considerable undertaking.
Now, you might counter, "but those other designs haven't been worked
out enough". That's true, but then my efforts to work them out further
by pointing out problems with them haven't gone very far. I have
sincerely tried to see a way to make them work.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Tue, Sep 24, 2013 at 10:15 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Well, I think we can rule out value locks that are held for the
> duration of a transaction right away. That's just not going to fly.

I think I agree with that.  I don't think I remember hearing that proposed.

> If we're really lucky, maybe the value locking stuff can be
> generalized or re-used as part of a btree index insertion buffer
> feature.

Well, that would be nifty.

>> Also, I tend to think that we might want to define
>> the operation as a REPLACE-type operation with respect to a certain
>> set of key columns; and so we'll do the insert-or-update behavior with
>> respect only to the index on those columns and let the chips fall
>> where they may with respect to any others.  In that case this all
>> becomes much less urgent.
>
> Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT,
> not try an INSERT, then maybe mark the heap tuple if there's a unique
> index dup and then go UPDATE the conflicting tuple. I mention this
> only because the term REPLACE has a certain baggage, and I feel it's
> important to be careful about such things.

I see.  Well, we could try to mimic their semantics, I suppose.  Those
semantics seem like a POLA violation to me; who would have thought
that a REPLACE could delete multiple tuples?  But what do I know?

> The only way that's going to work is if you say "use this unique
> index", which will look pretty gross in DML. That might actually be
> okay with me if we had somewhere to go from there in a future release,
> but I doubt that's the case. Another issue is that I'm not sure that
> this helps Andres much (or rather, clients of the logical changeset
> generation infrastructure that need to do conflict resolution), and
> that matters a lot to me here.

Yeah, it's kind of awful.

>> Suppose we define the operation as REPLACE rather than INSERT...ON
>> DUPLICATE KEY LOCK FOR UPDATE.  Then we could do something like this:
>>
>> 1. Try to insert a tuple.  If no unique index conflicts occur, stop.
>> 2. Note the identity of the conflicting tuple and mark the inserted
>> heap tuple dead.
>> 3. If the conflicting tuple's inserting transaction is still in
>> progress, wait for the inserting transaction to end.
>
> Sure, this is basically what the code does today (apart from marking a
> just-inserted tuple dead).
>
>> 4. If the conflicting tuple is dead (e.g. because the inserter
>> aborted), start over.
>
> Start over from where? I presume you mean the index tuple insertion,
> as things are today. Or do you mean the very start?

Yes, that's what I meant.

>> 5. If the conflicting tuple's key columns no longer match the key
>> columns of the REPLACE operation, start over.
>
> What definition of equality or inequality?

Binary equality, same as we'd use to decide whether an update can be done HOT.

>> 7. Update the tuple, even though it may be invisible to our snapshot
>> (a deliberate MVCC violation!).
>
> I realize that you just wanted to sketch a design, but offhand I think
> that the basic problem with what you describe is that it isn't
> accepting of the inevitability of there being a disconnect between
> value and row locking. Also, this doesn't fit with any roadmap for
> getting a real upsert,

Well, there are two separate issues here: what to do about MVCC, and
how to do the locking.  From an MVCC perspective, I can think of only
two behaviors when the conflicting tuple is committed but invisible:
roll back, or update it despite it being invisible.  If you're saying
you don't like either of those choices, I couldn't agree more, but I
don't have a third idea.  If you do, I'm all ears.

In terms of how to do the locking, what I'm mostly saying is that we
could try to implement this in a way that invents as few new concepts
as possible.  No promise tuples, no new SLRU, no new page-level bits,
just index tuples and heap tuples and so on.  Ideally, we don't even
change the WAL format, although step 2 might require a new record
type.  To the extent that what I actually described was at variance
with that goal, consider it a defect in my explanation rather than an
intent to vary.  I think there's value in considering such an
implementation because each new thing that we have to introduce in
order to get this feature is a possible reason for it to be rejected -
for modularity reasons, or because it hurts performance elsewhere, or
because it's more code we have to maintain, or whatever.

Now, what I hear you saying is, gee, the performance of that might be
terrible.  I'm not sure that I believe that, but it's possible that
you're right. Much seems to depend on what you think the frequency of
conflicts will be, and perhaps I'm assuming it will be low while
you're assuming a higher value.  Regardless, if the performance of the
sort of implementation I'm talking about would be terrible (under some
agreed-upon definition of what terrible means in this context), then
that's a good argument for not doing it that way.  I'm just not
convinced that's the case.

Basically, if there's a way we can do this without changing the
on-disk format (even in a backward-compatible way), I'd be strongly
inclined to go that route unless we have a really compelling reason to
believe it's going to suck (or be outright impossible).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Thu, Sep 26, 2013 at 3:07 PM, Peter Geoghegan <pg@heroku.com> wrote:
> When you consider that the feature will frequently be used with the
> assumption that updating is a much more likely outcome, it becomes
> clear that we need to be careful about this sort of interplay.

I think one thing that's pretty clear at this point is that almost any
version of this feature could be optimized for either the insert case
or the update case.  For example, my proposal could be modified to
search for a conflicting tuple first, potentially wasting an index
probes (or multiple index probes, if you want to search for potential
conflicts in multiple indexes) if we're inserting, but winning heavily
in the update case.  As written, it's optimized for the insert case.

In fact, I don't know how to know which of these things we should
optimize for.  I wrote part of the code for an EDB proprietary feature
that can do insert-or-update loads about 6 months ago[1], and we
optimized it for updates.  That was not, however, a matter of
principal; it just turned out to be easier to implement that way.  In
fact, I would have assumed that the insert-mostly case was more
likely, but I think the real answer is that some environments will be
insert-mostly and some will be update-mostly and some will be a mix.

If we really want to squeeze out every last drop of possible
performance, we might need two modes: one that assumes we'll mostly
insert, and another that assumes we'll mostly update.  That seems a
frustrating amount of detail to have to expose to the user; an
implementation that was efficient in both cases would be very
desirable, but I do not have a good idea how to get there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] In case you're wondering, attempting to use that feature to upsert
an invisible tuple will result in the load failing with a unique index
violation.



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Bruce Momjian
Date:
On Thu, Sep 26, 2013 at 03:33:34PM -0400, Robert Haas wrote:
> On Thu, Sep 26, 2013 at 3:07 PM, Peter Geoghegan <pg@heroku.com> wrote:
> > When you consider that the feature will frequently be used with the
> > assumption that updating is a much more likely outcome, it becomes
> > clear that we need to be careful about this sort of interplay.
> 
> I think one thing that's pretty clear at this point is that almost any
> version of this feature could be optimized for either the insert case
> or the update case.  For example, my proposal could be modified to
> search for a conflicting tuple first, potentially wasting an index
> probes (or multiple index probes, if you want to search for potential
> conflicts in multiple indexes) if we're inserting, but winning heavily
> in the update case.  As written, it's optimized for the insert case.

I assumed the code was going to do the index lookups first without a
lock, and take the appropriate action, insert or update, with fallbacks
for guessing wrong.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, I think we can rule out value locks that are held for the
>> duration of a transaction right away. That's just not going to fly.
>
> I think I agree with that.  I don't think I remember hearing that proposed.

I think I might have been unclear - I mean locks that are held for the
duration of *another* transaction, not our own, as we wait for that
other transaction to commit/abort. I think that earlier remarks from
yourself and Andres implied that this would be necessary. Perhaps I'm
mistaken. Your most recent design proposal doesn't do this, but I
think that that's only because it restricts the user to a single
unique index - it would otherwise be necessary to sit on the earlier
value locks (index tuples belonging to an unfinished transaction)
pending the completion of some other conflicting transaction, which
has numerous disadvantages (as described in my "it suits my purposes
to have the value locks be held for only an instant" mail to you [1]).

>> If we're really lucky, maybe the value locking stuff can be
>> generalized or re-used as part of a btree index insertion buffer
>> feature.
>
> Well, that would be nifty.

Yes, it would. I think, based on a conversation with Rob Wultsch, that
it's another area where MySQL still do quite a bit better.

> I see.  Well, we could try to mimic their semantics, I suppose.  Those
> semantics seem like a POLA violation to me; who would have thought
> that a REPLACE could delete multiple tuples?  But what do I know?

I think that it's fairly widely acknowledged to not be very good.
Every MySQL user uses INSERT...ON DUPLICATE KEY UPDATE instead.

>> The only way that's going to work is if you say "use this unique
>> index", which will look pretty gross in DML.

> Yeah, it's kind of awful.

It is.

>> What definition of equality or inequality?
>
> Binary equality, same as we'd use to decide whether an update can be done HOT.

I guess that's acceptable in theory, because binary equality is
necessarily a *stricter* condition than equality according to some
operator that is an equivalence relation. But the fact remains that
you're just ameliorating the problem by making it happen less often
(both through this kind of trick, but also by restricting us to one
unique index), not actually fixing it.

> Well, there are two separate issues here: what to do about MVCC, and
> how to do the locking.

Totally agreed. Fortunately, unlike the different aspects of value and
row locking, I think that these two questions can be reasonable
considered independently.

> From an MVCC perspective, I can think of only
> two behaviors when the conflicting tuple is committed but invisible:
> roll back, or update it despite it being invisible.  If you're saying
> you don't like either of those choices, I couldn't agree more, but I
> don't have a third idea.  If you do, I'm all ears.

I don't have another idea either. In fact, I'd go so far as to say
that doing any third thing that's better than those two to any
reasonable person is obviously impossible. But I'd add that we simple
cannot rollback at read committed, so we're just going to have to hold
our collective noses and do strange things with visibility.

FWIW, I'm tentatively looking at doing something like this:

*************** HeapTupleSatisfiesMVCC(HeapTuple htup, S
*** 958,963 ****
--- 959,975 ---- * By here, the inserting transaction has committed - have to check * when... */
+
+ /*
+ * Not necessarily visible to snapshot under conventional MVCC rules, but
+ * still locked by our xact and not updated -- importantly, normal MVCC
+ * semantics apply when we update the row, so only one version will be
+ * visible at once
+ */
+ if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
+ TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
+ return true;
+ if (XidInMVCCSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot)) return false; /* treat as still in progress */

This is something that I haven't given remotely enough thought yet, so
please take it with a big grain of salt.

> In terms of how to do the locking, what I'm mostly saying is that we
> could try to implement this in a way that invents as few new concepts
> as possible.  No promise tuples, no new SLRU, no new page-level bits,
> just index tuples and heap tuples and so on.  Ideally, we don't even
> change the WAL format, although step 2 might require a new record
> type.  To the extent that what I actually described was at variance
> with that goal, consider it a defect in my explanation rather than an
> intent to vary.  I think there's value in considering such an
> implementation because each new thing that we have to introduce in
> order to get this feature is a possible reason for it to be rejected -
> for modularity reasons, or because it hurts performance elsewhere, or
> because it's more code we have to maintain, or whatever.

There is certainly value in considering that, and you're right to take
that tact - it is generally valuable to have a patch be minimally
invasive. However, ultimately that's just one aspect of any given
design, an aspect that needs to be weighed against others where there
is a tension. Obviously in this instance I believe, rightly or
wrongly, that doing more - adding more infrastructure than might be
considered strictly necessary - is the least worst thing. Also,
sometimes the apparent similarity of a design to what we have today is
illusory - certainly, I think you'd at least agree that the problems
that bloating during the interim between value locking and row locking
present are qualitatively different to other problems that bloat
presents in all existing scenarios.

FWIW I'm not doing things this way because I'm ambitious, and am
willing to risk not having my work accepted if that means I might get
something that performs better, or has more features (like not
requiring the user to specify a unique index in DML). Rather, I'm
doing things this way because I sincerely believe that on balance mine
is the best, most forward-thinking design proposed to date, and
therefore the design most likely to ultimately be accepted (even
though I do of course accept that there are numerous aspects that need
to be worked out still). If the whole design is ultimately not
accepted, that's something that I'll have to deal with, but understand
that I don't see any way to play it safe here (except, I suppose, to
give up now).

> Now, what I hear you saying is, gee, the performance of that might be
> terrible.  I'm not sure that I believe that, but it's possible that
> you're right.

I think that the average case will be okay, but not great. I think
that the worst case performance may well be unforgivably bad, and it's
a fairly plausible worst case. Even if someone disputes its
likelihood, and demonstrates that it isn't actually that likely, that
isn't necessarily very re-assuring - getting all the details right is
pretty subtle, especially compared to just not bloating, and just
deferring to the btree code whose responsibilities include enforcing
uniqueness.

> Much seems to depend on what you think the frequency of
> conflicts will be, and perhaps I'm assuming it will be low while
> you're assuming a higher value.  Regardless, if the performance of the
> sort of implementation I'm talking about would be terrible (under some
> agreed-upon definition of what terrible means in this context), then
> that's a good argument for not doing it that way.  I'm just not
> convinced that's the case.

All fair points. Forgive me for repeating myself, but the word
"conflict" needs to be used carefully here, because there are two
basic ways of interpreting it - something that happens due to
concurrent xact activity around the same values, and something that
happens due to there already being some row there with a conflicting
value from some time ago (or that our xact inserted, even). Indeed,
the former *is* generally much less likely than the latter, so the
distinction is important. You could also further differentiate between
value level and row level conflicts, or at least I think that you
should, and that we should allow for value level conflicts.

Let me try and explain myself better, with reference to a concrete
example. Suppose we have a table with a primary key column, A, and a
unique constraint column, B, and we lock the pk value first and the
unique constraint value second. I'm assuming your design, but allowing
for multiple unique indexes because I don't think doing anything less
will be accepted - promise tuples have some of the same problems, as
well as some other idiosyncratic ones (see my earlier remarks on
recovery/freezing [2] for examples of those).

So there is a fairly high probability that the pk value on A will be
unique, and a fairly low probability that the unique constraint value
on B will be unique, at least in this usage pattern of interest, where
the user is mostly going to end up updating. Mostly, we insert a
speculative regular index tuple (that points to a speculative heap
tuple that we might decide to kill) into the pk column, A, right away,
and then maybe block pending the resolution of a conflicting
transaction on the unique constraint column B. I don't think we have
any reasonable way of not blocking on A - if we go clean it up for the
wait, that's going to bloat quite dramatically, *and* we have to WAL
log. In any case you seemed to accept that cleaning up bloat
synchronously like that was just going to be too expensive. So I
suppose that rules that out. That just leaves sitting on the "value
lock" (that the pk index tuple already inserted effectively is)
indefinitely, pending the outcome of the first transaction.

What are the consequences of sitting on that value lock indefinitely?
Well, xacts are going to block on the pk value much more frequently,
by simple virtue of the fact that the value locks there are held for a
long time - they just needed to hear a "no" answer, which the unique
constraint was in most cases happy to immediately give, so this is
totally unnecessary. Contention is now in a certain sense almost as
bad for every unique index as it is for the weakest link. That's only
where the problems begin, though, and it isn't necessary for there to
be bad contention on more than one unique index (the pk could just be
on a serial column, say) to see bad effects.

So your long-running xact that's blocking all the other sessions on
its proposed value for a (or maybe even b) - that finally gets to
proceed. Regardless of whether it commits or aborts, there will be a
big bloat race. This is because when the other sessions get the
go-ahead to proceed, they'll all run to get the row lock (one guy
might insert instead). Only one will be successful, but they'll all
kill their heap tuple on the assumption that they'll probably lock the
row, which is only true in the average case. Now, maybe you can teach
them to not bother killing the heap tuple when there are no index
tuples actually inserted to ruin things, but then maybe not, and maybe
it wouldn't help in this instance if you did teach them (because
there's a third, otherwise irrelevant constraint or whatever).

Realize you can generally only kill the heap tuple *before* you have
the row lock, because otherwise a totally innocent non-HOT update (may
not update any unique indexed columns at all) will deadlock with your
session, which I don't think is defensible, and will probably happen
often if allowed to (after all, this is upsert - users are going to
want to update their locked rows!).

So in this scenario, each of the original blockers will simultaneously
try again and again to get the row lock as one transaction proceeds
with locking and then probably commits. For every blocker's iteration
(there will be n_blockers - 1 iterations, with each iteration
resolving things for one blocker only), each blocker bloats. We're
talking about creating duplicates in unique indexes for each and every
iteration, for each and every blocker, and we all know duplicates in
btree indexes are, in a word, bad. I can imagine one or two
ridiculously bloated indexes in this scenario. It's even degenerative
in another direction - the more aggregate bloat we have, the slower
the jump from value to row locking takes, the more likely conflicts
are, the more likely bloat is.

Contrast this with my design, where re-ordering of would-be
conflicters across unique indexes (or serialization failures) can
totally nip this in the bud *if* the contention can be re-ordered
around, but if not, at least there is no need to worry about
aggregating bloat at all, because it creates no bloat.

Now, you're probably thinking "but I said I'll reverify the row for
conflicts across versions, and it'll be fine - there's generally no
need to iterate and bloat again provided no unique-indexed column
changed, even if that is more likely to occur due to the clean-up pre
row locking". Maybe I'm being unfair, but apart from requiring a
considerable amount of additional infrastructure of its own (a new
EvalPlanQual()-like thing that cares about binary equality in respect
of some columns only across row versions), I think that this is likely
to turn out to be subtly flawed in some way, simply because of the
modularity violation, so I haven't given you the benefit of the doubt
about your ability to frequently avoid repeatedly asking the index +
btree code what to do. For example, partial unique indexes - maybe
something that looked okay before because you simply didn't have cause
to insert into that unique index has to be considered in light of the
fact that it changed across row versions - are you going to stash that
knowledge too, and is it likely to affect someone who might otherwise
not have these issues really badly because we have to assume the worst
there? Do you want to do a value verification thing for that too, as
we do when deciding to insert into partial indexes in the first place?
Even if this new nothing-changed-across-versions infrastructure works,
will it work often enough in practice to be worth it -- have you ever
tracked the proportion of updates that were HOT updates in a
production DB? It isn't uncommon for it to not be great, and I think
that we can take that as a proxy for how well this will work. It could
be totally legitimate for the UPDATE portion to alter a unique indexed
column all the time.

> Basically, if there's a way we can do this without changing the
> on-disk format (even in a backward-compatible way), I'd be strongly
> inclined to go that route unless we have a really compelling reason to
> believe it's going to suck (or be outright impossible).

I don't believe that anything that I have proposed needs to break our
on-disk format - I hadn't considered what the implications might be in
this area for other proposals, but it's possible that that's an
additional advantage of doing value locking all in-memory.

[1] http://www.postgresql.org/message-id/CAM3SWZRV0F-DjgpXu-WxGoG9eEcLawNrEiO5+3UKRp2e5s=TSg@mail.gmail.com

[2] http://www.postgresql.org/message-id/CAM3SWZQUUuYYcGksVytmcGqACVMkf1ui1uvfJekM15YkWZpzhw@mail.gmail.com
-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Sep 26, 2013 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think one thing that's pretty clear at this point is that almost any
> version of this feature could be optimized for either the insert case
> or the update case.  For example, my proposal could be modified to
> search for a conflicting tuple first, potentially wasting an index
> probes (or multiple index probes, if you want to search for potential
> conflicts in multiple indexes) if we're inserting, but winning heavily
> in the update case.

I don't think that's really the case.

In what sense could my design really be said to prioritize either the
INSERT or the UPDATE case? I'm pretty sure that it's still necessary
to get all the value locks per unique index needed up until the first
one with a conflict even if you know that you're going to UPDATE for
*some* reason, in order for things to be well defined (which is
important, because there might be more than one conflict, and which
one is locked matters - maybe we could add DDL to let unique indexes
have a checking priority or something like that).

The only appreciable downside of my design for updates that I can
think of is that there has to be another index scan, to find the
locked-for-update row to update. However, that's probably worth it,
since it is at least relatively rare, and allows the user the
flexibility of using a more complex UPDATE predicate than "apply to
conflicter", which is something that the MySQL syntax effectively
limits users to.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Thu, Sep 26, 2013 at 11:58 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Well, I think we can rule out value locks that are held for the
>>> duration of a transaction right away. That's just not going to fly.
>>
>> I think I agree with that.  I don't think I remember hearing that proposed.
>
> I think I might have been unclear - I mean locks that are held for the
> duration of *another* transaction, not our own, as we wait for that
> other transaction to commit/abort. I think that earlier remarks from
> yourself and Andres implied that this would be necessary. Perhaps I'm
> mistaken. Your most recent design proposal doesn't do this, but I
> think that that's only because it restricts the user to a single
> unique index - it would otherwise be necessary to sit on the earlier
> value locks (index tuples belonging to an unfinished transaction)
> pending the completion of some other conflicting transaction, which
> has numerous disadvantages (as described in my "it suits my purposes
> to have the value locks be held for only an instant" mail to you [1]).

OK, now I understand what you are saying.  I don't think I agree with it.

> I don't have another idea either. In fact, I'd go so far as to say
> that doing any third thing that's better than those two to any
> reasonable person is obviously impossible. But I'd add that we simple
> cannot rollback at read committed, so we're just going to have to hold
> our collective noses and do strange things with visibility.

I don't accept that as a general principal.  We're writing the code;
we can make it behave any way we think best.

> This is something that I haven't given remotely enough thought yet, so
> please take it with a big grain of salt.

I doubt that any change to HeapTupleSatisfiesMVCC() will be
acceptable.  This feature needs to restrain itself to behavior changes
that only affect users of this feature, I think.

> There is certainly value in considering that, and you're right to take
> that tact - it is generally valuable to have a patch be minimally
> invasive. However, ultimately that's just one aspect of any given
> design, an aspect that needs to be weighed against others where there
> is a tension. Obviously in this instance I believe, rightly or
> wrongly, that doing more - adding more infrastructure than might be
> considered strictly necessary - is the least worst thing. Also,
> sometimes the apparent similarity of a design to what we have today is
> illusory - certainly, I think you'd at least agree that the problems
> that bloating during the interim between value locking and row locking
> present are qualitatively different to other problems that bloat
> presents in all existing scenarios.

TBH, no, I don't think I agree with that.  See further below.

> Let me try and explain myself better, with reference to a concrete
> example. Suppose we have a table with a primary key column, A, and a
> unique constraint column, B, and we lock the pk value first and the
> unique constraint value second. I'm assuming your design, but allowing
> for multiple unique indexes because I don't think doing anything less
> will be accepted - promise tuples have some of the same problems, as
> well as some other idiosyncratic ones (see my earlier remarks on
> recovery/freezing [2] for examples of those).

OK, so far I'm right with you.

> So there is a fairly high probability that the pk value on A will be
> unique, and a fairly low probability that the unique constraint value
> on B will be unique, at least in this usage pattern of interest, where
> the user is mostly going to end up updating. Mostly, we insert a
> speculative regular index tuple (that points to a speculative heap
> tuple that we might decide to kill) into the pk column, A, right away,
> and then maybe block pending the resolution of a conflicting
> transaction on the unique constraint column B. I don't think we have
> any reasonable way of not blocking on A - if we go clean it up for the
> wait, that's going to bloat quite dramatically, *and* we have to WAL
> log. In any case you seemed to accept that cleaning up bloat
> synchronously like that was just going to be too expensive. So I
> suppose that rules that out. That just leaves sitting on the "value
> lock" (that the pk index tuple already inserted effectively is)
> indefinitely, pending the outcome of the first transaction.

Agreed.

> What are the consequences of sitting on that value lock indefinitely?
> Well, xacts are going to block on the pk value much more frequently,
> by simple virtue of the fact that the value locks there are held for a
> long time - they just needed to hear a "no" answer, which the unique
> constraint was in most cases happy to immediately give, so this is
> totally unnecessary. Contention is now in a certain sense almost as
> bad for every unique index as it is for the weakest link. That's only
> where the problems begin, though, and it isn't necessary for there to
> be bad contention on more than one unique index (the pk could just be
> on a serial column, say) to see bad effects.

Here's where I start to lose faith.  It's unclear to me what those
other transactions are doing.  If they're trying to insert a record
that conflicts with the primary key of the tuple we're inserting,
they're probably doomed, but not necessarily; we might roll back.  If
they're also upserting, it's absolutely essential that they wait until
we get done before deciding what to do.

> So your long-running xact that's blocking all the other sessions on
> its proposed value for a (or maybe even b) - that finally gets to
> proceed. Regardless of whether it commits or aborts, there will be a
> big bloat race. This is because when the other sessions get the
> go-ahead to proceed, they'll all run to get the row lock (one guy
> might insert instead). Only one will be successful, but they'll all
> kill their heap tuple on the assumption that they'll probably lock the
> row, which is only true in the average case. Now, maybe you can teach
> them to not bother killing the heap tuple when there are no index
> tuples actually inserted to ruin things, but then maybe not, and maybe
> it wouldn't help in this instance if you did teach them (because
> there's a third, otherwise irrelevant constraint or whatever).

Supposing they are all upserters, it seems to me that what will
probably happen is that one of them will lock the row and update it,
and then commit.  Then the next one will lock the row and update it,
and then commit.  And so on.  It's probably important to avoid having
them keep recreating speculative tuples and then killing them as long
as a candidate tuple is available, so that they don't create a dead
tuple per iteration.  But that seems doable.

> Realize you can generally only kill the heap tuple *before* you have
> the row lock, because otherwise a totally innocent non-HOT update (may
> not update any unique indexed columns at all) will deadlock with your
> session, which I don't think is defensible, and will probably happen
> often if allowed to (after all, this is upsert - users are going to
> want to update their locked rows!).

I must be obtuse; I don't see why that would deadlock.

A bigger problem that I've just realized, though, is that once
somebody else has blocked on a unique index insertion, they'll be
stuck there until end of transaction even if we kill the tuple,
because they're waiting on the xid, not the index itself.  That might
be fatal to my proposed design, or at least require the use of some
more clever locking regimen.

> Contrast this with my design, where re-ordering of would-be
> conflicters across unique indexes (or serialization failures) can
> totally nip this in the bud *if* the contention can be re-ordered
> around, but if not, at least there is no need to worry about
> aggregating bloat at all, because it creates no bloat.

Yeah, possibly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Sep 27, 2013 at 5:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't have another idea either. In fact, I'd go so far as to say
>> that doing any third thing that's better than those two to any
>> reasonable person is obviously impossible. But I'd add that we simple
>> cannot rollback at read committed, so we're just going to have to hold
>> our collective noses and do strange things with visibility.
>
> I don't accept that as a general principal.  We're writing the code;
> we can make it behave any way we think best.

I presume you're referring to the principle that we cannot throw
serialization failures at read committed. I'd suggest that letting
that happen would upset a lot of people, because it's so totally
unprecedented. A large segment of our user base would just consider
that to be Postgres randomly throwing errors, and would be totally
dismissive of the need to do so, and not without some justification -
no one else does the same. The reality is that the majority of our
users don't even know what an isolation level is. I'm not just talking
about people that use Postgres more casually, such as Heroku
customers. I've personally talked to people who didn't even know what
a transaction isolation level was, that were in a position where they
really, really should have known.

> I doubt that any change to HeapTupleSatisfiesMVCC() will be
> acceptable.  This feature needs to restrain itself to behavior changes
> that only affect users of this feature, I think.

I agree with the principle of what you're saying, but I'm not aware
that those changes to HeapTupleSatisfiesMVCC() imply any behavioral
changes for those not using the feature. Certainly, the standard
regression tests and isolation tests still pass, for what it's worth.
Having said that, I have not thought about it enough to be willing to
actually defend that bit of code. Though I must admit that I am a
little encouraged by the fact that it passes casual inspection.

I am starting to wonder if it's really necessary to have a "blessed"
update that can see the locked, not-otherwise-visible tuple. Doing
that certainly has its disadvantages, both in terms of code complexity
and in terms of being arbitrarily restrictive. We're going to have to
allow the user to see the locked row after it's updated (the new row
version that we create will naturally be visible to its creating xact)
- is it really any worse that the user can see it before an update (or
a delete)? The user could decide to effectively make the update change
nothing, and see the same thing anyway.

I get why you're averse to doing odd things to visibility - I was too.
I just don't see that we have a choice if we want this feature to work
acceptably with read committed. In addition, as it happens I just
don't see that the general situation is made any worse by the fact
that the user might be able to see the row before an update/delete.
Isn't is also weird to update or delete something you cannot see?

Couldn't EvalPlanQual() be said to be an MVCC violation on similar
grounds? It also "reaches into the future". Locking a row isn't really
that distinct from updating it in terms of the code footprint, but
also from a logical perspective.

> It's probably important to avoid having
> them keep recreating speculative tuples and then killing them as long
> as a candidate tuple is available, so that they don't create a dead
> tuple per iteration.  But that seems doable.

I'm not so sure.

>> Realize you can generally only kill the heap tuple *before* you have
>> the row lock, because otherwise a totally innocent non-HOT update (may
>> not update any unique indexed columns at all) will deadlock with your
>> session, which I don't think is defensible, and will probably happen
>> often if allowed to (after all, this is upsert - users are going to
>> want to update their locked rows!).
>
> I must be obtuse; I don't see why that would deadlock.

If you don't see it, then you aren't being obtuse in asking for
clarification. It's really easy to be wrong about this kind of thing.

If the non-HOT update updates some random row, changing the key
columns, it will lock that random row version. It will then proceed
with "value locking" (i.e. inserting index tuples in the usual way, in
this case with entirely new values). It might then block on one of the
index tuples we, the upserter, have already inserted (these are our
"value locks" under your scheme). Meanwhile, we (the upserter) might
have *already* concluded that the *old* heap row that the regular
updater is in the process of rendering invisible is to blame in
respect of some other value in some later unique index, and that *it*
must be locked. Deadlock. This seems very possible if the key values
are somewhat correlated, which is probably generally quite common.

The important observation here is that an updater, in effect, locks
both the old and new sets of values (for non-HOT updates). And as I've
already noted, any practical "value locking" implementation isn't
going to be able to prevent the update from immediately locking the
old, because that doesn't touch an index. Hence, there is an
irresolvable disconnect between value and row locking.

Are we comfortable with this? Before you answer, consider that there
was lots of bugs (their words) in the MySQL implementation of this
same basic idea surrounding excessive deadlocking - I heard through
the grapevine that they fixed a number of bugs along these lines, and
that their implementation has historically had lots of deadlocking
problems.

I think that the way to deal with weird, unprincipled deadlocking is
to simply not hold value locks at the same time as row locks - it is
my contention that the lock starvation hazards that avoiding being
smarter about this may present aren't actually an issue, unless you
have some kind of highly implausible perfect storm of read-committed
aborters inserting around the same values - only one of those needs to
commit to remedy the situation - the first "no" answer is all we need
to give up.

To repeat myself, that's really the essential nature of my design: it
is accepting of the inevitability of there being a disconnect between
value and row locking. Value locks that are implemented in a sane way
can do very little; they can only prevent a conflicting insertion from
*finishing*, and not from causing a conflict for row locking.

> A bigger problem that I've just realized, though, is that once
> somebody else has blocked on a unique index insertion, they'll be
> stuck there until end of transaction even if we kill the tuple,
> because they're waiting on the xid, not the index itself.  That might
> be fatal to my proposed design, or at least require the use of some
> more clever locking regimen.

Well, it's really fatal to your proposed design *because* it implies
that others will be blocked on earlier value locks, which is what I
was trying to say (in saying this, I'm continuing to hold your design
to the same standard as my own, which is that it must work across
multiple unique indexes - I believe that you yourself accept this
standard based on your remarks here).

For the benefit of others who may not get what we're talking about: in
my patch, that isn't a problem, because when we block on acquiring an
xid ShareLock pending value conflict resolution, that means that the
other guy actually did insert (and did not merely think about it), and
so with that design it's entirely appropriate that we wait for his
xact to end.

>> Contrast this with my design, where re-ordering of would-be
>> conflicters across unique indexes (or serialization failures) can
>> totally nip this in the bud *if* the contention can be re-ordered
>> around, but if not, at least there is no need to worry about
>> aggregating bloat at all, because it creates no bloat.
>
> Yeah, possibly.

I think that re-ordering is an important property of any design where
we cannot bail out with serialization failures. I know it seems weird,
because it seems like an MVCC violation to have our behavior altered
as a result of a transaction that committed that isn't even visible to
us. As I think you appreciate, on a certain level that's just the
nature of the beast. This might sound stupid, but: you can say the
same thing about unique constraint violations! I do not believe that
this introduces any anomalies that read committed doesn't already
permit according to the standard.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - visibility semantics

From
Peter Geoghegan
Date:
On Tue, Sep 24, 2013 at 2:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Various messages are discussing semantics around visibility. I by now
> have a hard time keeping track. So let's keep the discussion of the
> desired semantics to this thread.

Yes, it's pretty complicated.

I meant to comment on this here, but ended up saying some stuff to
Robert about this in the main thread, so I should probably direct you
to that. You were probably right to start a new thread, because I
think we can usefully discuss this topic in parallel, but that's just
what ended up happening.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Fri, Sep 27, 2013 at 8:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Sep 27, 2013 at 5:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I don't have another idea either. In fact, I'd go so far as to say
>>> that doing any third thing that's better than those two to any
>>> reasonable person is obviously impossible. But I'd add that we simple
>>> cannot rollback at read committed, so we're just going to have to hold
>>> our collective noses and do strange things with visibility.
>>
>> I don't accept that as a general principal.  We're writing the code;
>> we can make it behave any way we think best.
>
> I presume you're referring to the principle that we cannot throw
> serialization failures at read committed. I'd suggest that letting
> that happen would upset a lot of people, because it's so totally
> unprecedented. A large segment of our user base would just consider
> that to be Postgres randomly throwing errors, and would be totally
> dismissive of the need to do so, and not without some justification -
> no one else does the same. The reality is that the majority of our
> users don't even know what an isolation level is. I'm not just talking
> about people that use Postgres more casually, such as Heroku
> customers. I've personally talked to people who didn't even know what
> a transaction isolation level was, that were in a position where they
> really, really should have known.

Yes, it might not be a good idea.  But I'm just saying, we get to decide.

>> I doubt that any change to HeapTupleSatisfiesMVCC() will be
>> acceptable.  This feature needs to restrain itself to behavior changes
>> that only affect users of this feature, I think.
>
> I agree with the principle of what you're saying, but I'm not aware
> that those changes to HeapTupleSatisfiesMVCC() imply any behavioral
> changes for those not using the feature. Certainly, the standard
> regression tests and isolation tests still pass, for what it's worth.
> Having said that, I have not thought about it enough to be willing to
> actually defend that bit of code. Though I must admit that I am a
> little encouraged by the fact that it passes casual inspection.

Well, at a minimum, it's a performance worry.  Those functions are
*hot*.  Single branches do matter there.

> I am starting to wonder if it's really necessary to have a "blessed"
> update that can see the locked, not-otherwise-visible tuple. Doing
> that certainly has its disadvantages, both in terms of code complexity
> and in terms of being arbitrarily restrictive. We're going to have to
> allow the user to see the locked row after it's updated (the new row
> version that we create will naturally be visible to its creating xact)
> - is it really any worse that the user can see it before an update (or
> a delete)? The user could decide to effectively make the update change
> nothing, and see the same thing anyway.

If we're not going to just error out over the invisible tuple the user
needs some way to interact with it.  The details are negotiable.

> I get why you're averse to doing odd things to visibility - I was too.
> I just don't see that we have a choice if we want this feature to work
> acceptably with read committed. In addition, as it happens I just
> don't see that the general situation is made any worse by the fact
> that the user might be able to see the row before an update/delete.
> Isn't is also weird to update or delete something you cannot see?
>
> Couldn't EvalPlanQual() be said to be an MVCC violation on similar
> grounds? It also "reaches into the future". Locking a row isn't really
> that distinct from updating it in terms of the code footprint, but
> also from a logical perspective.

Yes, EvalPlanQual() is definitely an MVCC violation.

>>> Realize you can generally only kill the heap tuple *before* you have
>>> the row lock, because otherwise a totally innocent non-HOT update (may
>>> not update any unique indexed columns at all) will deadlock with your
>>> session, which I don't think is defensible, and will probably happen
>>> often if allowed to (after all, this is upsert - users are going to
>>> want to update their locked rows!).
>>
>> I must be obtuse; I don't see why that would deadlock.
>
> If you don't see it, then you aren't being obtuse in asking for
> clarification. It's really easy to be wrong about this kind of thing.
>
> If the non-HOT update updates some random row, changing the key
> columns, it will lock that random row version. It will then proceed
> with "value locking" (i.e. inserting index tuples in the usual way, in
> this case with entirely new values). It might then block on one of the
> index tuples we, the upserter, have already inserted (these are our
> "value locks" under your scheme). Meanwhile, we (the upserter) might
> have *already* concluded that the *old* heap row that the regular
> updater is in the process of rendering invisible is to blame in
> respect of some other value in some later unique index, and that *it*
> must be locked. Deadlock. This seems very possible if the key values
> are somewhat correlated, which is probably generally quite common.

OK, I see.

> The important observation here is that an updater, in effect, locks
> both the old and new sets of values (for non-HOT updates). And as I've
> already noted, any practical "value locking" implementation isn't
> going to be able to prevent the update from immediately locking the
> old, because that doesn't touch an index. Hence, there is an
> irresolvable disconnect between value and row locking.

This part I don't follow.  "locking the old"?  What irresolvable
disconnect?  I mean, they're different things; I get *that*.

> Are we comfortable with this? Before you answer, consider that there
> was lots of bugs (their words) in the MySQL implementation of this
> same basic idea surrounding excessive deadlocking - I heard through
> the grapevine that they fixed a number of bugs along these lines, and
> that their implementation has historically had lots of deadlocking
> problems.
>
> I think that the way to deal with weird, unprincipled deadlocking is
> to simply not hold value locks at the same time as row locks - it is
> my contention that the lock starvation hazards that avoiding being
> smarter about this may present aren't actually an issue, unless you
> have some kind of highly implausible perfect storm of read-committed
> aborters inserting around the same values - only one of those needs to
> commit to remedy the situation - the first "no" answer is all we need
> to give up.

OK, I take your point, I think.  The existing system already acquires
value locks when a tuple lock is held, during an UPDATE, and we can't
change that.

>>> Contrast this with my design, where re-ordering of would-be
>>> conflicters across unique indexes (or serialization failures) can
>>> totally nip this in the bud *if* the contention can be re-ordered
>>> around, but if not, at least there is no need to worry about
>>> aggregating bloat at all, because it creates no bloat.
>>
>> Yeah, possibly.
>
> I think that re-ordering is an important property of any design where
> we cannot bail out with serialization failures. I know it seems weird,
> because it seems like an MVCC violation to have our behavior altered
> as a result of a transaction that committed that isn't even visible to
> us. As I think you appreciate, on a certain level that's just the
> nature of the beast. This might sound stupid, but: you can say the
> same thing about unique constraint violations! I do not believe that
> this introduces any anomalies that read committed doesn't already
> permit according to the standard.

I worry about the behavior being confusing and hard to understand in
the presence of multiple unique indexes and reordering.  Perhaps I
simply don't understand the problem domain well-enough yet.

From a user perspective, I would really think people would want to
specify a set of key columns and then update if a match is found on
those key columns. Suppose there's a unique index on (a, b) and
another on (c), and the user passes in (a,b,c)=(1,1,1).  It's hard for
me to imagine that the user will be happy to update either (1,1,2) or
(2,2,1), whichever exists.  In what situation would that be the
desired behavior?

Also, under such a programming model, if somebody drops a unique index
or adds a new one, the behavior of someone's application can
completely change.  I have a hard time swallowing that.  It's an
established precedent that dropping a unique index can make some other
operation fail (e.g. ADD FOREIGN KEY, and more recently CREATE VIEW ..
GROUP BY), and of course it can cause performance or plan changes.
But overturning the semantics is, I think, something new, and it
doesn't feel like a good direction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Sep 30, 2013 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I doubt that any change to HeapTupleSatisfiesMVCC() will be
>>> acceptable.  This feature needs to restrain itself to behavior changes
>>> that only affect users of this feature, I think.
>>
>> I agree with the principle of what you're saying, but I'm not aware
>> that those changes to HeapTupleSatisfiesMVCC() imply any behavioral
>> changes for those not using the feature.

> Well, at a minimum, it's a performance worry.  Those functions are
> *hot*.  Single branches do matter there.

Well, that certainly is a reasonable concern. Offhand, I suspect that
branch prediction helps immensely. But even if it doesn't, couldn't it
be the case that returning earlier there actually helps? Where we have
a real xid (so TransactionIdIsCurrentTransactionId() must do more than
a single test of a scalar variable), and the row is locked *only*
(which is already very cheap to check - it's another scalar variable
that we already test in a few other places in that function), isn't
there on average a high chance that the tuple ought to be visible to
our snapshot anyway?

>> I am starting to wonder if it's really necessary to have a "blessed"
>> update that can see the locked, not-otherwise-visible tuple.

> If we're not going to just error out over the invisible tuple the user
> needs some way to interact with it.  The details are negotiable.

I think that we will error out over an invisible tuple with higher
isolation levels. Certainly, what we do there today instead of
EvalPlanQual() looping is consistent with that behavior.

>> Couldn't EvalPlanQual() be said to be an MVCC violation on similar
>> grounds? It also "reaches into the future". Locking a row isn't really
>> that distinct from updating it in terms of the code footprint, but
>> also from a logical perspective.
>
> Yes, EvalPlanQual() is definitely an MVCC violation.

So I think that you can at least see why I'd consider that the two (my
tweaks to HeapTupleSatisfiesMVCC() and EvalPlanQual()) are isomorphic.
It just becomes the job of this new locking infrastructure to worry
about the would-be invisibility of the locked tuple, and raise a
serialization error accordingly at higher isolation levels.

>> The important observation here is that an updater, in effect, locks
>> both the old and new sets of values (for non-HOT updates). And as I've
>> already noted, any practical "value locking" implementation isn't
>> going to be able to prevent the update from immediately locking the
>> old, because that doesn't touch an index. Hence, there is an
>> irresolvable disconnect between value and row locking.
>
> This part I don't follow.  "locking the old"?  What irresolvable
> disconnect?  I mean, they're different things; I get *that*.

Well, if you update a row, the old row version's values are locked, in
the sense that any upserter interested in inserting the same values as
the old version is going to have to block pending the outcome of the
updating xact.

The disconnect is that any attempt at a clever dance, to interplay
value and row locking such that this definitely just works first time
seems totally futile - I'm emphasizing this because it's the obvious
way to approach this basic problem. It turns out that it could only be
done at great expense, in a way that would immediately be dismissed as
totally outlandish.

> OK, I take your point, I think.  The existing system already acquires
> value locks when a tuple lock is held, during an UPDATE, and we can't
> change that.

Right.

>> I think that re-ordering is an important property of any design where
>> we cannot bail out with serialization failures.

> I worry about the behavior being confusing and hard to understand in
> the presence of multiple unique indexes and reordering.  Perhaps I
> simply don't understand the problem domain well-enough yet.

It's only confusing if you are worried about what concurrent sessions
do with respect to each other at this low level. In which case, just
use a higher isolation level and pay the price. I'm not introducing
any additional anomalies described and prohibited by the standard by
doing this, and indeed the order of retrying in the event of a
conflict today is totally undefined, so this line of thinking is not
inconsistent with how things work today. Today, strictly speaking some
unique constraint violations might be more appropriate as
serialization failures. So with this new functionality, when used,
they're going to be actual serialization failures where that's
appropriate, where we'd otherwise go do something else other than
error. Why burden read committed like that? (Actually, fwiw I suspect
that currently the SSI guarantees *can* be violated with unique retry
re-ordering, but that's a whole other story, and is pretty subtle).

Let me come right out and say it: Yes, part of the reason that I'm
taking this line is because it's convenient to my implementation from
a number of different perspectives. But one of those perspectives is
that it will help performance in the face of contention immensely,
without violating any actual precept held today (by us or by the
standard or by anyone else AFAIK), and besides, my basic design is
informed by sincerely-held beliefs about what will actually work
within the constraints presented.

> From a user perspective, I would really think people would want to
> specify a set of key columns and then update if a match is found on
> those key columns. Suppose there's a unique index on (a, b) and
> another on (c), and the user passes in (a,b,c)=(1,1,1).  It's hard for
> me to imagine that the user will be happy to update either (1,1,2) or
> (2,2,1), whichever exists.  In what situation would that be the
> desired behavior?

You're right - that isn't desirable. The reason that we go to all this
trouble with locking multiple values concurrently boils down to
preventing the user from having to specify a constraint name - it's
usually really obvious *to users* that understand their schema, so why
bother them with that esoteric detail? The user *is* more or less
required to have a particular constraint in mind when writing their
DML (for upsert). It could be that that constraint has a 1:1
correlation with another constraint in practice, which would also work
out fine - they'd specify one or the other constrained column (maybe
both) in the subsequent update's predicate. But generally, yes,
they're out of luck here, until we get around to implementing MERGE in
its full generality, which I think what I've proposed is a logical
stepping stone towards (again, because it involves locking values
across unique indexes simultaneously).

Now, at least what I've proposed has the advantage of allowing the
user to add some controls in their update's predicate. So if they only
had updating (1,1,2) in mind, they could put WHERE a = 1 AND b = 1 in
there too (I'm imagining the wCTE pattern is used). They'd then be
able to inspect the value of the FOUND pseudo-variable or whatever.
Now, I'm assuming that we'd somehow be able to tell that the insert
hasn't succeeded (i.e. it locked), and maybe that doesn't accord very
well with these kinds of facilities as they exist today, but it
doesn't seem like too much extra work (MySQL would consider that both
the locked and updated rows were affected, which might help us here).

MySQL's INSERT...ON DUPLICATE KEY UPDATE has nothing like this - there
is no guidance as to why you went to update, and you cannot have a
separate update qual. Users better just get it right!

Maybe what's really needed here is INSERT...ON DUPLICATE KEY LOCK FOR
UPDATE RETURNING LOCKED... . You can see what was actually locked, and
act on *that* as appropriate. Though you don't get to see the actual
value of default expressions and so on, which is a notable
disadvantage over RETURNING REJECTS... .

The advantage of RETURNING LOCKED would be you could check if it
LOCKED for the reason you thought it should have. If it didn't, then
surely what you'd prefer would be a unique constraint violation, so
you can just go throw an error in application code (or maybe consider
another value for the columns that surprised you).

What do others think?

> Also, under such a programming model, if somebody drops a unique index
> or adds a new one, the behavior of someone's application can
> completely change.  I have a hard time swallowing that.  It's an
> established precedent that dropping a unique index can make some other
> operation fail (e.g. ADD FOREIGN KEY, and more recently CREATE VIEW ..
> GROUP BY), and of course it can cause performance or plan changes.
> But overturning the semantics is, I think, something new, and it
> doesn't feel like a good direction.

In what sense is causing, or preventing an error (the current state of
affairs) not a behavioral change? I'd have thought it a very
significant one. If what you're saying here is true, wouldn't that
mandate that we specify the name of a unique index inline, within DML?
I thought we were in agreement that that wasn't desirable.

If you think it's a bit odd that we lock every value while the user
essentially has one constraint in mind when writing their DML,
consider:

1) We need this for MERGE anyway.

2) Don't underestimate the intellectual overhead for developers and
operations personnel of adding an application-defined significance to
unique indexes that they don't otherwise have. It sure would suck if a
refactoring effort to normalize unique index naming style had the
effect of breaking a slew of application code. Certainly, everyone
else seems to have reached the same conclusion in their own
implementation of upsert, because they don't require that a unique
index be specified, even when that could have unexpected results.

3) The problems that getting the details wrong present can be
ameliorated by developers who feel it might be a problem for them, as
already described. I think in the vast majority of cases it just
obviously won't be a problem to begin with.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Sep 30, 2013 at 3:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
> If you think it's a bit odd that we lock every value while the user
> essentially has one constraint in mind when writing their DML,
> consider:

I should add to that list:

4) Locking all the values at once is necessary for the behavior of the
locking to be well-defined -- I feel we need to know that some exact
tuple is to blame (according to our well defined ordering for checking
unique indexes for conflicts) for at least one instant in time.

Given that we need to be the first to change the row without anything
being altered to it, this ought to be sufficient. If you think it's
bad that some other session can come in and insert a tuple that would
have caused us to decide differently (before *our* transaction commits
but *after* we've inserted), now you're into blaming the *wrong* tuple
in the future, and I can't get excited about that - we always prefer a
tuple normally visible to our snapshot, but if forced to (if there is
none) we just throw a serialization failure (where appropriate). So
for read committed you can have no *principled* beef with this, but
for serializable you're going to naturally prefer the
currently-visible tuple generally (that's the only correct behavior
there that won't error - there *better* be something visible).

Besides, the way the user tacitly has to use the feature with one
particular constraint in mind kind of implies that this cannot
happen...

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Mon, Sep 30, 2013 at 9:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Sep 30, 2013 at 3:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> If you think it's a bit odd that we lock every value while the user
>> essentially has one constraint in mind when writing their DML,
>> consider:
>
> I should add to that list:
>
> 4) Locking all the values at once is necessary for the behavior of the
> locking to be well-defined -- I feel we need to know that some exact
> tuple is to blame (according to our well defined ordering for checking
> unique indexes for conflicts) for at least one instant in time.
>
> Given that we need to be the first to change the row without anything
> being altered to it, this ought to be sufficient. If you think it's
> bad that some other session can come in and insert a tuple that would
> have caused us to decide differently (before *our* transaction commits
> but *after* we've inserted), now you're into blaming the *wrong* tuple
> in the future, and I can't get excited about that - we always prefer a
> tuple normally visible to our snapshot, but if forced to (if there is
> none) we just throw a serialization failure (where appropriate). So
> for read committed you can have no *principled* beef with this, but
> for serializable you're going to naturally prefer the
> currently-visible tuple generally (that's the only correct behavior
> there that won't error - there *better* be something visible).
>
> Besides, the way the user tacitly has to use the feature with one
> particular constraint in mind kind of implies that this cannot
> happen...

This patch is still marked as "Needs Review" in the CommitFest
application.  There's no reviewer, but in fact Andres and I both spent
quite a lot of time providing design feedback (probably more than I
spent on any other CommitFest patch).  I think it's clear that the
patch as submitted is not committable, so as far as the CommitFest
goes I'm going to mark it Returned with Feedback.

I think there are still some design considerations to work out here,
but honestly I'm not totally sure what the remaining points of
disagreement are.  It would be nice to here the opinions of a few more
people on the concurrency issues, but beyond that I think that a lot
of this is going to boil down to whether the details of the value
locking can be made to seem palatable enough and sufficiently
low-overhead in the common case.  I don't believe we can comment on
that in the abstract.

There's still some question in my mind as to what the semantics ought
to be.  I do understand Peter's point that having to specify a
particular index would be grotty, but I'm not sure it invalidates my
point that having to work across multiple indexes could lead to
surprising results in some scenarios. I'm not going to stand here and
hold my breath, though: if that's the only thing that makes me nervous
about the final patch, I'll not object to it on that basis.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Oct 9, 2013 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> This patch is still marked as "Needs Review" in the CommitFest
> application.  There's no reviewer, but in fact Andres and I both spent
> quite a lot of time providing design feedback (probably more than I
> spent on any other CommitFest patch).

Right, thank you both.

> I think there are still some design considerations to work out here,
> but honestly I'm not totally sure what the remaining points of
> disagreement are.  It would be nice to here the opinions of a few more
> people on the concurrency issues, but beyond that I think that a lot
> of this is going to boil down to whether the details of the value
> locking can be made to seem palatable enough and sufficiently
> low-overhead in the common case.  I don't believe we can comment on
> that in the abstract.

I agree that we cannot comment on it in the abstract. I am optimistic
that we can make the value locking work better without regressing the
common cases (especially if we're only concerned about not regressing
users that never use the feature, as opposed to having some
expectation for regular inserters inserting values into the same
ranges as an upserter). That's not my immediate concern, though - my
immediate concern is getting the concurrency and visibility issues
scrutinized.

What would it take to get the patch into a committable state if the
value locking had essentially the same properties (they were held
instantaneously), but were perfect? There is no point in giving the
value locking implementation too much further consideration unless
that question can be answered. In the past I've said that row locking
and value locking cannot be considered separately, but that was when
it was generally assumed that value locks had to persist for a long
time in a way that I don't think is feasible (and I think Robert would
now agree that it's at the very least very hard). Persisting value
locks basically make not regressing the general case hard, when you
think about the implementation. As Robert remarked, regular btree
index insertion blockings on an xid, not a value, and cannot be easily
made to appreciate that the "value lock" that a would-be duplicate
index tuple represents may just be held for a short time, and not the
entire duration of their inserter's transaction.

> There's still some question in my mind as to what the semantics ought
> to be.  I do understand Peter's point that having to specify a
> particular index would be grotty, but I'm not sure it invalidates my
> point that having to work across multiple indexes could lead to
> surprising results in some scenarios. I'm not going to stand here and
> hold my breath, though: if that's the only thing that makes me nervous
> about the final patch, I'll not object to it on that basis.

I should be so lucky!   :-)

Unfortunately, I have a very busy schedule in the month ahead,
including travelling to Ireland and Japan, so I don't think I'm going
to get the opportunity to work on this too much. I'll try and produce
a V4 that formally proposes some variant of my ideas around visibility
of locked tuples.

Here are some things you might not like about this patch, if we're
still assuming that the value locks are prototype and it's useful to
defer discussion around their implementation:

* The lock starvation hazards around going from value locking to row
locking, and retrying if it doesn't work out (i.e. if the row and its
descendant rows cannot be locked without what would ordinarily
necessitate using EvalPlanQual()). I don't see what we could do about
those, other than checking for changes in the rows unique index
values, which would be complex. I understand the temptation to do
that, but the fact is that that isn't going to work all the time -
some unique index value may well change every time. By doing that
you've already accepted whatever hazard may exist, and it becomes a
question of degree. Which is fine, but I don't see that the current
degree is actually much of problem in the real world.

* Reordering of value locks generally. I still need to ensure this
will behave reasonably at higher isolation levels (i.e. they'll get a
serialization failure). I think that Robert accepts that this isn't
inconsistent with read committed's documented behavior, and that it is
useful, and maybe even essential.

* The basic question of whether or not it's possible to lock values
and rows at the same time, and if that matters (because it turns out
what looks like that isn't, because deleters will effectively lock
values without even touching an index). I think Robert saw the
difficulty of doing this, but it would be nice to get a definitive
answer. I think that any MERGE implementation worth its salt will not
deadlock without the potential for multiple rows to be locked in an
inconsistent order, so this shouldn't either, and as I believe I
demonstrated, value locks and row locks should not be held at the same
time for at least that reason. Right?

* The syntax. I like the composability, and the way it's likely to
become idiomatic to combine it with wCTEs. Others may not.

* The visibility hacks that V4 is likely to have. The fact that
preserving the composable syntax may imply changes to
HeapTupleSatisfiesMVCC() so that rows locked but with no currently
visible version (under conventional rules) are visible to our snapshot
by virtue of having been locked all the same (this only matters at
read committed).

So I think that what this patch really could benefit from is lots of
scrutiny around the concurrency issues. It would be unfair to ask for
that before at least producing a V4, so I'll clean up what I already
have and post it, probably on Sunday.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Wed, Oct 9, 2013 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
> * The lock starvation hazards around going from value locking to row
> locking, and retrying if it doesn't work out (i.e. if the row and its
> descendant rows cannot be locked without what would ordinarily
> necessitate using EvalPlanQual()). I don't see what we could do about
> those, other than checking for changes in the rows unique index
> values, which would be complex. I understand the temptation to do
> that, but the fact is that that isn't going to work all the time -
> some unique index value may well change every time. By doing that
> you've already accepted whatever hazard may exist, and it becomes a
> question of degree. Which is fine, but I don't see that the current
> degree is actually much of problem in the real world.

Some of the decisions we make here may end up being based on measured
performance rather than theoretical analysis.

> * Reordering of value locks generally. I still need to ensure this
> will behave reasonably at higher isolation levels (i.e. they'll get a
> serialization failure). I think that Robert accepts that this isn't
> inconsistent with read committed's documented behavior, and that it is
> useful, and maybe even essential.

I think there's a sentence missing here, or something.  Obviously, the
behavior at higher isolation levels is neither consistent nor
inconsistent with read committed's documented behavior; it's another
issue entirely.

> * The basic question of whether or not it's possible to lock values
> and rows at the same time, and if that matters (because it turns out
> what looks like that isn't, because deleters will effectively lock
> values without even touching an index). I think Robert saw the
> difficulty of doing this, but it would be nice to get a definitive
> answer. I think that any MERGE implementation worth its salt will not
> deadlock without the potential for multiple rows to be locked in an
> inconsistent order, so this shouldn't either, and as I believe I
> demonstrated, value locks and row locks should not be held at the same
> time for at least that reason. Right?

Right.

> * The syntax. I like the composability, and the way it's likely to
> become idiomatic to combine it with wCTEs. Others may not.

I've actually lost track of what syntax you're proposing.

> * The visibility hacks that V4 is likely to have. The fact that
> preserving the composable syntax may imply changes to
> HeapTupleSatisfiesMVCC() so that rows locked but with no currently
> visible version (under conventional rules) are visible to our snapshot
> by virtue of having been locked all the same (this only matters at
> read committed).

I continue to think this is a bad idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Oct 9, 2013 at 5:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> * Reordering of value locks generally. I still need to ensure this
>> will behave reasonably at higher isolation levels (i.e. they'll get a
>> serialization failure). I think that Robert accepts that this isn't
>> inconsistent with read committed's documented behavior, and that it is
>> useful, and maybe even essential.
>
> I think there's a sentence missing here, or something.  Obviously, the
> behavior at higher isolation levels is neither consistent nor
> inconsistent with read committed's documented behavior; it's another
> issue entirely.

Here, "this" referred to the reordering concept generally. So I was
just saying that I'm not actually introducing any anomaly that is
described by the standard at read committed, and that at repeatable
read+, we can have actual serial ordering of value locks without
requiring them to last a long time, because we can throw serialization
failures, and can even do so when not strictly logically necessary.

>> * The basic question of whether or not it's possible to lock values
>> and rows at the same time, and if that matters (because it turns out
>> what looks like that isn't, because deleters will effectively lock
>> values without even touching an index). I think Robert saw the
>> difficulty of doing this, but it would be nice to get a definitive
>> answer. I think that any MERGE implementation worth its salt will not
>> deadlock without the potential for multiple rows to be locked in an
>> inconsistent order, so this shouldn't either, and as I believe I
>> demonstrated, value locks and row locks should not be held at the same
>> time for at least that reason. Right?
>
> Right.

I'm glad we're on the same page with that - it's a very important
consideration to my mind.

>> * The syntax. I like the composability, and the way it's likely to
>> become idiomatic to combine it with wCTEs. Others may not.
>
> I've actually lost track of what syntax you're proposing.

I'm continuing to propose:

INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

with a much less interesting variant that could be jettisoned:

INSERT...ON DUPLICATE KEY IGNORE

I'm also proposing extended RETURNING to make it work with this. So
the basic idea is that within Postgres, the idiomatic way to correctly
do upsert becomes something like:

postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;

>> * The visibility hacks that V4 is likely to have. The fact that
>> preserving the composable syntax may imply changes to
>> HeapTupleSatisfiesMVCC() so that rows locked but with no currently
>> visible version (under conventional rules) are visible to our snapshot
>> by virtue of having been locked all the same (this only matters at
>> read committed).
>
> I continue to think this is a bad idea.

Fair enough.

Is it just because of performance concerns? If so, that's probably not
that hard to address. It either has a measurable impact on performance
for a very unsympathetic benchmark or it doesn't. I guess that's the
standard that I'll be held to, which is probably fair.

Do you see the appeal of the composable syntax?

I appreciate that it's odd that serializable transactions now have to
worry about seeing something they shouldn't have seen (when they
conclusively have to go lock a row version not current to their
snapshot). But that's simpler than any of the alternatives that I see.
Does there really need to be a new snapshot type with one tiny
difference that apparently doesn't actually affect conventional
clients of MVCC snapshots?

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Wed, Oct 9, 2013 at 9:30 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> * The syntax. I like the composability, and the way it's likely to
>>> become idiomatic to combine it with wCTEs. Others may not.
>>
>> I've actually lost track of what syntax you're proposing.
>
> I'm continuing to propose:
>
> INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
>
> with a much less interesting variant that could be jettisoned:
>
> INSERT...ON DUPLICATE KEY IGNORE
>
> I'm also proposing extended RETURNING to make it work with this. So
> the basic idea is that within Postgres, the idiomatic way to correctly
> do upsert becomes something like:
>
> postgres=# with r as (
> insert into foo(a,b)
> values (5, '!'), (6, '@')
> on duplicate key lock for update
> returning rejects *
> )
> update foo set b = r.b from r where foo.a = r.a;

I can't claim to be enamored of this syntax.

>>> * The visibility hacks that V4 is likely to have. The fact that
>>> preserving the composable syntax may imply changes to
>>> HeapTupleSatisfiesMVCC() so that rows locked but with no currently
>>> visible version (under conventional rules) are visible to our snapshot
>>> by virtue of having been locked all the same (this only matters at
>>> read committed).
>>
>> I continue to think this is a bad idea.
>
> Fair enough.
>
> Is it just because of performance concerns? If so, that's probably not
> that hard to address. It either has a measurable impact on performance
> for a very unsympathetic benchmark or it doesn't. I guess that's the
> standard that I'll be held to, which is probably fair.

That's part of it; but I also think that HeapTupleSatisfiesMVCC() is a
pretty fundamental bit of the system that I am loathe to tamper with.
We can try to talk ourselves into believing that the definition change
will only affect this case, but I'm wary that there will be
unanticipated consequences, or simply that we'll find, after it's far
too late to do anything about it, that we don't particularly care for
the new semantics.  It's probably an overstatement to say that I'll
oppose any whatsoever that touches the semantics of that function, but
not by much.

> Do you see the appeal of the composable syntax?

To some extent.  It seems to me that what we're designing is a giant
grotty hack, albeit a convenient one.  But if we're not really going
to get MERGE, I'm not sure how much good it is to try to pretend we've
got something general.

> I appreciate that it's odd that serializable transactions now have to
> worry about seeing something they shouldn't have seen (when they
> conclusively have to go lock a row version not current to their
> snapshot).

Surely that's never going to be acceptable.  At read committed,
locking a version not current to the snapshot might be acceptable if
we hold our nose, but at any higher level I think we have to fail with
a serialization complaint.

> But that's simpler than any of the alternatives that I see.
> Does there really need to be a new snapshot type with one tiny
> difference that apparently doesn't actually affect conventional
> clients of MVCC snapshots?

I think that's the wrong way of thinking about it.  If you're
introducing a new type of snapshot, or tinkering with the semantics of
an existing one, I think that's a reason to reject the patch straight
off.  We should be looking for a design that doesn't require that.  If
we can't find one, I'm not sure we should do this at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-10-11 08:43:43 -0400, Robert Haas wrote:
> > I appreciate that it's odd that serializable transactions now have to
> > worry about seeing something they shouldn't have seen (when they
> > conclusively have to go lock a row version not current to their
> > snapshot).
> 
> Surely that's never going to be acceptable.  At read committed,
> locking a version not current to the snapshot might be acceptable if
> we hold our nose, but at any higher level I think we have to fail with
> a serialization complaint.

I think an UPSERTish action in RR/SERIALIZABLE that notices a concurrent
update should and has to *ALWAYS* raise a serialization
failure. Anything else will cause violations of the given guarantees.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Oct 11, 2013 at 10:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-11 08:43:43 -0400, Robert Haas wrote:
>> > I appreciate that it's odd that serializable transactions now have to
>> > worry about seeing something they shouldn't have seen (when they
>> > conclusively have to go lock a row version not current to their
>> > snapshot).
>>
>> Surely that's never going to be acceptable.  At read committed,
>> locking a version not current to the snapshot might be acceptable if
>> we hold our nose, but at any higher level I think we have to fail with
>> a serialization complaint.
>
> I think an UPSERTish action in RR/SERIALIZABLE that notices a concurrent
> update should and has to *ALWAYS* raise a serialization
> failure. Anything else will cause violations of the given guarantees.

Sorry, this was just a poor choice of words on my part. I totally
agree with you here. Although I wasn't even talking about noticing a
concurrent update - I was talking about noticing that a tuple that
it's necessary to lock isn't visible to a serializable snapshot in the
first place (which should also fail).

What I actually meant was that it's odd that that one case (reason for
returning) added to HeapTupleSatisfiesMVCC() will always obligate
Serializable transactions to throw a serialization failure. Though
that isn't strictly true; the modifications to
HeapTupleSatisfiesMVCC() that I'm likely to propose also redundantly
work for other cases where, if I'm not mistaken, that's okay (today,
if you've exclusively locked a tuple and it hasn't been
updated/deleted, why shouldn't it be visible to your snapshot?). The
onus is on the executor-level code to notice this
should-be-invisibility for non-read-committed, probably immediately
after returning from value locking.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Oct 11, 2013 at 5:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> * The visibility hacks that V4 is likely to have. The fact that
>>>> preserving the composable syntax may imply changes to
>>>> HeapTupleSatisfiesMVCC() so that rows locked but with no currently
>>>> visible version (under conventional rules) are visible to our snapshot
>>>> by virtue of having been locked all the same (this only matters at
>>>> read committed).
>>>
>>> I continue to think this is a bad idea.

>> Is it just because of performance concerns? /

> That's part of it; but I also think that HeapTupleSatisfiesMVCC() is a
> pretty fundamental bit of the system that I am loathe to tamper with.
> We can try to talk ourselves into believing that the definition change
> will only affect this case, but I'm wary that there will be
> unanticipated consequences, or simply that we'll find, after it's far
> too late to do anything about it, that we don't particularly care for
> the new semantics.  It's probably an overstatement to say that I'll
> oppose any whatsoever that touches the semantics of that function, but
> not by much.

A tuple that is exclusively locked by our transaction and not updated
or deleted being visible on that basis alone isn't *that* hard to
reason about. Granted, we need to be very careful here, but we're
talking about 3 lines of code.

>> Do you see the appeal of the composable syntax?
>
> To some extent.  It seems to me that what we're designing is a giant
> grotty hack, albeit a convenient one.  But if we're not really going
> to get MERGE, I'm not sure how much good it is to try to pretend we've
> got something general.

Well, to be fair perhaps all of the things that you consider grotty
hacks seem like inherent requirements to me, for any half-way
reasonable upsert implementation on any system, that has the essential
property of upsert: an atomic insert-or-update (or maybe serialization
failure).

>> But that's simpler than any of the alternatives that I see.
>> Does there really need to be a new snapshot type with one tiny
>> difference that apparently doesn't actually affect conventional
>> clients of MVCC snapshots?
>
> I think that's the wrong way of thinking about it.  If you're
> introducing a new type of snapshot, or tinkering with the semantics of
> an existing one, I think that's a reason to reject the patch straight
> off.  We should be looking for a design that doesn't require that.  If
> we can't find one, I'm not sure we should do this at all.

I'm confused by this. We need to lock a row not visible to our
snapshot under conventional rules. I think we can rule out
serialization failures at read committed. That just leaves changing
something about the visibility rules of an existing snapshot type, or
creating a new snapshot type, no?

It would also be unacceptable to update a tuple, and not have the new
row version (which of course will still have "information from the
future") visible to our snapshot - what would regular RETURNING
return? So what do you have in mind? I don't think that locking a row
and updating it are really that distinct anyway. The benefit of
locking is that we don't have to update. We can delete, for example.

Perhaps I've totally missed your point here, but to me it sounds like
you're saying that certain properties must always be preserved that
are fundamentally in tension with upsert working in the way people
expect, and the way it is bound to actually work in numerous other
systems.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Oct 9, 2013 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Unfortunately, I have a very busy schedule in the month ahead,
> including travelling to Ireland and Japan, so I don't think I'm going
> to get the opportunity to work on this too much. I'll try and produce
> a V4 that formally proposes some variant of my ideas around visibility
> of locked tuples.

V4 is attached.

Most notably, this adds the modifications to HeapTupleSatisfiesMVCC(),
though they're neater than in the snippet I sent earlier.

There is also some clean-up around row-level locking. That code has
been simplified. I also try and handle serialization failures in a
better way, though that really needs the attention of a subject matter
expert.

There are a few additional XXX comments highlighting areas of concern,
particularly around serializable behavior. I've deferred making higher
isolation levels care about wrongfully relying on the special
HeapTupleSatisfiesMVCC() exception (e.g. they won't throw a
serialization failure, mostly because I couldn't decide on where to do
the test on time prior to travelling tomorrow).

I've added code to do heap_prepare_insert before value locks are held.
Whatever our eventual value locking implementation, that's going to be
a useful optimization. Though unfortunately I ran out of time to give
this the scrutiny it really deserves, I suppose that it's something
that we can return to later.

I ask that reviewers continue to focus on concurrency issues and broad
design issues, and continue to defer discussion about an eventual
value locking implementation. I continue to think that that's the most
useful way of proceeding for the time being. My earlier points about
probable areas of concern [1] remain a good place for reviewers to
start.

[1] http://www.postgresql.org/message-id/CAM3SWZSvSrTzPhjNPjahtJ0rFfS-gJFhU86Vpewf+eO8GwZXNQ@mail.gmail.com

--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Fri, Oct 11, 2013 at 2:30 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> But that's simpler than any of the alternatives that I see.
>>> Does there really need to be a new snapshot type with one tiny
>>> difference that apparently doesn't actually affect conventional
>>> clients of MVCC snapshots?
>>
>> I think that's the wrong way of thinking about it.  If you're
>> introducing a new type of snapshot, or tinkering with the semantics of
>> an existing one, I think that's a reason to reject the patch straight
>> off.  We should be looking for a design that doesn't require that.  If
>> we can't find one, I'm not sure we should do this at all.
>
> I'm confused by this. We need to lock a row not visible to our
> snapshot under conventional rules. I think we can rule out
> serialization failures at read committed. That just leaves changing
> something about the visibility rules of an existing snapshot type, or
> creating a new snapshot type, no?
>
> It would also be unacceptable to update a tuple, and not have the new
> row version (which of course will still have "information from the
> future") visible to our snapshot - what would regular RETURNING
> return? So what do you have in mind? I don't think that locking a row
> and updating it are really that distinct anyway. The benefit of
> locking is that we don't have to update. We can delete, for example.

Well, the SQL standard way of doing this type of operation is MERGE.
The alternative we know exists in other databases is REPLACE; there's
also INSERT .. ON DUPLICATE KEY update.  In all of those cases,
whatever weirdness exists around MVCC is confined to that one command.I tend to think we should do similarly, with the
goalthat
 
HeapTupleSatisfiesMVCC need not change at all.

I don't have the only vote here, of course, but my feeling is that
that's more likely to be a good route.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 5:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, the SQL standard way of doing this type of operation is MERGE.
> The alternative we know exists in other databases is REPLACE; there's
> also INSERT .. ON DUPLICATE KEY update.  In all of those cases,
> whatever weirdness exists around MVCC is confined to that one command.
>  I tend to think we should do similarly, with the goal that
> HeapTupleSatisfiesMVCC need not change at all.

I don't think that it's very pragmatic to define success in terms of
not modifying a single visibility function. I feel it would be more
useful to define it as providing acceptable, non-surprising semantics,
while not regressing performance in other areas.

The fact remains that you're going to have a create a new snapshot
type even for this special case, so I don't see any win as regards
managing invasiveness here. Quite the contrary, in fact.

> I don't have the only vote here, of course, but my feeling is that
> that's more likely to be a good route.

Naturally we all want MERGE. It seems self-defeating to insist on
something significantly harder that there is significant less demand
for, though. I thought that there was at least informal agreement that
this sort of approach was preferable to MERGE in its full generality,
based on feedback at the 2012 developer meeting. I really don't think
that what I've done here is any worse than INSERT...ON DUPLICATE KEY
UPDATE in any of the areas you express concern about here. REPLACE has
some serious problems, and I just don't see it as a viable alternative
at all - just ask any MySQL user.

MERGE is of course more flexible to what I have here in some ways, but
actually less flexible in other ways. I think that the real point of
MERGE is that it's defined in a way that serves data warehousing use
cases very well: the semantics constrain things such that the executor
only has to execute a single ModifyTable node that does inserts,
updates and deletes in a single scan. That's great, but what if it's
useful to do that CRUD (yes, this can include selects) to entirely
different tables? Or what if the relevant DML will only come in a
later statement in the same transaction?

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 8:07 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Naturally we all want MERGE. It seems self-defeating to insist on
> something significantly harder that there is significant less demand
> for, though.

I hasten to add: which is not to imply that you're insisting rather
than expressing a sentiment.


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Tue, Oct 15, 2013 at 11:07 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Oct 15, 2013 at 5:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, the SQL standard way of doing this type of operation is MERGE.
>> The alternative we know exists in other databases is REPLACE; there's
>> also INSERT .. ON DUPLICATE KEY update.  In all of those cases,
>> whatever weirdness exists around MVCC is confined to that one command.
>>  I tend to think we should do similarly, with the goal that
>> HeapTupleSatisfiesMVCC need not change at all.
>
> I don't think that it's very pragmatic to define success in terms of
> not modifying a single visibility function. I feel it would be more
> useful to define it as providing acceptable, non-surprising semantics,
> while not regressing performance in other areas.
>
> The fact remains that you're going to have a create a new snapshot
> type even for this special case, so I don't see any win as regards
> managing invasiveness here. Quite the contrary, in fact.

Well, we might have to agree to disagree.

>> I don't have the only vote here, of course, but my feeling is that
>> that's more likely to be a good route.
>
> Naturally we all want MERGE. It seems self-defeating to insist on
> something significantly harder that there is significant less demand
> for, though. I thought that there was at least informal agreement that
> this sort of approach was preferable to MERGE in its full generality,
> based on feedback at the 2012 developer meeting. I really don't think
> that what I've done here is any worse than INSERT...ON DUPLICATE KEY
> UPDATE in any of the areas you express concern about here. REPLACE has
> some serious problems, and I just don't see it as a viable alternative
> at all - just ask any MySQL user.
>
> MERGE is of course more flexible to what I have here in some ways, but
> actually less flexible in other ways. I think that the real point of
> MERGE is that it's defined in a way that serves data warehousing use
> cases very well: the semantics constrain things such that the executor
> only has to execute a single ModifyTable node that does inserts,
> updates and deletes in a single scan. That's great, but what if it's
> useful to do that CRUD (yes, this can include selects) to entirely
> different tables? Or what if the relevant DML will only come in a
> later statement in the same transaction?

I'm not saying "go implement MERGE".  I'm saying, make the
insert-or-update operation a single statement, using some syntax TBD,
instead of requiring the use of a new insert statement that makes
invisible rows visible as a side effect, so that you can wrap that in
a CTE and feed it to an update statement.  That's complex and, AFAICS,
unlike how any other database product handles this.

Again, other people can have different opinions on this, and that's
fine.  I'm just giving you mine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-10-15 11:11:24 -0400, Robert Haas wrote:
> I'm not saying "go implement MERGE".  I'm saying, make the
> insert-or-update operation a single statement, using some syntax TBD,
> instead of requiring the use of a new insert statement that makes
> invisible rows visible as a side effect, so that you can wrap that in
> a CTE and feed it to an update statement.  That's complex and, AFAICS,
> unlike how any other database product handles this.

I think we most definitely should provide a single statement
variant. That's the one users yearn for.

I also would like a variant where I can lock a row on conflict, for
multimaster scenarios, but that doesn't necessarily have to be exposed
to SQL.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-10-15 10:19:17 -0700, Peter Geoghegan wrote:
> On Tue, Oct 15, 2013 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Well, I don't know that any of us can claim to have a lock on what the
> > syntax should look like.
> 
> Sure. But it's not just syntax. We're talking about functional
> differences too, since you're talking about mandating an update, which
> is a not the same as an "update locked row only conditionally", or a
> delete.

I think anything that only works by breaking visibility rules that way
is a nonstarter. Doing that from the C level is one thing, exposing it
this way seems a bad idea.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I think anything that only works by breaking visibility rules that way
> is a nonstarter. Doing that from the C level is one thing, exposing it
> this way seems a bad idea.

What visibility rule is that? Upsert *has* to do effectively the same
thing as what I've proposed - there is no getting away from it. So
maybe the visibility rulebook (which as far as I can tell is "the way
things work today") needs to be updated. If we did, say, INSERT...ON
DUPLICATE KEY UPDATE, we'd have to update a row with potentially no
visible-to-snapshot version *at all*, and make a new version of that
visible. That's just what it takes. What's the difference between that
and just locking? If the only difference is that it isn't necessary to
modify tqual.c because you're passing a tid directly, that isn't a
user-visible difference - the "rule" has been broken just the same.
Arguably, it's even more of a hack, since it's a special, out-of-band
visibility exception. I'm happy to have total scrutiny of changes to
tqual.c, but I'm surprised that the mere fact of it having been
modified is being weighed so heavily.

Another thing that I'm not clear on is how an update can be backed out
of if the row is modified by another xact. As I think I've already
illustrated, the row locking that takes place has to be kind of
opportunistic. I'm sure you could do it, but it would probably be
quite invasive.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Josh Berkus
Date:
On 10/15/2013 08:11 AM, Robert Haas wrote:
> I'm not saying "go implement MERGE".  I'm saying, make the
> insert-or-update operation a single statement, using some syntax TBD,
> instead of requiring the use of a new insert statement that makes
> invisible rows visible as a side effect, so that you can wrap that in
> a CTE and feed it to an update statement.  That's complex and, AFAICS,
> unlike how any other database product handles this.

Hmmm.  Is the plan NOT to eventually get to a single-statement upsert?
If not, then I'm not that keen on this feature.  I can't say that
anybody I know who's migrating from MySQL would use a 2-statement
version of upsert; if they were prepared for that, then they'd be
prepared to just rewrite their stuff as proper insert/updates anyway.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 10:58 AM, Josh Berkus <josh@agliodbs.com> wrote:
> Hmmm.  Is the plan NOT to eventually get to a single-statement upsert?
> If not, then I'm not that keen on this feature.

See the original e-mail in the thread for what I imagine idiomatic
usage will look like.

http://www.postgresql.org/message-id/CAM3SWZThwrKtvurf1aWAiH8qThGNMZAfyDcNw8QJu7pqHk5AGQ@mail.gmail.com

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 11:05 AM, Peter Geoghegan <pg@heroku.com> wrote:
> See the original e-mail in the thread for what I imagine idiomatic
> usage will look like.
>
> http://www.postgresql.org/message-id/CAM3SWZThwrKtvurf1aWAiH8qThGNMZAfyDcNw8QJu7pqHk5AGQ@mail.gmail.com


Note also that this doesn't preclude a variant with a more direct
update part (not that I think that's all that compelling). Doing
things this way was motivated by:

1) Serving the needs of logical changeset generation plugins, even if
Andres doesn't think that needs to be exposed through SQL. He and I
both want something that does this with low overhead (in particular,
no subtransactions).

2) Getting something effective into the next release. MERGE-like
flexibility seems like a very desirable thing. And the
implementation's infrastructure can be used by an eventual MERGE
implementation.

3) Being simple enough that huge bike shedding over syntax might not
be necessary. Making insert statements grow an update tumor is likely
to get messy fast. I know because I tried it myself.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Josh Berkus
Date:
Peter,

> Note also that this doesn't preclude a variant with a more direct
> update part (not that I think that's all that compelling). Doing
> things this way was motivated by:

I can see the value in the CTE format for this for existing PostgreSQL
users.

(although, AFAICT it doesn't allow for the implementation of one of my
personal desires, which is UPDATE ... ON NOT FOUND INSERT, for cases
where updates are expected to occur 95% of the time, but that's another
topic. Unless "rejects" for an Update could be the leftover rows, but
then we're getting into full MERGE.).

I'm just pointing out that this doesn't do much for the MySQL migration
case; the rewrite is too complex to automate.  I'd been assuming that we
had some plans to implement a MySQL-friendly syntax for 9.5, and this
version was a stepping stone to that.

Does this version make a distinction between PRIMARY KEY constraints and
UNIQUE indexes?  If not, how does it pick among keys?  If so, what about
tables with no PRIMARY KEY for various reasons (like unique GiST indexes?)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 11:23 AM, Josh Berkus <josh@agliodbs.com> wrote:
> (although, AFAICT it doesn't allow for the implementation of one of my
> personal desires, which is UPDATE ... ON NOT FOUND INSERT, for cases
> where updates are expected to occur 95% of the time, but that's another
> topic. Unless "rejects" for an Update could be the leftover rows, but
> then we're getting into full MERGE.).

This isn't really all that inefficient for that case. Certainly, the
balance in cost between mostly-insert cases and mostly-update cases is
a strength of my basic approach over others.

> Does this version make a distinction between PRIMARY KEY constraints and
> UNIQUE indexes?  If not, how does it pick among keys?  If so, what about
> tables with no PRIMARY KEY for various reasons (like unique GiST indexes?)

We thought about prioritizing where to look (mostly as a performance
optimization), but right now no. It works with amcanunique methods,
which in practice means btrees. There is no such thing as a GiST
unique index, so I guess you're referring to an exclusion constraint
on an equality operator. That doesn't work with this, but why would
you want it to? As for generalizing this to work with exclusion
constraints, which I guess you might have also meant, that's a much
more difficult and much less compelling proposition, in my opinion.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Josh Berkus
Date:
On 10/15/2013 11:38 AM, Peter Geoghegan wrote:
> We thought about prioritizing where to look (mostly as a performance
> optimization), but right now no. It works with amcanunique methods,
> which in practice means btrees. There is no such thing as a GiST
> unique index, so I guess you're referring to an exclusion constraint
> on an equality operator. That doesn't work with this, but why would
> you want it to? As for generalizing this to work with exclusion
> constraints, which I guess you might have also meant, that's a much
> more difficult and much less compelling proposition, in my opinion.

Yeah, that was one thing I was thinking of.

Also, because you can't INDEX CONCURRENTLY a PK, I've been building a
lot of databases which have no PKs, only UNIQUE indexes.  Historically,
this hasn't been an issue because aside from wonky annoyances (like the
CONCURRENTLY case), Postgres doesn't distinguish between UNIQUE indexes
and PRIMARY KEYs -- as, indeed, it shouldn't, since they're both keys,
adn the whole concept of a "primary key" is a legacy of index-organized
databases, which PostgreSQL is not.

However, it does seem like the new syntax could be extended with and
optional "USING unqiue_index_name" in the future (9.5), no?

I'm just checking that we're not painting ourselves into a corner with
this particular implementation.  It's OK if it doesn't implement most
things now; it's bad if it is impossible to build on and we have to
support it forever.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 11:55 AM, Josh Berkus <josh@agliodbs.com> wrote:
> However, it does seem like the new syntax could be extended with and
> optional "USING unqiue_index_name" in the future (9.5), no?

There is no reason why we couldn't do that and just consider that one
unique index. Whether we should is another question - I certainly
think that mandating it would be very bad.

> I'm just checking that we're not painting ourselves into a corner with
> this particular implementation.  It's OK if it doesn't implement most
> things now; it's bad if it is impossible to build on and we have to
> support it forever.

I don't believe it does. In essence this just simply inserts a row,
and rather than throwing a unique constraint violation, locks the row
that prevented insertion from proceeding in respect of any tuple
proposed for insertion where it does not. That's all. You can build
lots of things with it that you can't today. Or you can not use it at
all. So that covers semantics, I'd say.

As for implementation: I believe that the implementation is by far the
most forward thinking (in terms of building infrastructure for a
proper MERGE) of any proposal to date.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Josh Berkus
Date:
On 10/15/2013 12:03 PM, Peter Geoghegan wrote:
> On Tue, Oct 15, 2013 at 11:55 AM, Josh Berkus <josh@agliodbs.com> wrote:
>> However, it does seem like the new syntax could be extended with and
>> optional "USING unqiue_index_name" in the future (9.5), no?
> 
> There is no reason why we couldn't do that and just consider that one
> unique index. Whether we should is another question - 

What's the "shouldn't" argument, if any?

> I certainly
> think that mandating it would be very bad.

Agreed.  If there is a PK, we should allow the user to use it implicitly.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-10-15 10:53:35 -0700, Peter Geoghegan wrote:
> On Tue, Oct 15, 2013 at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I think anything that only works by breaking visibility rules that way
> > is a nonstarter. Doing that from the C level is one thing, exposing it
> > this way seems a bad idea.
> 
> What visibility rule is that?

The early return you added to HTSMVCC.

At the very least it opens you to lots of halloween problem like
scenarios.

> Upsert *has* to do effectively the same thing as what I've proposed -
> there is no getting away from it. So maybe the visibility rulebook
> (which as far as I can tell is "the way things work today") needs to
> be updated. If we did, say, INSERT...ON DUPLICATE KEY UPDATE, we'd
> have to update a row with potentially no visible-to-snapshot version
> *at all*, and make a new version of that visible. That's just what it
> takes. What's the difference between that and just locking? If the
> only difference is that it isn't necessary to modify tqual.c because
> you're passing a tid directly, that isn't a user-visible difference -
> the "rule" has been broken just the same.  Arguably, it's even more of
> a hack, since it's a special, out-of-band visibility exception.

No, doing it in special case code is fundamentally different since those
locations deal only with one row at a time. There's no scans that can
pass over that row.
That's why I think exposing the "on conflict lock" logic to anything but
C isn't going to fly btw.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-10-15 11:23:44 -0700, Josh Berkus wrote:
> (although, AFAICT it doesn't allow for the implementation of one of my
> personal desires, which is UPDATE ... ON NOT FOUND INSERT, for cases
> where updates are expected to occur 95% of the time, but that's another
> topic. Unless "rejects" for an Update could be the leftover rows, but
> then we're getting into full MERGE.).

FWIW I can't see the above syntax as something working very well - you
fundamentally have to SET every column and it only makes sense in
UPDATEs that provably affect only one row.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-10-15 11:55:06 -0700, Josh Berkus wrote:
> Also, because you can't INDEX CONCURRENTLY a PK, I've been building a
> lot of databases which have no PKs, only UNIQUE indexes.

You know that you can add prebuilt primary keys using ALTER TABLE
... ADD CONSTRAINT ... PRIMARY KEY (...) USING indexname?

> Postgres doesn't distinguish between UNIQUE indexes
> and PRIMARY KEYs -- as, indeed, it shouldn't, since they're both keys,
> adn the whole concept of a "primary key" is a legacy of index-organized
> databases, which PostgreSQL is not.

There's some other differences, fro one primary keys are automatically
picked up by foreign keys if the referenced columns aren't specified,
for another we do not yet automatically recognize NOT NULL UNIQUE
columns in GROUP BY.

> However, it does seem like the new syntax could be extended with and
> optional "USING unqiue_index_name" in the future (9.5), no?

Yes.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Josh Berkus
Date:
On 10/15/2013 02:31 PM, Andres Freund wrote:
> On 2013-10-15 11:55:06 -0700, Josh Berkus wrote:
>> Also, because you can't INDEX CONCURRENTLY a PK, I've been building a
>> lot of databases which have no PKs, only UNIQUE indexes.
> 
> You know that you can add prebuilt primary keys using ALTER TABLE
> ... ADD CONSTRAINT ... PRIMARY KEY (...) USING indexname?

That still requires an ACCESS EXCLUSIVE lock, and then can't be dropped
using DROP INDEX CONCURRENTLY.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 2:25 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-15 10:53:35 -0700, Peter Geoghegan wrote:
>> On Tue, Oct 15, 2013 at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > I think anything that only works by breaking visibility rules that way
>> > is a nonstarter. Doing that from the C level is one thing, exposing it
>> > this way seems a bad idea.
>>
>> What visibility rule is that?
>
> The early return you added to HTSMVCC.
>
> At the very least it opens you to lots of halloween problem like
> scenarios.

The term "visibility rule" as you've used it here is suggestive of
some authoritative rule that should obviously never even be bent. I'd
suggest that what Postgres does isn't very useful as an authority on
this matter, because Postgres doesn't have upsert. Besides, today
Postgres doesn't just bend the rules (that is, some kind of classic
notion of MVCC as described in "Concurrency Control in Distributed
Database Systems" or something), it totally breaks them, at least in
READ COMMITTED mode (and what I've proposed here just occurs in RC
mode).

It is not actually in evidence that this approach introduces Halloween
problems. In order for HTSMVCC to controversially indicate visibility
under my scheme, it is not sufficient for the row version to just be
exclusive locked by our xact without otherwise being visible - it must
also *not be updated*. Now, I'll freely admit that this could still be
problematic - there might have been a subtlety I missed. But since an
actual example of where this is problematic hasn't been forthcoming, I
take it that it isn't obvious to either yourself or Robert that it
actually is. Any scheme that involves playing cute tricks with
visibility (which is to say, any credible upsert implementation) needs
very careful thought.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not saying "go implement MERGE".  I'm saying, make the
> insert-or-update operation a single statement, using some syntax TBD,
> instead of requiring the use of a new insert statement that makes
> invisible rows visible as a side effect, so that you can wrap that in
> a CTE and feed it to an update statement.  That's complex and, AFAICS,
> unlike how any other database product handles this.

Well, lots of other databases have their own unique way of doing this
- apart from MySQL's INSERT...ON DUPLICATE KEY UPDATE, there is a
variant within Teradata, Sybase and SQLite. They're all different. And
in the case of Teradata, it was an interim feature towards MERGE which
came in a much later release, which is how I see this.

No other database system even has writeable CTEs, of course. It's a
fairly recent idea.

> Again, other people can have different opinions on this, and that's
> fine.  I'm just giving you mine.

I will defer to the majority opinion here. But you also expressed
concern about surprising results due to the wrong unique constraint
violation being the source of a conflict. Couldn't this syntax (with
the wCTE upsert pattern) help with that, by naming the constant
inserted in the update too? It would be pretty simple to expose that,
and far less grotty than naming a unique index in DML.


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Tue, Oct 15, 2013 at 11:34 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> Again, other people can have different opinions on this, and that's
>> fine.  I'm just giving you mine.
>
> I will defer to the majority opinion here. But you also expressed
> concern about surprising results due to the wrong unique constraint
> violation being the source of a conflict. Couldn't this syntax (with
> the wCTE upsert pattern) help with that, by naming the constant
> inserted in the update too? It would be pretty simple to expose that,
> and far less grotty than naming a unique index in DML.

Well, I don't know that any of us can claim to have a lock on what the
syntax should look like.  I think we need to hear some proposals.
You've heard my gripe about the current syntax (which Andres appears
to share), but I shan't attempt to prejudice you in favor of my
preferred alternative, because I don't have one yet.  There could be
other ways of avoiding that problem, though.  Here's an example:

UPSERT table (keycol1, ..., keycoln) = (keyval1, ..., keyvaln) SET
(nonkeycol1, ..., nonkeycoln) = (nonkeyval1, ..., nonkeyvaln)

That's pretty ugly on multiple levels, and I'm definitely not
proposing that exact thing, but the idea is: look for a record that
matches on the key columns/values; if found, update the non-key
columns with the corresponding values; if not found, construct a new
row with both the key and nonkey column sets and insert it.  If no
matching unique index exists we'll have to fail, but we stop short of
having to mention the name of that index.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Oct 15, 2013 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, I don't know that any of us can claim to have a lock on what the
> syntax should look like.

Sure. But it's not just syntax. We're talking about functional
differences too, since you're talking about mandating an update, which
is a not the same as an "update locked row only conditionally", or a
delete.

I get that it's a little verbose, but then this is ORM plumbing for
many of those that would prefer a more succinct syntax. Those people
would also benefit from having their ORM do something much more
powerful for them when needed.

> I think we need to hear some proposals.

Agreed.

> You've heard my gripe about the current syntax (which Andres appears
> to share), but I shan't attempt to prejudice you in favor of my
> preferred alternative, because I don't have one yet.

FWIW, I sincerely see very real advantages to what I've proposed here.
To me, the fact that it's convenient to implement is beside the point.

> There could be
> other ways of avoiding that problem, though.  Here's an example:
>
> UPSERT table (keycol1, ..., keycoln) = (keyval1, ..., keyvaln) SET
> (nonkeycol1, ..., nonkeycoln) = (nonkeyval1, ..., nonkeyvaln)
>
> That's pretty ugly on multiple levels, and I'm definitely not
> proposing that exact thing, but the idea is: look for a record that
> matches on the key columns/values; if found, update the non-key
> columns with the corresponding values; if not found, construct a new
> row with both the key and nonkey column sets and insert it.  If no
> matching unique index exists we'll have to fail, but we stop short of
> having to mention the name of that index.

What if you want to update the key columns - either the potential
conflict-causing one, or another? What about composite unique
constraints? MySQL certainly supports all that, for example.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Tue, Oct 15, 2013 at 1:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> There could be
>> other ways of avoiding that problem, though.  Here's an example:
>>
>> UPSERT table (keycol1, ..., keycoln) = (keyval1, ..., keyvaln) SET
>> (nonkeycol1, ..., nonkeycoln) = (nonkeyval1, ..., nonkeyvaln)
>>
>> That's pretty ugly on multiple levels, and I'm definitely not
>> proposing that exact thing, but the idea is: look for a record that
>> matches on the key columns/values; if found, update the non-key
>> columns with the corresponding values; if not found, construct a new
>> row with both the key and nonkey column sets and insert it.  If no
>> matching unique index exists we'll have to fail, but we stop short of
>> having to mention the name of that index.
>
> What if you want to update the key columns - either the potential
> conflict-causing one, or another?

I'm not sure what that means in the context of an UPSERT operation.
If the update case is, when a = 1 then make a = 2, then which value
goes in column a when we insert, 1 or 2?  But I suppose if you can
work that out it's just a matter of mentioning the column as both a
key column and a non-key column.

> What about composite unique
> constraints? MySQL certainly supports all that, for example.

That's why it allows you to specify N key columns rather than
restricting you to just one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 14.10.2013 07:12, Peter Geoghegan wrote:
> On Wed, Oct 9, 2013 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Unfortunately, I have a very busy schedule in the month ahead,
>> including travelling to Ireland and Japan, so I don't think I'm going
>> to get the opportunity to work on this too much. I'll try and produce
>> a V4 that formally proposes some variant of my ideas around visibility
>> of locked tuples.
>
> V4 is attached.
>
> Most notably, this adds the modifications to HeapTupleSatisfiesMVCC(),
> though they're neater than in the snippet I sent earlier.
>
> There is also some clean-up around row-level locking. That code has
> been simplified. I also try and handle serialization failures in a
> better way, though that really needs the attention of a subject matter
> expert.
>
> There are a few additional XXX comments highlighting areas of concern,
> particularly around serializable behavior. I've deferred making higher
> isolation levels care about wrongfully relying on the special
> HeapTupleSatisfiesMVCC() exception (e.g. they won't throw a
> serialization failure, mostly because I couldn't decide on where to do
> the test on time prior to travelling tomorrow).
>
> I've added code to do heap_prepare_insert before value locks are held.
> Whatever our eventual value locking implementation, that's going to be
> a useful optimization. Though unfortunately I ran out of time to give
> this the scrutiny it really deserves, I suppose that it's something
> that we can return to later.
>
> I ask that reviewers continue to focus on concurrency issues and broad
> design issues, and continue to defer discussion about an eventual
> value locking implementation. I continue to think that that's the most
> useful way of proceeding for the time being. My earlier points about
> probable areas of concern [1] remain a good place for reviewers to
> start.

I think it's important to recap the design goals of this. I don't think 
these have been listed before, so let me try:

* It should be usable and perform well for both large batch updates and 
small transactions.

* It should perform well both when there are no duplicates, and when 
there are lots of duplicates

And from that follows some finer requirements:

* Performance when there are no duplicates should be close to raw INSERT 
performance.

* Performance when all rows are duplicates should be close to raw UPDATE 
performance.

* We should not leave behind large numbers of dead tuples in either case.

Anything else I'm missing?


What about exclusion constraints? I'd like to see this work for them as 
well. Currently, exclusion constraints are checked after the tuple is 
inserted, and you abort if the constraint was violated. We could still 
insert the heap and index tuples first, but instead of aborting on 
violation, we would kill the heap tuple we already inserted and retry. 
There are some complications there, like how to wake up any other 
backends that are waiting to grab a lock on the tuple we just killed, 
but it seems doable.

That would, however, perform badly and leave garbage behind if there are 
duplicates. A refinement of that would be to first check for constraint 
violations, then insert the tuple, and then check again. That would 
avoid the garbage in most cases, but would perform much more poorly when 
there are no duplicates, because it needs two index scans for every 
insertion. A further refinement would be to keep track of how many 
duplicates there have been recently, and switch between the two 
strategies based on that.

That cost of doing two scans could be alleviated by using 
markpos/restrpos to do the second scan. That is presumably cheaper than 
starting a whole new scan with the same key. (markpos/restrpos don't 
currently work for non-MVCC snapshots, so that'd need to be fixed, though)

And that detour with exclusion constraints takes me back to the current 
patch :-). What if you implemented the unique check in a similar fashion 
too (when doing INSERT ON DUPLICATE KEY ...)? First, scan for a 
conflicting key, and mark the position. Then do the insertion to that 
position. If the insertion fails because of a duplicate key (which was 
inserted after we did the first scan), mark the heap tuple as dead, and 
start over. The indexam changes would be quite similar to the changes 
you made in your patch, but instead of keeping the page locked, you'd 
only hold a pin on the target page (if even that). The first indexam 
call would check that the key doesn't exist, and remember the insert 
position. The second call would re-find the previous position, and 
insert the tuple, checking again that there really wasn't a duplicate 
key violation. The locking aspects would be less scary than your current 
patch.

I'm not sure if that would perform as well as your current patch. I must 
admit your current approach is pretty optimal performance-wise. But I'd 
like to see it, and that would be a solution for exclusion constraints 
in any case.

One fairly limitation with your current approach is that the number of 
lwlocks you can hold simultaneously is limited (MAX_SIMUL_LWLOCKS == 
100). Another limitation is that the minimum for shared_buffers is only 
16. Neither of those is a serious problem in real applications - no-one 
runs with shared_buffers=16 and no sane schema has a hundred unique 
indexes, but it's still something to consider.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Nov 18, 2013 at 6:44 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I think it's important to recap the design goals of this.

Seems reasonable to list them out.

> * It should be usable and perform well for both large batch updates and
> small transactions.

I think that that's a secondary goal, a question to be considered but
perhaps deferred during this initial effort. I agree that it certainly
is important.

> * It should perform well both when there are no duplicates, and when there
> are lots of duplicates

I think this is very important.

> And from that follows some finer requirements:
>
> * Performance when there are no duplicates should be close to raw INSERT
> performance.
>
> * Performance when all rows are duplicates should be close to raw UPDATE
> performance.
>
> * We should not leave behind large numbers of dead tuples in either case.

I agree with all that.

> Anything else I'm missing?

I think so, yes. I'll add:

* Should not deadlock unreasonably.

If the UPDATE case is to work and perform almost as well as a regular
UPDATE, that must mean that it has essentially the same
characteristics as plain UPDATE. In particular, I feel fairly strongly
that it is not okay for upserts to deadlock with each other unless the
possibility of each transaction locking multiple rows (in an
inconsistent order) exists. I don't want to repeat the mistakes of
MySQL here. This is a point that I stressed to Robert on a previous
occasion [1]. It's why value locks and row locks cannot be held at the
same time. Incidentally, that implies that all alternative schemes
involving bloat will bloat once per attempt, I believe.

I'll also add:

* Should anticipate a day when Postgres needs plumbing for SQL MERGE,
which is still something we want, particularly for batch operations. I
realize that the standard doesn't strictly require MERGE to handle the
concurrency issues, but even still I don't think that an
implementation that doesn't is practicable - does such an
implementation currently exist in any other system?

> What about exclusion constraints? I'd like to see this work for them as
> well. Currently, exclusion constraints are checked after the tuple is
> inserted, and you abort if the constraint was violated. We could still
> insert the heap and index tuples first, but instead of aborting on
> violation, we would kill the heap tuple we already inserted and retry. There
> are some complications there, like how to wake up any other backends that
> are waiting to grab a lock on the tuple we just killed, but it seems doable.

I agree that it's at least doable.

> That would, however, perform badly and leave garbage behind if there are
> duplicates. A refinement of that would be to first check for constraint
> violations, then insert the tuple, and then check again. That would avoid
> the garbage in most cases, but would perform much more poorly when there are
> no duplicates, because it needs two index scans for every insertion. A
> further refinement would be to keep track of how many duplicates there have
> been recently, and switch between the two strategies based on that.

Seems like an awful lot of additional mechanism.

> That cost of doing two scans could be alleviated by using markpos/restrpos
> to do the second scan. That is presumably cheaper than starting a whole new
> scan with the same key. (markpos/restrpos don't currently work for non-MVCC
> snapshots, so that'd need to be fixed, though)

Well, it seems like we could already use a "pick up where you left
off" mechanism in the case of regular btree index tuple insertions
into unique indexes -- after all, we don't do that in the event of
blocking pending the outcome of the other transaction (that inserted a
duplicate that we need to conclusively know has or has not committed)
today. The fact that this doesn't already exist leaves me less than
optimistic about the prospect of making it work to facilitate a scheme
such as the one you describe here. (Today we still need to catch a
committed version of the tuple that would make our tuple a duplicate
from a fresh index scan, only *after* waiting for a transaction to
commit/abort at the end of our original index scan). So we're already
pretty naive about this, even though it would pay to not be.

Making something like markpos work for the purposes of an upsert
implementation seems not only hard, but also like a possible
modularity violation. Are we not unreasonably constraining the
implementation going forward? My patch respects the integrity of the
am abstraction, and doesn't really add any knowledge to the core
system about how amcanunique index methods might go about implementing
the new "amlock" method. The core system worries a bit about the "low
level locks" (as it naively refers to value locks), and doesn't
consider that it has the right to hold on to them for more than an
instant, but that's about it. Plus we don't have to worry about
whether something does or does not work for a certain snapshot type
with my approach, because as with the current unique index btree
coding, it operates at a lower level than that, and does not need to
consider visibility as such.

The markpos and restpos am methods only called for regular index
(only) scans, that don't need to worry about things that are not
visible. Of course, upsert needs to worry about
invisible-but-conclusively-live things. This seems much harder, and
basically implies value locking of some kind, if I'm not mistaken. So
have you really gained anything?

So what I've done, aside from being, as you say below, close to
optimal, is in a sense defined in terms of existing, well-established
abstractions. I feel it's easier to reason about the implications of
holding value locks (whatever the implementation) for longer and
across multiple operations than it is to do all this instead. What
I've done with locking is scary, but not as scary as the worst case of
alternative implementations.

> And that detour with exclusion constraints takes me back to the current
> patch :-). What if you implemented the unique check in a similar fashion too
> (when doing INSERT ON DUPLICATE KEY ...)? First, scan for a conflicting key,
> and mark the position. Then do the insertion to that position. If the
> insertion fails because of a duplicate key (which was inserted after we did
> the first scan), mark the heap tuple as dead, and start over. The indexam
> changes would be quite similar to the changes you made in your patch, but
> instead of keeping the page locked, you'd only hold a pin on the target page
> (if even that). The first indexam call would check that the key doesn't
> exist, and remember the insert position. The second call would re-find the
> previous position, and insert the tuple, checking again that there really
> wasn't a duplicate key violation. The locking aspects would be less scary
> than your current patch.
>
> I'm not sure if that would perform as well as your current patch. I must
> admit your current approach is pretty optimal performance-wise. But I'd like
> to see it, and that would be a solution for exclusion constraints in any
> case.

I'm certainly not opposed to making something like this work for
exclusion constraints. Certainly, I want this to be as general as
possible. But I don't think that it needs to be a blocker, and I don't
think we gain anything in code footprint by addressing that by being
as general as possible in our approach to the basic concurrency issue.
After all, we're going to have to repeat the basic pattern in multiple
modules.

With exclusion constraints, we'd have to worry about a single slot
proposed for insertion violating (and therefore presumably obliging us
to lock) every row in the table. Are we going to have a mechanism for
spilling a tid array potentially sized in gigabytes to disk (relating
to just one slot proposed for insertion)? Is it principled to have
that one slot project out rejects consisting of (say) the entire
table? Is it even useful to lock multiple rows if we can't really
update them, because they'll overlap each other when all updated with
the one value? These are complicated questions, and frankly I don't
have the bandwidth to answer them too soon. I just want to implement a
feature that there is obviously huge pent up demand for, that has in
the past put Postgres at a strategic disadvantage. I don't think it is
unsound to define ON DUPLICATE KEY in terms of unique indexes. That's
how we represent uniques...it isn't spelt ON OVERLAPPING or whatever.
That seems like an addition, a nice-to-have, and maybe not even that,
because exclusion-constrained columns *aren't* keys, and people aren't
likely to want to upsert details of a booking (the typical exclusion
constraint use-case) with the booking range in the UPDATE part's
predicate. They'd just do it by key, because they'd already have a
booking number PK value or whatever.

Making this perform as well as possible is an important consideration.
All alternative approaches that involve bloat concern me, and for
reasons that I'm not sure were fully appreciated during earlier
discussion on this thread: I'm worried about the worst case, not the
average case. I am worried about a so-called "thundering herd"
scenario. You need something like LockTuple() to arbitrate ordering,
which seems complex, and like a further modularity violation. If this
is to perform well when there are lots of existing tuples to be
updated (with contention that wouldn't be considered unreasonable for
plain updates), the amount of bloat generated by a thundering herd
could be really really bad (once per attempt per "head of
cattle"/upserter) . It's hard to say for sure how much of a problem
this is, but I think it needs to be considered. It's a problem that
I'm not sure we have the tools to analyze ahead of time. It's easier
to pin down and reason about the conventional value locking stuff,
because we know how deadlocks work. We know how to do analysis of
deadlock hazards, and the surface area actually turns out to be not
too large there.

> One fairly limitation with your current approach is that the number of
> lwlocks you can hold simultaneously is limited (MAX_SIMUL_LWLOCKS == 100).
> Another limitation is that the minimum for shared_buffers is only 16.
> Neither of those is a serious problem in real applications - no-one runs
> with shared_buffers=16 and no sane schema has a hundred unique indexes, but
> it's still something to consider.

I was under the impression, based on previously feedback, that what
I've done with LWLocks was unlikely to be accepted. I proceeded under
the assumption that we'll be able to ameliorate these problems, as for
example by implementing an alternative value locking mechanism (an
SLRU?) that is similar to what I've done to date (in particular, very
cheap and fast), but without all the down-sides that concerned Robert
and Andres, and now you. As I said, I still think that's easier and
safer than all alternative approaches described to date. It just so
happens that I also believe it will perform a lot better in the
average case too, but that isn't a key advantage to my mind.

You're right that the value locking is scary. I think we need to very
carefully consider it, once I have buy-in on the basic approach. I
really do think it's the least-worst approach described to date. It
isn't like we can't discuss making it inherently less scary, but I
hesitate to do that now, given that I don't know if that discussing
will go anywhere.

Thanks for your efforts on reviewing my work here! Do you think it
would be useful at this juncture to write a patch to make the order of
locking across unique indexes well-defined? I think it may well have
independent value to get the insertion into unique indexes (that can
throw errors) out of the way when doing a regular slot insertion.
Better to abort the transaction as soon as possible.

[1] http://www.postgresql.org/message-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9M7_g@mail.gmail.com

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Nov 18, 2013 at 4:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
> You're right that the value locking is scary. I think we need to very
> carefully consider it, once I have buy-in on the basic approach. I
> really do think it's the least-worst approach described to date. It
> isn't like we can't discuss making it inherently less scary, but I
> hesitate to do that now, given that I don't know if that discussing
> will go anywhere.

One possible compromise would be "promise tuples" where we know we'll
be able to keep our promise. In other words:

1. We lock values in the first phase, in more or less the manner of
the extant patch.

2. When a consensus exists that heap tuple insertion proceeds, we
proceed with insertion of these promise index tuples (and probably
keep just a pin on the relevant pages).

3. Proceed with insertion of the heap tuple (with no "value locks" of
any kind held).

3. Go back to the unique indexes, update the heap tid and unset the
index tuple flag (that indicates that the tuples are in this promise
state). Probably we can even be bright about re-finding the existing
promise tuples with their proper heap tid (e.g. maybe we can avoid
doing a regular index scan at least some of the time - chances are
pretty good that the index tuple is on the same page as before, so
it's generally well worth a shot looking there first). As with the
earlier promise tuple proposals, we store our xid in the ItemPointer.

4. Finally, insertion of non-unique index tuples occurs in the regular manner.

Obviously the big advantage here is that we don't have to worry about
value locking across heap tuple insertion at all, and yet we don't
have to worry about bloating, because we really do know that insertion
proper will proceed when inserting *this* type of promise index tuple.
Maybe that even makes it okay to just use buffer locks, if we think
some more about the other edge cases. Regular index scans take the
aforementioned flag as a kind of visibility hint, perhaps, so we don't
have to worry about them. And VACUUM would kill any dead promise
tuples - this would be much less of a concern than with the earlier
promise tuple proposals, because it is extremely non routine. Maybe
it's fine to not make autovacuum concerned about a whole new class of
(index-only) bloat, which seemed like a big problem with those earlier
proposals, simply because crashes within this tiny window are
hopefully so rare that it couldn't possibly amount to much bloat in
the grand scheme of things (at least before a routine VACUUM - UPDATEs
tend to necessitate those). If you have 50 upserting backends in this
tiny window during a crash, that would be only 50 dead index tuples.
Given the window is so tiny, I doubt it would be much of a problem at
all - even 50 seems like a very high number. The track_counts counts
that drive autovacuum here are already not crash safe, so I see no
regression.

Now, you still have to value lock across multiple btree unique
indexes, and I understand there are reservations about this. But the
surface area is made significantly smaller at reasonably low cost.
Furthermore, doing TOASTing out-of-line and so on ceases to be
necessary.

The LOCK FOR UPDATE case is the same as before. Nothing else changes.

FWIW, without presuming anything about value locking implementation,
I'm not too worried about making the implementation scale to very
large numbers of unique indexes, with very low shared_buffer settings.
We already have a fairly similar situation with
max_locks_per_transaction and so on, no?

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 19.11.2013 02:37, Peter Geoghegan wrote:
> On Mon, Nov 18, 2013 at 6:44 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> * It should be usable and perform well for both large batch updates and
>> small transactions.
>
> I think that that's a secondary goal, a question to be considered but
> perhaps deferred during this initial effort. I agree that it certainly
> is important.

Ok. Which use case are you targeting during this initial effort, batch
updates or small OLTP transactions?

>> Anything else I'm missing?
>
> I think so, yes. I'll add:
>
> * Should not deadlock unreasonably.
>
> If the UPDATE case is to work and perform almost as well as a regular
> UPDATE, that must mean that it has essentially the same
> characteristics as plain UPDATE. In particular, I feel fairly strongly
> that it is not okay for upserts to deadlock with each other unless the
> possibility of each transaction locking multiple rows (in an
> inconsistent order) exists.

Agreed.

>> What about exclusion constraints? I'd like to see this work for them as
>> well. Currently, exclusion constraints are checked after the tuple is
>> inserted, and you abort if the constraint was violated. We could still
>> insert the heap and index tuples first, but instead of aborting on
>> violation, we would kill the heap tuple we already inserted and retry. There
>> are some complications there, like how to wake up any other backends that
>> are waiting to grab a lock on the tuple we just killed, but it seems doable.
>
> I agree that it's at least doable.
>
>> That would, however, perform badly and leave garbage behind if there are
>> duplicates. A refinement of that would be to first check for constraint
>> violations, then insert the tuple, and then check again. That would avoid
>> the garbage in most cases, but would perform much more poorly when there are
>> no duplicates, because it needs two index scans for every insertion. A
>> further refinement would be to keep track of how many duplicates there have
>> been recently, and switch between the two strategies based on that.
>
> Seems like an awful lot of additional mechanism.

Not really. Once you have the code in place to do the
kill-inserted-tuple dance on a conflict, all you need is to do an extra
index search before it. And once you have that, it's not hard to add
some kind of a heuristic to either do the pre-check or skip it.

>> That cost of doing two scans could be alleviated by using markpos/restrpos
>> to do the second scan. That is presumably cheaper than starting a whole new
>> scan with the same key. (markpos/restrpos don't currently work for non-MVCC
>> snapshots, so that'd need to be fixed, though)
>
> Well, it seems like we could already use a "pick up where you left
> off" mechanism in the case of regular btree index tuple insertions
> into unique indexes -- after all, we don't do that in the event of
> blocking pending the outcome of the other transaction (that inserted a
> duplicate that we need to conclusively know has or has not committed)
> today. The fact that this doesn't already exist leaves me less than
> optimistic about the prospect of making it work to facilitate a scheme
> such as the one you describe here. (Today we still need to catch a
> committed version of the tuple that would make our tuple a duplicate
> from a fresh index scan, only *after* waiting for a transaction to
> commit/abort at the end of our original index scan). So we're already
> pretty naive about this, even though it would pay to not be.

We just haven't bothered to optimize for the case that you have to wait.
That's going to be slow anyway. Also, after sleeping, the insertion
position might've moved right a lot, if a lot of insertions happened
during the sleep, so it might be best to do a new scan anyway.

> Making something like markpos work for the purposes of an upsert
> implementation seems not only hard, but also like a possible
> modularity violation. Are we not unreasonably constraining the
> implementation going forward? My patch respects the integrity of the
> am abstraction, and doesn't really add any knowledge to the core
> system about how amcanunique index methods might go about implementing
> the new "amlock" method. The core system worries a bit about the "low
> level locks" (as it naively refers to value locks), and doesn't
> consider that it has the right to hold on to them for more than an
> instant, but that's about it. Plus we don't have to worry about
> whether something does or does not work for a certain snapshot type
> with my approach, because as with the current unique index btree
> coding, it operates at a lower level than that, and does not need to
> consider visibility as such.
>
> The markpos and restpos am methods only called for regular index
> (only) scans, that don't need to worry about things that are not
> visible. Of course, upsert needs to worry about
> invisible-but-conclusively-live things. This seems much harder, and
> basically implies value locking of some kind, if I'm not mistaken. So
> have you really gained anything?

I probably shouldn't have mentioned markpos/restrpos, you're right that
it's not a good idea to conflate that with index insertion.
Nevertheless, some kind of an API for doing a duplicate-key check prior
to insertion, and remembering the location for the actual insert later,
seems sensible. It's certainly no more of a modularity violation than
the value-locking scheme you're proposing.

What I'm thinking is a new indexam function, let's call it "pre-insert".
The pre-insert function checks for any possible unique key violations,
just like insertion, but doesn't modify the index. Also, as an
optimization, it can remember the position where the insertion will go
to later, and return an opaque token to represent that. That token can
be passed to the insert-function later, which can use it to quickly
re-find the insert position. In other words, very similar to the
index_lock function you're proposing, but it doesn't keep the page locked.

>> And that detour with exclusion constraints takes me back to the current
>> patch :-). What if you implemented the unique check in a similar fashion too
>> (when doing INSERT ON DUPLICATE KEY ...)? First, scan for a conflicting key,
>> and mark the position. Then do the insertion to that position. If the
>> insertion fails because of a duplicate key (which was inserted after we did
>> the first scan), mark the heap tuple as dead, and start over. The indexam
>> changes would be quite similar to the changes you made in your patch, but
>> instead of keeping the page locked, you'd only hold a pin on the target page
>> (if even that). The first indexam call would check that the key doesn't
>> exist, and remember the insert position. The second call would re-find the
>> previous position, and insert the tuple, checking again that there really
>> wasn't a duplicate key violation. The locking aspects would be less scary
>> than your current patch.
>>
>> I'm not sure if that would perform as well as your current patch. I must
>> admit your current approach is pretty optimal performance-wise. But I'd like
>> to see it, and that would be a solution for exclusion constraints in any
>> case.
>
> I'm certainly not opposed to making something like this work for
> exclusion constraints. Certainly, I want this to be as general as
> possible. But I don't think that it needs to be a blocker, and I don't
> think we gain anything in code footprint by addressing that by being
> as general as possible in our approach to the basic concurrency issue.
> After all, we're going to have to repeat the basic pattern in multiple
> modules.

Well, I don't know what to say. I *do* have a hunch that we'd gain much
in code footprint by making this general. I don't understand what
pattern you'd need to repeat in multiple modules.

Here's a patch, implementing a rough version of the scheme I'm trying to
explain. It's not as polished as yours, but it ought to be enough to
evaluate the code footprint and performance. It doesn't make any changes
to the indexam API, and it works the same with exclusion constraints and
unique constraints. As it stands, it doesn't leave bloat behind, except
when a concurrent insertion with a conflicting key happens between the
first "pre-check" and the actual insertion. That should be rare in practice.

What have you been using to performance test this?

> With exclusion constraints, we'd have to worry about a single slot
> proposed for insertion violating (and therefore presumably obliging us
> to lock) every row in the table. Are we going to have a mechanism for
> spilling a tid array potentially sized in gigabytes to disk (relating
> to just one slot proposed for insertion)? Is it principled to have
> that one slot project out rejects consisting of (say) the entire
> table? Is it even useful to lock multiple rows if we can't really
> update them, because they'll overlap each other when all updated with
> the one value?

Hmm. I think what you're referring to is the case where you try to
insert a row so that it violates an exclusion constraint, and in a way
that it conflicts with a large number of existing tuples. For example,
if you have a calendar application with a constraint that two
reservations must not overlap, and you try to insert a new reservation
that covers, say, a whole decade.

That's not a problem for ON DUPLICATE KEY IGNORE, as you just ignore the
conflict and move on. For ON DUPLICATE KEY LOCK FOR UPDATE, I guess we
would need to handle a large TID array. Or maybe we can arrange it so
that the tuples are locked as we scan them, without having to collect
them all in a large array.

(the attached patch only locks the first existing tuple that conflicts;
that needs to be fixed)

RETURNING REJECTS is not an issue here, as that just returns the
rejected rows we were about to insert, not the existing rows in the table.

- Heikki

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Nov 19, 2013 at 5:13 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Ok. Which use case are you targeting during this initial effort, batch
> updates or small OLTP transactions?

OLTP transactions are probably my primary concern. I just realized
that I wasn't actually very clear on that point in my most recent
e-mail -- my apologies. What we really need for batching, and what we
should work towards in the medium term is MERGE, where a single table
scan does everything.

However, I also care about facilitating conflict resolution in
multi-master replication systems, so I think we definitely ought to
consider that carefully if at all possible. Incidentally, Andres said
a few weeks back that he thinks that what I've proposed ought to be
only exposed to C code, owing to the fact that it necessitates the
visibility trick (actually, I think UPSERT does generally, but what
I've done has, I suppose, necessitated making it more explicit/general
- i.e. modifications are added to HeapTupleSatisfiesMVCC()). I don't
understand what difference it makes to only exposed it at the C level
- what I've proposed in this area is either correct or incorrect
(Andres mentioned the Halloween problem). Furthermore, I presume that
it's broadly useful to have Bucardo-style custom conflict resolution
policies, without people having to get their hands dirty with C, and I
think having this at the SQL level helps there. Plus, as I've said
many times, the flexibility this syntax offers is likely to be broadly
useful for ordinary SQL clients - this is almost as good as SQL MERGE
for many cases.

>> Seems like an awful lot of additional mechanism.
>
> Not really. Once you have the code in place to do the kill-inserted-tuple
> dance on a conflict, all you need is to do an extra index search before it.
> And once you have that, it's not hard to add some kind of a heuristic to
> either do the pre-check or skip it.

Perhaps.

> I probably shouldn't have mentioned markpos/restrpos, you're right that it's
> not a good idea to conflate that with index insertion. Nevertheless, some
> kind of an API for doing a duplicate-key check prior to insertion, and
> remembering the location for the actual insert later, seems sensible. It's
> certainly no more of a modularity violation than the value-locking scheme
> you're proposing.

I'm not so sure - in principle, any locking implementation can be used
by any conceivable amcanunique indexing method. The core system knows
that it isn't okay to sit on them all day long, but that doesn't seem
very onerous.

>> I'm certainly not opposed to making something like this work for
>> exclusion constraints. Certainly, I want this to be as general as
>> possible. But I don't think that it needs to be a blocker, and I don't
>> think we gain anything in code footprint by addressing that by being
>> as general as possible in our approach to the basic concurrency issue.
>> After all, we're going to have to repeat the basic pattern in multiple
>> modules.
>
>
> Well, I don't know what to say. I *do* have a hunch that we'd gain much in
> code footprint by making this general. I don't understand what pattern you'd
> need to repeat in multiple modules.

Now that I see this rough patch, I better appreciate what you mean. I
withdraw this objection.

> Here's a patch, implementing a rough version of the scheme I'm trying to
> explain. It's not as polished as yours, but it ought to be enough to
> evaluate the code footprint and performance. It doesn't make any changes to
> the indexam API, and it works the same with exclusion constraints and unique
> constraints. As it stands, it doesn't leave bloat behind, except when a
> concurrent insertion with a conflicting key happens between the first
> "pre-check" and the actual insertion. That should be rare in practice.
>
> What have you been using to performance test this?

I was just testing my patch against a custom pgbench workload,
involving running upserts against a table from a fixed range of PK
values. It's proven kind of difficult to benchmark this in the way
that pgbench has proved useful for in the past. Pretty soon the
table's PK range is "saturated", so they're all updates, but on the
other hand how do you balance the INSERT or UPDATE case?

Multiple unique indexes are the interesting case for comparing both
approaches. I didn't really worry about performance so much as
correctness, and for multiple unique constraints your approach clearly
falls down, as explained below.

>> Is it even useful to lock multiple rows if we can't really
>> update them, because they'll overlap each other when all updated with
>> the one value?
>
>
> Hmm. I think what you're referring to is the case where you try to insert a
> row so that it violates an exclusion constraint, and in a way that it
> conflicts with a large number of existing tuples. For example, if you have a
> calendar application with a constraint that two reservations must not
> overlap, and you try to insert a new reservation that covers, say, a whole
> decade.

Right.

> That's not a problem for ON DUPLICATE KEY IGNORE, as you just ignore the
> conflict and move on. For ON DUPLICATE KEY LOCK FOR UPDATE, I guess we would
> need to handle a large TID array. Or maybe we can arrange it so that the
> tuples are locked as we scan them, without having to collect them all in a
> large array.
>
> (the attached patch only locks the first existing tuple that conflicts; that
> needs to be fixed)

I'm having a hard time seeing how ON DUPLICATE KEY LOCK FOR UPDATE is
of very much use to exclusion constraints at all. Perhaps I lack
imagination here. However, ON DUPLICATE KEY IGNORE certainly *is*
useful with exclusion constraints, and I'm not dismissive of that.

I think we ought to at least be realistic about the concerns that
inform your approach here. I don't think that making this work for
exclusion constraints is all that compelling; I'll take it, I guess
(not that there is obviously a dichotomy between doing btree locking
and doing ECs too), but I doubt people put "overlaps" operators in the
predicates of DML very often *at all*, and doubt even more that there
is actual demand for upserting there. I think that the reason that you
prefer this design is almost entirely down to possible hazards with
btree locking around what I've done (or, indeed anything that
approximates what I've done); maybe that's so obvious that you didn't
even occur to you to mention it, but I think it should be
acknowledged. I don't think that using index locking of *some* form is
unreasonable. Certainly, I think that from reading the literature
(e.g. [1]) one can find evidence that btree page index locking as part
of value locking seems like a common technique in many popular RDBMSs,
and presumably forms an important part of their SQL MERGE
implementations. As it says in that paper:

"""
Thus, non-leaf pages do not require locks and are protected by latches
only. The remainder of this paper focuses on locks.
"""

They talk here about a very traditional System-R architecture -
"Assumptions about the database environment are designed to be very
traditional". Latches here are basically equivalent to our buffer
locks, and what they call locks we call heavyweight locks. So I'm
pretty sure many other *traditional* systems handle value locking by
escalating a "latch" to a leaf-page-level heavyweight lock (it's often
more granular too). I think that the advantages are fairly
fundamental.

I think that "4.1 Locks on keys and ranges" of this paper is interesting.

I've also found a gentler introduction to traditional btree key
locking [2]. In that paper, section "5 Protecting a B-tree’s logical
contents" it is said:

"""
Latches must be managed carefully in key range locking if lockable
resources are defined by keys that may be deleted if not protected.
Until the lock request is inserted into the lock manager’s data
structures, the latch on the data structure in the buffer pool is
required to ensure the existence of the key value. On the other hand,
if a lock cannot be granted immediately, the thread should not hold a
latch while the transaction waits. Thus, after waiting for a key value
lock, a transaction must repeat its root-to-leaf search for the key.
"""

So I strongly suspect that some other systems have found it useful to
escalate from a latch (buffer/page lock) to a lock (heavyweight lock).

I have some concerns about what you've done that may limit my
immediate ability to judge performance, and the relative merits of
both approaches generally. Now, I know you just wanted to sketch
something out, and that's fine, but I'm only sharing my thoughts. I am
particularly worried about the worst case (for either approach),
particularly with more than 1 unique index. I am also worried about
livelock hazards (again, in particular with more than 1 index) - I am
not asserting that they exist in your patch, but they are definitely
more difficult to reason about. Value locking works because once a
page lock is acquired, all unique indexes are inserted into. Could you
have two upserters livelock each other with two unique indexes with
1:1 correlated values in practice (i.e. 2 unique indexes that might
almost work as 1 composite index)? That is a reasonable usage of
upsert, I think.

We never wait on another transaction if there is a conflict when
inserting - we just do the usual UNIQUE_CHECK_PARTIAL thing (we don't
wait for other xact during btree insertion). This breaks the IGNORE
case (how does it determine the final outcome of the transaction that
inserted what may be a conflict, iff the conflict was only found
during insertion?), which would probably be fine for our purposes if
that were the only issue, but I have concerns about its effects on the
ON DUPLICATE KEY LOCK FOR UPDATE case too. I don't like that an
upserter's ExecInsertIndexTuples() won't wait on other xids generally,
I think. Why should the code btree-insert even though it knows it's
going to kill the heap tuple? It makes things very hard to reason
about.

If you are just mostly thinking about exclusion constraints here, then
I'm not sure that even at this point that it's okay that the IGNORE
case doesn't work there, because IGNORE is the only thing that makes
much sense for exclusion constraints.

The unacceptable-deadlocking-pattern generally occurs when we try to
lock two different row versions. Your patch is fairly easy to make
deadlock.

Regarding this:

        /*
         * At this point we have either a conflict or a potential conflict. If
         * we're not supposed to raise error, just return the fact of the
         * potential conflict without waiting to see if it's real.
         */
        if (errorOK && !wait)
        {
            conflict = true;
            if (conflictTid)
                *conflictTid = tup->t_self;
            break;
        }

Don't we really just have only a potential conflict? Even if
conflictTid is committed?

I think it's odd that you insert btree index tuples without ever
worrying about waiting (which is what breaks the IGNORE case, you
might say). UNIQUE_CHECK_PARTIAL never gives an xid to wait on from
within _bt_check_unique(). Won't that itself make other sessions block
pending the outcome of our transaction (in non-upserting
ExecInsertIndexTuples(), or in ExecCheckIndexConstraints())? Could
that be why your patch deadlocks unreasonable (that is, in the way
you've already agreed, in your most recent mail, isn't okay)?

Isn't it only already okay that UNIQUE_CHECK_PARTIAL might do that for
deferred unique indexes because of re-checking, which may then abort
the xact?

How will this work?:

         * XXX: If we know or assume that there are few duplicates, it would
         * be better to skip this, and just optimistically proceed with the
         * insertion below. You would then leave behind some garbage when a
         * conflict happens, but if it's rare, it doesn't matter much. Some
         * kind of heuristic might be in order here, like stop doing these
         * pre-checks if the last 100 insertions have not been duplicates.

...when you consider that the only place a tid can come from is this pre-check?

Anyway, consider the following simple test-case of your patch.

postgres=# create unlogged table foo
(
a int4 primary key,
b int4 unique
);
CREATE TABLE

If I run the attached pgbench script like this:

pg@hamster:~/pgbench-tools/tests$ pgbench -f upsert.sql -n -c 50 -T 20

I can get it to deadlock (and especially to throw unique constraint
violations) like crazy.  Single unique indexes seemed okay, though I
have my doubts that only allowing one unique index gets us far, or
that it will be acceptable to have the user specify a unique index in
DML or something. I discussed this with Robert in relation to his
design upthread. Multiple unique constraints were *always* the hard
case. I mean, my patch only really does something unconventional
*because* of that case, really. One unique index is easy.

Leaving discussion of value locking aside, just how rough is this
revision of yours? What do you think of certain controversial aspects
of my design that remain unchanged, such as the visibility trick (as
actually implemented, and/or just in principle)? What about the syntax
itself? It is certainly valuable to have additional MERGE-like
functionality above and beyond the basic "upsert", not least for
multi-master conflict resolution with complex resolution policies, and
this syntax gets us much of that.

How would you feel about making it possible for the UPDATE to use a
tidscan, by projecting out the tid that caused a conflict, as a
semi-documented optimization? It might be unfortunate if someone tried
to UPDATE based on that ctid twice, but that is a less common
requirement. It is kind of an abuse of notation, because of course
you're not supposed to be projecting out the conflict-causer but the
rejects, but perhaps we can live with that, if we can live with the
basic idea.

I'm sorry if my thoughts here are not fully developed, but it's hard
to pin this stuff down. Especially since I'm guessing what is and
isn't essential to your design in this rough sketch.

Thanks

[1] http://zfs.informatik.rwth-aachen.de/btw2007/paper/p18.pdf

[2] http://www.hpl.hp.com/techreports/2010/HPL-2010-9.pdf
--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sat, Nov 23, 2013 at 11:52 PM, Peter Geoghegan <pg@heroku.com> wrote:
> pg@hamster:~/pgbench-tools/tests$ pgbench -f upsert.sql -n -c 50 -T 20
>
> I can get it to deadlock (and especially to throw unique constraint
> violations) like crazy.

I'm sorry, this test-case is an earlier one that is actually entirely
invalid for the purpose stated (though my concerns stated above remain
- I just didn't think the multi-unique-index case had been exercised
enough, and so did this at the last minute). Please omit it from your
consideration. I think I have been working too late...


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sat, Nov 23, 2013 at 11:52 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I have some concerns about what you've done that may limit my
> immediate ability to judge performance, and the relative merits of
> both approaches generally. Now, I know you just wanted to sketch
> something out, and that's fine, but I'm only sharing my thoughts. I am
> particularly worried about the worst case (for either approach),
> particularly with more than 1 unique index. I am also worried about
> livelock hazards (again, in particular with more than 1 index) - I am
> not asserting that they exist in your patch, but they are definitely
> more difficult to reason about. Value locking works because once a
> page lock is acquired, all unique indexes are inserted into. Could you
> have two upserters livelock each other with two unique indexes with
> 1:1 correlated values in practice (i.e. 2 unique indexes that might
> almost work as 1 composite index)? That is a reasonable usage of
> upsert, I think.

So I had it backwards: In fact, it isn't possible to get your patch to
deadlock when it should - it livelocks instead (where with my patch,
as far as I can tell, we predictably and correctly have detected
deadlocks). I see an infinite succession of "insertion conflicted
after pre-check" DEBUG1 elog messages, and no progress, which is an
obvious indication of livelock. My test does involve 2 unique indexes
- that's generally the hard case to get right. Dozens of backends are
tied-up in livelock.

Test case for this is attached. My patch is considerably slowed down
by the way this test-case tangles everything up, but does get through
each pgbench run/loop in the bash script predictably enough. And when
I kill the test-case, a bunch of backends are not left around, stuck
in perpetual livelock (with my patch it takes only a few seconds for
the deadlock detector to get around to killing every backend).

I'm also seeing this:

Client 45 aborted in state 2: ERROR:  attempted to lock invisible tuple
Client 55 aborted in state 2: ERROR:  attempted to lock invisible tuple
Client 41 aborted in state 2: ERROR:  attempted to lock invisible tuple

To me this seems like a problem with the (potential) total lack of
locking that your approach takes (inserting btree unique index tuples
as in your patch is a form of value locking...sort of...it's a little
hard to reason about as presented). Do you think this might be an
inherent problem, or can you suggest a way to make your approach still
work?

So I probably should have previously listed as a requirement for our design:

* Doesn't just work with one unique index. Naming a unique index
directly in DML, or assuming that the PK is intended seems quite weak
to me.

This is something I discussed plenty with Robert, and I guess I just
forgot to repeat myself when asked.

Thanks
--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Josh Berkus
Date:
On 11/18/2013 06:44 AM, Heikki Linnakangas wrote:
> I think it's important to recap the design goals of this. I don't think
> these have been listed before, so let me try:
> 
> * It should be usable and perform well for both large batch updates and
> small transactions.
> 
> * It should perform well both when there are no duplicates, and when
> there are lots of duplicates
> 
> And from that follows some finer requirements:
> 
> * Performance when there are no duplicates should be close to raw INSERT
> performance.
> 
> * Performance when all rows are duplicates should be close to raw UPDATE
> performance.
> 
> * We should not leave behind large numbers of dead tuples in either case.

I think this is setting the bar way too high for an initial feature.
Would we like to eventually have all of those things?  Yes.  Do we need
to have all of them for 9.4?  No.

It's more useful to measure this feature against the current
alternatives used by our users, which are upsert functions and similar
patterns.  If we can make things easier and more efficient than those
(which shouldn't be hard), then it's a worthwhile step forwards.

That being said, the other requirement I am concerned about is being
able to support the syntax of this feature in commonly used ORMs.  That
is, can I write a fairly small Django or Rails extension which does
upsert using this patch?  Fortunately, I think I can ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Nov 26, 2013 at 9:11 AM, Josh Berkus <josh@agliodbs.com> wrote:
>> * It should be usable and perform well for both large batch updates and
>> small transactions.
>>
>> * It should perform well both when there are no duplicates, and when
>> there are lots of duplicates
>>
>> And from that follows some finer requirements:
>>
>> * Performance when there are no duplicates should be close to raw INSERT
>> performance.
>>
>> * Performance when all rows are duplicates should be close to raw UPDATE
>> performance.
>>
>> * We should not leave behind large numbers of dead tuples in either case.
>
> I think this is setting the bar way too high for an initial feature.
> Would we like to eventually have all of those things?  Yes.  Do we need
> to have all of them for 9.4?  No.

The requirements around performance/bloat have a lot to do with making
the feature work reasonably well for multi-master conflict resolution.
They also have much more to do with the worst case than the average
case. If the worst case really is terribly bad, that ends up being a
major gotcha. I'm not concerned about bloat as such, but in any case
whether or not Heikki's design can mostly avoid bloat is, for now, of
secondary importance.

I feel the need to re-iterate something I've already said: I don't see
that I have a concession to make here with a view to pragmatically
getting something useful into 9.4. I am playing it as safe as I think
I can.

> It's more useful to measure this feature against the current
> alternatives used by our users, which are upsert functions and similar
> patterns.  If we can make things easier and more efficient than those
> (which shouldn't be hard), then it's a worthwhile step forwards.

Actually, it's very hard. I don't have license to burn through xids.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 11/26/13 01:59, Peter Geoghegan wrote:
> On Sat, Nov 23, 2013 at 11:52 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I have some concerns about what you've done that may limit my
>> immediate ability to judge performance, and the relative merits of
>> both approaches generally. Now, I know you just wanted to sketch
>> something out, and that's fine, but I'm only sharing my thoughts. I am
>> particularly worried about the worst case (for either approach),
>> particularly with more than 1 unique index. I am also worried about
>> livelock hazards (again, in particular with more than 1 index) - I am
>> not asserting that they exist in your patch, but they are definitely
>> more difficult to reason about. Value locking works because once a
>> page lock is acquired, all unique indexes are inserted into. Could you
>> have two upserters livelock each other with two unique indexes with
>> 1:1 correlated values in practice (i.e. 2 unique indexes that might
>> almost work as 1 composite index)? That is a reasonable usage of
>> upsert, I think.
>
> So I had it backwards: In fact, it isn't possible to get your patch to
> deadlock when it should - it livelocks instead (where with my patch,
> as far as I can tell, we predictably and correctly have detected
> deadlocks). I see an infinite succession of "insertion conflicted
> after pre-check" DEBUG1 elog messages, and no progress, which is an
> obvious indication of livelock. My test does involve 2 unique indexes
> - that's generally the hard case to get right. Dozens of backends are
> tied-up in livelock.
>
> Test case for this is attached.

Great, thanks! I forgot to reset the "conflicted" variable when looping
to retry, so that once it got into the "insertion conflicted after
pre-check" situation, it never got out of it.

After fixing that bug, I'm getting a correctly-detected deadlock every
now and then with that test case.

> I'm also seeing this:
>
> Client 45 aborted in state 2: ERROR:  attempted to lock invisible tuple
> Client 55 aborted in state 2: ERROR:  attempted to lock invisible tuple
> Client 41 aborted in state 2: ERROR:  attempted to lock invisible tuple

Hmm. That's because the trick I used to kill the just-inserted tuple
confuses a concurrent heap_lock_tuple call. It doesn't expect the tuple
it's locking to become invisible. Actually, doesn't your patch have the
same bug? If you're about to lock a tuple in ON DUPLICATE KEY LOCK FOR
UPDATE, and the transaction that inserted the duplicate row aborts just
before the heap_lock_tuple() call, I think you'd also see that error.

> To me this seems like a problem with the (potential) total lack of
> locking that your approach takes (inserting btree unique index tuples
> as in your patch is a form of value locking...sort of...it's a little
> hard to reason about as presented). Do you think this might be an
> inherent problem, or can you suggest a way to make your approach still
> work?

Just garden-variety bugs :-). Attached patch fixes both issues.

> So I probably should have previously listed as a requirement for our design:
>
> * Doesn't just work with one unique index. Naming a unique index
> directly in DML, or assuming that the PK is intended seems quite weak
> to me.
>
> This is something I discussed plenty with Robert, and I guess I just
> forgot to repeat myself when asked.

Totally agreed on that.

- Heikki

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Nov 26, 2013 at 11:32 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> After fixing that bug, I'm getting a correctly-detected deadlock every now
> and then with that test case.

We'll probably want to carefully consider how
predictably/deterministically this occurs.

> Hmm. That's because the trick I used to kill the just-inserted tuple
> confuses a concurrent heap_lock_tuple call. It doesn't expect the tuple it's
> locking to become invisible. Actually, doesn't your patch have the same bug?
> If you're about to lock a tuple in ON DUPLICATE KEY LOCK FOR UPDATE, and the
> transaction that inserted the duplicate row aborts just before the
> heap_lock_tuple() call, I think you'd also see that error.

Yes, that's true. It will occur much more frequently with your
previous revision, but the V4 patch is also affected.

>> To me this seems like a problem with the (potential) total lack of
>> locking that your approach takes (inserting btree unique index tuples
>> as in your patch is a form of value locking...sort of...it's a little
>> hard to reason about as presented). Do you think this might be an
>> inherent problem, or can you suggest a way to make your approach still
>> work?
>
>
> Just garden-variety bugs :-). Attached patch fixes both issues.

Great. I'll let you know what I think.

>> * Doesn't just work with one unique index. Naming a unique index
>> directly in DML, or assuming that the PK is intended seems quite weak
>> to me.

> Totally agreed on that.

Good.

BTW, you keep forgetting to add "expected" output of the new isolation tests.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Nov 26, 2013 at 1:41 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Great. I'll let you know what I think.

So having taken a look at what you've done here, some concerns remain.
I'm coming up with a good explanation/test case, which might be easier
than trying to explain it any other way.

There are some visibility-related race conditions even still, with the
same test case as before. It takes a good while to recreate, but can
be done after several hours on an 8 core server under my control:

pg@gerbil:~/pgdata$ ls -l -h -a hack_log.log
-rw-rw-r-- 1 pg pg 1.6G Nov 27 05:10 hack_log.log
pg@gerbil:~/pgdata$ cat hack_log.log | grep visible
ERROR:  attempted to update invisible tuple
ERROR:  attempted to update invisible tuple
ERROR:  attempted to update invisible tuple

FWIW I'm pretty sure that my original patch has the same bug, but it
hardly matters now.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Nov 26, 2013 at 8:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
> There are some visibility-related race conditions even still

I also see this, sandwiched between the very many "deadlock detected"
errors recorded over 6 or so hours (this is in chronological order,
with no ERRORs omitted within the range shown):

ERROR:  deadlock detected
ERROR:  deadlock detected
ERROR:  deadlock detected
ERROR:  unable to fetch updated version of tuple
ERROR:  unable to fetch updated version of tuple
ERROR:  unable to fetch updated version of tuple
ERROR:  unable to fetch updated version of tuple
ERROR:  unable to fetch updated version of tuple
ERROR:  unable to fetch updated version of tuple
ERROR:  unable to fetch updated version of tuple
ERROR:  unable to fetch updated version of tuple
ERROR:  unable to fetch updated version of tuple
ERROR:  unable to fetch updated version of tuple
ERROR:  unable to fetch updated version of tuple
ERROR:  deadlock detected
ERROR:  deadlock detected
ERROR:  deadlock detected
ERROR:  deadlock detected

This, along with the already-discussed "attempted to update invisible
tuple" forms a full account of unexpected ERRORs seen during the
extended run of the test case, so far.

Since it took me a relatively long time to recreate this, it may not
be trivial to do so. Unless you don't think it's useful to do so, I'm
going to give this test a full 24 hours, just in case it shows up
anything else like this.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-11-27 01:09:49 -0800, Peter Geoghegan wrote:
> On Tue, Nov 26, 2013 at 8:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
> > There are some visibility-related race conditions even still
> 
> I also see this, sandwiched between the very many "deadlock detected"
> errors recorded over 6 or so hours (this is in chronological order,
> with no ERRORs omitted within the range shown):
> 
> ERROR:  deadlock detected
> ERROR:  deadlock detected
> ERROR:  deadlock detected
> ERROR:  unable to fetch updated version of tuple
> ERROR:  unable to fetch updated version of tuple
> ERROR:  unable to fetch updated version of tuple
> ERROR:  unable to fetch updated version of tuple
> ERROR:  unable to fetch updated version of tuple
> ERROR:  unable to fetch updated version of tuple
> ERROR:  unable to fetch updated version of tuple
> ERROR:  unable to fetch updated version of tuple
> ERROR:  unable to fetch updated version of tuple
> ERROR:  unable to fetch updated version of tuple
> ERROR:  unable to fetch updated version of tuple
> ERROR:  deadlock detected
> ERROR:  deadlock detected
> ERROR:  deadlock detected
> ERROR:  deadlock detected
> 
> This, along with the already-discussed "attempted to update invisible
> tuple" forms a full account of unexpected ERRORs seen during the
> extended run of the test case, so far.

I think at least the "unable to fetch updated version of tuple" ERRORs
are likely to be an unrelated 9.3+ BUG that I've recently
reported. Alvaro has a patch. C.f. 20131124000203.GA4403@alap2.anarazel.de

Even the "deadlock detected" errors might be a fkey-locking issue. Bug
#8434, but that's really hard to know without more details.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Nov 27, 2013 at 1:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Even the "deadlock detected" errors might be a fkey-locking issue. Bug
> #8434, but that's really hard to know without more details.

Thanks, I was aware of that but didn't make the connection.

I've written a test-case that is designed to exercise one case that
deadlocks like crazy - deadlocking is the expected, correct behavior.
The deadlock errors are not in themselves suspicious. Actually, if
anything I find it suspicious that there aren't more deadlocks.


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Nov 27, 2013 at 1:09 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Since it took me a relatively long time to recreate this, it may not
> be trivial to do so. Unless you don't think it's useful to do so, I'm
> going to give this test a full 24 hours, just in case it shows up
> anything else like this.

I see a further, distinct error message this morning:

"ERROR:  unrecognized heap_lock_tuple status: 1"

This is a would-be "attempted to lock invisible tuple" error, but with
the error raised by some heap_lock_tuple() call site, unlike the
previous situation where heap_lock_tuple() raised the error directly.
Since with the most recent revision, we handle this (newly possible)
return code in the new ExecLockHeapTupleForUpdateSpec() function, that
just leaves EvalPlanQualFetch() as a plausible place to see it, given
the codepaths exercised in the test case.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Nov 27, 2013 at 1:09 AM, Peter Geoghegan <pg@heroku.com> wrote:
> This, along with the already-discussed "attempted to update invisible
> tuple" forms a full account of unexpected ERRORs seen during the
> extended run of the test case, so far.

Actually, it was slightly misleading of me to say it's the same
test-case; in fact, this time I ran each pgbench run with a variable,
random number of seconds between 2 and 20 inclusive (as opposed to
always 2 seconds). If you happen to need help recreating this, I am
happy to give it.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
What's the status of this patch? I posted my version using a quite 
different approach than your original patch. You did some testing of 
that, and ran into unrelated bugs. Have they been fixed now?

Where do we go from here? Are you planning to continue based on my 
proof-of-concept patch, fixing the known issues with that? Or do you 
need more convincing?

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Dec 12, 2013 at 1:23 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> What's the status of this patch? I posted my version using a quite different
> approach than your original patch. You did some testing of that, and ran
> into unrelated bugs. Have they been fixed now?

Sorry, I dropped the ball on this. I'm doing a bit more testing of an
approach to fixing the new bugs. I'll let you know how I get on
tomorrow (later today for you).


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Dec 12, 2013 at 1:47 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Sorry, I dropped the ball on this.

Thank you for your patience, Heikki.

I attached two revisions - one of my patch (btreelock_insert_on_dup)
and one of your alternative design (exclusion_insert_on_dup). In both
cases I've added a new visibility rule to HeapTupleSatisfiesUpdate(),
and enabled projecting on duplicate-causing-tid by means of the ctid
system column when RETURNING REJECTS. I'm not in an immediate position
to satisfy myself that the former revision is correct (I'm travelling
tomorrow morning and running a bit short on time) and I'm not
proposing the latter for inclusion as part of the feature (that's a
discussion we may have in time, but it serves a useful purpose during
testing).

Both of these revisions have identical ad-hoc test cases included as
new files - see testcase.sh and upsert.sql. My patch doesn't have any
unique constraint violations, and has pretty consistent performance,
while yours has many unique constraint violations. I'd like to hear
your thoughts on the testcase, and the design implications.

--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> I attached two revisions - one of my patch (btreelock_insert_on_dup)
> and one of your alternative design (exclusion_insert_on_dup).

I spent a little bit of time looking at btreelock_insert_on_dup.  AFAICT
it executes FormIndexDatum() for later indexes while holding assorted
buffer locks in earlier indexes.  That really ain't gonna do, because in
the case of an expression index, FormIndexDatum will execute nearly
arbitrary user-defined code, which might well result in accesses to those
indexes or others.  What we'd have to do is refactor so that all the index
tuple values get computed before we start to insert any of them.  That
doesn't seem impossible, but it implies a good deal more refactoring than
has been done here.

Once we do that, I wonder if we couldn't get rid of the LWLockWeaken/
Strengthen stuff.  That scares the heck out of me; I think it's deadlock
problems waiting to happen.

Another issue is that the number of buffer locks being held doesn't seem
to be bounded by anything much.  The current LWLock infrastructure has a
hard limit on how many lwlocks can be held per backend.

Also, the lack of any doc updates makes it hard to review this.  I can
see that you don't want to touch the user-facing docs until the syntax
is agreed on, but at the very least you ought to produce real updates
for the indexam API spec, since you're changing that significantly.

BTW, so far as the syntax goes, I'm quite distressed by having to make
REJECTS into a fully-reserved word.  It's not reserved according to the
standard, and it seems pretty likely to be something that apps might be
using as a table or column name.
        regards, tom lane



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Dec 13, 2013 at 4:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I spent a little bit of time looking at btreelock_insert_on_dup.  AFAICT
> it executes FormIndexDatum() for later indexes while holding assorted
> buffer locks in earlier indexes.  That really ain't gonna do, because in
> the case of an expression index, FormIndexDatum will execute nearly
> arbitrary user-defined code, which might well result in accesses to those
> indexes or others.  What we'd have to do is refactor so that all the index
> tuple values get computed before we start to insert any of them.  That
> doesn't seem impossible, but it implies a good deal more refactoring than
> has been done here.

We were proceeding on the basis that what I'd done, if deemed
acceptable in principle, could eventually be replaced by an
alternative value locking implementation that more or less similarly
extends the limited way in which value locking already occurs (i.e.
unique index enforcement's buffer locking), but without the downsides.
While I certainly appreciate your input, I still think that there is a
controversy about what implementation gets us the most useful
semantics, and I think we should now focus on resolving it. I am not
sure that Heikki's approach is functionally equivalent to mine. At the
very least, I think the trade-off of doing one or the other should be
well understood.

> Once we do that, I wonder if we couldn't get rid of the LWLockWeaken/
> Strengthen stuff.  That scares the heck out of me; I think it's deadlock
> problems waiting to happen.

There are specific caveats around using those. I think that they could
be useful elsewhere, but are likely to only ever have a few clients.
As previously mentioned, the same semantics appear in other similar
locking primitives in other domains, so fwiw it really doesn't strike
me as all that controversial. I agree that their *usage* is not
acceptable as-is. I've only left the usage in the patch to give us
some basis for reasoning about the performance on mixed workloads for
comparative purposes. Perhaps I shouldn't have even done that, to
better focus reviewer attention on the semantics implied by each
implementation.

> Also, the lack of any doc updates makes it hard to review this.  I can
> see that you don't want to touch the user-facing docs until the syntax
> is agreed on, but at the very least you ought to produce real updates
> for the indexam API spec, since you're changing that significantly.

I'll certainly do that in any future revision.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Dec 12, 2013 at 4:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Both of these revisions have identical ad-hoc test cases included as
> new files - see testcase.sh and upsert.sql. My patch doesn't have any
> unique constraint violations, and has pretty consistent performance,
> while yours has many unique constraint violations. I'd like to hear
> your thoughts on the testcase, and the design implications.

I withdraw the test-case. Both approaches behave similarly if you look
for long enough, and that's okay.

I also think that changes to HeapTupleSatisfiesUpdate() are made
unnecessary by recent bug fixes to that function. The test case
previously described [1] that broke that is no longer recreatable, at
least so far.

Do you think that we need to throw a serialization failure within
ExecLockHeapTupleForUpdateSpec() iff heap_lock_tuple() returns
HeapTupleInvisible and IsolationUsesXactSnapshot()? Also, I'm having a
hard time figuring out a good choke point to catch MVCC snapshots
availing of our special visibility rule where they should not due to
IsolationUsesXactSnapshot(). It seems sufficient to continue to assume
that  Postgres won't attempt to lock any tid invisible under
conventional MVCC rules in the first place, except within
ExecLockHeapTupleForUpdateSpec(), but what do we actually do within
ExecLockHeapTupleForUpdateSpec()? I'm thinking of a new tqual.c
routine concerning the tuple being in the future that we re-check when
IsolationUsesXactSnapshot(). That's not very modular, though. Maybe
we'd go through heapam.c.

I think it doesn't matter that what now constitute MVCC snapshots
(with the new, special "reach into the future" rule) have that new
rule, for the purposes of higher isolation levels, because we'll have
a serialization failure within ExecLockHeapTupleForUpdateSpec() before
this is allowed to become a problem. In order for the new rule to be
relevant, we'd have to be the Xact to lock in the first place, and as
an xact in non-read-committed mode, we'd be sure to call the new
tqual.c "in the future" routine or whatever. Only upserters can lock a
row in the future, so it is the job of upserters to care about this
special case.

Incidentally, I tried to rebase recently, and saw some shift/reduce
conflicts due to 1b4f7f93b4693858cb983af3cd557f6097dab67b, "Allow
empty target list in SELECT". The fix for that is not immediately
obvious.

So I think we should proceed with the non-conclusive-check-first
approach (if only on pragmatic grounds), but even now I'm not really
sure. I think there might be unprincipled deadlocking should
ExecInsertIndexTuples() fail to be completely consistent about its
ordering of insertion - the use of dirty snapshots (including as part
of conventional !UNIQUE_CHECK_PARTIAL unique index enforcement) plays
a part in this risk. Roughly speaking, heap_delete() doesn't render
the tuple immediately invisible to some-other-xact's dirty snapshot
[2], and I think that could have unpleasant interactions, even if it
is also beneficial in some ways. Our old, dead tuples from previous
attempts stick around, and function as "value locks" to everyone else,
since for example _bt_check_unique() cares about visibility having
merely been affected, which is grounds for blocking. More
counter-intuitive still, we go ahead with "value locking" (i.e. btree
UNIQUE_CHECK_PARTIAL tuple insertion originating from the main
speculative ExecInsertIndexTuples() call) even though we already know
that we will delete the corresponding heap row (which, as noted, still
satisfies HeapTupleSatisfiesDirty() and so is value-lock-like).

Empirically, retrying because ExecInsertIndexTuples() returns some
recheckIndexes occurs infrequently, so maybe that makes all of this
okay. Or maybe it happens infrequently *because* we don't give up on
insertion when it looks like the current iteration is futile. Maybe
just inserting into every unique index, and then blocking on an xid
within ExecCheckIndexConstraints(), works out fairly and performs
reasonably in all common cases. It's pretty damn subtle, though, and I
worry about the worst case performance, and basic correctness issues
for these reasons. The fact that deferred unique indexes also use
UNIQUE_CHECK_PARTIAL is cold comfort -- that only ever has to through
an error on conflict, and only once. We haven't "earned the right" to
lock *all* values in all unique indexes, but kind of do so anyway in
the event of an "insertion conflicted after pre-check".

Another concern that bears reiterating is: I think making the
lock-for-update case work for exclusion constraints is a lot of
additional complexity for a very small return.

Do you think it's worth optimizing ExecInsertIndexTuples() to avoid
futile non-unique/exclusion constrained index tuple insertion?

[1] http://www.postgresql.org/message-id/CAM3SWZS2--GOvUmYA2ks_aNyfesb0_H6T95_k8+wyx7Pi=CQvw@mail.gmail.com

[2]
https://github.com/postgres/postgres/blob/94b899b829657332bda856ac3f06153d09077bd1/src/backend/utils/time/tqual.c#L798

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Dec 18, 2013 at 8:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Empirically, retrying because ExecInsertIndexTuples() returns some
> recheckIndexes occurs infrequently, so maybe that makes all of this
> okay. Or maybe it happens infrequently *because* we don't give up on
> insertion when it looks like the current iteration is futile. Maybe
> just inserting into every unique index, and then blocking on an xid
> within ExecCheckIndexConstraints(), works out fairly and performs
> reasonably in all common cases. It's pretty damn subtle, though, and I
> worry about the worst case performance, and basic correctness issues
> for these reasons.

I realized that it's possible to create the problem that I'd
previously predicted with "promise tuples" [1] some time ago, that are
similar in some regards to what Heikki has here. At the time, Robert
seemed to agree that this was a concern [2].

I have a very simple testcase attached, much simpler that previous
testcases, that reproduces deadlock for the patch
exclusion_insert_on_dup.2013_12_12.patch.gz at scale=1 frequently, and
occasionally when scale=10 (for tiny, single-statement transactions).
With scale=100, I can't get it to deadlock on my laptop (60 clients in
all cases), at least in a reasonable time period. With the patch
btreelock_insert_on_dup.2013_12_12.patch.gz, it will never deadlock,
even with scale=1, simply because value locks are not held on to
across row locking. This is why I characterized the locking as
"opportunistic" on several occasions in months past.

The test-case is actually much simpler than the one I describe in [1],
and much simpler than all previous test-cases, as there is only one
unique index, though the problem is essentially the same. It is down
to old "value locks" held across retries - with "exclusion_...", we
can't *stop* locking things from previous locking attempts (where a
locking attempt is btree insertion with the UNIQUE_CHECK_PARTIAL
flag), because dirty snapshots still see
inserted-then-deleted-in-other-xact tuples. This deadlocking seems
unprincipled and unjustified, which is a concern that I had all along,
and a concern that Heikki seemed to share more recently [3]. This is
why I felt strongly all along that value locks ought to be cheap to
both acquire and _release_, and it's unfortunate that so much time was
wasted on tangential issues, though I do accept some responsibility
for that.

So, I'd like to request as much scrutiny as possible from as wide as
possible a cross section of contributors of this test case
specifically. This feature's development is deadlocked on resolving
this unprincipled deadlocking controversy. This is a relatively easy
thing to have an opinion on, and I'd like to hear as many as possible.

Is this deadlocking something we can live with? What is a reasonable
path forward? Perhaps I am being pedantic in considering unnecessary
deadlocking as ipso facto unacceptable (after all, MySQL lived with
this kind of problem for long enough, even if it has gotten better for
them recently), but there is a very real danger of painting ourselves
into a corner with these concurrency issues. I aim to have the
community understand ahead of time the exact set of
trade-offs/semantics implied by our chosen implementation, whatever
the outcome. That seems very important. I myself lean towards this
being a blocker for the "exclusion_" approach at least as presented.

Now, you might say to yourself "why should I assume that this isn't
just attributable to btree page buffer locks being coarser than other
approaches to value locking?". That's a reasonable point, and indeed
it's why I avoided lower scale values in prior, more complicated
test-cases, but that doesn't actually account for the problem
highlighted: In this test-case we do not hold buffer locks across
other buffer locks within a single backends (at least in any new way),
nor do we lock rows while holding buffer locks within a single
backend. Quite simply, the conventional btree value locking approach
doesn't attempt to lock 2 things within a backend at the same time,
and you need to do that to get a deadlock, so there are no deadlocks.
Importantly, the "btree_..." implementation can release value locks.

Thanks

P.S. In the interest of reproducibility, I attach new revisions of
each patch, even though there is no reason to believe that any changes
since the last revision posted are significant to the test-case. There
was some diff minimization, plus I incorporated some (but not all)
unrelated feedback from Tom. It wasn't immediately obvious, at least
to me, that "rejects" can be made to be an unreserved keyword, due to
shift/reduce conflicts, but I did document AM changes. Hopefully this
gives some indication of the essential nature or intent of my design
that we may work towards refining (depending on the outcome of
discussion here, of course).

P.P.S. Be careful not to fall afoul of the shift/reduce conflicts when
applying either patch on top of commit
1b4f7f93b4693858cb983af3cd557f6097dab67b. I'm working on a fix that
allows a clean rebase.

[1] http://www.postgresql.org/message-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9M7_g@mail.gmail.com

[2] http://www.postgresql.org/message-id/CA+TgmobwDZSVcKWTmVNBxeHSe4LCnW6zon2soH6L7VoO+7tAzw@mail.gmail.com

[3] http://www.postgresql.org/message-id/528B640F.50601@vmware.com

--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 12/20/2013 06:06 AM, Peter Geoghegan wrote:
> On Wed, Dec 18, 2013 at 8:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Empirically, retrying because ExecInsertIndexTuples() returns some
>> recheckIndexes occurs infrequently, so maybe that makes all of this
>> okay. Or maybe it happens infrequently *because* we don't give up on
>> insertion when it looks like the current iteration is futile. Maybe
>> just inserting into every unique index, and then blocking on an xid
>> within ExecCheckIndexConstraints(), works out fairly and performs
>> reasonably in all common cases. It's pretty damn subtle, though, and I
>> worry about the worst case performance, and basic correctness issues
>> for these reasons.
>
> I realized that it's possible to create the problem that I'd
> previously predicted with "promise tuples" [1] some time ago, that are
> similar in some regards to what Heikki has here. At the time, Robert
> seemed to agree that this was a concern [2].
>
> I have a very simple testcase attached, much simpler that previous
> testcases, that reproduces deadlock for the patch
> exclusion_insert_on_dup.2013_12_12.patch.gz at scale=1 frequently, and
> occasionally when scale=10 (for tiny, single-statement transactions).
> With scale=100, I can't get it to deadlock on my laptop (60 clients in
> all cases), at least in a reasonable time period. With the patch
> btreelock_insert_on_dup.2013_12_12.patch.gz, it will never deadlock,
> even with scale=1, simply because value locks are not held on to
> across row locking. This is why I characterized the locking as
> "opportunistic" on several occasions in months past.
>
> The test-case is actually much simpler than the one I describe in [1],
> and much simpler than all previous test-cases, as there is only one
> unique index, though the problem is essentially the same. It is down
> to old "value locks" held across retries - with "exclusion_...", we
> can't *stop* locking things from previous locking attempts (where a
> locking attempt is btree insertion with the UNIQUE_CHECK_PARTIAL
> flag), because dirty snapshots still see
> inserted-then-deleted-in-other-xact tuples. This deadlocking seems
> unprincipled and unjustified, which is a concern that I had all along,
> and a concern that Heikki seemed to share more recently [3]. This is
> why I felt strongly all along that value locks ought to be cheap to
> both acquire and _release_, and it's unfortunate that so much time was
> wasted on tangential issues, though I do accept some responsibility
> for that.

Hmm. If I understand the problem correctly, it's that as soon as another 
backend sees the tuple you've inserted and calls XactLockTableWait(), it 
will not stop waiting even if we later decide to kill the 
already-inserted tuple.

One approach to fix that would be to release and immediately re-acquire 
the transaction-lock, when you kill an already-inserted tuple. Then 
teach the callers of XactLockTableWait() to re-check if the tuple is 
still alive. I'm just waving hands here, but the general idea is to 
somehow wake up others when you kill the tuple.

We could make use of that facility to also let others to proceed, if you 
delete a tuple in the same transaction that you insert it. It's a corner 
case, not worth much on its own, but I think it would fall out of the 
above machinery for free, and be an easier way to test it than inducing 
deadlocks with ON DUPLICATE.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Fri, Dec 20, 2013 at 3:39 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Hmm. If I understand the problem correctly, it's that as soon as another
> backend sees the tuple you've inserted and calls XactLockTableWait(), it
> will not stop waiting even if we later decide to kill the already-inserted
> tuple.
>
> One approach to fix that would be to release and immediately re-acquire the
> transaction-lock, when you kill an already-inserted tuple. Then teach the
> callers of XactLockTableWait() to re-check if the tuple is still alive.

That particular mechanism sounds like a recipe for unintended consequences.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Alvaro Herrera
Date:
Robert Haas escribió:
> On Fri, Dec 20, 2013 at 3:39 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > Hmm. If I understand the problem correctly, it's that as soon as another
> > backend sees the tuple you've inserted and calls XactLockTableWait(), it
> > will not stop waiting even if we later decide to kill the already-inserted
> > tuple.
> >
> > One approach to fix that would be to release and immediately re-acquire the
> > transaction-lock, when you kill an already-inserted tuple. Then teach the
> > callers of XactLockTableWait() to re-check if the tuple is still alive.
> 
> That particular mechanism sounds like a recipe for unintended consequences.

Yep, what I thought too.

There are probably other ways to make that general idea work though.  I
didn't follow this thread carefully, but is the idea that there would be
many promise tuples "live" at any one time, or only one?  Because if
there's only one, or a very limited number, it might be workable to
sleep on that tuple's lock instead of the xact's lock.

Another thought is to have a different LockTagType that signals a
transaction that's doing the INSERT/ON DUPLICATE thingy, and remote
backends sleep on that instead of the regular transaction lock.  That
different lock type could be released and reacquired as proposed by
Heikki above without danger of unintended consequences.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 12/20/2013 10:56 PM, Alvaro Herrera wrote:
> Robert Haas escribió:
>> On Fri, Dec 20, 2013 at 3:39 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> Hmm. If I understand the problem correctly, it's that as soon as another
>>> backend sees the tuple you've inserted and calls XactLockTableWait(), it
>>> will not stop waiting even if we later decide to kill the already-inserted
>>> tuple.
>>>
>>> One approach to fix that would be to release and immediately re-acquire the
>>> transaction-lock, when you kill an already-inserted tuple. Then teach the
>>> callers of XactLockTableWait() to re-check if the tuple is still alive.
>>
>> That particular mechanism sounds like a recipe for unintended consequences.
>
> Yep, what I thought too.
>
> There are probably other ways to make that general idea work though.  I
> didn't follow this thread carefully, but is the idea that there would be
> many promise tuples "live" at any one time, or only one?  Because if
> there's only one, or a very limited number, it might be workable to
> sleep on that tuple's lock instead of the xact's lock.

Only one.

heap_update() and heap_delete() also grab a heavy-weight lock on the 
tuple, before calling XactLockTableWait(). _bt_doinsert() does not, but 
it could. Perhaps we can take advantage of that.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Dec 20, 2013 at 12:39 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Hmm. If I understand the problem correctly, it's that as soon as another
> backend sees the tuple you've inserted and calls XactLockTableWait(), it
> will not stop waiting even if we later decide to kill the already-inserted
> tuple.

Forgive me for being pedantic, but I wouldn't describe it that way.
Quite simply, the tuples speculatively inserted (and possibly later
deleted) are functionally value locks, that presently cannot be easily
released (so my point is it doesn't matter if you're currently waiting
on XactLockTableWait() or are just about to). I have to wonder about
the performance implications of fixing this, even if we suppose the
fix is itself inexpensive. The current approach probably benefits from
not having to re-acquire value locks from previous attempts, since
everyone still has to care about "value locks" from our previous
attempts.

The more I think about it, the more opposed I am to letting this
slide, which is an notion I had considered last night, if only because
MySQL did so for many years. This is qualitatively different from
other cases where we deadlock. Even back when we exclusive locked rows
as part of foreign key enforcement, I think it was more or less always
possible to do an analysis of the dependencies that existed, to ensure
that locks were acquired in a predictable order so that deadlocking
could not occur. Now, maybe that isn't practical for an entire app,
but it is practical to do in a localized way as problems emerge. In
contrast, if we allowed unprincipled deadlocking, the only advice we
could give is "stop doing so much upserting".


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Dec 20, 2013 at 1:12 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> There are probably other ways to make that general idea work though.  I
>> didn't follow this thread carefully, but is the idea that there would be
>> many promise tuples "live" at any one time, or only one?  Because if
>> there's only one, or a very limited number, it might be workable to
>> sleep on that tuple's lock instead of the xact's lock.
>
> Only one.
>
> heap_update() and heap_delete() also grab a heavy-weight lock on the tuple,
> before calling XactLockTableWait(). _bt_doinsert() does not, but it could.
> Perhaps we can take advantage of that.

I am skeptical of this approach. It sounds like you're saying that
you'd like to intersperse value and row locking, such that you'd
definitely get a row lock on your first attempt after detecting a
duplicate. With respect, I dismissed this months ago. Why should it be
okay to leave earlier, actually inserted index tuples (from earlier
unique indexes) behind? You still have to delete those (that is, the
heap tuple) on conflict, and what you outline is sufficiently
hand-wavey for me to strongly doubt the feasibility of making earlier
btree tuples not behave as pseudo value locks ***in all relevant
contexts***. How exactly do you determine that row versions were
*deleted*? How do you sensibly differentiate between updates and
deletes, or do you? What of lock starvation hazards? Perhaps I've
misunderstood, but detecting and reasoning about deletedness like this
seems like a major modularity violation, even by the standards of the
btree AM. Do XactLockTableWait() callers have to re-check
tuple-deletedness both before and after their XactLockTableWait()
call? For regular non-upserting inserters too?

I think that the way forward is to refine my design in order to
upgrade locks from exclusive buffer locks to something else, managed
by the lock manager but perhaps through an additional layer of
indirection. As previously outlined, I'm thinking of a new SLRU-based
granular value locking infrastructure built for this purpose, with
btree inserters marking pages as having an entry in this table. That
doesn't sound like much fun to go and implement, but it's reasonably
well precedented, if authoritative transaction processing papers are
anything to go by, as previously noted [1].

I hate to make a plausibility argument, particularly at this late
stage, but: no one, myself included, has managed to find any holes in
the semantics implied by my implementation in the last few months. It
is relatively easy to reason about, and doesn't leave the idea of an
amcanunique abstraction in tatters, nor does it expand the already
byzantine tuple locking infrastructure in a whole new direction. These
are strong advantages. It really isn't hard to imagine a totally sound
implementation of the same idea -- what I do with buffer locks, but
without actual buffer locks and their obvious attendant disadvantages,
and without appreciably regressing the performance of non-upsert
use-cases. AFAICT, there is way less uncertainty around doing this,
unless you think that unprincipled deadlocking is morally defensible,
which I don't believe you or anyone else does.

[1] http://www.postgresql.org/message-id/CAM3SWZQ9XMM8bZyNX3memy1AMQcKqXuUSy8t1iFqZz999U_AGQ@mail.gmail.com
-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Dec 20, 2013 at 11:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I think that the way forward is to refine my design in order to
> upgrade locks from exclusive buffer locks to something else, managed
> by the lock manager but perhaps through an additional layer of
> indirection. As previously outlined, I'm thinking of a new SLRU-based
> granular value locking infrastructure built for this purpose, with
> btree inserters marking pages as having an entry in this table.

I'm working on a revision that holds lmgr page-level exclusive locks
(and buffer pins) across multiple operations.  This isn't too
different to what you've already seen, since they are still only held
for an instant. Notably, hash indexes currently quickly grab and
release lmgr page-level locks, though they're the only existing
clients of that infrastructure. I think on reflection that
fully-fledged value locking may be overkill, given the fact that these
locks are only held for an instant, and only need to function as a
choke point for unique index insertion, and only when upserting
occurs.

This approach seems promising. It didn't take me very long to get it
to a place where it passed a few prior test-cases of mine, with fairly
varied input, though the patch isn't likely to be posted for another
few days. I think I can get it to a place where it doesn't regress
regular insertion at all. I think that that will tick all of the many
boxes, without unwieldy complexity and without compromising conceptual
integrity.

I mention this now because obviously time is a factor. If you think
there's something I need to do, or that there's some way that I can
more usefully coordinate with you, please let me know. Likewise for
anyone else following.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Sun, Dec 22, 2013 at 6:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Dec 20, 2013 at 11:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I think that the way forward is to refine my design in order to
>> upgrade locks from exclusive buffer locks to something else, managed
>> by the lock manager but perhaps through an additional layer of
>> indirection. As previously outlined, I'm thinking of a new SLRU-based
>> granular value locking infrastructure built for this purpose, with
>> btree inserters marking pages as having an entry in this table.
>
> I'm working on a revision that holds lmgr page-level exclusive locks
> (and buffer pins) across multiple operations.  This isn't too
> different to what you've already seen, since they are still only held
> for an instant. Notably, hash indexes currently quickly grab and
> release lmgr page-level locks, though they're the only existing
> clients of that infrastructure. I think on reflection that
> fully-fledged value locking may be overkill, given the fact that these
> locks are only held for an instant, and only need to function as a
> choke point for unique index insertion, and only when upserting
> occurs.
>
> This approach seems promising. It didn't take me very long to get it
> to a place where it passed a few prior test-cases of mine, with fairly
> varied input, though the patch isn't likely to be posted for another
> few days. I think I can get it to a place where it doesn't regress
> regular insertion at all. I think that that will tick all of the many
> boxes, without unwieldy complexity and without compromising conceptual
> integrity.
>
> I mention this now because obviously time is a factor. If you think
> there's something I need to do, or that there's some way that I can
> more usefully coordinate with you, please let me know. Likewise for
> anyone else following.

I don't think this is a project to rush through.  We've lived without
MERGE/UPSERT for several years now, and we can live without it for
another release cycle while we try to reach agreement on the way
forward.  I can tell that you're convinced you know the right way
forward here, and you may be right, but I don't think you've convinced
everyone else - maybe not even anyone else.

I wouldn't suggest modeling anything you do on the way hash indexes
using heavyweight locks.  That is a performance disaster, not to
mention being basically a workaround for the fact that whoever wrote
the code originally didn't bother figuring out any way that splitting
a bucket could be accomplished in a crash-safe manner, even in theory.If it weren't for that, we'd be using buffer
locksthere.  That
 
doesn't necessarily mean that page-level heavyweight locks aren't the
right thing here, but the performance aspects of any such approach
will need to be examined carefully.

To be honest, I am still not altogether sold on any part of this
feature.  I don't like the fact that it violates MVCC - although I
admit that some form of violation is inevitable in any feature in this
area unless we're content to live with many serialization failures, I
don't like the particular way it violates MVCC, I don't like the
syntax (returns rejects? blech), and I don't like the fact that
getting the locking right, or even getting the semantics right, seems
to be so darned hard.  I think we're in real danger of building
something that will be too complex, or just too weird, for users to
use, and too complex to maintain as well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Dec 23, 2013 at 7:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think this is a project to rush through.  We've lived without
> MERGE/UPSERT for several years now, and we can live without it for
> another release cycle while we try to reach agreement on the way
> forward.  I can tell that you're convinced you know the right way
> forward here, and you may be right, but I don't think you've convinced
> everyone else - maybe not even anyone else.

That may be. Attention from reviewers has been in relatively short
supply. Not that that isn't always true.

> I wouldn't suggest modeling anything you do on the way hash indexes
> using heavyweight locks.  That is a performance disaster, not to
> mention being basically a workaround for the fact that whoever wrote
> the code originally didn't bother figuring out any way that splitting
> a bucket could be accomplished in a crash-safe manner, even in theory.
>  If it weren't for that, we'd be using buffer locks there.

Having looked at the code for the first time recently, I'd agree that
hash indexes are a disaster. A major advantage of The Lehman and Yao
Algorithm, as prominently noted in the paper, is that exclusive locks
are only acquired on leaf pages to increase concurrency. Since I only
propose to extend this to a heavyweight page lock, and still only for
an instant, it seems reasonable to assume that the performance will be
acceptable for an initial version of this. It's not as if most places
will have to pay any heed to this heavyweight lock - index scans and
non-upserting inserts are generally unaffected. We can later optimize
performance as we measure a need to do so. Early indications are that
the performance is reasonable.

Holding value locks for more than an instant doesn't make sense. The
reason is simple: when upserting, we're tacitly only really looking
for violations on one particular unique index. We just lock them all
at once because the implementation doesn't know *which* unique index.
So in actuality, it's really no different from existing
potential-violation handling for unique indexes, except we have to do
some extra work in addition to the usual restart from scratch stuff
(iff we have multiple unique indexes).

> To be honest, I am still not altogether sold on any part of this
> feature.  I don't like the fact that it violates MVCC - although I
> admit that some form of violation is inevitable in any feature in this
> area unless we're content to live with many serialization failures, I
> don't like the particular way it violates MVCC

Discussions around visibility issues have not been very useful. As
I've said, I don't like the term "MVCC violation", because it's
suggestive of some classical, codified definition of MVCC, a
definition that doesn't actually exist anywhere, even in research
papers, AFAICT. So while I understand your concerns around the
modifications to HeapTupleSatisfiesMVCC(), and while I respect that we
need to be mindful of the performance impact, my position is that if
that really is what we need to do, we might as well be honest about
it, and express intent succinctly and directly. This is a position
that is orthogonal to the proposed syntax, even if that is convenient
to my patch. It's already been demonstrated that yes, the MVCC
violation can be problematic when we call HeapTupleSatisfiesUpdate(),
which is a bug that was fixed by making another modest modification to
HeapTupleSatisfiesUpdate(). It is notable that that bug would have
still occurred had a would-be-HTSMVCC-invisible tuple been passed
through any other means. What problem, specifically, do you envisage
avoiding by doing it some other way? What other way do you have in
mind?

We invested huge effort into more granular FK locking when we had a
few complaints about it. I wouldn't be surprised if that effort
modestly regressed HeapTupleSatisfiesMVCC(). On the other hand, this
feature has been in very strong demand for over a decade, and has a
far smaller code footprint. I don't want to denigrate the FK locking
stuff in any way - it is a fantastic piece of work - but it's
important to have a sense of proportion about these things. In order
to make visibility work in the way we require, we're almost always
just doing additional checking of infomask bits, and the t_infomask
variable is probably already in a CPU register (this is a huge
simplification, but is essentially true). Like you, I have noted that
HeapTupleSatisfiesMVCC() is a fairly hot routine during profiling
before, but it's not *that* hot.

It's understandable that you raise these points, but from my
perspective it's hard to address your concerns without more concrete
objections.

> I don't like the
> syntax (returns rejects? blech)

I suppose it isn't ideal in some ways. On the other hand, it is
extremely flexible, with many of the same advantages of SQL MERGE.
Importantly, it will facilitate merging as part of conflict resolution
on multi-master replication systems, which I think is of considerable
strategic importance even beyond having a simple upsert.

I would like to see us get this into 9.4, and get MERGE into 9.5.

> and I don't like the fact that
> getting the locking right, or even getting the semantics right, seems
> to be so darned hard.  I think we're in real danger of building
> something that will be too complex, or just too weird, for users to
> use, and too complex to maintain as well.

Please don't conflate confusion or uncertainty around alternative
approaches with confusion or uncertainty around mine - *far* more time
has been spent discussing the former. While I respect the instinct
that says we ought to be very conservative around changing anything
that the btree AM does, I really don't think my design is itself all
that complicated.

I've been very consistent even in the face of strong criticism. What I
have now is essentially the same design as back in early September.
After the initial ON DUPLICATE KEY IGNORE patch in August, I soon
realized that value locking and row locking could not be sensibly
considered in isolation, and over the objections of others pushed
ahead with integrating the two. I believe now as I believed then that
value locks need to be cheap to release (or it at least needs to be
possible), and that it was okay to drop all value locks when we need
to deal with a possible conflict/getting an xid shared lock - if those
unlocked pages have separate conflicts on our next attempt, the
feature is being badly misused (for upserting) or it doesn't matter
because we only need one conclusive "No" answer (for IGNOREing, but
also for upserting).

I have been trying to focus attention of these aspects throughout this
discussion. I'm not sure how successful I was here.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Mon, Dec 23, 2013 at 5:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Dec 23, 2013 at 7:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't think this is a project to rush through.  We've lived without
>> MERGE/UPSERT for several years now, and we can live without it for
>> another release cycle while we try to reach agreement on the way
>> forward.  I can tell that you're convinced you know the right way
>> forward here, and you may be right, but I don't think you've convinced
>> everyone else - maybe not even anyone else.
>
> That may be. Attention from reviewers has been in relatively short
> supply. Not that that isn't always true.

I think concrete concerns about usability have largely been
subordinated to abstruse discussions about locking protocols.  A
discussion strictly about what syntax people would consider
reasonable, perhaps on another thread, might elicit broader
participation (although this week might not be the right time to try
to attract an audience).

> Having looked at the code for the first time recently, I'd agree that
> hash indexes are a disaster. A major advantage of The Lehman and Yao
> Algorithm, as prominently noted in the paper, is that exclusive locks
> are only acquired on leaf pages to increase concurrency. Since I only
> propose to extend this to a heavyweight page lock, and still only for
> an instant, it seems reasonable to assume that the performance will be
> acceptable for an initial version of this. It's not as if most places
> will have to pay any heed to this heavyweight lock - index scans and
> non-upserting inserts are generally unaffected. We can later optimize
> performance as we measure a need to do so. Early indications are that
> the performance is reasonable.

OK.

>> To be honest, I am still not altogether sold on any part of this
>> feature.  I don't like the fact that it violates MVCC - although I
>> admit that some form of violation is inevitable in any feature in this
>> area unless we're content to live with many serialization failures, I
>> don't like the particular way it violates MVCC
>
> Discussions around visibility issues have not been very useful. As
> I've said, I don't like the term "MVCC violation", because it's
> suggestive of some classical, codified definition of MVCC, a
> definition that doesn't actually exist anywhere, even in research
> papers, AFAICT.

I don't know whether or not that's literally true, but like Potter
Stewart, I don't think there's any real ambiguity about the underlying
concept.  The concepts of read->write, write->read, and write->write
dependencies between transactions are well-described in textbooks such
as Jim Gray's Transaction Processing: Concepts and Techniques and this
paper on MVCC:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.552&rep=rep1&type=pdf

I think the definition of an MVCC violation is that a snapshot sees
the effects of a transaction which committed after that snapshot was
taken.  And maybe it's good and right that this patch is introducing a
new way for that to happen, or maybe it's not, but the point is that
we get to decide.

> I've been very consistent even in the face of strong criticism. What I
> have now is essentially the same design as back in early September.
> After the initial ON DUPLICATE KEY IGNORE patch in August, I soon
> realized that value locking and row locking could not be sensibly
> considered in isolation, and over the objections of others pushed
> ahead with integrating the two. I believe now as I believed then that
> value locks need to be cheap to release (or it at least needs to be
> possible), and that it was okay to drop all value locks when we need
> to deal with a possible conflict/getting an xid shared lock - if those
> unlocked pages have separate conflicts on our next attempt, the
> feature is being badly misused (for upserting) or it doesn't matter
> because we only need one conclusive "No" answer (for IGNOREing, but
> also for upserting).

I'm not saying that you haven't been consistent, or that you've done
anything wrong at all.  I'm just saying that the default outcome is
that we change nothing, and the fact that nobody's been able to
demonstrate an approach is clearly superior to what you've proposed
does not mean we have to accept what you've proposed.  I am not
necessarily advocating for rejecting your proposed approach, although
I do have concerns about it, but I think it is clear that it is not
backed by any meaningful amount of consensus.  Maybe that will change
in the next two months, and maybe it won't.  If it doesn't, whether
through any fault of yours or not, I don't think this is going in.  If
this is all perfectly clear to you already, then I apologize for
belaboring the point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-12-23 14:59:31 -0800, Peter Geoghegan wrote:
> On Mon, Dec 23, 2013 at 7:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't think this is a project to rush through.  We've lived without
> > MERGE/UPSERT for several years now, and we can live without it for
> > another release cycle while we try to reach agreement on the way
> > forward.

Agreed, but I really think it's one of the biggest weaknesses of
postgres at this point.

> >  I can tell that you're convinced you know the right way
> > forward here, and you may be right, but I don't think you've convinced
> > everyone else - maybe not even anyone else.

> That may be. Attention from reviewers has been in relatively short
> supply. Not that that isn't always true.

I don't really see the lack of review as being crucial at this point. At
least I have quite some doubts about the approach you've chosen and I
have voiced it - so have others. Whether yours is workable seems to
hinge entirely on whether you can build a scalable, maintainable
value-locking scheme. Besides some thoughts about using slru.c for it I
haven't seen much about the design of that part - might just have missed
it though. Personally I can't ad-lib a design for it, but I haven't
though about it too much.
I don't think there's too much reviewers can do before you've provided a
POC implementation of real value locking.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Dec 24, 2013 at 4:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I don't really see the lack of review as being crucial at this point. At
> least I have quite some doubts about the approach you've chosen and I
> have voiced it - so have others.

Apparently you haven't been keeping up with this thread. The approach
that Heikki outlined with his POC patch was demonstrated to deadlock
in an unprincipled manner - it took me a relatively long time to
figure this out because I didn't try a simple enough test-case. There
is every reason to think that alternative promise tuple approaches
would behave similarly, short of some very invasive, radical changes
to how we wait on XID share locks that I really don't think are going
to fly. That's why I chose this approach: at no point did anyone have
a plausible alternative that didn't have similar problems, and I
personally saw no alternative. It wasn't really a choice at all.

In hindsight I should have known better than to think that people
would be willing to defer discussion of a more acceptable value
locking implementation to consider the interactions between the
different subsystems, which I felt were critical and warranted
up-front discussion, a belief which has now been borne out. Lesson
learned. It's a pity that that's the way things are, because that
discussion could have been really useful, and saved us all some time.

> I don't think there's too much reviewers can do before you've provided a
> POC implementation of real value locking.

I don't see what is functionally insufficient about the value locking
that you've already seen. I'm currently working towards extended the
buffer locking to use a heavyweight lock held only for an instant, but
potentially across multiple operations, although of course only when
upserting occurs so as to not regress regular insertion. If you're
still of the opinion that it is necessary to hold value locks of some
form on earlier unique indexes, as you wait maybe for hours on some
conflicting xid, then I still disagree with you for reasons recently
re-stated [1]. You never stated a reason why you thought it was
necessary. If you have one now, please share it. Note that I release
all value locks before row locking too, which is necessary because to
do any less will cause unprincipled deadlocking, as we've seen.

Other than that, I have no idea what your continued objection to my
design would be once the buffer level exclusive locks are replaced
with page level heavyweight locks across complex (though brief)
operations (I guess you might not like the visibility stuff or the
syntax, but that isn't what you're talking about here). More granular
value locking might help boost performance, but maybe not even by
much, since we're only locking a single leaf page per unique index
against insertion, and only for an instant. I see no reason to make
the coarser-than-necessary granularity of the value locking a blocker.
Predicate locks on btree leaf pages acquired by SSI are also coarser
than strictly necessary.

[1] http://www.postgresql.org/message-id/CAM3SWZSOdUmg4899tJc09R2uoRTYhb0VL9AasC1Fz7AW4GsR-g@mail.gmail.com

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
Hi,

On 2013-12-24 13:18:36 -0800, Peter Geoghegan wrote:
> On Tue, Dec 24, 2013 at 4:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I don't really see the lack of review as being crucial at this point. At
> > least I have quite some doubts about the approach you've chosen and I
> > have voiced it - so have others.
> 
> Apparently you haven't been keeping up with this thread. The approach
> that Heikki outlined with his POC patch was demonstrated to deadlock
> in an unprincipled manner - it took me a relatively long time to
> figure this out because I didn't try a simple enough test-case.

So? I still have the fear that you approach will end up being way too
complicated and full of layering violations. I didn't say it's a no-go
(not that I have veto powers, even if I'd consider it one).

And yes, I still think that promise tuples might be a better solution
regardless of the issues you mentioned, but you know what? That doesn't
matter. Me thinking it's the better approach is primarily based on gut
feeling, and I clearly haven't voiced clear enough reasons to convince
you. So you going with your own, possibly more substantiated, gut
feeling is perfectly alright. Unless I go ahead and write a POC of my
own at least ;)

> In hindsight I should have known better than to think that people
> would be willing to defer discussion of a more acceptable value
> locking implementation to consider the interactions between the
> different subsystems, which I felt were critical and warranted
> up-front discussion, a belief which has now been borne out.
> Lesson learned. It's a pity that that's the way things are, because that
> discussion could have been really useful, and saved us all some time.

Whoa? What? Not convincing everyone is far from it being a useless
discussion. Such an attitude sure is not the way to go to elicit more
feedback.
And it clearly gave you the feedback that most people regard holding
buffer locks across other nontrivial operations, in a potentially
unbounded number, as a fundamental problem.

> > I don't think there's too much reviewers can do before you've provided a
> > POC implementation of real value locking.
> 
> I don't see what is functionally insufficient about the value locking
> that you've already seen.

I still think it's fundamentally unacceptable to hold buffer locks
across any additional complex operations. So yes, I think the current
state is fundamentally insufficient.
Note that the case of the existing uniqueness checking already is bad,
but it at least will never run any user defined code in that context,
just HeapTupleSatisfies* and HOT code. So I don't think arguments of the
"we're already doing it in uniqueness checking" ilk have much merit.

> If you're  still of the opinion that it is necessary to hold value locks of some
> form on earlier unique indexes, as you wait maybe for hours on some
> conflicting xid, then I still disagree with you for reasons recently
> re-stated [1].

I guess you're referring to:

On 2013-12-23 14:59:31 -0800, Peter Geoghegan wrote:
> Holding value locks for more than an instant doesn't make sense. The
> reason is simple: when upserting, we're tacitly only really looking
> for violations on one particular unique index. We just lock them all
> at once because the implementation doesn't know *which* unique index.
> So in actuality, it's really no different from existing
> potential-violation handling for unique indexes, except we have to do
> some extra work in addition to the usual restart from scratch stuff
> (iff we have multiple unique indexes).

I think the point here really is that that you assume that we're always
only looking for conflicts with one unique index. If that's all we want
to support - sure, only the keys in that index need to be locked.
I don't think that's necessarily a given, especially when you just want
to look at the conflict in detail, without using a subtransaction.

> You never stated a reason why you thought it was
> necessary. If you have one now, please share it. Note that I release
> all value locks before row locking too, which is necessary because to
> do any less will cause unprincipled deadlocking, as we've seen.

I can't sensibly comment upon that right now, I'd need to read more code
to understand what you're doing there.

> Other than that, I have no idea what your continued objection to my
> design would be once the buffer level exclusive locks are replaced
> with page level heavyweight locks across complex (though brief)
> operations

Well, you haven't delivered that part yet, that's pretty much my point,
no?
I don't think you can easily do this by just additionally taking a new
kind of heavyweight locks in the new codepaths - that will still allow
deadlocks with the old codepaths taking only lwlocks. So this is a
nontrivial sub-project which very well might influence whether the
approach is deemed acceptable or not.

> (I guess you might not like the visibility stuff or the
> syntax, but that isn't what you're talking about here).

I don't particularly care about that for now. I think we can find common
ground, even if it will take some further emails. It probably isn't
what's in there right now, but I don't think you'e intended it as such.

I don't think the visibility modifications are a good thing (or correct)
as is, but I don't think they are neccessary for your approach to make
sense.

> I've been very consistent even in the face of strong criticism. What I
> have now is essentially the same design as back in early September.

Uh. And why's that necessarily a good thing?


Minor details I noticed in passing:
* Your tqual.c bit isn't correct, you're forgetting multixacts.
* You several mention "core" in comments as if this wouldn't be part of it, that seems confusing.
* Doesn't ExecInsert() essentially busy-loop if there's a concurrent non-committed insert?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Dec 25, 2013 at 6:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> So? I still have the fear that you approach will end up being way too
> complicated and full of layering violations. I didn't say it's a no-go
> (not that I have veto powers, even if I'd consider it one).

Apart from not resulting in unprincipled deadlocking, it respects the
AM abstraction more than all other approaches outlined. Inserting
tuples as value locks just isn't a great approach, even if you ignore
the fact you must come up with a whole new way to release your "value
locks" without ending your xact.

> And yes, I still think that promise tuples might be a better solution
> regardless of the issues you mentioned, but you know what? That doesn't
> matter. Me thinking it's the better approach is primarily based on gut
> feeling, and I clearly haven't voiced clear enough reasons to convince
> you. So you going with your own, possibly more substantiated, gut
> feeling is perfectly alright. Unless I go ahead and write a POC of my
> own at least ;)

My position is not based on a gut feeling. It is based on carefully
considering the interactions of the constituent parts, plus the
experience of actually building a working prototype.

> Whoa? What? Not convincing everyone is far from it being a useless
> discussion. Such an attitude sure is not the way to go to elicit more
> feedback.
> And it clearly gave you the feedback that most people regard holding
> buffer locks across other nontrivial operations, in a potentially
> unbounded number, as a fundamental problem.

Uh, I knew that it was a problem all along. While I explored ways of
ameliorating the problem, I specifically stated that we should discuss
the subsystems interactions/design, which you were far too quick to
dismiss. The overall design is far more pertinent than one specific
mechanism. While I certainly welcome your participation, if you want
to be an effective reviewer I suggest examining your own attitude.
Everyone wants this feature.

>> I don't see what is functionally insufficient about the value locking
>> that you've already seen.
>
> I still think it's fundamentally unacceptable to hold buffer locks
> across any additional complex operations. So yes, I think the current
> state is fundamentally insufficient.

I said *functionally* insufficient. Buffer locks demonstrably do a
perfectly fine job of value locking. Of course the current state is
insufficient, but I'm talking about design here.

>> Holding value locks for more than an instant doesn't make sense. The
>> reason is simple: when upserting, we're tacitly only really looking
>> for violations on one particular unique index. We just lock them all
>> at once because the implementation doesn't know *which* unique index.
>> So in actuality, it's really no different from existing
>> potential-violation handling for unique indexes, except we have to do
>> some extra work in addition to the usual restart from scratch stuff
>> (iff we have multiple unique indexes).
>
> I think the point here really is that that you assume that we're always
> only looking for conflicts with one unique index. If that's all we want
> to support - sure, only the keys in that index need to be locked.
> I don't think that's necessarily a given, especially when you just want
> to look at the conflict in detail, without using a subtransaction.

Why would I not assume that? It's perfectly obvious from the syntax
that you can't do much if you don't know ahead of time where the
conflict might be. It's just like the MySQL feature - the user had
better know where it might be. Now, at least with my syntax as a user
you have some capacity to recover if you consider ahead of time that
you might get it wrong. But clearly rejected, and not conflicting rows
are projected, and multiple conflicts per row are not accounted for.
We lock on the first conflict, which with idiomatic usage will be the
only possible conflict.

That isn't the only reason why value locks don't need to be held for
more than an instant. It's just the most obvious one.

Incidentally, there are many implementation reasons why "true value
locking", where value locks are held indefinitely is extremely
difficult. When I referred to an SLRU, I was just exploring the idea
of making value locks (still only held for an instant) more granular.
On closer examination it looks to me like premature optimization,
though.

>> You never stated a reason why you thought it was
>> necessary. If you have one now, please share it. Note that I release
>> all value locks before row locking too, which is necessary because to
>> do any less will cause unprincipled deadlocking, as we've seen.
>
> I can't sensibly comment upon that right now, I'd need to read more code
> to understand what you're doing there.

You could have looked at it back in September, if only you'd given
these interactions the up-front consideration that they warranted.
Nothing has changed there at all.

> Well, you haven't delivered that part yet, that's pretty much my point,
> no?
> I don't think you can easily do this by just additionally taking a new
> kind of heavyweight locks in the new codepaths - that will still allow
> deadlocks with the old codepaths taking only lwlocks. So this is a
> nontrivial sub-project which very well might influence whether the
> approach is deemed acceptable or not.

I have already written the code, and am in the process of cleaning it
up and gaining confidence that I haven't missed something. It's not
trivial, and there are some subtleties, but I think that your level of
skepticism around the difficulty of doing this is excessive.
Naturally, non-speculative insertion does have to care about the
heavyweight locks sometimes, but only when a page-level flag is found
to be set.

>> (I guess you might not like the visibility stuff or the
>> syntax, but that isn't what you're talking about here).
>
> I don't particularly care about that for now. I think we can find common
> ground, even if it will take some further emails. It probably isn't
> what's in there right now, but I don't think you'e intended it as such.

I certainly hope we can find common ground. I want to work with you.

>> I've been very consistent even in the face of strong criticism. What I
>> have now is essentially the same design as back in early September.
>
> Uh. And why's that necessarily a good thing?

It isn't necessarily, but you've taken my comments out of context. I
was addressing Robert, and his general point about there being
confusion around the semantics and locking protocol aspects. My point
was: if that general impression was created, it is almost entirely
because of discussion of other approaches. The fact that I've been
consistent on design aspects clearly indicates that no one has said
anything to make me reconsider my position. If that's just because
there hasn't been enough scrutiny of my design, then I can hardly be
blamed; I've been begging for that kind of scrutiny.

I have been the one casting doubt on other designs, and quite
successfully I might add. The fact that there was confusion about
those other approaches should not prejudice anyone against my
approach. That doesn't mean I'm right, of course, but as long as no
one is examining those aspects, and as long as no one appreciates what
considerations informed the design I came up with, we won't make
progress. Can we focus on the design, and how things fit together,
please?

> Minor details I noticed in passing:
> * Your tqual.c bit isn't correct, you're forgetting multixacts.

I knew that was broken, but I don't grok the tuple locking code.
Perhaps you can suggest a correct formulation.

> * You several mention "core" in comments as if this wouldn't be part of
>   it, that seems confusing.

Well, the executor is naive of the underlying AM, even if it is btree.
What terminology do you suggest that captures that?

> * Doesn't ExecInsert() essentially busy-loop if there's a concurrent
>   non-committed insert?

No, it does not. It frees earlier value locks, and waits on the other
xact in the usual manner, and then restarts from scratch. There is a    XactLockTableWait() call in _bt_lockinsert(),
justlike
 
_bt_doinsert(). It might be the case that waiters have to spin a few
times if there is lots of conflicts on the same row, but that's
similar to the current state of affairs in _bt_doinsert(). If there
were transactions that kept aborting you'd see the same thing today,
as when doing upserting with subtransactions with lots of conflicts.
It is true that there is nothing to arbitrate the ordering (i.e. there
is no LockTupleTuplock()/Locktuple() call in the btree code), but I
think that doesn't matter because that arbitration occurs when
something close to conventional row locking occurs in
nodeModifyTable.c (or else we just IGNORE).

If you think that there could be unpleasant interactions because of
the lack of Locktuple() arbitration within _bt_lockinsert() or
something like that, please provide a test-case.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Fri, Dec 20, 2013 at 12:39 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Hmm. If I understand the problem correctly, it's that as soon as another
> backend sees the tuple you've inserted and calls XactLockTableWait(), it
> will not stop waiting even if we later decide to kill the already-inserted
> tuple.
>
> One approach to fix that would be to release and immediately re-acquire the
> transaction-lock, when you kill an already-inserted tuple. Then teach the
> callers of XactLockTableWait() to re-check if the tuple is still alive. I'm
> just waving hands here, but the general idea is to somehow wake up others
> when you kill the tuple.

While mulling this over further, I had an idea about this: suppose we
marked the tuple in some fashion that indicates that it's a promise
tuple.  I imagine an infomask bit, although the concept makes me wince
a bit since we don't exactly have bit space coming out of our ears
there.  Leaving that aside for the moment, whenever somebody looks at
the tuple with a mind to calling XactLockTableWait(), they can see
that it's a promise tuple and decide to wait on some other heavyweight
lock instead.  The simplest thing might be for us to acquire a
heavyweight lock on the promise tuple before making index entries for
it, and then have callers wait on that instead always instead of
transitioning from the tuple lock to the xact lock.

Then we don't need to do anything screwy like releasing our
transaction lock; if we decide to kill the promise tuple, we have a
lock to release that pertains specifically to that tuple.

This might be a dumb idea; I'm just thinking out loud.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Dec 26, 2013 at 5:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> While mulling this over further, I had an idea about this: suppose we
> marked the tuple in some fashion that indicates that it's a promise
> tuple.  I imagine an infomask bit, although the concept makes me wince
> a bit since we don't exactly have bit space coming out of our ears
> there.  Leaving that aside for the moment, whenever somebody looks at
> the tuple with a mind to calling XactLockTableWait(), they can see
> that it's a promise tuple and decide to wait on some other heavyweight
> lock instead.  The simplest thing might be for us to acquire a
> heavyweight lock on the promise tuple before making index entries for
> it, and then have callers wait on that instead always instead of
> transitioning from the tuple lock to the xact lock.

I think the interlocking with buffer locks and heavyweight locks to
make that work could be complex. I'm working on a scheme where we
always acquire a page heavyweight lock ahead of acquiring an
equivalent buffer lock, and without any other buffer locks held (for
the critical choke point buffer, to implement value locking). With my
scheme, you may have to retry, but only in the event of page splits
and only at the choke point. In any case, what you describe here
strikes me as an expansion on the already less than ideal modularity
violation within the btree AM (i.e. the way it buffer locks the heap
with its own index buffers concurrently for uniqueness checking). It
might be that the best argument for explicit value locks (implemented
as page heavyweight locks or whatever) is that they are completely
distinct to row locks, and are an abstraction managed entirely by the
AM itself, quite similar to the historic, limited value locking that
unique index enforcement has always used.

If we take Heikki's POC patch as representative of promise tuple
schemes in general, this scheme might not be good enough. Index tuple
insertions don't wait on each other there, and immediately report
conflict. We need pre-checking to get an actual conflict TID in that
patch, with no help from btree available.

I'm generally opposed to making value locks of any stripe be held for
more than an instant (so we should not hold them indefinitely pending
another conflicting xact finishing). It's not just that it's
convenient to my implementation; I also happen to think that it makes
no sense. Should you really lock a value in an earlier unique index
for hours, pending conflicter xact finishing, because you just might
happen to want to insert said value, but probably not?

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-12-26 21:11:27 -0800, Peter Geoghegan wrote:
> I'm generally opposed to making value locks of any stripe be held for
> more than an instant (so we should not hold them indefinitely pending
> another conflicting xact finishing). It's not just that it's
> convenient to my implementation; I also happen to think that it makes
> no sense. Should you really lock a value in an earlier unique index
> for hours, pending conflicter xact finishing, because you just might
> happen to want to insert said value, but probably not?

There are some advantages: For one, it allows you to guarantee forward
progress if you do it right, which surely isn't a bad propert to
have. For another, it's much more in line with the way normal uniqueness
checks works.
Possibly the disadvantages outweigh the advantages, but that's a far cry
from making no sense.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
Hi,

On 2013-12-25 15:27:36 -0800, Peter Geoghegan wrote:
> Uh, I knew that it was a problem all along. While I explored ways of
> ameliorating the problem, I specifically stated that we should discuss
> the subsystems interactions/design, which you were far too quick to
> dismiss.

Aha?

> The overall design is far more pertinent than one specific
> mechanism. While I certainly welcome your participation, if you want
> to be an effective reviewer I suggest examining your own attitude.
> Everyone wants this feature.

You know what. I don't particularly feel the need to be a reviewer of
this patch. I comment because there didn't seem enough comments on some
parts and because I see some things as problematic. If you don't want
those comments, ok. No problem.

> >> Holding value locks for more than an instant doesn't make sense. The
> >> reason is simple: when upserting, we're tacitly only really looking
> >> for violations on one particular unique index. We just lock them all
> >> at once because the implementation doesn't know *which* unique index.
> >> So in actuality, it's really no different from existing
> >> potential-violation handling for unique indexes, except we have to do
> >> some extra work in addition to the usual restart from scratch stuff
> >> (iff we have multiple unique indexes).
> >
> > I think the point here really is that that you assume that we're always
> > only looking for conflicts with one unique index. If that's all we want
> > to support - sure, only the keys in that index need to be locked.
> > I don't think that's necessarily a given, especially when you just want
> > to look at the conflict in detail, without using a subtransaction.
> 
> Why would I not assume that? It's perfectly obvious from the syntax
> that you can't do much if you don't know ahead of time where the
> conflict might be.

Because it's a damn useful feature to have. As I said above:
> if that's all we want to support - sure, only the keys in that index
> need to be locked.

I don't think the current syntax the feature implements can be used as
the sole argument what the feature should be able to support.

If you think from the angle of a async MM replication solution
replicating a table with multiple unique keys, not having to specify a
single index we to expect conflicts from, is surely helpful.

> >> You never stated a reason why you thought it was
> >> necessary. If you have one now, please share it. Note that I release
> >> all value locks before row locking too, which is necessary because to
> >> do any less will cause unprincipled deadlocking, as we've seen.
> >
> > I can't sensibly comment upon that right now, I'd need to read more code
> > to understand what you're doing there.
> 
> You could have looked at it back in September, if only you'd given
> these interactions the up-front consideration that they warranted.
> Nothing has changed there at all.

Holy fuck. Peter. Believe it or not, I don't remember all code, comments
& design that I've read at some point. And that sometimes means that I
need to re-read code to judge some things. That I don't have time to
fully do so on the 24th doesn't strike me as particularly suprising.

> > Well, you haven't delivered that part yet, that's pretty much my point,
> > no?
> > I don't think you can easily do this by just additionally taking a new
> > kind of heavyweight locks in the new codepaths - that will still allow
> > deadlocks with the old codepaths taking only lwlocks. So this is a
> > nontrivial sub-project which very well might influence whether the
> > approach is deemed acceptable or not.
> 
> I have already written the code, and am in the process of cleaning it
> up and gaining confidence that I haven't missed something. It's not
> trivial, and there are some subtleties, but I think that your level of
> skepticism around the difficulty of doing this is excessive.
> Naturally, non-speculative insertion does have to care about the
> heavyweight locks sometimes, but only when a page-level flag is found
> to be set.

Cool then.

> >> I've been very consistent even in the face of strong criticism. What I
> >> have now is essentially the same design as back in early September.
> >
> > Uh. And why's that necessarily a good thing?
> 
> It isn't necessarily, but you've taken my comments out of context.

It's demonstrative of the reaction to a good part of the doubts
expressed.

> Can we focus on the design, and how things fit together,
> please?

I don't understand you here. You want people to discuss the high level
design but then criticize us for discussion the high level design when
it involves *possibly* doing things differently. Evaluating approaches
*is* focusing on the design.
And saying that a basic constituent part doesn't work - like using the
buffer locking for value locking, which you loudly doubted for some time
- *is* design critizism. The pointed out weakness very well might be
non-existant because of a misunderstanding, or relatively easily
fixable.

> > Minor details I noticed in passing:
> > * Your tqual.c bit isn't correct, you're forgetting multixacts.
> 
> I knew that was broken, but I don't grok the tuple locking code.
> Perhaps you can suggest a correct formulation.

I don't think there's nice high-level infrastructure to do what you need
here yet. You probably need a variant of MultiXactIdIsRunning() like
MultiXactIdAreMember() that checks whether any of our xids is
participating. Which you then check when xmax is a multi.

Unfortunately I am afraid that it won't be ok to check
HEAP_XMAX_IS_LOCKED_ONLY xmaxes only - it might have been a no-key
update + some concurrent key-share lock where the updater aborted. Now,
I think you only acquire FOR UPDATE locks so far, but using
subtransactions you still can get into such a scenario, even involving
FOR UPDATE locks.

> > * You several mention "core" in comments as if this wouldn't be part of
> >   it, that seems confusing.
> 
> Well, the executor is naive of the underlying AM, even if it is btree.
> What terminology do you suggest that captures that?

I don't have a particularly nice suggestion. "generic" maybe.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Dec 27, 2013 at 12:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> You know what. I don't particularly feel the need to be a reviewer of
> this patch. I comment because there didn't seem enough comments on some
> parts and because I see some things as problematic. If you don't want
> those comments, ok. No problem.

I was attempting to make a point about the controversy this generated
in September. Which is: we were talking past each other. It was an
unfortunate, unproductive use of our time - there was some useful
discussion, but in general far more heat than light was generated. I
don't want to play the blame game. I want to avoid that situation in
the future, since it's obvious to me that it was totally avoidable.
Okay?

> I don't think the current syntax the feature implements can be used as
> the sole argument what the feature should be able to support.
>
> If you think from the angle of a async MM replication solution
> replicating a table with multiple unique keys, not having to specify a
> single index we to expect conflicts from, is surely helpful.

Well, you're not totally on your own for something like that with this
feature. You can project the conflicter's tid, and possibly do a more
sophisticated recovery, like inspecting the locked row and iterating.
That's probably not at all ideal, but then I can't even imagine what
the best interface for what you describe here looks like. If there are
multiple conflicts, do you delete or update some or all of them? How
do you express that concept from a DML statement? Maybe you could
project the conflict rows (with perhaps more than 1 for each row
proposed for insertion) and not the rejected, but it's hard to imagine
making that intuitive or succinct (conflicting rows could be projected
twice or more for separate conflicts, etc). Maybe what I have here is
in practical terms actually a pretty decent approximation of what you
want.

It seems undesirable to give other use-cases baggage around locking
values for an indefinite period, just to make this work for MM
replication, especially since it isn't clear that it actually could be
used effectively by a MM replication solution given the syntax, or any
conceivable alternative syntax or variant.

Could SQL MERGE do this for you? Offhand I don't think that it could.
In fact, I think it would be much less useful than what I've proposed
for this use-case. Its "WHEN NOT MATCHED THEN" clause doesn't let you
introspect details of what matched and did not match. Furthermore,
though I haven't verified this, off-hand I suspect other systems are
fussy about what you want to merge on. All examples of MERGE use I've
found after a quick Google search shows merging on a simple equi-join
criteria.

>> Can we focus on the design, and how things fit together,
>> please?
>
> I don't understand you here. You want people to discuss the high level
> design but then criticize us for discussion the high level design when
> it involves *possibly* doing things differently. Evaluating approaches
> *is* focusing on the design.

I spent several weeks earnestly thrashing out details of Heikki's
design. I am open to any alternative design that meets the criteria I
outlined to Heikki, with which Heikki was in full agreement. One of
those criterions was that unprincipled deadlocking, that would never
occur with equivalent update statements should not occur.
Unfortunately, Heikki's POC patch did not meet that standard. I have
limited enthusiasm for making it or a similar scheme meet that
standard by further burdening the btree AM with additional knowledge
of the heap or row locking. Since in the past you've expressed general
concerns about the modularity violation within the btree AM today, I
assume that you aren't too enthusiastic about that kind of expansion
either.

> Unfortunately I am afraid that it won't be ok to check
> HEAP_XMAX_IS_LOCKED_ONLY xmaxes only - it might have been a no-key
> update + some concurrent key-share lock where the updater aborted. Now,
> I think you only acquire FOR UPDATE locks so far

That's right. Just FOR UPDATE locks.

> but using
> subtransactions you still can get into such a scenario, even involving
> FOR UPDATE locks.

Sigh.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
Attached revision only uses heavyweight page locks across complex
operations. I haven't benchmarked it, but it appears to perform
reasonably well. I haven't attempted to measure a regression for
regular insertion, but offhand it seems likely that any regression
would be well within the noise - more or less immeasurably small. I
won't repeat too much of what is already well commented in the patch.
For those that would like a relatively quick summary of what I've
done, I include inline a new section that I've added to the nbtree
README:

Notes about speculative insertion
---------------------------------

As an amcanunique AM, the btree implementation is required to support
"speculative insertion".  This means that the value locking method
through which unique index enforcement conventionally occurs is
extended and generalized, such that insertion is staggered:  the core
code attempts to get full consensus on whether values proposed for
insertion will not cause duplicate key violations.  Speculative
insertion is only possible for unique index insertion without deferred
uniqueness checking (since speculative insertion into a deferred
unique constraint's index is a contradiction in terms).

For conventional unique index insertion, the Btree implementation
exclusive locks a buffer holding the first page that the value to be
inserted could possibly be on, though only for an instant, during and
shortly after uniqueness verification.  It would not be acceptable to
hold this lock across complex operations for the duration of the
remainder of the first phase of speculative insertion.  Therefore, we
convert this exclusive buffer lock to an exclusive page lock managed
by the lock manager, thereby greatly ameliorating the consequences of
undiscovered deadlocking implementation bugs (though deadlocks are not
expected), and minimizing the impact on system interruptibility, while
not affecting index scans.

It may be useful to informally think of the page lock type acquired by
speculative insertion as similar to an intention exclusive lock, a
type of lock found in some legacy 2PL database systems that use
multi-granularity locking.  A session establishes the exclusive right
to subsequently establish a full write lock, without actually blocking
reads of the page unless and until a lock conversion actually occurs,
at which point both reads and writes are blocked.  Under this mental
model, buffer shared locks can be thought of as intention shared
locks.

As implemented, these heavyweight locks are only relevant to the
insertion case; at no other point are they actually considered, since
insertion is the only way through which new values are introduced.
The first page a value proposed for insertion into an index could be
on represents a natural choke point for our extended, though still
rather limited system of value locking.  Naturally, when we perform a
"lock escalation" and acquire an exclusive buffer lock, all other
buffer locks on the same buffer are blocked, which is how the
implementation localizes knowledge about the heavyweight lock to
insertion-related routines.  Apart from deletion, which is
concomitantly prevented by holding a pin on the buffer throughout, all
exclusive locking of Btree buffers happen as a direct or indirect
result of insertion, so this approach is sufficient. (Actually, an
exclusive lock may still be acquired without insertion to initialize a
root page, but that hardly matters.)

Note that all value locks (including buffer pins) are dropped
immediately as speculative insertion is aborted, as the implementation
waits on the outcome of another xact, or as "insertion proper" occurs.
These page-level locks are not intended to last more than an instant.
In general, the protocol for heavyweight locking Btree pages is that
heavyweight locks are acquired before any buffer locks are held, while
the locks are only released after all buffer locks are released.
While not a hard and fast rule, presently we avoid heavyweight page
locking more than one page per unique index concurrently.


Happy new year
--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 12/26/2013 01:27 AM, Peter Geoghegan wrote:
> On Wed, Dec 25, 2013 at 6:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> And yes, I still think that promise tuples might be a better solution
>> regardless of the issues you mentioned, but you know what? That doesn't
>> matter. Me thinking it's the better approach is primarily based on gut
>> feeling, and I clearly haven't voiced clear enough reasons to convince
>> you. So you going with your own, possibly more substantiated, gut
>> feeling is perfectly alright. Unless I go ahead and write a POC of my
>> own at least ;)
>
> My position is not based on a gut feeling. It is based on carefully
> considering the interactions of the constituent parts, plus the
> experience of actually building a working prototype.

I also carefully considered all that stuff, and reached a different 
conclusion. Plus I also actually built a working prototype (for some 
approximation of "working" - it's still a prototype).

>> Whoa? What? Not convincing everyone is far from it being a useless
>> discussion. Such an attitude sure is not the way to go to elicit more
>> feedback.
>> And it clearly gave you the feedback that most people regard holding
>> buffer locks across other nontrivial operations, in a potentially
>> unbounded number, as a fundamental problem.
>
> Uh, I knew that it was a problem all along. While I explored ways of
> ameliorating the problem, I specifically stated that we should discuss
> the subsystems interactions/design, which you were far too quick to
> dismiss. The overall design is far more pertinent than one specific
> mechanism. While I certainly welcome your participation, if you want
> to be an effective reviewer I suggest examining your own attitude.
> Everyone wants this feature.

Frankly I'm pissed off that you dismissed from the start the approach 
that seems much better to me. I gave you a couple of pointers very early 
on: look at the way we do exclusion constraints, and try to do something 
like promise tuples or killing an already-inserted tuple. You dismissed 
that, so I had to write that prototype myself. Even after that, you have 
spent zero effort to resolve the remaining issues with that approach, 
proclaiming that it's somehow fundamentally flawed and that locking 
index pages is obviously better. It's not. Sure, it still needs work, 
but the remaining issue isn't that difficult to resolve. Surely not any 
more complicated than what you did with heavy-weight locks on b-tree 
pages in your latest patch.

Now, enough with the venting. Back to drawing board, to figure out how 
best to fix the deadlock issue with the 
insert_on_dup-kill-on-conflict-2.patch. Please help me with that.

PS. In btreelock_insert_on_dup_v5.2013_12_28.patch, the language used in 
the additional text in README is quite difficult to read. Too many 
difficult sentences and constructs for a non-native English speaker like 
me. I had to look up "concomitantly" in a dictionary and I'm still not 
sure I understand that sentence :-).

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 12/27/2013 07:11 AM, Peter Geoghegan wrote:
> On Thu, Dec 26, 2013 at 5:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> While mulling this over further, I had an idea about this: suppose we
>> marked the tuple in some fashion that indicates that it's a promise
>> tuple.  I imagine an infomask bit, although the concept makes me wince
>> a bit since we don't exactly have bit space coming out of our ears
>> there.  Leaving that aside for the moment, whenever somebody looks at
>> the tuple with a mind to calling XactLockTableWait(), they can see
>> that it's a promise tuple and decide to wait on some other heavyweight
>> lock instead.  The simplest thing might be for us to acquire a
>> heavyweight lock on the promise tuple before making index entries for
>> it, and then have callers wait on that instead always instead of
>> transitioning from the tuple lock to the xact lock.

Yeah, that seems like it should work. You might not even need an 
infomask bit for that; just take the "other heavyweight lock" always 
before calling XactLockTableWait(), whether it's a promise tuple or not. 
If it's not, acquiring the extra lock is a waste of time but if you're 
going to sleep anyway, the overhead of one extra lock acquisition hardly 
matters.

> I think the interlocking with buffer locks and heavyweight locks to
> make that work could be complex.

Hmm. Can you elaborate?

The inserter has to acquire the heavyweight lock before releasing the 
buffer lock, because otherwise another inserter (or deleter or updater) 
might see the tuple, acquire the heavyweight lock, and fall to sleep on 
XactLockTableWait(), before the inserter has grabbed the heavyweight 
lock. If that race condition happens, you have the original problem 
again, ie. the updater unnecessarily waits for the inserting transaction 
to finish, even though it already killed the tuple it inserted.

That seems easy to avoid. If the heavyweight lock uses the transaction 
id as the key, just like XactLockTableInsert/XactLockTableWait, you can 
acquire it before doing the insertion.

Peter, can you give that a try, please?

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sun, Dec 29, 2013 at 8:18 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> My position is not based on a gut feeling. It is based on carefully
>> considering the interactions of the constituent parts, plus the
>> experience of actually building a working prototype.
>
>
> I also carefully considered all that stuff, and reached a different
> conclusion. Plus I also actually built a working prototype (for some
> approximation of "working" - it's still a prototype).

Well, clearly you're in agreement with me about unprincipled
deadlocking. That's what I was referring to here.

> Frankly I'm pissed off that you dismissed from the start the approach that
> seems much better to me. I gave you a couple of pointers very early on: look
> at the way we do exclusion constraints, and try to do something like promise
> tuples or killing an already-inserted tuple. You dismissed that, so I had to
> write that prototype myself.

Sorry, but that isn't consistent with my recollection at all. The
first e-mail you sent to any of the threads on this was on 2013-11-18.
Your first cut at a prototype was on 2013-11-19, the very next day. If
you think that I ought to have been able to know what you had in mind
based on conversations at pgConf.EU, you're giving me way too much
credit. The only thing vaguely along those lines that I recall you
mentioning to me in Dublin was that you thought I should make this
work with exclusion constraints - I was mostly explaining what I'd
done, and why. I was pleased that you listened courteously, but I
didn't have a clue what you had in mind, not least because to the best
of my recollection you didn't say anything about killing tuples. I'm
not going to swear that you didn't say something like that, since a
lot of things were said in a relatively short period, but it's
certainly true that I was quite curious about how you might go about
incorporating exclusion constraints into this for a while before you
began visibly participating on list.

Now, when you actually posted the prototype, I realized that it was
the same basic design that I'd cited in my very first e-mail on the
IGNORE patch (the one that didn't have row locking at all) - nobody
else wanted to do heap insertion first for promise tuples. I read that
2007 thread [1] a long time ago, but that recognition only came when
you posted your first prototype, or perhaps shortly before when you
started participating on list.

I am unconvinced that making this work for exclusion constraints is
compelling, except for IGNORE, which is sensible but something I would
not weigh heavily at all. In any case, since your implementation
currently doesn't lock more than one row per tuple proposed for
insertion (even though exclusion constraints could have a huge number
of rows to lock when you propose to insert a row with a range covering
a decade, and many rows need to be locked, where with unique indexes
you only ever lock either 0 or 1 rows per slot). I could fairly easily
extend my patch to have it work for exclusion constraints with IGNORE
only.

You didn't try and convince me that what you have proposed is better
than what I have. You immediately described your approach. You did say
some things about buffer locking, but you didn't differentiate between
what was essential to my design, and what was incidental, merely
calling it scary (i.e. you did something similar to what you're
accusing me of here - you didn't dismiss it, but you didn't address it
either). If you look back at the discussion throughout late November
and much of December, it is true that I am consistently critical, but
that was clearly a useful exercise, because now we know there is a
problem to fix.

Why is your approach better? You never actually said. In short, I
think that my approach may be better because it doesn't conflate row
locking with value locking (therefore making it easier to reason
about, maintaining modularity), and that it never bloats, and that
releasing locks is clearly cheap which may matter a lot sometimes. I
don't think the "intent exclusive" locking of my most recent revision
is problematic for performance - as the L&Y paper says, exclusive
locking leaf pages only is not that problematic. Extending that in a
way that still allows reads, only for an instant isn't going to be too
problematic.

I'm not sure that this is essential to your design, and I'm not sure
what your thoughts are on this, but Andres has defended the idea of
promise tuples that lock old values indefinitely pending the outcome
of another xact where we'll probably want to update, and I think this
is a bad idea. Andres recently seemed less convinced of this than in
the past [2], but I'd like to hear your thoughts. It's very pertinent,
because I think releasing locks needs to be cheap, and rendering old
promise tuples invisible is not cheap.

I'm not trying to piss anyone off here - I need all the help I can
get. These are important questions, and I'm not posing them to you to
be contrary.

> Even after that, you have spent zero effort to
> resolve the remaining issues with that approach, proclaiming that it's
> somehow fundamentally flawed and that locking index pages is obviously
> better. It's not. Sure, it still needs work, but the remaining issue isn't
> that difficult to resolve. Surely not any more complicated than what you did
> with heavy-weight locks on b-tree pages in your latest patch.

I didn't say that locking index pages is obviously better, and I
certainly didn't say anything about what you've done being
fundamentally flawed. I said that I "have limited enthusiasm for
expanding the modularity violation that exists within the btree AM".
Based on what Andres has said in the recent past on this thread about
the current btree code, that "in my opinion, bt_check_unique() doing
[locking heap and btree buffers concurrently] is a bug that needs
fixing" [3], can you really blame me? What this patch does not need is
another controversy. It seems pretty reasonable and sane that we'd
implement this by generalizing from the existing mechanism. Plus there
is plenty of evidence of other systems escalating what they call
"latches" and what we call buffer locks to heavyweight locks, I
believe going back to the 1970s. It's far from radical.

> Now, enough with the venting. Back to drawing board, to figure out how best
> to fix the deadlock issue with the insert_on_dup-kill-on-conflict-2.patch.
> Please help me with that.

I will help you. I'll look at it tomorrow.

> PS. In btreelock_insert_on_dup_v5.2013_12_28.patch, the language used in the
> additional text in README is quite difficult to read. Too many difficult
> sentences and constructs for a non-native English speaker like me. I had to
> look up "concomitantly" in a dictionary and I'm still not sure I understand
> that sentence :-).

Perhaps I should have eschewed obfuscation and espoused elucidation
here. I was trying to fit the style of the surrounding text. I just
mean that aside from the obvious reason for holding a pin, doing so at
the same time precludes deletion of the buffer, which requires a
"super exclusive" lock on the buffer.

[1] http://www.postgresql.org/message-id/1172858409.3760.1618.camel@silverbirch.site

[2] http://www.postgresql.org/message-id/20131227075453.GB17584@alap2.anarazel.de

[3] http://www.postgresql.org/message-id/20130914221524.GF4071@awork2.anarazel.de
-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 12/30/2013 05:57 AM, Peter Geoghegan wrote:
> Now, when you actually posted the prototype, I realized that it was
> the same basic design that I'd cited in my very first e-mail on the
> IGNORE patch (the one that didn't have row locking at all) - nobody
> else wanted to do heap insertion first for promise tuples. I read that
> 2007 thread [1] a long time ago, but that recognition only came when
> you posted your first prototype, or perhaps shortly before when you
> started participating on list.

Ah, I didn't remember that thread. Yeah, apparently I proposed the exact 
same design back then. Simon complained about the dead tuples being left 
behind, but I don't think that's a big issue with the design we've been 
playing around now; you only end up with dead tuples when two backends 
try to insert the same value concurrently, which shouldn't happen very 
often. Other than that, there wasn't any discussion on whether that's a 
good approach or not.

> In short, I think that my approach may be better because it doesn't
> conflate row locking with value locking (therefore making it easier
> to reason about, maintaining modularity),

You keep saying that, but I don't understand what you mean. With your 
approach, an already-inserted heap tuple acts like a value lock, just 
like in my approach. You have the page-level locks on b-tree pages in 
addition to that, but the heap-tuple based mechanism is there too.

> I'm not sure that this is essential to your design, and I'm not sure
> what your thoughts are on this, but Andres has defended the idea of
> promise tuples that lock old values indefinitely pending the outcome
> of another xact where we'll probably want to update, and I think this
> is a bad idea. Andres recently seemed less convinced of this than in
> the past [2], but I'd like to hear your thoughts. It's very pertinent,
> because I think releasing locks needs to be cheap, and rendering old
> promise tuples invisible is not cheap.

Well, killing an old promise tuple is not cheap, but it shouldn't happen 
often. In both approaches, what probably matters more is the overhead of 
the extra heavy-weight locking. But this is all speculation, until we 
see some benchmarks.

> I said that I "have limited enthusiasm for
> expanding the modularity violation that exists within the btree AM".
> Based on what Andres has said in the recent past on this thread about
> the current btree code, that "in my opinion, bt_check_unique() doing
> [locking heap and btree buffers concurrently] is a bug that needs
> fixing" [3], can you really blame me? What this patch does not need is
> another controversy. It seems pretty reasonable and sane that we'd
> implement this by generalizing from the existing mechanism.

_bt_check_unique() is a modularity violation, agreed. Beauty is in the 
eye of the beholder, I guess, but I don't see either patch making that 
any better or worse.

>> Now, enough with the venting. Back to drawing board, to figure out how best
>> to fix the deadlock issue with the insert_on_dup-kill-on-conflict-2.patch.
>> Please help me with that.
>
> I will help you. I'll look at it tomorrow.

Thanks!

> [1] http://www.postgresql.org/message-id/1172858409.3760.1618.camel@silverbirch.site
>
> [2] http://www.postgresql.org/message-id/20131227075453.GB17584@alap2.anarazel.de
>
> [3] http://www.postgresql.org/message-id/20130914221524.GF4071@awork2.anarazel.de

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-12-29 19:57:31 -0800, Peter Geoghegan wrote:
> On Sun, Dec 29, 2013 at 8:18 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> >> My position is not based on a gut feeling. It is based on carefully
> >> considering the interactions of the constituent parts, plus the
> >> experience of actually building a working prototype.
> >
> >
> > I also carefully considered all that stuff, and reached a different
> > conclusion. Plus I also actually built a working prototype (for some
> > approximation of "working" - it's still a prototype).
> 
> Well, clearly you're in agreement with me about unprincipled
> deadlocking. That's what I was referring to here.

Maybe you should describe what you mean with "unprincipled". Sure, the
current patch deadlocks, but I don't see anything fundamental,
unresolvable there. So I don't understand what the word unprincipled
means in that sentence..

> Andres recently seemed less convinced of this than in
> the past [2], but I'd like to hear your thoughts.

Not really, I just don't have the time/energy to fight for it (aka write
a POC) at the moment.
I still think any form of promise tuple, be it index, or heap based,
it's a much better, more general, approach than yours. That doesn't
preclude other approaches from being workable though.

> I didn't say that locking index pages is obviously better, and I
> certainly didn't say anything about what you've done being
> fundamentally flawed. I said that I "have limited enthusiasm for
> expanding the modularity violation that exists within the btree AM".
> Based on what Andres has said in the recent past on this thread about
> the current btree code, that "in my opinion, bt_check_unique() doing
> [locking heap and btree buffers concurrently] is a bug that needs
> fixing" [3], can you really blame me?

Uh. But that was said in the context of *your* approach being
flawed. Because it - at that time, I didn't look at the newest version
yet - extended the concept of holding btree page locks across external
operation to far much more code, even including user defined code!. And
you argued that that isn't a problem using _bt_check_unique() as an
argument.

I don't really see why your patch is less of a modularity violation than
Heikki's POC. It's just a different direction.

> > PS. In btreelock_insert_on_dup_v5.2013_12_28.patch, the language used in the
> > additional text in README is quite difficult to read. Too many difficult
> > sentences and constructs for a non-native English speaker like me. I had to
> > look up "concomitantly" in a dictionary and I'm still not sure I understand
> > that sentence :-).

+1 on the simpler language front as a fellow non-native speaker.

Personally, the biggest thing I think you could do in favor of your
position, is trying to be a bit more succinct in the mailing list
discussions. I certainly fail at times at that as well, but I really try
to work on it...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Dec 30, 2013 at 8:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Maybe you should describe what you mean with "unprincipled". Sure, the
> current patch deadlocks, but I don't see anything fundamental,
> unresolvable there. So I don't understand what the word unprincipled
> means in that sentence..

Maybe it is resolvable, and maybe it's worth resolving - I never said
that it wasn't, I just said that I doubt the latter. By unprincipled
deadlocking, I mean deadlocking that cannot be reasonably prevented by
a user. Currently, I think that never deadlocking is a reasonable
aspiration for all applications. It's never really necessary. When it
occurs, we can advise users to do simple analysis and application
refactoring to prevent it. With unprincipled deadlocking, we can give
no such advice. The only advice we can give is to stop doing so much
upserting, which is a big departure from how things are today. AFAICT,
no one disagrees with my view that this is bad, and probably
unacceptable.

> Uh. But that was said in the context of *your* approach being
> flawed. Because it - at that time, I didn't look at the newest version
> yet - extended the concept of holding btree page locks across external
> operation to far much more code, even including user defined code!. And
> you argued that that isn't a problem using _bt_check_unique() as an
> argument.

That's a distortion of my position at the time. I acknowledged from
the start that all buffer locking was problematic (e.g. [1]), and was
exploring alternative locking approaches and the merit of the design.
This is obviously the kind of project that needs to be worked at
through iterative prototyping. While arguing that deadlocking would
not occur, I lost sight of the bigger picture. But even if that wasn't
true, I don't know why you feel the need to go on and on about buffer
locking like this months later. Are you trying to be provocative? Can
you *please* stop?

Everyone knows that the btree heap access is a modularity violation.
Even the AM docs says that the heap access is "without a doubt ugly
and non-modular". So my original point remains, which is that
expanding that is obviously going to be controversial, and probably
legitimately so. I thought that your earlier marks on
_bt_check_unique() were a good example of this sentiment, but I hardly
need such an example.

> I don't really see why your patch is less of a modularity violation than
> Heikki's POC. It's just a different direction.

My approach does not regress modularity because it doesn't do anything
extra with the heap at all, and only btree insertion routines are
affected. Locking is totally localized to the btree insertion routines
- one .c file. At no other point does anything else have to care, and
it's obvious that this situation won't change in the future when we
decide to do something else cute with infomask bits or whatever.
That's a *huge* distinction.

[1] http://www.postgresql.org/message-id/CAM3SWZR2X4HJg7rjn0K4+hFdguCYX2prEP0Y3a7nccEjEowqqw@mail.gmail.com
-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-12-30 12:29:22 -0800, Peter Geoghegan wrote:
> But even if that wasn't
> true, I don't know why you feel the need to go on and on about buffer
> locking like this months later. Are you trying to be provocative? Can
> you *please* stop?

ERR? Peter? *You* quoted a statement of mine that only made sense in
it's original context. And I *did* say that the point about buffer
locking applied to the *past* version of the patch.


Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Dec 30, 2013 at 12:45 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-30 12:29:22 -0800, Peter Geoghegan wrote:
>> But even if that wasn't
>> true, I don't know why you feel the need to go on and on about buffer
>> locking like this months later. Are you trying to be provocative? Can
>> you *please* stop?
>
> ERR? Peter? *You* quoted a statement of mine that only made sense in
> it's original context. And I *did* say that the point about buffer
> locking applied to the *past* version of the patch.

Not so. You suggested it was a bug that needed to be fixed, completely
independently of this effort. You clearly referred to the current
code.

"Yes, it it is different. But, in my opinion, bt_check_unique() doing
so is a bug that needs fixing. Not something that we want to extend."

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Adrian Klaver
Date:
On 12/30/2013 12:45 PM, Andres Freund wrote:
> On 2013-12-30 12:29:22 -0800, Peter Geoghegan wrote:
>> But even if that wasn't
>> true, I don't know why you feel the need to go on and on about buffer
>> locking like this months later. Are you trying to be provocative? Can
>> you *please* stop?
>
> ERR? Peter? *You* quoted a statement of mine that only made sense in
> it's original context. And I *did* say that the point about buffer
> locking applied to the *past* version of the patch.

Alright this seems to have gone from confusion about the proposal to 
confusion about the confusion. Might I suggest a cooling off period and 
a return to the discussion in possibly a Wiki page where the 
points/counter points could be laid out more efficiently?

>
>
> Andres
>


-- 
Adrian Klaver
adrian.klaver@gmail.com



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Dec 30, 2013 at 7:22 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Ah, I didn't remember that thread. Yeah, apparently I proposed the exact
> same design back then. Simon complained about the dead tuples being left
> behind, but I don't think that's a big issue with the design we've been
> playing around now; you only end up with dead tuples when two backends try
> to insert the same value concurrently, which shouldn't happen very often.

Right, because you check first, which also has a cost, paid in CPU
cycles and memory bandwidth, and buffer lock contention. As opposed to
a cost almost entirely localized to inserters into a single leaf page
per unique index for only an instant. You're checking *all* unique
indexes.

You call check_exclusion_or_unique_constraint() once per unique index
(or EC), and specify to wait on the xact, at least until a conflict is
found. So if you're waiting on an xact, your conclusion that earlier
unique indexes had no conflicts could soon become totally obsolete. So
for non-idiomatic usage, say like the usage Andres in particular cares
about for MM conflict resolution, I worry about the implications of
that. I'm not asserting that it's a problem, but it does seem like
something that's quite hard to reason about. Maybe Andres can comment.

>> In short, I think that my approach may be better because it doesn't
>> conflate row locking with value locking (therefore making it easier
>> to reason about, maintaining modularity),
>
> You keep saying that, but I don't understand what you mean. With your
> approach, an already-inserted heap tuple acts like a value lock, just like
> in my approach. You have the page-level locks on b-tree pages in addition to
> that, but the heap-tuple based mechanism is there too.

Yes, but that historic behavior isn't really value locking at all.
That's very much like row locking, because there is a row, not the
uncertain intent to try to insert a row. Provided your transaction
commits and the client's transaction doesn't delete the row, the row
is definitely there. For upsert, conflicts may well be the norm, not
the exception.

Value locking is the exclusive lock on the buffer held during
_bt_check_unique(). I'm trying to safely extend that mechanism, to
reach consensus among unique indexes, which to me seems the most
logical and conservative approach. For idiomatic usage, it's only
sensible for there to be a conflict on one unique index, known ahead
of time. If you don't know where the conflict will be, then typically
your DML statement is unpredictable, just like the MySQL feature.
Otherwise, for MM conflict resolution, I think it makes sense to pick
those conflicts off, one at a time, dealing with exactly one row per
conflict. I mean, even with your approach, you're still not dealing
with later conflicts in later unique indexes, right? The fact that you
prevent conflicts on previously non-conflicting unique indexes only,
and, I suppose, not later ones too, seems quite arbitrary.

> Well, killing an old promise tuple is not cheap, but it shouldn't happen
> often. In both approaches, what probably matters more is the overhead of the
> extra heavy-weight locking. But this is all speculation, until we see some
> benchmarks.

Fair enough. We'll know more when we have fixed the exclusion
constraint supporting patch, which will allow us to make a fair
comparison. I'm working on it. Although I'd suggest that having dead
duplicates in indexes where that's avoidable is a cost that isn't
necessarily that easily characterized. I especially don't like that
you're currently doing the UNIQUE_CHECK_PARTIAL deferred unique
constraint thing of always inserting, continuing on for *all* unique
indexes regardless of finding a duplicate. Whatever overhead my
approach may imply around lock contention, clearly the cost to index
scans is zero.

The other thing is that if you're holding old "value locks" (i.e.
previously inserted btree tuples, from earlier unique indexes) pending
resolving a value conflict, you're holding those value locks
indefinitely pending the completion of the other guy's xact, just in
case there ends up being no conflict, which in general is unlikely. So
in extreme cases, that could be the difference between waiting all day
(on someone sitting on a lock that they very probably have no use
for), and not waiting at all.

> _bt_check_unique() is a modularity violation, agreed. Beauty is in the eye
> of the beholder, I guess, but I don't see either patch making that any
> better or worse.

Clearly the way in which you envisage releasing locks to prevent
unprincipled deadlocking implies that btree has to know more about the
heap, and maybe even that the heap has to know something about btree,
or at least about amcanunique AMs (including possible future
amcanunique AMs that may or may not be well suited to implementing
this the same way).

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sun, Dec 29, 2013 at 9:09 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>>> While mulling this over further, I had an idea about this: suppose we
>>> marked the tuple in some fashion that indicates that it's a promise
>>> tuple.  I imagine an infomask bit, although the concept makes me wince
>>> a bit since we don't exactly have bit space coming out of our ears
>>> there.  Leaving that aside for the moment, whenever somebody looks at
>>> the tuple with a mind to calling XactLockTableWait(), they can see
>>> that it's a promise tuple and decide to wait on some other heavyweight
>>> lock instead.  The simplest thing might be for us to acquire a
>>> heavyweight lock on the promise tuple before making index entries for
>>> it, and then have callers wait on that instead always instead of
>>> transitioning from the tuple lock to the xact lock.
>
> Yeah, that seems like it should work. You might not even need an infomask
> bit for that; just take the "other heavyweight lock" always before calling
> XactLockTableWait(), whether it's a promise tuple or not. If it's not,
> acquiring the extra lock is a waste of time but if you're going to sleep
> anyway, the overhead of one extra lock acquisition hardly matters.

Are you suggesting that I lock the tuple only (say, through a special
LockPromiseTuple() call), or lock the tuple *and* call
XactLockTableWait() afterwards? You and Robert don't seem to be in
agreement about which here. From here on I assume Robert's idea (only
get the special promise lock where appropriate), because that makes
more sense to me.

I've taken a look at this idea, but got frustrated. You're definitely
going to need an infomask bit for this. Otherwise, how do you
differentiate between a "pending" promise tuple and a "fulfilled"
promise tuple (or a tuple that never had anything to do with promises
in the first place)? You'll want to wake up as soon as it becomes
clear that the former is not going to become the latter on the one
hand. On the other hand, you really will want to wait until xact end
on the pending promise tuple when it becomes a fulfilled promise, or
on an already-fulfilled promise tuple, or a plain old tuple. It's
either locking the promise tuple, or locking the xid; never both,
because the combination makes no sense to any case (unless you're
talking about the case where you lock the promise tuple and then later
*somehow* decide that you need to lock the xid as the upserter
releases promise tuple locks directly within ExecInsert() upon
successful insertion).

The fact that your LockPromiseTuple() call didn't find someone else
with the lock does not mean no one ever promised the tuple (assuming
no infomask bit has the relevant info).

Obviously you can't just have upserters hold on to the promise tuple
locks until xact end if the promiser's insertion succeeds, for the
same reason we don't with regular in-memory tuple locks: they're
totally unbounded. So not only are you going to need an infomask
promise bit, you're going to need to go and unset the bit in the event
of a *successful* insertion, so that waiters know to wait on your xact
now when you finally UnlockPromiseTuple() within ExecInsert() to
finish off successful insertion. *And*, all XactLockTableWait()
promise waiters need to go back and check that just-in-case.

This problem illustrates what I mean about conflating row locking with
value locking.

>> I think the interlocking with buffer locks and heavyweight locks to
>> make that work could be complex.
>
> Hmm. Can you elaborate?

What I meant is that you should be wary of what you go on to describe below.

> The inserter has to acquire the heavyweight lock before releasing the buffer
> lock, because otherwise another inserter (or deleter or updater) might see
> the tuple, acquire the heavyweight lock, and fall to sleep on
> XactLockTableWait(), before the inserter has grabbed the heavyweight lock.
> If that race condition happens, you have the original problem again, ie. the
> updater unnecessarily waits for the inserting transaction to finish, even
> though it already killed the tuple it inserted.

Right. Can you suggest a workaround to the above problems?
-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 12/31/2013 09:18 AM, Peter Geoghegan wrote:
> On Sun, Dec 29, 2013 at 9:09 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>>>> While mulling this over further, I had an idea about this: suppose we
>>>> marked the tuple in some fashion that indicates that it's a promise
>>>> tuple.  I imagine an infomask bit, although the concept makes me wince
>>>> a bit since we don't exactly have bit space coming out of our ears
>>>> there.  Leaving that aside for the moment, whenever somebody looks at
>>>> the tuple with a mind to calling XactLockTableWait(), they can see
>>>> that it's a promise tuple and decide to wait on some other heavyweight
>>>> lock instead.  The simplest thing might be for us to acquire a
>>>> heavyweight lock on the promise tuple before making index entries for
>>>> it, and then have callers wait on that instead always instead of
>>>> transitioning from the tuple lock to the xact lock.
>>
>> Yeah, that seems like it should work. You might not even need an infomask
>> bit for that; just take the "other heavyweight lock" always before calling
>> XactLockTableWait(), whether it's a promise tuple or not. If it's not,
>> acquiring the extra lock is a waste of time but if you're going to sleep
>> anyway, the overhead of one extra lock acquisition hardly matters.
>
> Are you suggesting that I lock the tuple only (say, through a special
> LockPromiseTuple() call), or lock the tuple *and* call
> XactLockTableWait() afterwards? You and Robert don't seem to be in
> agreement about which here.

I meant the latter, ie. grab the new kind of lock first, then check if 
the tuple is still there, and then call XactLockTableWait() as usual.

>> The inserter has to acquire the heavyweight lock before releasing the buffer
>> lock, because otherwise another inserter (or deleter or updater) might see
>> the tuple, acquire the heavyweight lock, and fall to sleep on
>> XactLockTableWait(), before the inserter has grabbed the heavyweight lock.
>> If that race condition happens, you have the original problem again, ie. the
>> updater unnecessarily waits for the inserting transaction to finish, even
>> though it already killed the tuple it inserted.
>
> Right. Can you suggest a workaround to the above problems?

Umm, I did, in the next paragraph ;-) :

> That seems easy to avoid. If the heavyweight lock uses the
> transaction id as the key, just like
> XactLockTableInsert/XactLockTableWait, you can acquire it before
> doing the insertion.

Let me elaborate that. The idea is to have new heavy-weight lock that's 
just like the transaction lock used by 
XactLockTableInsert/XactLockTableWait, but separate from that. Let's 
call it PromiseTupleInsertionLock. The insertion procedure in INSERT ... 
ON DUPLICATE looks like this:

1. PromiseTupleInsertionLockAcquire(<my xid>)
2. Insert heap tuple
3. Insert index tuples
4. Check if conflict happened. Kill the already-inserted tuple on conflict.
5. PromiseTupleInsertionLockRelease(<my xid>)

IOW, the only change to the current patch is that you acquire the new 
kind of lock before starting the insertion, and you release it after 
you've killed the tuple, or you know you're not going to kill it.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> 1. PromiseTupleInsertionLockAcquire(<my xid>)
> 2. Insert heap tuple
> 3. Insert index tuples
> 4. Check if conflict happened. Kill the already-inserted tuple on conflict.
> 5. PromiseTupleInsertionLockRelease(<my xid>)
>
> IOW, the only change to the current patch is that you acquire the new kind
> of lock before starting the insertion, and you release it after you've
> killed the tuple, or you know you're not going to kill it.

Where does row locking fit in there? - you may need to retry when that
part is incorporated, of course. What if you have multiple promise
tuples from a contended attempt to insert a single slot, or multiple
broken promise tuples across multiple slots or even multiple commands
in the same xact?

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> Are you suggesting that I lock the tuple only (say, through a special
>> LockPromiseTuple() call), or lock the tuple *and* call
>> XactLockTableWait() afterwards? You and Robert don't seem to be in
>> agreement about which here.
>
> I meant the latter, ie. grab the new kind of lock first, then check if the
> tuple is still there, and then call XactLockTableWait() as usual.

I don't follow this either. Through what exact mechanism does the
waiter know that there was a wait on the
PromiseTupleInsertionLockAcquire() call, and so it should not wait on
XactLockTableWait()? Does whatever mechanism you have in mind not have
race conditions?


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2013-12-27 14:11:44 -0800, Peter Geoghegan wrote:
> On Fri, Dec 27, 2013 at 12:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I don't think the current syntax the feature implements can be used as
> > the sole argument what the feature should be able to support.
> >
> > If you think from the angle of a async MM replication solution
> > replicating a table with multiple unique keys, not having to specify a
> > single index we to expect conflicts from, is surely helpful.
> 
> Well, you're not totally on your own for something like that with this
> feature. You can project the conflicter's tid, and possibly do a more
> sophisticated recovery, like inspecting the locked row and iterating.

Yea, but in that case I *do* conflict with more than one index and old
values need to stay locked. Otherwise anything resembling
forward-progress guarantee is lost.

> That's probably not at all ideal, but then I can't even imagine what
> the best interface for what you describe here looks like. If there are
> multiple conflicts, do you delete or update some or all of them? How
> do you express that concept from a DML statement?

For my usecases just getting the tid back is fine - it's in C
anyway. But I'd rather be in a position to do it from SQL as well...

If there are multiple conflicts the conflicting row should be
updated. If we didn't release the value locks on the individual indexes,
we can know beforehand whether only one row is going to be affected. If
there really are more than one, error out.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Jan 2, 2014 at 1:49 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Well, you're not totally on your own for something like that with this
>> feature. You can project the conflicter's tid, and possibly do a more
>> sophisticated recovery, like inspecting the locked row and iterating.
>
> Yea, but in that case I *do* conflict with more than one index and old
> values need to stay locked. Otherwise anything resembling
> forward-progress guarantee is lost.

I'm not sure I understand. In a very real sense they do stay locked.
What is insufficient about locking the definitively visible row with
the value, rather than the value itself? What distinction are you
making? On the first conflict you can delete the row you locked, and
then re-try, possibly further merging some stuff from the just-deleted
row when you next upsert.

It's possible that an "earlier" unique index value that is unlocked
before row locking proceeds will get a new would-be duplicate after
you're returned a locked row, but it's not obvious that that's a
problem for your use-case (a problem that can't be worked around), or
that promise tuples get you anything better.

>> That's probably not at all ideal, but then I can't even imagine what
>> the best interface for what you describe here looks like. If there are
>> multiple conflicts, do you delete or update some or all of them? How
>> do you express that concept from a DML statement?
>
> For my usecases just getting the tid back is fine - it's in C
> anyway. But I'd rather be in a position to do it from SQL as well...

I believe you can.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Andres Freund
Date:
On 2014-01-02 02:20:02 -0800, Peter Geoghegan wrote:
> On Thu, Jan 2, 2014 at 1:49 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> Well, you're not totally on your own for something like that with this
> >> feature. You can project the conflicter's tid, and possibly do a more
> >> sophisticated recovery, like inspecting the locked row and iterating.
> >
> > Yea, but in that case I *do* conflict with more than one index and old
> > values need to stay locked. Otherwise anything resembling
> > forward-progress guarantee is lost.
> 
> I'm not sure I understand. In a very real sense they do stay locked.
> What is insufficient about locking the definitively visible row with
> the value, rather than the value itself?

Locking the definitely visible row only works if there's a row matching
the index's columns. If the values of the new row don't have
corresponding values in all the indexes you have the same old race
conditions again.
I think to be useful for many cases you really need to be able to ask
for a potentially conflicting row and be sure that if there's none you
are able to insert the row separately.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> 1. PromiseTupleInsertionLockAcquire(<my xid>)
>> 2. Insert heap tuple
>> 3. Insert index tuples
>> 4. Check if conflict happened. Kill the already-inserted tuple on conflict.
>> 5. PromiseTupleInsertionLockRelease(<my xid>)
>>
>> IOW, the only change to the current patch is that you acquire the new kind
>> of lock before starting the insertion, and you release it after you've
>> killed the tuple, or you know you're not going to kill it.
>
> Where does row locking fit in there? - you may need to retry when that
> part is incorporated, of course. What if you have multiple promise
> tuples from a contended attempt to insert a single slot, or multiple
> broken promise tuples across multiple slots or even multiple commands
> in the same xact?

Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking
the tuple, rather than the XID.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/02/2014 02:53 PM, Robert Haas wrote:
> On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> 1. PromiseTupleInsertionLockAcquire(<my xid>)
>>> 2. Insert heap tuple
>>> 3. Insert index tuples
>>> 4. Check if conflict happened. Kill the already-inserted tuple on conflict.
>>> 5. PromiseTupleInsertionLockRelease(<my xid>)
>>>
>>> IOW, the only change to the current patch is that you acquire the new kind
>>> of lock before starting the insertion, and you release it after you've
>>> killed the tuple, or you know you're not going to kill it.
>>
>> Where does row locking fit in there? - you may need to retry when that
>> part is incorporated, of course. What if you have multiple promise
>> tuples from a contended attempt to insert a single slot, or multiple
>> broken promise tuples across multiple slots or even multiple commands
>> in the same xact?

You can only have one speculative insertion in progress at a time. After 
you've done all the index insertions and checked that you really didn't 
conflict with anyone, you're not going to go back and kill the tuple 
anymore. After that point, the insertion is not speculation anymore.

> Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking
> the tuple, rather than the XID.

Well, that would be ideal, because we already have tuple locks. It would 
be nice to use the same concept for this. It's a bit tricky, however. I 
guess the most straightforward way to do it would be to grab a 
heavy-weight lock after you've inserted the tuple, but before releasing 
the buffer lock. I don't immediately see a problem with that, although 
it's a bit scary to acquire a heavy-weight lock while holding a buffer lock.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Thu, Jan 2, 2014 at 11:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 01/02/2014 02:53 PM, Robert Haas wrote:
>> On Tue, Dec 31, 2013 at 4:12 AM, Peter Geoghegan <pg@heroku.com> wrote:
>>>
>>> On Tue, Dec 31, 2013 at 12:52 AM, Heikki Linnakangas
>>> <hlinnakangas@vmware.com> wrote:
>>>>
>>>> 1. PromiseTupleInsertionLockAcquire(<my xid>)
>>>> 2. Insert heap tuple
>>>> 3. Insert index tuples
>>>> 4. Check if conflict happened. Kill the already-inserted tuple on
>>>> conflict.
>>>> 5. PromiseTupleInsertionLockRelease(<my xid>)
>>>>
>>>> IOW, the only change to the current patch is that you acquire the new
>>>> kind
>>>> of lock before starting the insertion, and you release it after you've
>>>> killed the tuple, or you know you're not going to kill it.
>>>
>>>
>>> Where does row locking fit in there? - you may need to retry when that
>>> part is incorporated, of course. What if you have multiple promise
>>> tuples from a contended attempt to insert a single slot, or multiple
>>> broken promise tuples across multiple slots or even multiple commands
>>> in the same xact?
>
> You can only have one speculative insertion in progress at a time. After
> you've done all the index insertions and checked that you really didn't
> conflict with anyone, you're not going to go back and kill the tuple
> anymore. After that point, the insertion is not speculation anymore.

Yeah... but how does someone examining the tuple know that?  We need
to avoid having them block on the promise-tuple insertion lock if
we've reacquired it meanwhile for a new speculative insertion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
I decided to make at least a cursory attempt to measure or
characterize the performance of each of our approaches to value
locking. Being fair here is a non-trivial matter, because of the fact
that upserts can behave quite differently based on the need to insert
or update, lock contention and so on. Also, I knew that anything I
came up with would not be comparing like with like: as things stand,
the btree locking code is more or less correct, and the alternative
exclusion constraint supporting implementation is more or less
incorrect (of course, you may yet describe a way to fix the
unprincipled deadlocking previously revealed by my testcase [1], but
it is far from clear what impact this fix will have on performance).
Still, something is better than nothing.

This was run on a Linux server ("Linux 3.8.0-31-generic
#46~precise1-Ubuntu") with these specifications:
https://www.hetzner.de/en/hosting/produkte_rootserver/ex40 .
Everything fits in shared_buffers, but the I/O system is probably the
weakest link here.

To be 100% clear, I am comparing
btreelock_insert_on_dup.v5.2013_12_28.patch.gz [2] with
exclusion_insert_on_dup.2013_12_19.patch.gz [3]. I'm also testing a
third approach, involving avoidance of exclusive buffer locks and
heavyweight locks for upserts in the first phase of speculative
insertion. That patch is unposted, but shows a modest improvement over
[2].

I ran this against the table foo:

pgbench=# \d+ foo                         Table "public.foo"Column |  Type   | Modifiers | Storage  | Stats target |
Description
--------+---------+-----------+----------+--------------+-------------a      | integer | not null  | plain    |
    |b      | integer |           | plain    |              |c      | text    |           | extended |              |
 
Indexes:   "foo_pkey" PRIMARY KEY, btree (a)
Has OIDs: no

My custom script was:

\setrandom rec 1 :scale
with rej as(insert into foo(a, b, c) values(:rec, :rec, 'insert') on
duplicate key lock for update returning rejects *) update foo set c =
'update' from rej where foo.a = rej.a;

I specified that each pgbench client in each run should last for 200k
upserts (with 100k possible distinct key values), not that it should
last some number of seconds. The idea is that there is a reasonable
mix of inserts and updates initially, for lower client counts, but
exactly the same number of queries are run for each patch, so as to
normalize the effects of contention across both runs (this sure is
hand-wavy, but likely better than nothing). I'm just looking for
approximate numbers here, and I'm sure that you could find more than
one way to benchmark this feature, with varying degrees of sympathy
towards each of our two approaches to value locking. This benchmark
isn't sympathetic to btree locking at all, because there is a huge
amount of contention for the higher client counts, with 100% of
possible rows updated by the time we're done at 16 clients, for
example.

To compensate somewhat for the relatively low duration of each run, I
take an average-of-5, rather than an average-of-3 as representative
for each client count + run/patch combination.

Full report of results are here:
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp/

My executive summary is that the exclusion patch performs about the
same on lower client counts, presumably due to not having the
additional window of btree lock contention. By 8 clients, the
exclusion patch does noticeably better, but it's a fairly modest
improvement.

Forgive me if I'm belaboring the point, but even though I'm
benchmarking the simplest possible upsert statements, had I chosen
small pgbench scale factors (e.g. scales that would see 100 - 1000
possible distinct key values in total) the btree locking
implementation would surely win very convincingly, just because the
alternative implementation would spend almost all of its time
deadlocked, waiting for the deadlock detector to free clients in one
second deadlock_timeout cycles. My goal here was just to put a rough
number on how these two approaches compare, while trying to be as fair
as possible.

I have to wonder about the extent to which the exclusion approach
benefits from holding old value locks. So even if the unprincipled
deadlocking issue can be fixed without much additional cost, it might
be that the simple fact that that approach holds those pseudo "value
locks" (i.e. old dead rows from previous iterations on the same tuple
slot) indefinitely helps performance, and losing that property alone
will hurt performance, even though it's necessary.

For those that wonder what the effect on multiple unique index would
be, that isn't really all that relevant, since contention on multiple
unique indexes isn't expected with idiomatic usage (though I suppose
an upsert's non-HOT update would have to compete).

[1] http://www.postgresql.org/message-id/CAM3SWZShbE29KpoD44cVc3vpZJGmDer6k_6FGHiSzeOZGmTFSQ@mail.gmail.com

[2] http://www.postgresql.org/message-id/CAM3SWZRpnkuVrENDV3zM=BNTCv8-X3PYXt76pohGyAuP1iq-ug@mail.gmail.com

[3] http://www.postgresql.org/message-id/CAM3SWZShbE29KpoD44cVc3vpZJGmDer6k_6FGHiSzeOZGmTFSQ@mail.gmail.com

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Jan 2, 2014 at 8:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> Yeah, it seems like PromiseTupleInsertionLockAcquire should be locking
>> the tuple, rather than the XID.
>
> Well, that would be ideal, because we already have tuple locks. It would be
> nice to use the same concept for this. It's a bit tricky, however. I guess
> the most straightforward way to do it would be to grab a heavy-weight lock
> after you've inserted the tuple, but before releasing the buffer lock. I
> don't immediately see a problem with that, although it's a bit scary to
> acquire a heavy-weight lock while holding a buffer lock.

That's a really big modularity violation. Everything after
RelationPutHeapTuple() but before the buffer unlock in heap_insert()
is currently critical section. I'm not saying that it can't be done,
but it certainly is scary.

We also have heavyweight page locks, currently used by hash indexes.
That approach does not require us to contort the row locking code, and
certainly does not require us to acquire heavyweight locks with buffer
locks already held. I could understand your initial disinclination to
doing things this way, particularly when the unprincipled deadlocking
problem was not well understood, but I think that this must tip the
balance in favor of the approach I advocate. What I've done with
heavyweight locks is a modest, localized, logical expansion on the
existing mechanism, that is easy to reason about, with room for
further optimization in the future, that still has reasonable
performance characteristics today, including I believe better
worst-case latency. Heavyweight locks on btree pages are very well
precedented, if you look beyond Postgres.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Jan 2, 2014 at 2:37 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Locking the definitely visible row only works if there's a row matching
> the index's columns. If the values of the new row don't have
> corresponding values in all the indexes you have the same old race
> conditions again.

I still don't get it - perhaps you should break down exactly what you
mean with an example. I'm talking about potentially doing multiple
upserts per row proposed for insertion to handle multiple conflicts,
perhaps with some deletes between upserts, not just one upsert with a
single update part.

> I think to be useful for many cases you really need to be able to ask
> for a potentially conflicting row and be sure that if there's none you
> are able to insert the row separately.

Why? What work do you need to perform after reserving the right to
insert but before inserting? Can't you just upsert resulting in
insert, and then perform that work, potentially deleting the row
inserted if and when you change your mind? Is there any real
difference between what that does for you, and what any particular
variety of promise tuple might do for you?

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Jan 2, 2014 at 11:58 AM, Peter Geoghegan <pg@heroku.com> wrote:
> My executive summary is that the exclusion patch performs about the
> same on lower client counts, presumably due to not having the
> additional window of btree lock contention. By 8 clients, the
> exclusion patch does noticeably better, but it's a fairly modest
> improvement.

I forgot to mention that synchronous_commit was turned off, so as to
eliminate noise that might have been added by commit latency, while
still obligating btree to WAL log everything with an exclusive buffer
lock held.


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Eisentraut
Date:
This patch doesn't apply anymore.



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Jan 3, 2014 at 7:39 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> This patch doesn't apply anymore.

Yes, there was some bit-rot. I previous deferred dealing with a
shift/reduce conflict implied by commit
1b4f7f93b4693858cb983af3cd557f6097dab67b. I've fixed that problem now
using non operator precedence, and performed a clean rebase on master.
I've also fixed the basis of your much earlier complaint about
breakage of ecpg's regression tests (without adding support for the
feature to ecpg). All make check-world tests pass. Patch is attached.
I have yet to figure out how to make REJECTS a non-reserved keyword,
or even just a type_func_name_keyword, though intuitively I have a
sense that the latter ought to be possible.

This is the same basic patch as benchmarked above, with various tricks
to avoid stronger lock acquisition when that's likely profitable (we
can even do _bt_check_unique() with only a shared lock and no hwlock
much of the time, on the well-informed suspicion that it won't be
necessary to insert, but only to return a TID). There has also been
some clean-up to aspects of serializable behavior, but that needs
further attention and scrutiny from a subject matter expert, hopefully
Heikki. Though it's probably also true that I should find time to
think about transaction isolation some more.

I've since had another idea relating to performance optimization,
which was to hint that the last attempt to insert a key was
unsuccessful, so the next one (after the conflicting transaction's
commit/abort) of that same value will very likely conflict too, making
lock avoidance profitable on average. This appears to be much more
effective than the previous woolly heuristic (never published, just
benchmarked), which I've left in as an additional reason to avoid
heavyweight locking, if only for discussion. This benchmark now shows
my approach winning convincingly with this additional "priorConflict"
optimization:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp-2/

If someone had time to independently recreate the benchmark I have
here, or perhaps to benchmark the patch in some other way, that would
be useful (for full details see my recent e-mail about the prior
benchmark, where the exact details are described - this is the same,
but with one more run for the priorConflict optimization).

Subtleties of visibility also obviously deserve closer inspection, but
perhaps I shouldn't be so hasty: No consensus on the way forward looks
even close to emerging. How do people feel about my approach now?

--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Dec 13, 2013 at 4:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> BTW, so far as the syntax goes, I'm quite distressed by having to make
> REJECTS into a fully-reserved word.  It's not reserved according to the
> standard, and it seems pretty likely to be something that apps might be
> using as a table or column name.

I've been looking at this, but I'm having a hard time figuring out a
way to eliminate shift/reduce conflicts while not maintaining REJECTS
as a fully reserved keyword - I'm pretty sure it's impossible with an
LALR parser. I'm not totally enamored with the exact syntax proposed
-- I appreciate the flexibility on the one hand, but on the other hand
I suppose that REJECTS could just as easily be any number of other
words.

One possible compromise would be to use a synonym that is not imagined
to be in use very widely, although I looked up "reject" in a thesaurus
and didn't feel too great about that idea afterwards. Another idea
would be to have a REJECTING keyword, as the sort of complement of
RETURNING (currently you can still ask for RETURNING, without REJECTS
but with ON DUPLICATE KEY LOCK FOR UPDATE if that happens to make
sense). I think that would work fine, and might actually be more
elegant. Now, REJECTING will probably have to be a reserved keyword,
but that seems less problematic, particularly as RETURNING is itself a
reserved keyword not described by the standard. In my opinion
REJECTING would reinforce the notion of projecting the complement of
what RETURNING would project in the same context.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
I've worked on a simple set of tests, written quickly in bash, that I
think exercise interesting cases:

https://github.com/petergeoghegan/upsert

Perhaps most notably, these tests make comparisons between the
performance of ordinary inserts with a serial primary key table, and
effectively equivalent upserts that always insert. Even though a
SERIAL primary key is used, which you might imagine to be a bad case
for the extended heavyweight leaf page locks, performance seems almost
as good as regular insertion (though I should add that there was still
only one unique index). Upserts that only update are significantly
slower, and end up at very approximately 2/3 the performance of
equivalent updates (a figure that I arrive at through running the
tests on my laptop, which is not terribly rigorous, so perhaps take
that with a grain of salt). That the update-only case is slower is
hardly surprising, since those require an "extra" index scan as we
re-find the conflict tuple. Using ctid projection to avoid the index
scan doesn't work, at least not without extra in-application handling,
because a tid scan is never used if you write a typical wCTE upsert
statement - the planner forces a seqscan. It would have been
interesting to see where a tid scan for the Update/ModifyTable node's
nestloop join left things, but I didn't get that far. I think that if
we had a really representative test (no "extra" index scan), the
performance of upserts that only update would similarly be almost as
good as regular updates for many representative cases.

The upsert tests also provide cursory smoke testing for cases of
interest. I suggest comparing the test cases, and their
performance/behavior between the exclusion* and btree* patches.

A new revision of my patch is attached. There have mostly just been
improvement made that are equally applicable to promise tuples, so
this should not be interpreted as a rejection of promise tuples, so
much as a way of using my time efficiently while I wait to hear back
about how others feel my approach compares. I'm still rather curious
about what people think in light of recent developments. :-)

Changes include:

* Another ecpg fix.

* Misc polishing and refactoring to btree code. Most notably I now
cache btree insertion scankeys across phases (as well as continuing to
cache each IndexTuple).

* contrib/pageinspect notes when btree leaf pages have the locked flag bit set.

* Better lock/unlock ordering. Commit
a1dfaef6c6e2da34dbc808f205665a6c96362149 added strict ordering of
indexes because of exclusive heavyweight locks held on indexes. Back
then (~2000), Heavyweight page locks were also acquired on btree pages
[1], and in general it was expected that heavyweight locks could be
held on indexes across complex operations (I don't think this is
relied upon anymore, but the assumption that it could happen remains).
I've extended this so that relcache ensures that we always insert into
primary key indexes first (and in the case of my patch, lock values in
PKs last, to minimize the locking window). It seems sensible as a
general bloat avoidance measure to get their insertion out of the way,
so that we're guaranteed that the same slot's index tuples bloat only
unique indexes when the slot is responsible for a unique violation,
rather than just ordering by oid where you can get dead index tuples
in previously inserted non-unique indexes.

* I believe I've fixed the bug in the modifications made to
HeapTupleSatisfiesMVCC(), though I'd like confirmation that I have the
details right. What do you think, Andres?

* I stop higher isolation levels from availing of the aforementioned
modifications to HeapTupleSatisfiesMVCC(), since that would certainly
be inconsistent with user expectations. I'm not sure what "read
phenomenon" described by the standard this violates, but it's
obviously inconsistent with the spirit of the higher isolation levels
to be able to see a row committed by a transaction conceptually
still-in-progress. It's bad enough to do it for READ COMMITTED, but I
think a consensus may be emerging that that's the least worst thing
(correct me if I'm mistaken).

* The "priorConfict" optimization, which was previously shown to
really help performance [2] has been slightly improved to remember row
locking conflicts too.

[1]
https://github.com/postgres/postgres/blob/a1dfaef6c6e2da34dbc808f205665a6c96362149/src/backend/access/nbtree/nbtpage.c#L314

[2] http://www.postgresql.org/message-id/CAM3SWZQZTAN1fDiq4o2umGOaczbpemyQoM-6OxgUFBzi+dQzkg@mail.gmail.com

--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Jan 7, 2014 at 8:46 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I've worked on a simple set of tests, written quickly in bash, that I
> think exercise interesting cases:
>
> https://github.com/petergeoghegan/upsert
>
> Perhaps most notably, these tests make comparisons between the
> performance of ordinary inserts with a serial primary key table, and
> effectively equivalent upserts that always insert.

While I realize that everyone is busy, I'm concerned about the lack of
discussing here. It's been 6 full days since I posted my benchmark,
which I expected to quickly clear some things up, or at least garner
interest, and yet no one has commented here since.

Here is a summary of the situation, at least as I understand it:

* My patch has been shown to perform much better than the alternative
"promise tuples" proposal. The benchmark previously published,
referred to above makes this evident for workloads with lots of
contention [1].

Now, to cover everything, I've gone on to benchmark inserts into a
table foo(serial, int4, text) that lock the row using the new
infrastructure. The SERIAL column is the primary key. I'm trying to
characterize the overhead of the extended value locking here, by
showing the same case (a worst case) with and without the overhead.
Here are the results:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/vs-vanilla-insert/

(asynchronous commits, logged table)

With both extremes covered, the data suggests that my patch performs
very well by *any* standard. But if we consider how things compare to
the alternative proposal, all indications are that performance is far
superior (at least for representative cases without too many unique
indexes, not that I suspect things are much different with many).
Previous concerns about the cost of extended leaf page locking ought
to be almost put to rest by this benchmark, because inserting a
sequence of btree index tuple integers in succession is a particularly
bad case, and yet in practice the implementation does very well. (With
my patch, we're testing the same statement with an ON DUPLICATE KEY
LOCK FOR UPDATE part, but there are naturally no conflicts on the
SERIAL PK - on master we're testing the same INSERT statement without
that, inserting sequence values just as before, only without the
worst-case value locking overhead).

* The alternative exclusion* patch still deadlocks in an unprincipled
fashion, when simple, idiomatic usage encounters contention. Heikki
intends to produce a revision that fixes the problem, though having
considered it carefully myself, I don't know what mechanism he has in
mind, and frankly I'm skeptical. More importantly, I have to question
whether we should continue to pursue that alternative approach, giving
what we now know about its performance characteristics. It could be
improved, but not by terribly much, particularly for the case where
there is plenty of update contention, which was shown in [1] to be
approximately 2-3 times slower than extended page locking (*and* it's
already looking for would-be duplicates *first*). I'm trying to be as
fair as possible, and yet the difference is huge. It's going to be
really hard to beat something where the decision to try to see if we
should insert or update comes so late: the decision is made as late as
possible, is based on strong indicators of likely outcome, while the
cost of making the wrong choice is very low. With shared buffer locks
held calling _bt_check_unique(), we still lock out concurrent would-be
duplicate insertion, and so don't need to restart from scratch (to
insert instead) in the same way as with the alternative proposal's
largely AM-naive approach.

* I am not aware that anyone considers that there are any open items
yet. I've addressed all of those now. Progress here is entirely
blocked on waiting for review feedback.

With the new priorConflict lock strength optimization, my patch is in
some ways similar to what Heikki proposed (in the exclusion* patch).
It's as if the first phase, the locking operation is an index scan
with an identity crisis. It can decide to continue to be an "index
scan" (albeit an oddball one with an insertion scankey that using
shared buffer locks prevents concurrent duplicate insertion, for very
efficient uniqueness checks), or it can decide to actually insert, at
the last possible moment. The second phase is picked up with much of
the work already complete from the first, so the amount of wasted work
is very close to zero in all cases. How can anything beat that?

If the main argument for the exclusion approach is that it works with
exclusion constraints, then I can still go and make what I've done
work there too (for the IGNORE case, which I maintain is the only
exclusion constraint variant of this that is useful to users). In any
case I think making anything work for exclusion constraints should be
a relatively low priority.

I'd like to hear more opinions on what I've done here, if anyone has
bandwidth to spare. I doubt I need to remind anybody that this is a
feature of considerable strategic importance. We need this, and we've
been unnecessarily at a disadvantage to other systems by not having it
for all these years. Every application developer wants this feature -
it's a *very* frequent complaint.

[1] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp-2/
-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/10/2014 05:36 AM, Peter Geoghegan wrote:
> While I realize that everyone is busy, I'm concerned about the lack of
> discussing here. It's been 6 full days since I posted my benchmark,
> which I expected to quickly clear some things up, or at least garner
> interest, and yet no one has commented here since.

Nah, that's nothing. I have a patch in the January commitfest that was
already posted for the previous commitfest. It received zero review back
then, and still has no reviewer signed up, let alone anyone actually
reviewing it. And arguably it's a bug fix!

http://www.postgresql.org/message-id/5285071B.1040100@vmware.com

Wink wink, if you're looking for patches to review... ;-)

>   The alternative exclusion* patch still deadlocks in an unprincipled
> fashion, when simple, idiomatic usage encounters contention. Heikki
> intends to produce a revision that fixes the problem, though having
> considered it carefully myself, I don't know what mechanism he has in
> mind, and frankly I'm skeptical.

Here's an updated patch. Hope it works now... This is based on an older
version, and doesn't include any fixes from your latest
btreelock_insert_on_dup.v7.2014_01_07.patch. Please check the common
parts, and copy over any relevant changes.

The fix for the deadlocking issue consists of a few parts. First,
there's a new heavy-weight lock type, a speculative insertion lock,
which works somewhat similarly to XactLockTableWait(), but is only held
for the duration of a single speculative insertion. When a backend is
about to begin a speculative insertion, it first acquires the
speculative insertion lock. When it's done with the insertion, meaning
it has either cancelled it by killing the already-inserted tuple or
decided that it's going to go ahead with it, the lock is released.

The speculative insertion lock is keyed by Xid, and token. The lock can
be taken many times in the same transaction, and token's purpose is to
distinguish which insertion is currently in progress. The token is
simply a backend-local counter, incremented each time the lock is taken.

In addition to the heavy-weight lock, there are new fields in PGPROC to
indicate which tuple the backend is currently inserting. When the tuple
is inserted, the backend fills in the relation's relfilenode and item
pointer in MyProc->specInsert* fields, while still holding the buffer
lock. The current speculative insertion token is also stored there.

With that mechanism, when another backend sees a tuple whose xmin is
still in progress, it can check if the insertion is a speculative
insertion. To do that, scan the proc array, and find the backend with
the given xid. Then, check that the relfilenode and itempointer in that
backend's PGPROC slot match the tuple, and make note of the token the
backend had advertised.

HeapTupleSatisfiesDirty() does the proc array check, and returns the
token in the snapshot, alongside snapshot->xmin. The caller can then use
that information in place of XactLockTableWait().


There would be other ways to skin the cat, but this seemed like the
quickest to implement. One more straightforward approach would be to use
the tuple's TID directly in the speculative insertion lock's key,
instead of Xid+token, but then the inserter would have to grab the
heavy-weight lock while holding the buffer lock, which seems dangerous.
Another alternative would be to store token in the heap tuple header,
instead of PGPROC; a tuple that's still being speculatively inserted has
no xmax, so it could be placed in that field. Or ctid.

> More importantly, I have to question
> whether we should continue to pursue that alternative approach, giving
> what we now know about its performance characteristics.

Yes.

> It could be
> improved, but not by terribly much, particularly for the case where
> there is plenty of update contention, which was shown in [1] to be
> approximately 2-3 times slower than extended page locking (*and*  it's
> already looking for would-be duplicates*first*). I'm trying to be as
> fair as possible, and yet the difference is huge.

*shrug*. I'm not too concerned about performance during contention. But
let's see how this fixed version performs. Could you repeat the tests
you did with this?

Any guesses what the bottleneck is? At a quick glance at a profile of a
pgbench run with this patch, I didn't see anything out of ordinary, so
I'm guessing it's lock contention somewhere.

- Heikki

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/08/2014 06:46 AM, Peter Geoghegan wrote:
> A new revision of my patch is attached.

I'm getting deadlocks with this patch, using the test script you posted 
earlier in 
http://www.postgresql.org/message-id/CAM3SWZQh=8xNVgbBzYHJeXUJBHwZNjUTjEZ9t-DBO9t_mX_8Kw@mail.gmail.com. 
Am doing something wrong, or is that a regression?

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Jan 10, 2014 at 7:12 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I'm getting deadlocks with this patch, using the test script you posted
> earlier in
> http://www.postgresql.org/message-id/CAM3SWZQh=8xNVgbBzYHJeXUJBHwZNjUTjEZ9t-DBO9t_mX_8Kw@mail.gmail.com.
> Am doing something wrong, or is that a regression?

Yes. The point of that test case was that it made your V1 livelock
(which you fixed), not deadlock in a way detected by the deadlock
detector, which is the correct behavior.

This testcase was the one that showed up *unprincipled* deadlocking:

http://www.postgresql.org/message-id/CAM3SWZShbE29KpoD44cVc3vpZJGmDer6k_6FGHiSzeOZGmTFSQ@mail.gmail.com

I'd focus on that test case.
-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/10/2014 08:37 PM, Peter Geoghegan wrote:
> On Fri, Jan 10, 2014 at 7:12 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> I'm getting deadlocks with this patch, using the test script you posted
>> earlier in
>> http://www.postgresql.org/message-id/CAM3SWZQh=8xNVgbBzYHJeXUJBHwZNjUTjEZ9t-DBO9t_mX_8Kw@mail.gmail.com.
>> Am doing something wrong, or is that a regression?
>
> Yes. The point of that test case was that it made your V1 livelock
> (which you fixed), not deadlock in a way detected by the deadlock
> detector, which is the correct behavior.

Oh, ok. Interesting. With the patch version I posted today, I'm not 
getting deadlocks. I'm not getting duplicates in the table either, so it 
looks like the promise tuple approach somehow avoids the deadlocks, 
while the btreelock patch does not.

Why does it deadlock with the btreelock patch? I don't see why it 
should. If you have two backends inserting a single tuple, and they 
conflict, one of them should succeed to insert, and the other one should 
update.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Jan 10, 2014 at 11:28 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Why does it deadlock with the btreelock patch? I don't see why it should. If
> you have two backends inserting a single tuple, and they conflict, one of
> them should succeed to insert, and the other one should update.

Are you sure that it doesn't make your patch deadlock too, with enough
pressure? I've made that mistake myself.

That test-case made my patch deadlock (in a detected fashion) when it
used buffer locks as a value locking prototype - I say as much right
there in the November mail you linked to. I think that's acceptable,
because it's non-sensible use of the feature (my point was only that
it shouldn't livelock). The test case is naively locking a row without
knowing ahead of time (or pro-actively checking) if the conflict is on
the first or second unique index. So before too long, you're updating
the "wrong" row (no existing lock is really held), based on the 'a'
column's projected value, when in actuality the conflict was on the
'b' column's projected value. Conditions are right for deadlock,
because two rows are locked, not one.

Although I have not yet properly considered your most recent revision,
I can't imagine why the same would not apply there, since the row
locking component is (probably) still identical. Granted, that
distinction between row locking and value locking is a bit fuzzy in
your approach, but if you happened to not insert any rows in any
previous iterations (i.e. there were no unfilled promise tuples), and
you happened to perform conflict handling first, it could still
happen, albeit with lower probability, no?

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/10/2014 10:00 PM, Peter Geoghegan wrote:
> On Fri, Jan 10, 2014 at 11:28 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Why does it deadlock with the btreelock patch? I don't see why it should. If
>> you have two backends inserting a single tuple, and they conflict, one of
>> them should succeed to insert, and the other one should update.
>
> Are you sure that it doesn't make your patch deadlock too, with enough
> pressure? I've made that mistake myself.
>
> That test-case made my patch deadlock (in a detected fashion) when it
> used buffer locks as a value locking prototype - I say as much right
> there in the November mail you linked to. I think that's acceptable,
> because it's non-sensible use of the feature (my point was only that
> it shouldn't livelock). The test case is naively locking a row without
> knowing ahead of time (or pro-actively checking) if the conflict is on
> the first or second unique index. So before too long, you're updating
> the "wrong" row (no existing lock is really held), based on the 'a'
> column's projected value, when in actuality the conflict was on the
> 'b' column's projected value. Conditions are right for deadlock,
> because two rows are locked, not one.

I see. Yeah, I also get deadlocks when I change update statement to use 
"foo.b = rej.b" instead of "foo.a = rej.a". I think it's down to the 
indexes are processed, ie. which conflict you see first.

This is pretty much the same issue we discussed wrt. exclusion 
contraints. If the tuple being inserted conflicts with several existing 
tuples, what to do? I think the best answer would be to return and lock 
them all. It could still deadlock, but it's nevertheless less surprising 
behavior than returning one of the tuples in random. Actually, we could 
even avoid the deadlock by always locking the tuples in a certain order, 
although I'm not sure if it's worth the trouble.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Jan 10, 2014 at 1:25 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> This is pretty much the same issue we discussed wrt. exclusion contraints.
> If the tuple being inserted conflicts with several existing tuples, what to
> do? I think the best answer would be to return and lock them all. It could
> still deadlock, but it's nevertheless less surprising behavior than
> returning one of the tuples in random. Actually, we could even avoid the
> deadlock by always locking the tuples in a certain order, although I'm not
> sure if it's worth the trouble.

I understand and accept that as long as we're intent on locking more
than one row per transaction, that action could deadlock with another
session doing something similar. Actually, I've even encountered
people giving advice in relation to proprietary systems along the
lines of: "if your big SQL MERGE statement is deadlocking excessively,
you might try hinting to make sure a nested loop join is used". I
think that this kind of ugly compromise is unavoidable in those
scenarios (in reality the most popular strategy is probably "cross
your fingers"). But as everyone agrees, the common case where an xact
only upserts one row should never deadlock with another, similar xact.
So *that* isn't a problem I have with making row locking work for
exclusion constraints.

My problem is that in general I'm not sold on the actual utility of
making this kind of row locking work with exclusion constraints. I'm
sincerely having a hard time thinking of a practical use-case
(although, as I've said, I want to make it work with IGNORE). Even if
you work all this row locking stuff out, and the spill-to-disk aspect
out, the interface is still wrong, because you need to figure out a
way to project more than one reject per slot. Maybe I lack imagination
around how to make that work, but there are a lot of "ifs" and "buts"
either way.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Jim Nasby
Date:
On 1/10/14, 4:40 PM, Peter Geoghegan wrote:
> My problem is that in general I'm not sold on the actual utility of
> making this kind of row locking work with exclusion constraints. I'm
> sincerely having a hard time thinking of a practical use-case
> (although, as I've said, I want to make it work with IGNORE). Even if
> you work all this row locking stuff out, and the spill-to-disk aspect
> out, the interface is still wrong, because you need to figure out a
> way to project more than one reject per slot. Maybe I lack imagination
> around how to make that work, but there are a lot of "ifs" and "buts"
> either way.

Well, the usual example for exclusion constraints is resource scheduling (ie: scheduling what room a class will be held
in).In that context is it hard to believe that you might want to MERGE a set of new classroom assignments in?
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Jan 10, 2014 at 4:09 PM, Jim Nasby <jim@nasby.net> wrote:
> Well, the usual example for exclusion constraints is resource scheduling
> (ie: scheduling what room a class will be held in). In that context is it
> hard to believe that you might want to MERGE a set of new classroom
> assignments in?

So you schedule a class that clashes with 3 other classes, and you
want to update all 3 rows/classes with details from your one row
proposed for insertion? That makes no sense, unless the classes were
in fixed time slots, in which case you could use a unique constraint
to begin with. You can't change the rows to have the same time range
for all 3. So you have to delete two first, and update the range of
one. Which two? And you can't really rely on having locked existing
rows operating as a kind of "persistent value lock", as I do, because
you've locked a row with a different range to the one you care about -
someone can still insert another row that doesn't block on that one
but blocks on your range. So you really do need a sophisticated, fully
formed value locking infrastructure to make it work, for a feature of
marginal utility at best. I'm having a hard time imagining any user
actually wanting to do any of this, and I'm having a harder time still
imagining anyone putting in the work to make it possible, if indeed it
is possible.

No one has ever implemented fully formed predicate locking in a
commercial database system, because it's an NP-complete problem [1],
[2]. Only very limited special cases are practicable, and I'm pretty
sure this isn't one of them.

[1] http://library.riphah.edu.pk/acm/disk_1/text/1-2/SIGMOD79/P127.PDF

[2]
http://books.google.com/books?id=wV5Ran71zNoC&pg=PA284&lpg=PA284&dq=predicate+locking+np+complete&source=bl&ots=PgNJ5H3L8V&sig=fOZ2Wr4fIxj0eFQD0tCGPLTsfY0&hl=en&sa=X&ei=PpTQUquoBMfFsATtw4CADA&ved=0CDIQ6AEwAQ#v=onepage&q=predicate%20locking%20np%20complete&f=false
--
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Jim Nasby
Date:
On 1/10/14, 6:51 PM, Peter Geoghegan wrote:
> On Fri, Jan 10, 2014 at 4:09 PM, Jim Nasby<jim@nasby.net>  wrote:
>> >Well, the usual example for exclusion constraints is resource scheduling
>> >(ie: scheduling what room a class will be held in). In that context is it
>> >hard to believe that you might want to MERGE a set of new classroom
>> >assignments in?
> So you schedule a class that clashes with 3 other classes, and you
> want to update all 3 rows/classes with details from your one row
> proposed for insertion?

Nuts, I was misunderstanding the scenario. I thought this was simply going to violate exclusion constraints.

I see what you're saying now, and I'm not coming up with a scenario either. Perhaps Jeff Davis could, since he created
them...if he can't then I'd say we're safe ignoring that aspect.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Jan 10, 2014 at 7:09 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Nah, that's nothing. I have a patch in the January commitfest that was
> already posted for the previous commitfest. It received zero review back
> then, and still has no reviewer signed up, let alone anyone actually
> reviewing it. And arguably it's a bug fix!
>
> http://www.postgresql.org/message-id/5285071B.1040100@vmware.com
>
> Wink wink, if you're looking for patches to review... ;-)

Yeah, I did intend to take a closer look at that one (I've looked at
it but have nothing to share yet). I've been a little busy with other
things. That patch is more of the kind where it's a matter of
determining if what you've done is exactly correct (no one would
disagree with the substance of what you propose), whereas there is
uncertainty about whether I've gotten the semantics right and so on.
But that's no excuse. :-)

>>   The alternative exclusion* patch still deadlocks in an unprincipled
>> fashion

> Here's an updated patch. Hope it works now... This is based on an older
> version, and doesn't include any fixes from your latest
> btreelock_insert_on_dup.v7.2014_01_07.patch. Please check the common parts,
> and copy over any relevant changes.

Okay, attached is a revision with some of my fixes for other parts of
the code merged (in particular, for the grammer, ecpg and some aspects
of row locking and visibility).

Some quick observations on your patch - maybe this is obvious, and you
have work-arounds in mind, but this is just my first impression:

* You're always passing HEAP_INSERT_SPECULATIVE to heap_insert, and
therefore in the event of any sort of insertion always getting an
exclusive lock on the procArray. I guess the fact that this always
happens, and not just when upserting is an oversight (I know you just
wanted to throw together a POC), but even still that seems kind of
questionable. Everyone knows that contention during GetSnapshotData is
still a big problem for us. Taking an exclusive ProcArrayLock perhaps
as frequently as more than once per slot seems like a really bad idea,
even if it's limited to speculative inserters.

* It seems questionable that you don't at least have a shared
ProcArrayLock when you set the token value in
SetSpeculativeInsertionToken() (as you know, MyPgXact->xmin can be set
with such a shared lock, so doing something similar here might be
okay, but it's far from obvious that no lock is okay). Now, I guess
you can point towards MinimumActiveBackends() as a kind of precedent,
but that seems substantially less scary than what you've done, because
that's just reading if a field is zero or non-zero. Obviously the
implications of actually doing this are that things get even worse for
performance. And even a shared lock might not be good enough - I'd
have to think about it some more to give a firmer opinion.

> The fix for the deadlocking issue consists of a few parts. First, there's a
> new heavy-weight lock type, a speculative insertion lock, which works
> somewhat similarly to XactLockTableWait(), but is only held for the duration
> of a single speculative insertion. When a backend is about to begin a
> speculative insertion, it first acquires the speculative insertion lock.
> When it's done with the insertion, meaning it has either cancelled it by
> killing the already-inserted tuple or decided that it's going to go ahead
> with it, the lock is released.

I'm afraid I must reiterate my earlier objection to the general thrust
of what you're doing, which is that it is evidently unnecessary to
spread knowledge of value locking around the system, as opposed to
localizing knowledge of it to one module, in this case nbtinsert.c.
While it's true that the idea of the AM abstraction is already perhaps
a little strained, this seems like a big expansion on that problem.
Why should this approach make sense for every conceivable AM that
supports some notion of a constraint? Heavyweight exclusive locks on
indexes (at the page level typically), persisting across complex
operations are not a new thing for Postgres.

> HeapTupleSatisfiesDirty() does the proc array check, and returns the token
> in the snapshot, alongside snapshot->xmin. The caller can then use that
> information in place of XactLockTableWait().

That seems like a modularity violation too. The HeapTupleSatisfiesMVCC
changes reflect a genuine need to make every MVCC snapshot care about
the special visibility exception, whereas only one or two
HeapTupleSatisfiesDirty() callers will ever care about speculative
insertion. Even if you're unmoved by the modularity/aesthetic argument
(which is not to suppose that you actually are), the fact that you're
calling SpeculativeInsertionIsInProgress(), which acquires a shared
ProcArrayLock much of the time from within HeapTupleSatisfiesDirty(),
may have seriously regressed foreign key enforcement, for example.
You're going to need something like a new type of snapshot, basically,
and we probably already have too many of those. But then, can you
really get away with a new snapshot type so most existing places are
unaffected? Why shouldn't ri_triggers.c have to care? Offhand I think
it must care, unless you go give it some special knowledge too. So you
either risk regressing performance badly, or play whack-a-mole with
all of the other dirty snapshot callsites. That seems like a textbook
example of a really bad modularity violation. The consequences may
spread beyond that, further than we can easily predict.

>> More importantly, I have to question
>> whether we should continue to pursue that alternative approach, giving
>> what we now know about its performance characteristics.
>
> Yes.

Okay. Unfortunately, I must press you on this point: what is it that
you don't like about what I've done? What aspects of my approach
concern you, and specifically what aspects of my approach do you hope
to avoid? If you take a close look at how value locking is performed,
it actually is very similar to the existing mechanism,
counterintuitive though that is. It's a modest expansion on how things
already work. I contend that my approach is, apart from everything
else, the more conservative of the two.

>> It could be
>> improved, but not by terribly much, particularly for the case where
>> there is plenty of update contention, which was shown in [1] to be
>> approximately 2-3 times slower than extended page locking (*and*  it's
>> already looking for would-be duplicates*first*). I'm trying to be as
>>
>> fair as possible, and yet the difference is huge.
>
> *shrug*. I'm not too concerned about performance during contention. But
> let's see how this fixed version performs. Could you repeat the tests you
> did with this?

Why would you not be too concerned about the performance with
contention? It's a very important aspect. But even if you don't, if
you look at the transaction throughput with only a single client in
the update-heavy benchmark [1] (with one client there is a fair mix of
inserts and updates), my approach still comes out far ahead.
Transaction throughput is almost 100% higher, with the *difference*
exceeding 150% at 8 clients but never reaching too much higher. I
think that the problem isn't so much with contention between clients
as much as with contention between inserts and updates, which affects
everyone to approximately the same degree. And the average max latency
across runs for one client is 130.447 ms, as opposed to 0.705 ms with
my patch - that's less than 1%. Whatever way you cut it, the
performance of my approach is far superior. Although we should
certainly investigate the impact of your most recent revision, and I
intend to, how can you not consider those differences to be extremely
significant? And honestly, I have a really hard time imagining what
you've done here had anything other than a strong negative effect on
performance, in which case the difference in performance will be wider
still.

> Any guesses what the bottleneck is? At a quick glance at a profile of a
> pgbench run with this patch, I didn't see anything out of ordinary, so I'm
> guessing it's lock contention somewhere.

See my previous remarks on "index scan with an identity crisis" [2].
I'm pretty sure it was mostly down to the way you optimistically
proceed with duplicate index tuple insertion (which you'll do even
with the btree code knowing that it almost certainly won't work out,
something that makes less sense than with deferred unique constraints,
where the user has specifically indicated that things may well work
out by making the constraint deferred in the first place). I also
think that the way my approach very effectively avoids wasted effort
(including but not limited to unnecessarily heavy locking) plays an
important role in making it perform so well. This turns out to be much
more important than the downside of having value locks be slightly
coarser than strictly necessary. When I tried to quantify that
overhead with a highly unsympathetic benchmark, the difference was
barely measurable [2][3].

[1] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp-2/

[2] http://www.postgresql.org/message-id/CAM3SWZQBhS0JriD6EfeW3MoTXy1eK-8Wdr6FvFFR0AyCDgCBvA@mail.gmail.com

[3] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/vs-vanilla-insert/

--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Greg Stark
Date:
On Sat, Jan 11, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Jan 10, 2014 at 4:09 PM, Jim Nasby <jim@nasby.net> wrote:
>> Well, the usual example for exclusion constraints is resource scheduling
>> (ie: scheduling what room a class will be held in). In that context is it
>> hard to believe that you might want to MERGE a set of new classroom
>> assignments in?
>
> So you schedule a class that clashes with 3 other classes, and you
> want to update all 3 rows/classes with details from your one row
> proposed for insertion?


Well, perhaps you want to mark the events as conflicting with your new event?

-- 
greg



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Jan 10, 2014 at 10:01 PM, Greg Stark <stark@mit.edu> wrote:
>> So you schedule a class that clashes with 3 other classes, and you
>> want to update all 3 rows/classes with details from your one row
>> proposed for insertion?
>
>
> Well, perhaps you want to mark the events as conflicting with your new event?

But short of a sophisticated persistent value locking implementation
(which I'm pretty skeptical of the feasibility of), more conflicting
events could be added at any moment. I doubt that you're appreciably
any better off than if you were to simply check with a select query,
even though that approach is obviously broken. In general, making row
locking work for exclusion constraints, so you can treat them in a way
that allows you to merge on arbitrary operators seems to me like a tar
pit.


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Fri, Jan 10, 2014 at 7:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> *shrug*. I'm not too concerned about performance during contention. But
>> let's see how this fixed version performs. Could you repeat the tests you
>> did with this?
>
> Why would you not be too concerned about the performance with
> contention? It's a very important aspect. But even if you don't, if
> you look at the transaction throughput with only a single client in
> the update-heavy benchmark [1] (with one client there is a fair mix of
> inserts and updates), my approach still comes out far ahead.
> Transaction throughput is almost 100% higher, with the *difference*
> exceeding 150% at 8 clients but never reaching too much higher. I
> think that the problem isn't so much with contention between clients
> as much as with contention between inserts and updates, which affects
> everyone to approximately the same degree. And the average max latency
> across runs for one client is 130.447 ms, as opposed to 0.705 ms with
> my patch - that's less than 1%. Whatever way you cut it, the
> performance of my approach is far superior. Although we should
> certainly investigate the impact of your most recent revision, and I
> intend to, how can you not consider those differences to be extremely
> significant?

So I re-ran the same old benchmark, where we're almost exclusively
updating. Results for your latest revision were very similar to my
patch:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/exclusion-no-deadlock/

This suggests that the main problem encountered was lock contention
among old, broken promise tuples. Note that this benchmark doesn't
involve any checkpointing, and everything fits in memory.
Opportunistic pruning is possible, which I'd imagine helps a lot with
the bloat, at least in this benchmark - there are only every 100,000
live tuples. That might not always be true, of course.

In any case, my patch is bound to win decisively for the other
extreme, the insert-only case, because the overhead of doing an index
scan first is always wasted there with your approach, and the overhead
of extended btree leaf page locking has been shown to be quite low. In
the past you've spoken of avoiding that overhead through an adaptive
strategy based on statistics, but I think you'll have a hard time
beating a strategy where the decision comes as late as possible, and
is informed by highly localized page-level metadata already available.
My implementation can abort an attempt to just read an existing
would-be duplicate very inexpensively (with no strong locks), going
back to just after the _bt_search() to get a heavyweight lock if just
reading doesn't work out (if there is no duplicate found), so as to
not waste all of its prior work. Doing one of the two extremes of
insert-mostly or update-only well is relatively easy; dynamically
adapting to one or the other is much harder. Especially if it's a
consistent mix of inserts and updates, where general observations
aren't terribly useful.

All other concerns of mine still remain, including the concern over
the extra locking of the proc array - I'm concerned about the
performance impact of that on other parts of the system not exercised
by this test.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sat, Jan 11, 2014 at 2:39 AM, Peter Geoghegan <pg@heroku.com> wrote:
> So I re-ran the same old benchmark, where we're almost exclusively
> updating. Results for your latest revision were very similar to my
> patch:
>
> http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/exclusion-no-deadlock/

To put that in context, here is a previously unpublished repeat of the
same benchmark on the slightly improved second most recently submitted
revision of mine, v6:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/upsert-cmp-3/

(recall that I improved things a bit by remember row-locking
conflicts, not just conflicts when we try value locking - that made a
small additional difference, reflected here but not in /upsert-cmp-2/
).

The numbers for each patch are virtually identical. I guess I could
improve my patch by not always getting a heavyweight lock on the first
insert attempt, based on the general observation that we have
previously always updated. My concern would be that that would happen
at the expense of the other case.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/11/2014 12:40 AM, Peter Geoghegan wrote:
> My problem is that in general I'm not sold on the actual utility of
> making this kind of row locking work with exclusion constraints. I'm
> sincerely having a hard time thinking of a practical use-case
> (although, as I've said, I want to make it work with IGNORE). Even if
> you work all this row locking stuff out, and the spill-to-disk aspect
> out, the interface is still wrong, because you need to figure out a
> way to project more than one reject per slot. Maybe I lack imagination
> around how to make that work, but there are a lot of "ifs" and "buts"
> either way.

Exclusion constraints can be used to implement uniqueness checks with 
SP-GiST or GiST indexes. For example, if you want to enforce that there 
are no two tuples with the same x and y coordinates, ie. use a point as 
the key. You could add a b-tree index just to enforce the constraint, 
but it's better if you don't have to. In general, it's just always 
better if features don't have implementation-specific limitations like this.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/11/2014 12:39 PM, Peter Geoghegan wrote:
> In any case, my patch is bound to win decisively for the other
> extreme, the insert-only case, because the overhead of doing an index
> scan first is always wasted there with your approach, and the overhead
> of extended btree leaf page locking has been shown to be quite low.

Quite possibly. Run the benchmark, and we'll see how big a difference 
we're talking about.

> In
> the past you've spoken of avoiding that overhead through an adaptive
> strategy based on statistics, but I think you'll have a hard time
> beating a strategy where the decision comes as late as possible, and
> is informed by highly localized page-level metadata already available.
> My implementation can abort an attempt to just read an existing
> would-be duplicate very inexpensively (with no strong locks), going
> back to just after the _bt_search() to get a heavyweight lock if just
> reading doesn't work out (if there is no duplicate found), so as to
> not waste all of its prior work. Doing one of the two extremes of
> insert-mostly or update-only well is relatively easy; dynamically
> adapting to one or the other is much harder. Especially if it's a
> consistent mix of inserts and updates, where general observations
> aren't terribly useful.

Another way to optimize it is to keep the b-tree page pinned after doing 
the pre-check. Then you don't need to descend the tree again when doing 
the insert. That would require small indexam API changes, but wouldn't 
be too invasive, I think.

> All other concerns of mine still remain, including the concern over
> the extra locking of the proc array - I'm concerned about the
> performance impact of that on other parts of the system not exercised
> by this test.

Yeah, I'm not thrilled about that part either. Fortunately there are 
other ways to implement that. In fact, I think you could just not bother 
taking the ProcArrayLock when setting the fields. The danger is that 
another backend sees a mixed state of the fields, but that's OK. The 
worst that can happen is that it will do an unnecessary lock/release on 
the heavy-weight lock. And to reduce the overhead when reading the 
fields, you could merge the SpeculativeInsertionIsInProgress() check 
into TransactionIdIsInProgress(). The call site in tqual.c always calls 
it together with TransactionIdIsInProgress(), which scans the proc array 
anyway, while holding the lock.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Jan 13, 2014 at 12:23 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Exclusion constraints can be used to implement uniqueness checks with
> SP-GiST or GiST indexes. For example, if you want to enforce that there are
> no two tuples with the same x and y coordinates, ie. use a point as the key.
> You could add a b-tree index just to enforce the constraint, but it's better
> if you don't have to. In general, it's just always better if features don't
> have implementation-specific limitations like this.

That seems rather narrow. Among other things, I worry about the
baggage for users in documenting supporting SP-GiST/GiST. "We support
it, but it only really works for the case where you're using exclusion
constraints as unique constraints, something that might make sense in
certain narrow contexts, contrary to our earlier general statement
that a unique index should be preferred there". We catalog amcanunique
methods as the way that we support unique indexes. I really do feel
that that's the appropriate level to support the feature at, and I
have not precluded other amcanunique implementations from doing the
same, having documented the intended value locking interface/contract
for the benefit of any future amcanunique AM author. It's ON DUPLICATE
KEY, not ON OVERLAPPING KEY, or any other syntax suggestive of
exclusion constraints and their arbitrary commutative operators.


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Mon, Jan 13, 2014 at 1:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Jan 13, 2014 at 12:23 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Exclusion constraints can be used to implement uniqueness checks with
>> SP-GiST or GiST indexes. For example, if you want to enforce that there are
>> no two tuples with the same x and y coordinates, ie. use a point as the key.
>> You could add a b-tree index just to enforce the constraint, but it's better
>> if you don't have to. In general, it's just always better if features don't
>> have implementation-specific limitations like this.
>
> That seems rather narrow. Among other things, I worry about the
> baggage for users in documenting supporting SP-GiST/GiST. "We support
> it, but it only really works for the case where you're using exclusion
> constraints as unique constraints, something that might make sense in
> certain narrow contexts, contrary to our earlier general statement
> that a unique index should be preferred there". We catalog amcanunique
> methods as the way that we support unique indexes. I really do feel
> that that's the appropriate level to support the feature at, and I
> have not precluded other amcanunique implementations from doing the
> same, having documented the intended value locking interface/contract
> for the benefit of any future amcanunique AM author. It's ON DUPLICATE
> KEY, not ON OVERLAPPING KEY, or any other syntax suggestive of
> exclusion constraints and their arbitrary commutative operators.

For what it's worth, I agree with Heikki.  There's probably nothing
sensible an upsert can do if it conflicts with more than one tuple,
but if it conflicts with just exactly one, it oughta be OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Jan 13, 2014 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> For what it's worth, I agree with Heikki.  There's probably nothing
> sensible an upsert can do if it conflicts with more than one tuple,
> but if it conflicts with just exactly one, it oughta be OK.

If there is exactly one, *and* the existing value is exactly the same
as the value proposed for insertion (or, I suppose, a subset of the
existing value, but that's so narrow that it might as well not apply).
In short, when you're using an exclusion constraint as a unique
constraint. Which is very narrow indeed. Weighing the costs and the
benefits, that seems like far more cost than benefit, before we even
consider anything beyond simply explaining the applicability and
limitations of upserting with exclusion constraints. It's generally
far cleaner to define speculative insertion as something that happens
with unique indexes only.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/13/2014 10:53 PM, Peter Geoghegan wrote:
> On Mon, Jan 13, 2014 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> For what it's worth, I agree with Heikki.  There's probably nothing
>> sensible an upsert can do if it conflicts with more than one tuple,
>> but if it conflicts with just exactly one, it oughta be OK.
>
> If there is exactly one, *and* the existing value is exactly the same
> as the value proposed for insertion (or, I suppose, a subset of the
> existing value, but that's so narrow that it might as well not apply).
> In short, when you're using an exclusion constraint as a unique
> constraint. Which is very narrow indeed. Weighing the costs and the
> benefits, that seems like far more cost than benefit, before we even
> consider anything beyond simply explaining the applicability and
> limitations of upserting with exclusion constraints. It's generally
> far cleaner to define speculative insertion as something that happens
> with unique indexes only.

Well, even if you don't agree that locking all the conflicting rows for 
update is sensible, it's still perfectly sensible to return the rejected 
rows to the user. For example, you're inserting N rows, and if some of 
them violate a constraint, you still want to insert the non-conflicting 
rows instead of rolling back the whole transaction.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Jan 13, 2014 at 12:58 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Well, even if you don't agree that locking all the conflicting rows for
> update is sensible, it's still perfectly sensible to return the rejected
> rows to the user. For example, you're inserting N rows, and if some of them
> violate a constraint, you still want to insert the non-conflicting rows
> instead of rolling back the whole transaction.

Right, but with your approach, can you really be sure that you have
the right rejecting tuple ctid (not reject)? In other words, as you
wait for the exclusion constraint to conclusively indicate that there
is a conflict, minutes may have passed in which time other conflicts
may emerge in earlier unique indexes. Whereas with an approach where
values are locked, you are guaranteed that earlier unique indexes have
no conflicting values. Maintaining that property seems useful, since
we check in a well-defined order, and we're still projecting a ctid.
Unlike when row locking is involved, we can make no assumptions or
generalizations around where conflicts will occur. Although that may
also be a general concern with your approach when row locking, for
multi-master replication use-cases. There may be some value in knowing
it cannot have been earlier unique indexes (and so the existing values
for those unique indexes in the locked row should stay the same -
don't many conflict resolution policies work that way?).


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Jan 13, 2014 at 12:49 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> In any case, my patch is bound to win decisively for the other
>> extreme, the insert-only case, because the overhead of doing an index
>> scan first is always wasted there with your approach, and the overhead
>> of extended btree leaf page locking has been shown to be quite low.
>
> Quite possibly. Run the benchmark, and we'll see how big a difference we're
> talking about.

I'll come up with something and let you know.

> Another way to optimize it is to keep the b-tree page pinned after doing the
> pre-check. Then you don't need to descend the tree again when doing the
> insert. That would require small indexam API changes, but wouldn't be too
> invasive, I think.

You'll still need a callback to drop the pin when it transpires that
there is a conflict in a later unique index, and state to pass a bt
stack back, at which point you've already made exactly the same
changes to the AM interface as in my proposal. The only difference is
that the core code doesn't rely on the value locks being released
after an instant, but that isn't something that you take advantage of.
Furthermore, AFAIK there is no reason to think that anything other
than btree will benefit, which makes it a bit unfortunate that the AM
has to support it generally.

So, again, it's kind of a modularity violation, and it may not even
actually be possible, since _bt_search() is only callable with an
insertion scankey, which is the context in which the existing
guarantee around releasing locks and re-searching from that point
applies, for reasons that seem to me to be very subtle. At the very
least you need to pass a btstack to _bt_doinsert() to save the work of
re-scanning, as I do.

>> All other concerns of mine still remain, including the concern over
>> the extra locking of the proc array - I'm concerned about the
>> performance impact of that on other parts of the system not exercised
>> by this test.
>
> Yeah, I'm not thrilled about that part either. Fortunately there are other
> ways to implement that. In fact, I think you could just not bother taking
> the ProcArrayLock when setting the fields. The danger is that another
> backend sees a mixed state of the fields, but that's OK. The worst that can
> happen is that it will do an unnecessary lock/release on the heavy-weight
> lock. And to reduce the overhead when reading the fields, you could merge
> the SpeculativeInsertionIsInProgress() check into
> TransactionIdIsInProgress(). The call site in tqual.c always calls it
> together with TransactionIdIsInProgress(), which scans the proc array
> anyway, while holding the lock.

Currently in your patch all insertions do
SpeculativeInsertionLockAcquire(GetCurrentTransactionId()) -
presumably this is not something you intend to keep. Also, you should
not do this for regular insertion:

if (options & HEAP_INSERT_SPECULATIVE)SetSpeculativeInsertion(relation->rd_node, &heaptup->t_self);

Can you explain the following, please?:

+ /*
+  * Returns a speculative insertion token for waiting for the insertion to
+  * finish.
+  */
+ uint32
+ SpeculativeInsertionIsInProgress(TransactionId xid, RelFileNode rel,
ItemPointer tid)
+ {
+     uint32        result = 0;
+     ProcArrayStruct *arrayP = procArray;
+     int            index;

Why is this optimization correct? Presently it allows your patch to
avoid getting a shared ProcArrayLock from HeapTupleSatisfiesDirty().

+     if (TransactionIdPrecedes(xid, TransactionXmin))
+         return false;

So from HeapTupleSatisfiesDirty(), you're checking if "xid" (the
passed tuple's xmin) precedes our transaction's xmin (well, that of
our last snapshot updated by GetSnapshotData()). This is set within
GetSnapshotData(), but we're dealing with a dirty snapshot with no
xmin, so TransactionXmin pertains to our MVCC snapshot, not our dirty
snapshot.

It isn't really true that TransactionIdIsInProgress() gets the same
shared ProcArrayLock in a similar fashion, for a full linear search; I
think that the various fast-paths make it far less likely than it is
for SpeculativeInsertionIsInProgress() (or, perhaps, should be). Here
is what that other routine does in around the same place:
/* * Don't bother checking a transaction older than RecentXmin; it could not * possibly still be running.  (Note: in
particular,this guarantees that * we reject InvalidTransactionId, FrozenTransactionId, etc as not * running.) */if
(TransactionIdPrecedes(xid,RecentXmin)){    xc_by_recent_xmin_inc();    return false;}
 

This extant code checks against RecentXmin, *not* TransactionXmin.  It
also caches things quite effectively, but that caching isn't very
useful to you here. It checks latestCompletedXid before doing a linear
search through the proc array too.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Jan 13, 2014 at 6:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
> + uint32
> + SpeculativeInsertionIsInProgress(TransactionId xid, RelFileNode rel,
> ItemPointer tid)
> + {

For the purposes of preventing unprincipled deadlocking, commenting
out the following (the only caller of the above) has no immediately
discernible effect with any of the test-cases that I've published:
             /* XXX shouldn't we fall through to look at xmax? */
+             /* XXX why? or is that now covered by the above check? */
+             snapshot->speculativeToken =
+                 SpeculativeInsertionIsInProgress(HeapTupleHeaderGetRawXmin(tuple),
+                                                  rnode,
+                                                  &htup->t_self);
+
+             snapshot->xmin = HeapTupleHeaderGetRawXmin(tuple);             return true;        /* in insertion by
other*/
 

I think that the prevention of unprincipled deadlocking is all down to
this immediately prior piece of code, at least in those test cases:

!             /*
!              * in insertion by other.
!              *
!              * Before returning true, check for the special case that the
!              * tuple was deleted by the same transaction that inserted it.
!              * Such a tuple will never be visible to anyone else, whether
!              * the transaction commits or aborts.
!              */
!             if (!(tuple->t_infomask & HEAP_XMAX_INVALID) &&
!                 !(tuple->t_infomask & HEAP_XMAX_COMMITTED) &&
!                 !(tuple->t_infomask & HEAP_XMAX_IS_MULTI) &&
!                 !HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
!                 HeapTupleHeaderGetRawXmax(tuple) == HeapTupleHeaderGetRawXmin(tuple))
!             {
!                 return false;
!             }

But why should it be acceptable to change the semantics of dirty
snapshots like this, which previously always returned true when
control reached here? It is a departure from their traditional
behavior, not limited to clients of this new promise tuple
infrastructure. Now, it becomes entirely a matter of whether we tried
to insert before or after the deleting xact's deletion (of a tuple it
originally inserted) as to whether or not we block. So in general we
don't get to "keep our old value locks" until xact end when we update
or delete. Even if you don't consider this a bug for existing dirty
snapshot clients (I myself do - we can't rely on deleting a row and
re-inserting the same values now, which could be particularly
undesirable for updates), I have already described how we can take
advantage of deleting tuples while still holding on to their "value
locks" [1] to Andres. I think it'll be very important for multi-master
conflict resolution. I've already described this useful property of
dirty snapshots numerous times on this thread in relation to different
aspects, as it happens. It's essential.

Anyway, I guess you're going to need an infomask bit to fix this, so
you can differentiate between 'promise' tuples and 'proper' tuples.
Those are in short supply. I still think this problem is more or less
down to a modularity violation, and I suspect that this is not the
last problem that will be found along these lines if we continue to
pursue this approach.

[1] http://www.postgresql.org/message-id/CAM3SWZQpLSGPS2Kd=-n6HVYiqkF_mCxmX-Q72ar9UPzQ-X6F6Q@mail.gmail.com
-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/14/2014 12:20 PM, Peter Geoghegan wrote:
> I think that the prevention of unprincipled deadlocking is all down to
> this immediately prior piece of code, at least in those test cases:

> !             /*
> !              * in insertion by other.
> !              *
> !              * Before returning true, check for the special case that the
> !              * tuple was deleted by the same transaction that inserted it.
> !              * Such a tuple will never be visible to anyone else, whether
> !              * the transaction commits or aborts.
> !              */
> !             if (!(tuple->t_infomask & HEAP_XMAX_INVALID) &&
> !                 !(tuple->t_infomask & HEAP_XMAX_COMMITTED) &&
> !                 !(tuple->t_infomask & HEAP_XMAX_IS_MULTI) &&
> !                 !HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
> !                 HeapTupleHeaderGetRawXmax(tuple) == HeapTupleHeaderGetRawXmin(tuple))
> !             {
> !                 return false;
> !             }
>
> But why should it be acceptable to change the semantics of dirty
> snapshots like this, which previously always returned true when
> control reached here? It is a departure from their traditional
> behavior, not limited to clients of this new promise tuple
> infrastructure. Now, it becomes entirely a matter of whether we tried
> to insert before or after the deleting xact's deletion (of a tuple it
> originally inserted) as to whether or not we block. So in general we
> don't get to "keep our old value locks" until xact end when we update
> or delete.

Hmm. So the scenario would be that a process inserts a tuple, but kills 
it again later in the transaction, and then re-inserts the same value. 
The expectation is that because it inserted the value once already, 
inserting it again will not block. Ie. inserting and deleting a tuple 
effectively acquires a value-lock on the inserted values.

> Even if you don't consider this a bug for existing dirty
> snapshot clients (I myself do - we can't rely on deleting a row and
> re-inserting the same values now, which could be particularly
> undesirable for updates),

Yeah, it would be bad if updates start failing because of this. We could 
add a check for that, and return true if the tuple was updated rather 
than deleted.

> I have already described how we can take
> advantage of deleting tuples while still holding on to their "value
> locks" [1] to Andres. I think it'll be very important for multi-master
> conflict resolution. I've already described this useful property of
> dirty snapshots numerous times on this thread in relation to different
> aspects, as it happens. It's essential.

I didn't understand that description.

> Anyway, I guess you're going to need an infomask bit to fix this, so
> you can differentiate between 'promise' tuples and 'proper' tuples.

Yeah, that's one way. Or you could set xmin to invalid, to make the 
killed tuple look thoroughly dead to everyone.

> Those are in short supply. I still think this problem is more or less
> down to a modularity violation, and I suspect that this is not the
> last problem that will be found along these lines if we continue to
> pursue this approach.

You have suspected that many times throughout this thread, and every 
time there's been a relatively simple solutions to the issues you've 
raised. I suspect that's also going to be true for whatever mundane next 
issue you come up with.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/14/2014 12:44 AM, Peter Geoghegan wrote:
> On Mon, Jan 13, 2014 at 12:58 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Well, even if you don't agree that locking all the conflicting rows for
>> update is sensible, it's still perfectly sensible to return the rejected
>> rows to the user. For example, you're inserting N rows, and if some of them
>> violate a constraint, you still want to insert the non-conflicting rows
>> instead of rolling back the whole transaction.
>
> Right, but with your approach, can you really be sure that you have
> the right rejecting tuple ctid (not reject)? In other words, as you
> wait for the exclusion constraint to conclusively indicate that there
> is a conflict, minutes may have passed in which time other conflicts
> may emerge in earlier unique indexes. Whereas with an approach where
> values are locked, you are guaranteed that earlier unique indexes have
> no conflicting values. Maintaining that property seems useful, since
> we check in a well-defined order, and we're still projecting a ctid.
> Unlike when row locking is involved, we can make no assumptions or
> generalizations around where conflicts will occur. Although that may
> also be a general concern with your approach when row locking, for
> multi-master replication use-cases. There may be some value in knowing
> it cannot have been earlier unique indexes (and so the existing values
> for those unique indexes in the locked row should stay the same -
> don't many conflict resolution policies work that way?).

I don't understand what you're saying. Can you give an example?

In the use case I was envisioning above, ie. you insert N rows, and if 
any of them violate constraint, you still want to insert the 
non-violating instead of rolling back the whole transaction, you don't 
care. You don't care what existing rows the new rows conflicted with.

Even if you want to know what you conflicted with, I can't make sense of 
what you're saying. In the btreelock approach, the value locks are 
immediately released once you discover that there's conflict. So by the 
time you get to do anything with the ctid of the existing tuple you 
conflicted with, new conflicting tuples might've appeared.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Jan 14, 2014 at 2:43 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Hmm. So the scenario would be that a process inserts a tuple, but kills it
> again later in the transaction, and then re-inserts the same value. The
> expectation is that because it inserted the value once already, inserting it
> again will not block. Ie. inserting and deleting a tuple effectively
> acquires a value-lock on the inserted values.

Right.

> Yeah, it would be bad if updates start failing because of this. We could add
> a check for that, and return true if the tuple was updated rather than
> deleted.

Why would you fix it that way?

>> I have already described how we can take
>> advantage of deleting tuples while still holding on to their "value
>> locks" [1] to Andres. I think it'll be very important for multi-master
>> conflict resolution. I've already described this useful property of
>> dirty snapshots numerous times on this thread in relation to different
>> aspects, as it happens. It's essential.
>
> I didn't understand that description.

I was describing how deleting existing locked rows, and re-inserting,
could deal with multiple conflicts for multi-master replication
use-cases. It hardly matters much though, because it's not as if the
usefulness and necessity of this property of dirty snapshots is in
question.

>> Anyway, I guess you're going to need an infomask bit to fix this, so
>> you can differentiate between 'promise' tuples and 'proper' tuples.
>
> Yeah, that's one way. Or you could set xmin to invalid, to make the killed
> tuple look thoroughly dead to everyone.

I'm think you'll have to use an infomask bit so everyone knows that
this is a promise tuple from the start. Otherwise, I suspect that
there are race conditions. The problem was that
inserted-then-deleted-in-same-xact tuples (both regular and promise)
were invisible to all xacts' dirty snapshots, when they should have
only been invisible to the deleting xact's dirty snapshot. So it isn't
obvious to me how you interlock things such that another xact doesn't
incorrectly decide that it has to wait on what is really a promise
tuple's xact for the full duration of that xact, having found no
speculative insertion token to ShareLock (which implies unprincipled
deadlocking), while simultaneously having other sessions not fail to
see as dirty-visible a same-xact-inserted-deleted non-promise tuple
(thereby ensuring those other sessions correctly conclude that it is
necessary to wait for the end of the xmin/xmax xact). If you set the
xmin to invalid too late, it doesn't help any existing waiters.

Even if setting xmin to invalid is workable, it's a strike against the
performance of your approach, because it's another heap buffer
exclusive lock.

> You have suspected that many times throughout this thread, and every time
> there's been a relatively simple solutions to the issues you've raised. I
> suspect that's also going to be true for whatever mundane next issue you
> come up with.

I don't think it's a mundane issue. But in any case, you haven't
addressed why you think your proposal is more or less better than my
proposal, which is the pertinent question. You haven't given me so
much as a high level summary of whatever misgivings you may have about
it, even though I've asked you to comment on my approach to value
locking several times. You haven't pointed out that it has any
specific bug (which is not to suppose that that's because there are
none). The point is that it is not my contention that what you're
proposing is totally unworkable. Rather, I think that the original
proposal will probably ultimately perform better in all cases, is
easier to reason about and is certainly far more modular. It appears
to me to be the more conservative of the two proposals. In all
sincerity, I simply don't know what factors you're weighing here. In
saying that, I really don't mean to imply that you're assigning weight
to things in a way that I am in disagreement with. I simply don't
understand what is important to you here, and why your proposal
preserves or enhances the things that you believe are important. Would
you please explain your position along those lines?

Now, I'll concede that it will be harder to make the IGNORE syntax
work with exclusion constraints with what I've done, which would be
nice. However, in my opinion that should be given far less weight than
these other issues. It's ON DUPLICATE KEY...; no one could reasonably
assume that exclusion constraints were covered. Also, upserting with
exclusion constraints is a non-starter. It's only applicable to the
case where you're using exclusion constraints exactly as you would use
unique constraints, which has to be very rare. It will cause much more
confusion than anything else.

INSERT IGNORE in MySQL works with NOT NULL constraints, unique
constraints, and all other constraints. FWIW I think that it would be
kind of arbitrary to make IGNORE work with exclusion constraints and
not other types of constraints, whereas when it's specifically ON
DUPLICATE KEY, that seems far less surprising.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/14/2014 11:22 PM, Peter Geoghegan wrote:
> On Tue, Jan 14, 2014 at 2:43 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> You have suspected that many times throughout this thread, and every time
>> there's been a relatively simple solutions to the issues you've raised. I
>> suspect that's also going to be true for whatever mundane next issue you
>> come up with.
>
> I don't think it's a mundane issue. But in any case, you haven't
> addressed why you think your proposal is more or less better than my
> proposal, which is the pertinent question.

1. It's simpler.

2. Works for exclusion constraints.

> You haven't given me so
> much as a high level summary of whatever misgivings you may have about
> it, even though I've asked you to comment on my approach to value
> locking several times. You haven't pointed out that it has any
> specific bug (which is not to suppose that that's because there are
> none). The point is that it is not my contention that what you're
> proposing is totally unworkable. Rather, I think that the original
> proposal will probably ultimately perform better in all cases, is
> easier to reason about and is certainly far more modular. It appears
> to me to be the more conservative of the two proposals. In all
> sincerity, I simply don't know what factors you're weighing here. In
> saying that, I really don't mean to imply that you're assigning weight
> to things in a way that I am in disagreement with. I simply don't
> understand what is important to you here, and why your proposal
> preserves or enhances the things that you believe are important. Would
> you please explain your position along those lines?

I guess that simplicity is in the eye of the beholder, but please take a 
look at git diff --stat:
 41 files changed, 1224 insertions(+), 107 deletions(-)

vs.
 50 files changed, 2215 insertions(+), 240 deletions(-)

Admittedly, some of the difference comes from the fact that you've spent 
a lot more time commenting and polishing the btreelock patch. But mostly 
I dislike additional complexity required in b-tree for this.

I don't think B-tree locking is more conservative. The 
insert-and-then-check approach is already used by exclusion constraints, 
I'm just extending it to not abort on conflict, but do something else.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/14/2014 11:22 PM, Peter Geoghegan wrote:
> The problem was that
> inserted-then-deleted-in-same-xact tuples (both regular and promise)
> were invisible to all xacts' dirty snapshots, when they should have
> only been invisible to the deleting xact's dirty snapshot.

Right.

> So it isn't
> obvious to me how you interlock things such that another xact doesn't
> incorrectly decide that it has to wait on what is really a promise
> tuple's xact for the full duration of that xact, having found no
> speculative insertion token to ShareLock (which implies unprincipled
> deadlocking), while simultaneously having other sessions not fail to
> see as dirty-visible a same-xact-inserted-deleted non-promise tuple
> (thereby ensuring those other sessions correctly conclude that it is
> necessary to wait for the end of the xmin/xmax xact). If you set the
> xmin to invalid too late, it doesn't help any existing waiters.

If a backend finds no speculative insertion token to ShareLock, then it
really isn't a speculative insertion, and the process should sleep on
the xid as usual.

Once we remove the modification in HeapTupleSatisfiesDirty() that made
it return false when xmin == xmax, the problem that arises is that
another backend that sees the killed tuple incorrectly determines that
it has to wait for it that transaction to finish, even though it was a
speculatively inserted tuple that was killed, and hence can be ignored.
We can avoid that problem by setting xmin to invalid, or otherwise
marking the tuple as dead.

Attached is a patch doing that, to again demonstrate what I mean. I'm
not sure if setting xmin to invalid is really the best way to mark the
tuple dead; I don't think a tuple's xmin can currently be
InvalidTransaction under any other circumstances, so there might be some
code out there that's not prepared for it. So using an infomask bit
might indeed be better. Or something else entirely.

- Heikki

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Jan 14, 2014 at 2:16 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> I don't think it's a mundane issue. But in any case, you haven't
>> addressed why you think your proposal is more or less better than my
>> proposal, which is the pertinent question.
>
> 1. It's simpler.
>
> 2. Works for exclusion constraints.

Thank you for clarifying where you're coming from.

> I guess that simplicity is in the eye of the beholder, but please take a
> look at git diff --stat:
>
>  41 files changed, 1224 insertions(+), 107 deletions(-)
>
> vs.
>
>  50 files changed, 2215 insertions(+), 240 deletions(-)
>
> Admittedly, some of the difference comes from the fact that you've spent a
> lot more time commenting and polishing the btreelock patch. But mostly I
> dislike additional complexity required in b-tree for this.

It's very much down to differences in how well commented and
documented each patch is. I have a fully formed amendment to the AM
interface, complete with documentation of the AM and btree aspects,
and detailed comments around how the parts fit together. But you've
already explored doing something similar to what I do, to similarly
avoid having to refind the page (less the heavyweight locking), which
seems almost equivalent to what I propose in terms of its impact on
btree, before we consider anything else.

> I don't think B-tree locking is more conservative. The insert-and-then-check
> approach is already used by exclusion constraints, I'm just extending it to
> not abort on conflict, but do something else.

If you examine what I actually do, you'll see that it's pretty much
equivalent to how the extant value locking of unique btree indexes has
always worked. It's just that the process is staggered at an exact
point, the point where traditionally we hold no buffer locks, only a
buffer pin (although we do additionally verify that the index gives
the go-ahead before getting to later indexes, to get consensus to
proceed with insertion).

The suggestion that mine is the conservative approach is also based on
the fact that database systems have made use of page level exclusive
locks on indexes, managed by the lock manager, persisting over complex
operations in many different contexts for many years.  This includes
Postgres, where for many years relcache takes precautions again
deadlocking in such AMs by ordering the list of indexes associated
with a relation by pg_index.indexrelid. Currently this may not be
necessary, but the principle stands.

The insert-then-check approach of exclusion constraints is quite
different to what is proposed here, because exclusion constraints only
ever have to abort the xact if things don't work out. There is no
value locking. That's far easier to pin down. You definitely don't
have to do anything new with visibility.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Jan 14, 2014 at 3:25 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Attached is a patch doing that, to again demonstrate what I mean. I'm not
> sure if setting xmin to invalid is really the best way to mark the tuple
> dead; I don't think a tuple's xmin can currently be InvalidTransaction under
> any other circumstances, so there might be some code out there that's not
> prepared for it. So using an infomask bit might indeed be better. Or
> something else entirely.

Have you thought about the implications for other snapshot types (or
other tqual.c routines)? My concern is that a client of that
infrastructure (either current or future) could spuriously conclude
that a heap tuple satisfied it, when in fact only a promise tuple
satisfied it. It wouldn't necessarily follow that the promise would be
fulfilled, nor that there would be some other proper heap tuple
equivalent to that fulfilled promise tuple as far as those clients are
concerned.

heap_delete() will not call HeapTupleSatisfiesUpdate() when you're
deleting a promise tuple, which on the face of it is fine - it's
always going to technically be instantaneously invisible, because it's
always created by the same command id (i.e. HeapTupleSatisfiesUpdate()
would just return HeapTupleInvisible if called). So far so good, but
we are technically doing something else quite new - deleting a
would-be instantaneously invisible tuple. So like your concern about
setting xmin to invalid, my concern is that code may exist that treats
cmin < cmax as an invariant. Now, you might think that that would be a
manageable concern, and to be fair a look at the ComboCids code that
mostly arbitrates that stuff seems to indicate that it's okay, but
it's still worth noting.

I think you should consider breaking off the relcache parts of my
patch and committing them, because they're independently useful. If we
are going to have a lot of conflicts that need to be handled by a
heap_delete(), there is no point in inserting non-unique index tuples
for what is not yet conclusively a proper (non-promise) tuple. Those
should always come last. And even without upsert, strictly inserting
into unique indexes first seems like a useful thing relative to the
cost. Unique violations are the cause of many aborted transactions,
and there is no need to ever bloat non-unique indexes of the same slot
when that happens.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Tue, Jan 14, 2014 at 3:07 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> Right, but with your approach, can you really be sure that you have
>> the right rejecting tuple ctid (not reject)? In other words, as you
>> wait for the exclusion constraint to conclusively indicate that there
>> is a conflict, minutes may have passed in which time other conflicts
>> may emerge in earlier unique indexes. Whereas with an approach where
>> values are locked, you are guaranteed that earlier unique indexes have
>> no conflicting values. Maintaining that property seems useful, since
>> we check in a well-defined order, and we're still projecting a ctid.
>> Unlike when row locking is involved, we can make no assumptions or
>> generalizations around where conflicts will occur. Although that may
>> also be a general concern with your approach when row locking, for
>> multi-master replication use-cases. There may be some value in knowing
>> it cannot have been earlier unique indexes (and so the existing values
>> for those unique indexes in the locked row should stay the same -
>> don't many conflict resolution policies work that way?).
>
> I don't understand what you're saying. Can you give an example?
>
> In the use case I was envisioning above, ie. you insert N rows, and if any
> of them violate constraint, you still want to insert the non-violating
> instead of rolling back the whole transaction, you don't care. You don't
> care what existing rows the new rows conflicted with.
>
> Even if you want to know what you conflicted with, I can't make sense of
> what you're saying. In the btreelock approach, the value locks are
> immediately released once you discover that there's conflict. So by the time
> you get to do anything with the ctid of the existing tuple you conflicted
> with, new conflicting tuples might've appeared.

That's true, but at least the timeframe in which an additional
conflict may occur on just-locked index values in bound to more or
less an instant. In any case how important this is is an interesting
question, and perhaps one that Andres can weigh in on as someone that
knows a lot about multi-master replication. This issue is particularly
interesting because this testcase appears to make both patches
livelock, for reasons that I believe are related:

https://github.com/petergeoghegan/upsert/blob/master/torture.sh

I have an idea of what I could do to fix this, but I don't have time
to make sure that my hunch is correct. I'm travelling tomorrow to give
a talk at PDX pug, so I'll have limited access to e-mail.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Jan 15, 2014 at 8:23 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I have an idea of what I could do to fix this, but I don't have time
> to make sure that my hunch is correct.

It might just be a matter of:

@@ -186,6 +186,13 @@ ExecLockHeapTupleForUpdateSpec(EState *estate,    switch (test)    {        case
HeapTupleInvisible:
+            /*
+             * Tuple may have originated from this command, in which case it's
+             * already locked
+             */
+            if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple.t_data))
&&
+                HeapTupleHeaderGetCmin(tuple.t_data) == estate->es_output_cid)
+                return true;            /* Tuple became invisible;  try again */            if
(IsolationUsesXactSnapshot())               ereport(ERROR,
 

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 01/16/2014 03:25 AM, Peter Geoghegan wrote:
> I think you should consider breaking off the relcache parts of my
> patch and committing them, because they're independently useful. If we
> are going to have a lot of conflicts that need to be handled by a
> heap_delete(), there is no point in inserting non-unique index tuples
> for what is not yet conclusively a proper (non-promise) tuple. Those
> should always come last. And even without upsert, strictly inserting
> into unique indexes first seems like a useful thing relative to the
> cost. Unique violations are the cause of many aborted transactions,
> and there is no need to ever bloat non-unique indexes of the same slot
> when that happens.

Makes sense. Can you extract that into a separate patch, please?

I was wondering if that might cause deadlocks if an existing index is 
changed from unique to non-unique, or vice versa, as the ordering would 
change. But we don't have a DDL command to change that, so the question 
is moot.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Jan 16, 2014 at 12:35 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Makes sense. Can you extract that into a separate patch, please?

Okay.

On an unrelated note, here are results for a benchmark that compares
the two patches for an insert heavy workload:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/insert-heavy-cmp/

I should point out that this is a sympathetic case for the exclusion
approach; there is only one unique index involved, and the heap tuples
were relatively wide:

pg@gerbil:~/pgbench-tools/tests$ cat tpc-b-upsert.sql
\set nbranches 1000000000
\set naccounts 1000000000
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom delta -5000 5000
with rej as(insert into pgbench_accounts(aid, bid, abalance, filler)
values(:aid, :bid, :delta, 'filler') on duplicate key lock for update
returning rejects aid, abalance) update pgbench_accounts set abalance
= pgbench_accounts.abalance + rej.abalance from rej where
pgbench_accounts.aid = rej.aid;

(This benchmark used an unlogged table, if only because to do
otherwise would severely starve this particular server of I/O).
-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Wed, Jan 15, 2014 at 11:02 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It might just be a matter of:
>
> @@ -186,6 +186,13 @@ ExecLockHeapTupleForUpdateSpec(EState *estate,
>         switch (test)
>         {
>                 case HeapTupleInvisible:
> +                       /*
> +                        * Tuple may have originated from this command, in which case it's
> +                        * already locked
> +                        */
> +                       if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple.t_data))
> &&
> +                               HeapTupleHeaderGetCmin(tuple.t_data) == estate->es_output_cid)
> +                               return true;
>                         /* Tuple became invisible;  try again */
>                         if (IsolationUsesXactSnapshot())
>                                 ereport(ERROR,

I think we need to give this some more thought. I have not addressed
the implications for MVCC snapshots here. I think that I'll need to
raise a WARNING along the lines of "your snapshot isn't going to
consider the locked tuple visible because the same command inserted
it", or perhaps even raise an ERROR regardless of isolation level
(although note that I'm not suggesting that we raise an ERROR in the
event of receiving HeapTupleInvisible from heap_lock_tuple()/HTSU()
for other reasons, which *is* possible, nor am I suggesting that later
commands of the same xact would ever see this ERROR). I'm comfortable
with the idea of what you might loosely describe as a "READ COMMITTED
mode serialization failure" here, because this case is so much more
narrow than the other case I've proposed making a special exception to
the general semantics of MVCC snapshots to accommodate (i.e. the case
where a tuple is locked from an xact logically still-in-progress to
our snapshot in RC mode).

I think I'll be happy to declare that usage of the feature that hits
this issue is somewhere between questionable and wrong. It probably
isn't worth making another, similar HTSMVCC exception for this case.
But ISTM that we still have to do *something* other than simply credit
users with taking care to avoid tripping up on this.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Robert Haas
Date:
On Thu, Jan 16, 2014 at 3:35 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Makes sense. Can you extract that into a separate patch, please?
>
> I was wondering if that might cause deadlocks if an existing index is
> changed from unique to non-unique, or vice versa, as the ordering would
> change. But we don't have a DDL command to change that, so the question is
> moot.

It's not hard to imagine someone wanting to add such a DDL command.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sat, Jan 18, 2014 at 5:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I was wondering if that might cause deadlocks if an existing index is
>> changed from unique to non-unique, or vice versa, as the ordering would
>> change. But we don't have a DDL command to change that, so the question is
>> moot.
>
> It's not hard to imagine someone wanting to add such a DDL command.

Perhaps, but the burden of solving that problem ought to rest with
whoever eventually propose the command. Certainly, if someone did so
today, I would object on the grounds that their patch precluded us
from ever prioritizing unique indexes, to get them out of the way
during insertion, so I am not actually making such an effort more
difficult than it already is. Moreover, avoiding entirely predictable
index bloat is more important than making supporting this yet to be
proposed feature's implementation easier. I was surprised when I
learned that things didn't already work this way.

Attached patch, broken off from my patch has relcache sort indexes by
(!indisunique, relindexid).

--
Peter Geoghegan

Attachment

Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Jan 16, 2014 at 6:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I think we need to give this some more thought. I have not addressed
> the implications for MVCC snapshots here.

So I gave this some more thought, and this is what I came up with:

+ static bool
+ ExecLockHeapTupleForUpdateSpec(EState *estate,
+                                ResultRelInfo *relinfo,
+                                ItemPointer tid)
+ {
+     Relation                relation = relinfo->ri_RelationDesc;
+     HeapTupleData            tuple;
+     HeapUpdateFailureData     hufd;
+     HTSU_Result             test;
+     Buffer                    buffer;
+
+     Assert(ItemPointerIsValid(tid));
+
+     /* Lock tuple for update */
+     tuple.t_self = *tid;
+     test = heap_lock_tuple(relation, &tuple,
+                            estate->es_output_cid,
+                            LockTupleExclusive, false, /* wait */
+                            true, &buffer, &hufd);
+     ReleaseBuffer(buffer);
+
+     switch (test)
+     {
+         case HeapTupleInvisible:
+             /*
+              * Tuple may have originated from this transaction, in which case
+              * it's already locked.  However, to avoid having to consider the
+              * case where the user locked an instantaneously invisible row
+              * inserted in the same command, throw an error.
+              */
+             if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple.t_data)))
+                 ereport(ERROR,
+                         (errcode(ERRCODE_UNIQUE_VIOLATION),
+                          errmsg("could not lock instantaneously invisible tuple
inserted in same transaction"),
+                          errhint("Ensure that no rows proposed for insertion in the
same command have constrained values that duplicate each other.")));
+             if (IsolationUsesXactSnapshot())
+                 ereport(ERROR,
+                         (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+                          errmsg("could not serialize access due to concurrent update")));
+             /* Tuple became invisible due to concurrent update; try again */
+             return false;
+         case HeapTupleSelfUpdated:
+             /*

I'm just throwing an error when locking the tuple returns
HeapTupleInvisible, and the xmin of the tuple is our xid.

It's sufficient to just check
TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple.t_data)),
because there is no way that _bt_check_unique() could consider the
tuple dirty visible + conclusively fit for a lock attempt if it came
from our xact, while at the same time for the same tuple
HeapTupleSatisfiesUpdate() indicated invisibility, unless the tuple
originated from the same command. Checking against subxacts or
ancestor xacts is at worst redundant.

I am happy with this. ISTM that it'd be hard to argue that any
reasonable and well-informed person would ever thank us for trying
harder here, although it took me a while to reach that position. To
understand what I mean, consider what MySQL does when in a similar
position. I didn't actually check, but based on the fact that their
docs don't consider this question I guess MySQL would go update the
tuple inserted by that same "INSERT....ON DUPLICATE KEY UPDATE"
command. Most of the time the conflicting tuples proposed for
insertion by the user are in *some* way different (i.e. if the table
was initially empty and you did a regular insert, inserting those same
tuples would cause a unique constraint violation all on their own, but
without there being any fully identical tuples among these
hypothetical tuples proposed for insertion). It seems obvious that the
order in which each tuple is evaluated for insert-or-update on MySQL
is more or less undefined. And so by allowing this, they arguably
allow their users to miss something they should not: they don't end up
doing anything useful with the datums originally inserted in the
command, but then subsequently updated over with something else in the
same command.

MySQL users are not notified that this happened, and are probably
blissfully unaware that there has been a limited form of data loss. So
it's The Right Thing to say to Postgres users: "if you inserted these
rows into the table when it was empty, there'd *still* definitely be a
unique constraint violation, and you need to sort that out before
asking Postgres to handle conflicts with concurrent sessions and
existing data, where rows that come from earlier commands in your xact
counts as existing data". The only problem I can see with that is that
we cannot complain consistently for practical reasons, as when we lock
*some other* xact's tuple rather than inserting in the same command
two or more times. But at least when that happens they can definitely
update two or more times (i.e. the row that we "locked twice" is
visible). Naturally we can't catch every error a DML author may make.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sat, Jan 18, 2014 at 6:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
> MySQL users are not notified that this happened, and are probably
> blissfully unaware that there has been a limited form of data loss. So
> it's The Right Thing to say to Postgres users: "if you inserted these
> rows into the table when it was empty, there'd *still* definitely be a
> unique constraint violation, and you need to sort that out before
> asking Postgres to handle conflicts with concurrent sessions and
> existing data, where rows that come from earlier commands in your xact
> counts as existing data".

I Googled and found evidence indicating that a number of popular
proprietary system's SQL MERGE implementations do much the same thing.
You may get an "attempt to UPDATE the same row twice" error on both
SQL Server and Oracle. I wouldn't like to speculate if the standard
requires this of MERGE, but to require it seems very sensible.

> The only problem I can see with that is that
> we cannot complain consistently for practical reasons, as when we lock
> *some other* xact's tuple rather than inserting in the same command
> two or more times.

Actually, maybe it would be practical to complain that the same UPSERT
command attempted to lock a row twice with at least *almost* total
accuracy, and not just for the particularly problematic case where
tuple visibility is not assured.

Personally, I favor just making "case HeapTupleSelfUpdated:" within
the patch's ExecLockHeapTupleForUpdateSpec() function complain when
"hufd.cmax == estate->es_output_cid)" (currently there is a separate
complaint, but only when those two variables are unequal). That's
probably almost perfect in practice.

If we wanted perfection, which would be to always complain when two
rows were locked by the same UPSERT command, it would be a matter of
having heap_lock_tuple indicate to the patch's
ExecLockHeapTupleForUpdateSpec() caller that the row was already
locked, so that it could complain in a special way for the
locked-not-updated case. But that is hard, because there is no way for
it to know if the current *command* locked the tuple, and that's the
only case that we are justified in raising an error for.

But now that I think about it some more, maybe always complaining when
we lock but have not yet updated is not just not worth the trouble,
but is in fact bogus. It's not obvious what precise behavior is
correct here. I was worried about someone updating something twice,
but maybe it's fully sufficient to do what I've already proposed,
while in addition documenting that you cannot on-duplicate-key-lock a
tuple that has already been inserted or updated within the same
command. It will be very rare for anyone to trip up over that in
practice (e.g. by locking twice and spuriously updating the same row
twice or more in a later command). Users learn to not try this kind of
thing by having it break immediately; the fact that it doesn't break
with 100% reliability is good enough (plus it doesn't *really* fail to
break when it should because of how things are documented).

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sat, Jan 18, 2014 at 7:49 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Personally, I favor just making "case HeapTupleSelfUpdated:" within
> the patch's ExecLockHeapTupleForUpdateSpec() function complain when
> "hufd.cmax == estate->es_output_cid)" (currently there is a separate
> complaint, but only when those two variables are unequal). That's
> probably almost perfect in practice.

Actually, there isn't really a need to do so, since I believe in
practice the tuple locked will always be instantaneously invisible
(when we have the scope to avoid this "updated the tuple twice in the
same command" problem by forbidding it in the style of SQL MERGE).
However, I think I'm going to propose that we still do something in
the ExecLockHeapTupleForUpdateSpec() HeapTupleSelfUpdated handler (in
addition to HeapTupleInvisible), because that'll still be illustrative
dead code.


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Jan 16, 2014 at 12:35 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> I think you should consider breaking off the relcache parts of my
>> patch and committing them, because they're independently useful.
>
> Makes sense. Can you extract that into a separate patch, please?

Perhaps you can take a look at this again, when you get a chance.


-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Heikki Linnakangas
Date:
On 02/07/2014 01:27 PM, Peter Geoghegan wrote:
> On Thu, Jan 16, 2014 at 12:35 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>>> I think you should consider breaking off the relcache parts of my
>>> patch and committing them, because they're independently useful.
>>
>> Makes sense. Can you extract that into a separate patch, please?
>
> Perhaps you can take a look at this again, when you get a chance.

The relcache parts? I don't think a separate patch ever appeared that 
could be reviewed.

Looking again at the last emails in this whole thread, I don't have 
anything to add. At this point, I think it's pretty clear this won't 
make it into 9.4, so I'm going to mark this as "returned with feedback". 
If someone else thinks this still has a chance and is willing to review 
this and beat it into shape, please resurrect it quickly.

- Heikki



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Mon, Feb 10, 2014 at 11:57 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> The relcache parts? I don't think a separate patch ever appeared that could
> be reviewed.


I posted the patch on January 18th:
http://www.postgresql.org/message-id/CAM3SWZTh4VkESoT7dCrWbPRN7zZhNZ-Wa6zmvO1FF7gBNOjNOg@mail.gmail.com

I was under the impression that you agreed that this was independently
valuable, regardless of the outcome here.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Sun, Jan 19, 2014 at 2:17 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I'm just throwing an error when locking the tuple returns
> HeapTupleInvisible, and the xmin of the tuple is our xid.

I would like some feedback on this point. We need to consider how
exactly to avoid updating the same tuple inserted by our command.
Updating a tuple we inserted cannot be allowed to happen, not least
because to do so causes livelock.

A related consideration that I raised in mid to late January that
hasn't been commented on is avoiding updating the same tuple twice,
and where we come down on that with respect to where our
responsibility to the user starts and ends. For example, SQL MERGE
officially forbids this, but MySQL's INSERT...ON DUPLICATE KEY UPDATE
seems not to, probably due to implementation considerations.

-- 
Peter Geoghegan



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Bruce Momjian
Date:
On Mon, Feb 10, 2014 at 06:40:30PM +0000, Peter Geoghegan wrote:
> On Sun, Jan 19, 2014 at 2:17 AM, Peter Geoghegan <pg@heroku.com> wrote:
> > I'm just throwing an error when locking the tuple returns
> > HeapTupleInvisible, and the xmin of the tuple is our xid.
> 
> I would like some feedback on this point. We need to consider how
> exactly to avoid updating the same tuple inserted by our command.
> Updating a tuple we inserted cannot be allowed to happen, not least
> because to do so causes livelock.
> 
> A related consideration that I raised in mid to late January that
> hasn't been commented on is avoiding updating the same tuple twice,
> and where we come down on that with respect to where our
> responsibility to the user starts and ends. For example, SQL MERGE
> officially forbids this, but MySQL's INSERT...ON DUPLICATE KEY UPDATE
> seems not to, probably due to implementation considerations.

Where are we on this?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From
Peter Geoghegan
Date:
On Thu, Apr 17, 2014 at 9:52 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Where are we on this?

My hope is that I can get agreement on a way forward during pgCon. Or,
at the very least, explain the issues as I see them in a relatively
accessible and succinct way to those interested.


-- 
Peter Geoghegan