Thread: Making joins involving ctid work for the benefit of UPSERT

Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
I'm working on UPSERT again. I think that in order to make useful
progress, I'll have to find a better way of overcoming the visibility
issues (i.e. the problem of what to do about a
still-in-progress-to-our-snapshot row being locked at READ COMMITTED
isolation level [1][2]).

I've made some tentative additions to my earlier patch to change the syntax:

postgres=# explain analyze
with c as ( insert into ints(key, val) values (1, 'a'), (2, 'b') on conflict select * for update) update ints set val =
c.valfrom c conflicts;                                                     QUERY PLAN
 

----------------------------------------------------------------------------------------------------------------------Update
onints  (cost=0.03..62.34 rows=2 width=104) (actual
 
time=0.095..0.095 rows=0 loops=1)  CTE c    ->  Insert on ints ints_1  (cost=0.00..0.03 rows=2 width=36)
(actual time=0.039..0.049 rows=2 loops=1)          ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2
width=36) (actual time=0.001..0.002 rows=2 loops=1)  ->  Nested Loop  (cost=0.00..62.31 rows=2 width=104) (actual
time=0.063..0.080 rows=2 loops=1)        Join Filter: (ints.ctid = c.ctid)        Rows Removed by Join Filter: 6
-> CTE Scan on c  (cost=0.00..0.04 rows=2 width=100) (actual
 
time=0.044..0.055 rows=2 loops=1)        ->  Materialize  (cost=0.00..28.45 rows=1230 width=10)
(actual time=0.005..0.006 rows=4 loops=2)              ->  Seq Scan on ints  (cost=0.00..22.30 rows=1230
width=10) (actual time=0.009..0.009 rows=4 loops=1)Planning time: 0.132 msExecution time: 0.159 ms
(12 rows)

This works, as far as it goes. Parse analysis adds the ctid join to
the top-level UPDATE based on the fact that a magical UPDATE statement
(magical due to having a CONFLICTS clause) references a CTE in its
FromList (this is mandated by CONFLICTS during parse analysis), a CTE
containing an INSERT ON CONFLICT, and itself projecting a magical set
of rejected rows sufficient to get a locked c.ctid implicitly. (FWIW,
DELETEs may also accept a CONFLICTS clause and be used like this).

The optimizer does not support joins involving ctid, which is why
there is a seqscan plan - even setting "enable_seqscan = off" does not
alter the plan right now.

Apart from the obvious fact that we're doing an unnecessary Seq Scan
on the target table, which is unacceptable on performance grounds
(since we do in fact already know exactly what list of tids to pass to
the top-level ModifyTable node to UPDATE), there is another problem:
Since I cannot use a tid scan, I cannot play tricks with visibility
within tidscans for the benefit of solving the basic problem of doing
the right thing for READ COMMITTED. Clearly I need to add a new "MVCC
violation" in the spirit of EvalPlanQual() to avoid this problem
(since READ COMMITTED serialization failures are unacceptable to
users), but I'd like this new MVCC violation to be as selective as
possible.

Let me take a step back, though. Presently UPDATE WHERE CURRENT OF
does something that isn't a million miles from what I have in mind to
address the problem. That feature exists for "sensitive cursors",
where we lock rows (since FOR UPDATE was specified with the DECLARE
CURSOR statement) as they're fetched from the cursor. The upshot is
that in general, rows can change "out from under us" since they're not
yet locked, including in ways that might result in a new row version
not satisfying the original SELECT FOR UPDATE query predicate -
clearly users don't want to have to deal with that when they
subsequently go to UPDATE WHERE CURRENT OF. Special handling is
required. Purely because UPDATE WHERE CURRENT OF was specified, as
part of this special handling that update must use a tid scan. There
is a little bit of special machinery within the optimizer (within
cost_tidscan()) to force this for the isCurrentOf case.

The tid scan code within the executor also has special UPDATE WHERE
CURRENT OF handling. TidListCreate() specially handles the case by
asking execCurrentOf() to look up the cursor's portal, and to get a
tid from its QueryDesc for the current cursor position. TidNext()
specially fetches the most recent version visible to our estate's
Snapshot when we UPDATE WHERE CURRENT OF, without considering the
original SELECT FOR UPDATE predicate (and whether or not it would in
any sense continue to be satisfied). Each tid is then fetched from the
heap and returned. I think I ought to be able to do something similar
with this new CONFLICTS clause, by similarly marshaling some list of
tids ultimately originating from the ModifyTable node that locked a
potentially-not-yet-visible row (or maybe the optimizer recognizes the
internal special case and creates the tid scan node "visibility
credulous"). I'd then return heap tuples by using a dirty snapshot.
Since this can only happen for the UPDATE's tid scan, and there must
be a tid scan, it should be correct (on its own fairly reasonable
terms, for READ COMMITTED). After all, at this juncture for the
purposes of UPSERT it will already certainly be the case that the
tuple is locked.

I would like to:

* Get feedback on the hazards of taking this direction, and indeed the
advisability of the whole approach in general.

For example, I think we'd always want to use a DirtySnapshot, and
never try to just get the latest version visible to our esate
snapshot, but I might be wrong about that. Since my magical UPDATE
could have additional boolean conditions in its predicate (that is,
additions to the implicit ctid join, in contrast to the existing
magical WHERE CURRENT OF variant of UPDATE where that's always the
only thing in the predicate), the "morality" of *which* tid should get
to the UPDATE should certainly be considered. Of particular concern is
the fact that having Postgres do one or the other could make the
difference between an UPDATE qual being satisfied or not being
satisfied. Maybe this is an argument against being able to tack on
additional boolean conditions in the following fashion (which I'd
prefer to support without also supporting more that one join (the
magical implicit one), even though as I mentioned WHERE CURRENT OF
does not support boolean conditions):

with c as ( insert into ints(key, val) values (1, 'a'), (2, 'b') on conflict select * for update) update ints set val =
c.valfrom c conflicts where c.key != 1;
 

* Get some direction on what it would take to make joins involving
ctid work, since all of this hinges upon that infrastructure becoming
available. This appears to be a Simple Matter of Programming (at least
for someone that happens to already have a good understanding of the
optimizer), and is anticipated by this comment within tidpath.c:
* There is currently no special support for joins involving CTID; in* particular nothing corresponding to
best_inner_indexscan(). Since it's* not very useful to store TIDs of one table in another table, there* doesn't seem to
beenough use-case to justify adding a lot of code* for that.
 

[1] http://www.postgresql.org/message-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com
[2] http://www.pgcon.org/2014/schedule/attachments/327_upsert_weird.pdf
, slide 25
-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Thu, Jul 17, 2014 at 6:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
> This appears to be a Simple Matter of Programming (at least
> for someone that happens to already have a good understanding of the
> optimizer), and is anticipated by this comment within tidpath.c:
>
>  * There is currently no special support for joins involving CTID; in
>  * particular nothing corresponding to best_inner_indexscan().  Since it's
>  * not very useful to store TIDs of one table in another table, there
>  * doesn't seem to be enough use-case to justify adding a lot of code
>  * for that.

By the way, this comment is obsolete since best_inner_indexscan() was
removed in 2012 by the parameterized paths patch.

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Andres Freund
Date:
Hi,

On 2014-07-17 18:18:41 -0700, Peter Geoghegan wrote:
> I'm working on UPSERT again. I think that in order to make useful
> progress, I'll have to find a better way of overcoming the visibility
> issues (i.e. the problem of what to do about a
> still-in-progress-to-our-snapshot row being locked at READ COMMITTED
> isolation level [1][2]).

Agreed.

> I've made some tentative additions to my earlier patch to change the syntax:
> 
> postgres=# explain analyze
> with c as (
>   insert into ints(key, val) values (1, 'a'), (2, 'b')
>   on conflict
>   select * for update)
>   update ints set val = c.val from c conflicts;

I still don't agree that this is a sane syntax for upsert. Avoiding it
would reduce the scope of the problems you have measureably. We should
provide a syntax that does the UPSERT itself, instead of delegating that
to the user as you're proposing above.  Afaics nearly none of the
problems you're debating in your email would exist if upsert were
entirely done internally.

I think the things that use "wierd" visibility semantics are pretty much
all doing that internally (things being EvalPlanQual stuff for
INSERT/UPDATE/DELETE and the referential integrity triggers). I don't
see sufficient reason why we should break away from that here. Yes,
there's stuff that cannot be done when it's done internally, but we can
live with those not being possible.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Fri, Jul 18, 2014 at 10:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I think the things that use "wierd" visibility semantics are pretty much
> all doing that internally (things being EvalPlanQual stuff for
> INSERT/UPDATE/DELETE and the referential integrity triggers). I don't
> see sufficient reason why we should break away from that here. Yes,
> there's stuff that cannot be done when it's done internally, but we can
> live with those not being possible.

The whole point of what I was proposing was that those semantics would
only apply to a special tid scan node. Perhaps I missed something, but
I'm not sure why you'd consider that I was breaking away from that
here at all.

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Andres Freund
Date:
On 2014-07-18 10:53:36 -0700, Peter Geoghegan wrote:
> On Fri, Jul 18, 2014 at 10:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I think the things that use "wierd" visibility semantics are pretty much
> > all doing that internally (things being EvalPlanQual stuff for
> > INSERT/UPDATE/DELETE and the referential integrity triggers). I don't
> > see sufficient reason why we should break away from that here. Yes,
> > there's stuff that cannot be done when it's done internally, but we can
> > live with those not being possible.
> 
> The whole point of what I was proposing was that those semantics would
> only apply to a special tid scan node. Perhaps I missed something, but
> I'm not sure why you'd consider that I was breaking away from that
> here at all.

I don't see why you'd need such a node at all if we had a fully builtin
UPSERT. The whole stuff with ON CONFLICT SELECT FOR UPDATE and then
UPDATE ... FROM c CONFLICTS is too complicated and exposes stuff that
barely anybody will understand, let alone use correctly in queries they
write themselves.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Fri, Jul 18, 2014 at 11:06 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I don't see why you'd need such a node at all if we had a fully builtin
> UPSERT. The whole stuff with ON CONFLICT SELECT FOR UPDATE and then
> UPDATE ... FROM c CONFLICTS is too complicated and exposes stuff that
> barely anybody will understand, let alone use correctly in queries they
> write themselves.

I accept that there will be a need for certain restrictions. Most
obviously, if you update the target table referencing a CTE like this,
not using the special CONFLICTS clause in the UPDATE (or DELETE) is an
error. And as I mentioned, you may only join the projected duplicates
to the UPDATE ModifyTable - an attempt to join any more relations is
an error. In short, this *is* a fully built-in upsert.


-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Andres Freund
Date:
On 2014-07-18 11:14:34 -0700, Peter Geoghegan wrote:
> On Fri, Jul 18, 2014 at 11:06 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I don't see why you'd need such a node at all if we had a fully builtin
> > UPSERT. The whole stuff with ON CONFLICT SELECT FOR UPDATE and then
> > UPDATE ... FROM c CONFLICTS is too complicated and exposes stuff that
> > barely anybody will understand, let alone use correctly in queries they
> > write themselves.
> 
> I accept that there will be a need for certain restrictions. Most
> obviously, if you update the target table referencing a CTE like this,
> not using the special CONFLICTS clause in the UPDATE (or DELETE) is an
> error. And as I mentioned, you may only join the projected duplicates
> to the UPDATE ModifyTable - an attempt to join any more relations is
> an error. In short, this *is* a fully built-in upsert.

Meh. A understandable syntax wouldn't require the pullups with a special
scan node and such. I think you're attempting a sort of genericity
that's making your (important!) goal much harder to reach.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Fri, Jul 18, 2014 at 11:23 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Meh. A understandable syntax wouldn't require the pullups with a special
> scan node and such. I think you're attempting a sort of genericity
> that's making your (important!) goal much harder to reach.

Well, maybe. If the genericity of this syntax isn't what people want,
I may go with something else. At the suggestion of Robert I did start
a dedicated thread on that question months back (explicitly putting
aside the nitty-gritty details of the implementation), but that didn't
go anywhere.

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Fri, Jul 18, 2014 at 11:32 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Well, maybe. If the genericity of this syntax isn't what people want,
> I may go with something else.

By the way, I didn't mention that there is now also the optional
ability to specify a "merge on" unique index directly in DML. It would
be much nicer to specify a sort key instead, but that might be tricky
in the general case. I imagine that other systems solve the problem by
being very restrictive in what will be (implicitly) accepted as a
merge-on index. Seemingly there are problems with all major SQL MERGE
implementations, so I don't necessarily expect to be able to draw
lessons from them in any way here.

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Fri, Jul 18, 2014 at 11:23 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Meh. A understandable syntax wouldn't require the pullups with a special
> scan node and such.

Well, in general ExecModifyTable()/ExecUpdate() trusts the tid passed
to match the qual in the query. Unless you're willing to give up on
the idea of having a qual on the update (other than something like an
ON CONFLICTS qual, obviously) I think you'll probably end up with some
kind of pseudo-scan node anyway, if only for consistency with how
things already work, to give you somewhere to show the Filter in
explain output and so on. ExecScan() probably needs to ExecQual().
Inserts don't have scan nodes, but updates do, and so it seems pretty
odd to make the "Insert" node (a ModifyTable node) pullup from a
result node (as always), and the "Update" node (also a ModifyTable
node) treat the first ModifyTable node as its own pseudo-scan node,
when in fact we are scanning the heap in a manner not unlike a
conventional update. Or do you envision a second result node where an
update qual might be evaluated? I am not enthused about either
possibility.

The idea of not having the ability to put a predicate on the update at
all, much like the MySQL thing isn't completely outrageous, but it
certainly isn't going to go down well those that have already
expressed concerns about how much of a foot gun this could be.
-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Robert Haas
Date:
On Sat, Jul 19, 2014 at 10:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Jul 18, 2014 at 11:23 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Meh. A understandable syntax wouldn't require the pullups with a special
>> scan node and such.
>
> Well, in general ExecModifyTable()/ExecUpdate() trusts the tid passed
> to match the qual in the query. Unless you're willing to give up on
> the idea of having a qual on the update (other than something like an
> ON CONFLICTS qual, obviously) I think you'll probably end up with some
> kind of pseudo-scan node anyway, if only for consistency with how
> things already work, to give you somewhere to show the Filter in
> explain output and so on. ExecScan() probably needs to ExecQual().
> Inserts don't have scan nodes, but updates do, and so it seems pretty
> odd to make the "Insert" node (a ModifyTable node) pullup from a
> result node (as always), and the "Update" node (also a ModifyTable
> node) treat the first ModifyTable node as its own pseudo-scan node,
> when in fact we are scanning the heap in a manner not unlike a
> conventional update. Or do you envision a second result node where an
> update qual might be evaluated? I am not enthused about either
> possibility.
>
> The idea of not having the ability to put a predicate on the update at
> all, much like the MySQL thing isn't completely outrageous, but it
> certainly isn't going to go down well those that have already
> expressed concerns about how much of a foot gun this could be.

This all seems completely to one side of Andres's point.  I think what
he's saying is: don't implement an SQL syntax of the form INSERT ON
CONFLICT and let people use that to implement upsert.  Instead,
directly implement an SQL command called UPSERT or similar.  That's
long been my intuition as well.  I think generality here is not your
friend.

I'd suggest something like:

UPSERT table SET col = value [, ...] WHERE predicate;

with semantics like this:

1. Search the table, using any type of scan you like, for a row
matching the given predicate.
2. If you find more than one tuple that is visible to your scan, error.
3. If you find exactly one tuple that is visible to your scan, update it.  Stop.
4. If you find no tuple that is visible to your scan, but you do find
at least one tuple whose xmin has committed and whose xmax has not
committed, then (4a) if the isolation level is REPEATABLE READ or
higher, error; (4b) if there is more than one such tuple, error; else
(4b) update it, in violation of normal MVCC visibility rules.
5. Construct a new tuple using the col/value pairs from the SET clause
and try to insert it.  If this would fail with a unique index
violation, back out and go back to step 1.

Having said all that, I believe the INSERT ON CONFLICT syntax is more
easily comprehensible than previous proposals.  But I still tend to
agree with Andres that an explicit UPSERT syntax or something like it,
that captures all of the MVCC games inside itself, is likely
preferable from a user standpoint, whatever the implementation ends up
looking like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Wed, Jul 23, 2014 at 9:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> This all seems completely to one side of Andres's point.  I think what
> he's saying is: don't implement an SQL syntax of the form INSERT ON
> CONFLICT and let people use that to implement upsert.  Instead,
> directly implement an SQL command called UPSERT or similar.  That's
> long been my intuition as well.  I think generality here is not your
> friend.

Sure, I was just making one fairly narrow point in relation to Andres'
concern about executor structuring of the UPSERT.

> I'd suggest something like:
>
> UPSERT table SET col = value [, ...] WHERE predicate;

Without dismissing the overall concern shared by you and Andres, that
particular update-driven syntax doesn't strike me as something that
should be pursued. I would like to be able to insert or update more
than a single row at a time, for one thing. For another, what exactly
appears in the predicate? Is it more or less the same as an UPDATE's
predicate?

> with semantics like this:
>
> 1. Search the table, using any type of scan you like, for a row
> matching the given predicate.

Perhaps I've misunderstood, but this is fundamentally different to
what I'd always thought would need to happen. I always understood that
the UPSERT should be insert-driven, and so not really have an initial
scan at all in the same sense that every Insert lacks one. Moreover,
I've always thought that everyone agreed on that. We go to insert, and
then in the course of inserting index tuples do something with dirty
snapshots. That's how we get information on conflicts.

For one thing we cannot use any kind of scan unless there is a new
mechanism for seeing not-yet-visible tuples, like some kind of
non-MVCC snapshot that those scan nodes can selectively use. Now, I
guess that you intend that in this scenario you go straight to 5, and
then your insert finds the not-yet-visible conflict. But it's not
clear that when 5 fails, and we pick up from 1 that we then get a new
MVCC Snapshot or something else that gives us a hope of succeeding
this time around. And it seems quite wasteful to not use knowledge of
conflicts from the insert at all - that's one thing I'm trying to
address here; removing unnecessary index scans when we actually know
what our conflict TIDs are.

> 2. If you find more than one tuple that is visible to your scan, error.

This point seems to concern making the UPSERT UPDATE only affect zero
or one rows. Is that right? If so, why do you think that's a useful
constraint to impose?

> 3. If you find exactly one tuple that is visible to your scan, update it.  Stop.
> 4. If you find no tuple that is visible to your scan, but you do find
> at least one tuple whose xmin has committed and whose xmax has not
> committed, then (4a) if the isolation level is REPEATABLE READ or
> higher, error; (4b) if there is more than one such tuple, error; else
> (4b) update it, in violation of normal MVCC visibility rules.

Point 4b ("if there is more than one such tuple...") seems like it
needs some practical or theoretical justification - do you just not
want to follow an update chain? If so, why? What would the error
message actually be? And, to repeat myself: Why should an MVCC
snapshot see more than one such tuple in the ordinary course of
scanning a relation, since there is by definition a unique constraint
that prevents this? Or maybe I'm wrong; maybe you don't even require
that. That isn't clear.

As you know, I've always opposed any type of READ COMMITTED
serialization failure. If you're worried about lock starvation hazards
- although I don't think strong concerns here are justified - we can
always put in a fail-safe number of retries before throwing an error.
I'm comfortable with that because I think based on experimentation
that it's going to be virtually impossible to hit. Apparently other
snapshot isolation databases acquire a new snapshot, and re-do the
command rather than using an EvalPlanQual()-like mechanism and thereby
violating snapshot isolation. Those other systems have just such a
hard limit on retries before erroring, and it seems to work out okay
for them.

> 5. Construct a new tuple using the col/value pairs from the SET clause
> and try to insert it.  If this would fail with a unique index
> violation, back out and go back to step 1.

Again, I can't see why you'd do this step last and not first, since
this is naturally the place to initially "see into the future" with a
dirty snapshot.

> Having said all that, I believe the INSERT ON CONFLICT syntax is more
> easily comprehensible than previous proposals.  But I still tend to
> agree with Andres that an explicit UPSERT syntax or something like it,
> that captures all of the MVCC games inside itself, is likely
> preferable from a user standpoint, whatever the implementation ends up
> looking like.

Okay then. If you both feel that way, I will come up with something
closer to what you sketch. But for now I still strongly feel it ought
to be driven by an insert. Perhaps I've misunderstood you entirely,
though.

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Greg Stark
Date:
On Wed, Jul 23, 2014 at 5:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'd suggest something like:
>
> UPSERT table SET col = value [, ...] WHERE predicate;

I don't think this is actually good enough. Typical use cases are
things like "increment this counter or insert a new one starting at
0".

-- 
greg



Re: Making joins involving ctid work for the benefit of UPSERT

From
Kevin Grittner
Date:
Peter Geoghegan <pg@heroku.com> wrote:

> I still strongly feel it ought to be driven by an insert

Could you clarify that?  Does this mean that you feel that we
should write to the heap before reading the index to see if the row
will be a duplicate?  If so, I think that is a bad idea, since this
will sometimes be used to apply a new data set which hasn't changed
much from the old, and that approach will perform poorly for this
use case, causing a lot of bloat.  It certainly would work well for
the case that most of the rows are expected to be INSERTs rather
than DELETEs, but I'm not sure that's justification for causing
extreme bloat in the other cases.

Also, just a reminder that I'm going to squawk loudly if the
implementation does not do something fairly predictable and sane
for the case that the table has more than one UNIQUE index and you
attempt to UPSERT a row that is a duplicate of one row on one of
the indexes and a different row on a different index.  The example
discussed during your PGCon talk was something like a city table
with two column, each with a UNIQUE constraint, containing:

 city_id | city_name
---------+-----------
       1 | Toronto
       2 | Ottawa

... and an UPSERT comes through for (1, 'Ottawa').  We would all
like for that never to happen, but it will.  There must be sane and
documented behavior in that case.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making joins involving ctid work for the benefit of UPSERT

From
Robert Haas
Date:
On Wed, Jul 23, 2014 at 4:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Jul 23, 2014 at 9:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> This all seems completely to one side of Andres's point.  I think what
>> he's saying is: don't implement an SQL syntax of the form INSERT ON
>> CONFLICT and let people use that to implement upsert.  Instead,
>> directly implement an SQL command called UPSERT or similar.  That's
>> long been my intuition as well.  I think generality here is not your
>> friend.
>
> Sure, I was just making one fairly narrow point in relation to Andres'
> concern about executor structuring of the UPSERT.
>
>> I'd suggest something like:
>>
>> UPSERT table SET col = value [, ...] WHERE predicate;
>
> Without dismissing the overall concern shared by you and Andres, that
> particular update-driven syntax doesn't strike me as something that
> should be pursued. I would like to be able to insert or update more
> than a single row at a time, for one thing. For another, what exactly
> appears in the predicate? Is it more or less the same as an UPDATE's
> predicate?

To the last question, yes.  To the first point, I'm not bent on this
particular syntax, but I am +1 for the idea that the command is
something specifically upsert-like rather than something more generic
like an ON CONFLICT SELECT that lets you do arbitrary things with the
returned rows.

>> with semantics like this:
>>
>> 1. Search the table, using any type of scan you like, for a row
>> matching the given predicate.
>
> Perhaps I've misunderstood, but this is fundamentally different to
> what I'd always thought would need to happen. I always understood that
> the UPSERT should be insert-driven, and so not really have an initial
> scan at all in the same sense that every Insert lacks one. Moreover,
> I've always thought that everyone agreed on that. We go to insert, and
> then in the course of inserting index tuples do something with dirty
> snapshots. That's how we get information on conflicts.

It's certain arguable whether you should INSERT and then turn failures
into an update or try to UPDATE and then turn failures into an INSERT;
we might even want to have both options available, though that smells
a little like airing too much of our dirty laundry.  But I think I
generally favor optimizing for the UPDATE case for more or less the
same reasons Kevin articulated.

> For one thing we cannot use any kind of scan unless there is a new
> mechanism for seeing not-yet-visible tuples, like some kind of
> non-MVCC snapshot that those scan nodes can selectively use. Now, I
> guess that you intend that in this scenario you go straight to 5, and
> then your insert finds the not-yet-visible conflict. But it's not
> clear that when 5 fails, and we pick up from 1 that we then get a new
> MVCC Snapshot or something else that gives us a hope of succeeding
> this time around. And it seems quite wasteful to not use knowledge of
> conflicts from the insert at all - that's one thing I'm trying to
> address here; removing unnecessary index scans when we actually know
> what our conflict TIDs are.

Here you seem to be suggested that I intended to propose your existing
design rather than something else, which I didn't.  In this design,
you find the conflict (at most one) but scanning for the tuple to be
updated.

>> 2. If you find more than one tuple that is visible to your scan, error.
>
> This point seems to concern making the UPSERT UPDATE only affect zero
> or one rows. Is that right? If so, why do you think that's a useful
> constraint to impose?

Because nobody wants an operation to either insert 1 tuple or update
n>=1 tuples.  The intention is that the predicate should probably be
something like WHERE unique_key = 'some_value', but you can use
something else.  So it's kinda like saying which index you care about
for a particular operation, except without having to explicitly name
an index.  But in any event you should use a predicate that uniquely
identifies the tuple you want to update.

>> 3. If you find exactly one tuple that is visible to your scan, update it.  Stop.
>> 4. If you find no tuple that is visible to your scan, but you do find
>> at least one tuple whose xmin has committed and whose xmax has not
>> committed, then (4a) if the isolation level is REPEATABLE READ or
>> higher, error; (4b) if there is more than one such tuple, error; else
>> (4b) update it, in violation of normal MVCC visibility rules.
>
> Point 4b ("if there is more than one such tuple...") seems like it
> needs some practical or theoretical justification - do you just not
> want to follow an update chain? If so, why? What would the error
> message actually be? And, to repeat myself: Why should an MVCC
> snapshot see more than one such tuple in the ordinary course of
> scanning a relation, since there is by definition a unique constraint
> that prevents this? Or maybe I'm wrong; maybe you don't even require
> that. That isn't clear.

In case (4b), like in case (2), you've done something like UPSERT tab
SET ... WHERE non_unique_index = 'value_appearing_more_than_once'.
You shouldn't do that, because you have only one replacement tuple
(embodied in the SET clause).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making joins involving ctid work for the benefit of UPSERT

From
David G Johnston
Date:
Peter Geoghegan-3 wrote
>> with semantics like this:
>>
>> 1. Search the table, using any type of scan you like, for a row
>> matching the given predicate.
> 
> Perhaps I've misunderstood, but this is fundamentally different to
> what I'd always thought would need to happen. I always understood that
> the UPSERT should be insert-driven, and so not really have an initial
> scan at all in the same sense that every Insert lacks one. Moreover,
> I've always thought that everyone agreed on that. We go to insert, and
> then in the course of inserting index tuples do something with dirty
> snapshots. That's how we get information on conflicts.

SQL standard not-withstanding there really is no way to pick one
implementation and not have it be sub-optimal in some situations.  Unless
there is a high barrier why not introduce syntax:

SCAN FIRST; INSERT FIRST

that allows the user to specify the behavior that they expect would be most
efficient given their existing/new data ratio?


>> Having said all that, I believe the INSERT ON CONFLICT syntax is more
>> easily comprehensible than previous proposals.  But I still tend to
>> agree with Andres that an explicit UPSERT syntax or something like it,
>> that captures all of the MVCC games inside itself, is likely
>> preferable from a user standpoint, whatever the implementation ends up
>> looking like.
> 
> Okay then. If you both feel that way, I will come up with something
> closer to what you sketch. But for now I still strongly feel it ought
> to be driven by an insert. Perhaps I've misunderstood you entirely,
> though.

Getting a little syntax crazy here but having all of:

UPSERT [SCAN|INSERT] FIRST
INSERT ON CONFLICT UPDATE - same as INSERT FIRST
UPDATE ON MISSING INSERT - same as SCAN FIRST

with the corresponding 2 implementations would make the user interface
slightly more complicated but able to be conformed to the actual data that
the user has.


You could basically perform a two-phase pass where you run the
user-requested algorithm and then for all failures attempt the alternate
algorithm and then error if both fail.

I am not at all fluent on the concurrency issues here, and the MVCC
violations and re-tries that might be considered, but at a high-level there
is disagreement here simply because both answers are "correct" and ideally
both can be provided to the user.

David J.






--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Making-joins-involving-ctid-work-for-the-benefit-of-UPSERT-tp5811919p5812640.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Wed, Jul 23, 2014 at 3:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Could you clarify that?  Does this mean that you feel that we
> should write to the heap before reading the index to see if the row
> will be a duplicate?  If so, I think that is a bad idea, since this
> will sometimes be used to apply a new data set which hasn't changed
> much from the old, and that approach will perform poorly for this
> use case, causing a lot of bloat.  It certainly would work well for
> the case that most of the rows are expected to be INSERTs rather
> than DELETEs, but I'm not sure that's justification for causing
> extreme bloat in the other cases.

No, I think we should stagger ordinary index insertion in a way that
locks indexes - we lock indexes, then if successful insert a heap
tuple before finally inserting index tuples using the existing
heavyweight page-level index locks. My design doesn't cause bloat
under any circumstances. Heikki's design, which he sketched with an
actual POC implementation involved possible bloating in the event of a
conflict. He also had to go and delete the promise tuple (from within
ExecInsert()) in the event of the conflict before row locking in order
to prevent unprincipled deadlocking. Andres wanted to do something
else similarly involving "promise tuples", where the xid on the
inserter was stored in indexes with a special flag. That could also
cause bloat. I think that could be particularly bad when conflicts
necessitate visiting indexes one by one to kill promise tuples, as
opposed to just killing one heap tuple as in the case of Heikki's
design.

Anyway, both of those designs, and my own design are insert-driven.
The main difference between the design that Heikki sketched and my own
is that mine does not cause bloat, but is more invasive to the nbtree
code (but less invasive to a lot of other places to make the
deadlocking-ultimately-conflicting tuple killing work). But I believe
that Heikki's design is identical to my own in terms of user-visible
semantics. That said, his design was just a sketched and it wouldn't
be fair to hold him to it.

> Also, just a reminder that I'm going to squawk loudly if the
> implementation does not do something fairly predictable and sane
> for the case that the table has more than one UNIQUE index and you
> attempt to UPSERT a row that is a duplicate of one row on one of
> the indexes and a different row on a different index.

Duly noted.  :-)

I think that it's going to have to support that one way or the other.
It might be the case that I'll want to make the choice of unique index
optionally "implicit", but it's clear that we want to be able to
specify a specific unique index in one form or another. Actually, I've
already added that. It's just optional right now. I haven't found a
better way than by just specifying the name of the unique index in
DML, which is ugly, which is the main reason I want to make it
optional. Perhaps we can overcome this.

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Wed, Jul 23, 2014 at 4:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> > On Wed, Jul 23, 2014 at 9:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> >> 2. If you find more than one tuple that is visible to your scan, error.
> >
> > This point seems to concern making the UPSERT UPDATE only affect zero
> > or one rows. Is that right? If so, why do you think that's a useful
> > constraint to impose?
> 
> Because nobody wants an operation to either insert 1 tuple or update
> n>=1 tuples.  The intention is that the predicate should probably be
> something like WHERE unique_key = 'some_value', but you can use
> something else.  So it's kinda like saying which index you care about
> for a particular operation, except without having to explicitly name
> an index.  But in any event you should use a predicate that uniquely
> identifies the tuple you want to update.

This seemed a nice idea when I first read it earlier today, but now I'm
not so sure.  Are you saying that it wouldn't be allowed to use an
UPSERT with some sort of join, such that each joined row would produce
either one insert or one update?  To clarify: suppose I import some
external data into a temp table, then run UPSERT "USING" that table so
that the rows end up in a permanent table; some of the rows might be
already in the permanent table, some others might not.  I would hope
that by the time UPSERT is done, all the rows are in the permanent
table.  Would that raise an error, with your proposed design?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Wed, Jul 23, 2014 at 3:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> To the last question, yes.  To the first point, I'm not bent on this
> particular syntax, but I am +1 for the idea that the command is
> something specifically upsert-like rather than something more generic
> like an ON CONFLICT SELECT that lets you do arbitrary things with the
> returned rows.

FWIW I wasn't proposing that you'd be able to do arbitrary things.
There'd be a few restrictions, such as forbidding unexpected DML
commands, and requiring that only a special update reference the
special rejected-projecting CTE. They just wouldn't be arbitrary
restrictions. But that's not that interesting to me anymore; you and
Andres want a dedicated DML command with the update part built in,
that isn't as flexible. Okay, fine. I don't think it makes the
implementation any easier, but that's what I'll do.

> It's certain arguable whether you should INSERT and then turn failures
> into an update or try to UPDATE and then turn failures into an INSERT;
> we might even want to have both options available, though that smells
> a little like airing too much of our dirty laundry.  But I think I
> generally favor optimizing for the UPDATE case for more or less the
> same reasons Kevin articulated.

I don't see the connection between this and Kevin's remarks. And FWIW,
I don't see a reason to favor inserts or updates. Fortunately, what I
have balances both cases very well, and doesn't cause bloat. The work
of descending the index to lock it isn't wasted if an update is
required. My implementation decides to either insert or update at
literally the latest possible moment.

>> For one thing we cannot use any kind of scan unless there is a new
>> mechanism for seeing not-yet-visible tuples, like some kind of
>> non-MVCC snapshot that those scan nodes can selectively use. Now, I
>> guess that you intend that in this scenario you go straight to 5, and
>> then your insert finds the not-yet-visible conflict. But it's not
>> clear that when 5 fails, and we pick up from 1 that we then get a new
>> MVCC Snapshot or something else that gives us a hope of succeeding
>> this time around. And it seems quite wasteful to not use knowledge of
>> conflicts from the insert at all - that's one thing I'm trying to
>> address here; removing unnecessary index scans when we actually know
>> what our conflict TIDs are.
>
> Here you seem to be suggested that I intended to propose your existing
> design rather than something else, which I didn't.  In this design,
> you find the conflict (at most one) but scanning for the tuple to be
> updated.

Yes, but what if you don't see a conflict because it isn't visible to
your snapshot, and then you insert, and only then (step 5), presumably
with a dirty snapshot, you find a conflict? How does the loop
terminate if that brings you back to step 1 with the same MVCC
snapshot feeding the update?

>>> 2. If you find more than one tuple that is visible to your scan, error.
>>
>> This point seems to concern making the UPSERT UPDATE only affect zero
>> or one rows. Is that right? If so, why do you think that's a useful
>> constraint to impose?
>
> Because nobody wants an operation to either insert 1 tuple or update
> n>=1 tuples.  The intention is that the predicate should probably be
> something like WHERE unique_key = 'some_value', but you can use
> something else.  So it's kinda like saying which index you care about
> for a particular operation, except without having to explicitly name
> an index.  But in any event you should use a predicate that uniquely
> identifies the tuple you want to update.

I agree that you want to uniquely identify each tuple. What I meant
was, why should we not be able to upsert multiple rows in a single
command? What's wrong with that?

>> Point 4b ("if there is more than one such tuple...") seems like it
>> needs some practical or theoretical justification - do you just not
>> want to follow an update chain? If so, why? What would the error
>> message actually be? And, to repeat myself: Why should an MVCC
>> snapshot see more than one such tuple in the ordinary course of
>> scanning a relation, since there is by definition a unique constraint
>> that prevents this? Or maybe I'm wrong; maybe you don't even require
>> that. That isn't clear.
>
> In case (4b), like in case (2), you've done something like UPSERT tab
> SET ... WHERE non_unique_index = 'value_appearing_more_than_once'.
> You shouldn't do that, because you have only one replacement tuple
> (embodied in the SET clause).

Oh, I totally agree that we should throw an error if any one row  is
affected more than once by the same command. Indeed, SQL MERGE
requires this, since to do otherwise would leave the final value of
the row unspecified. It seemed to me from your original explanation of
point 4b that you were saying something much stronger. But if that's
all you meant, then I agree.

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Wed, Jul 23, 2014 at 9:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'd suggest something like:
>
> UPSERT table SET col = value [, ...] WHERE predicate;

I think I see another problem with this. The UPDATE must provide a
WHERE clause Var on a unique indexed column (let's say it's
constrained to provide a "(Var [unique-indexed opclass support
function 3 op] Const)" qual during parse analysis because you asked
for UPSERT. But it can also affect 0 rows for other reasons, since
this UPDATE qual can have arbitrary other expressions. So in general
you have any number of reasons for not updating, which implies that
you must insert, which might not be possible because there might even
still be duplicate key violation in a not-yet-visible-to-you row (even
just in the unique index you intended to merge on). Whereas, when
inserting, there is exactly one reason (per row) to go and update - a
conflict (i.e. a would-be duplicate violation). And this is a conflict
that is meaningful to every transaction, regardless of their snapshot,
since it represents an objective fact about the physical presence of
an index tuple.

So, do you make the UPDATE differentiate between different reasons for
the UPDATE failing, only inserting when an UPSERT would have happened
had you omitted the extra stuff in your UPSERT predicate? Can you
really differentiate anyway? And isn't the choice to do the insert
based on what your MVCC snapshot happens to see - rather than
something that there is necessarily objective truth on, the physical
presence of a duplicate tuple in an index - rather arbitrary? It's a
different situation to my implementation, where you do an insert-like
thing, and then find a conflict row to lock and then update, because
at that point the row is successfully locked, and the WHERE clause may
be evaluated against rejects and the *most recent version* of the
locked, conflict-causing row. The most recent, locked version is not
arbitrary at all. That's the version you ought to evaluate a predicate
against before finally deciding to UPDATE.

I assume you agree with my view that UPSERT should always insert or
update. This kind of business sounds closer to SQL MERGE, where an
insert may not drive things, and you accept these kinds of anomalies
in exchange for a lot more flexibility in not having inserts drive
things. Did you happen to read my post on that?
-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Robert Haas
Date:
On Wed, Jul 23, 2014 at 7:32 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> Because nobody wants an operation to either insert 1 tuple or update
>> n>=1 tuples.  The intention is that the predicate should probably be
>> something like WHERE unique_key = 'some_value', but you can use
>> something else.  So it's kinda like saying which index you care about
>> for a particular operation, except without having to explicitly name
>> an index.  But in any event you should use a predicate that uniquely
>> identifies the tuple you want to update.
>
> This seemed a nice idea when I first read it earlier today, but now I'm
> not so sure.  Are you saying that it wouldn't be allowed to use an
> UPSERT with some sort of join, such that each joined row would produce
> either one insert or one update?  To clarify: suppose I import some
> external data into a temp table, then run UPSERT "USING" that table so
> that the rows end up in a permanent table; some of the rows might be
> already in the permanent table, some others might not.  I would hope
> that by the time UPSERT is done, all the rows are in the permanent
> table.  Would that raise an error, with your proposed design?

Yeah, my syntax didn't have a mechanism for that.  I agree we should
have one.  I was just brainstorming.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making joins involving ctid work for the benefit of UPSERT

From
Robert Haas
Date:
On Wed, Jul 23, 2014 at 7:35 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> It's certain arguable whether you should INSERT and then turn failures
>> into an update or try to UPDATE and then turn failures into an INSERT;
>> we might even want to have both options available, though that smells
>> a little like airing too much of our dirty laundry.  But I think I
>> generally favor optimizing for the UPDATE case for more or less the
>> same reasons Kevin articulated.
>
> I don't see the connection between this and Kevin's remarks. And FWIW,
> I don't see a reason to favor inserts or updates. Fortunately, what I
> have balances both cases very well, and doesn't cause bloat. The work
> of descending the index to lock it isn't wasted if an update is
> required. My implementation decides to either insert or update at
> literally the latest possible moment.

AFAIUI, this is because your implementation uses lwlocks in a way that
Andres and I both find unacceptable.  My suspicion is that any version
of this that ends up getting committed is going to involve a risk of
bloat in cases involving retries, and I think it will be easier to
minimize bloat in an update-driven implementation.  But I suppose
that's speculative.

>> Here you seem to be suggested that I intended to propose your existing
>> design rather than something else, which I didn't.  In this design,
>> you find the conflict (at most one) but scanning for the tuple to be
>> updated.
>
> Yes, but what if you don't see a conflict because it isn't visible to
> your snapshot, and then you insert, and only then (step 5), presumably
> with a dirty snapshot, you find a conflict? How does the loop
> terminate if that brings you back to step 1 with the same MVCC
> snapshot feeding the update?

Good point.  Maybe the syntax should be something like:

UPSERT table (keycol [, keycol] ...) { VALUES (val [, val] ...) [,
...] | select_query }

That would address both the concern about being able to pipe multiple
tuples through it and the point you just raised.  We look for a row
that matches each given tuple on the key columns; if one is found, we
update it; if none is found, we insert.

> I agree that you want to uniquely identify each tuple. What I meant
> was, why should we not be able to upsert multiple rows in a single
> command? What's wrong with that?

Nothing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Mon, Jul 28, 2014 at 8:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> AFAIUI, this is because your implementation uses lwlocks in a way that
> Andres and I both find unacceptable.

That's not the case. My implementation uses page-level heavyweight
locks. The nbtree AM used to use them for other stuff. Plenty of other
systems use index level locks managed by a heavyweight lock manager.

>>> Here you seem to be suggested that I intended to propose your existing
>>> design rather than something else, which I didn't.  In this design,
>>> you find the conflict (at most one) but scanning for the tuple to be
>>> updated.
>>
>> Yes, but what if you don't see a conflict because it isn't visible to
>> your snapshot, and then you insert, and only then (step 5), presumably
>> with a dirty snapshot, you find a conflict? How does the loop
>> terminate if that brings you back to step 1 with the same MVCC
>> snapshot feeding the update?
>
> Good point.  Maybe the syntax should be something like:
>
> UPSERT table (keycol [, keycol] ...) { VALUES (val [, val] ...) [,
> ...] | select_query }
>
> That would address both the concern about being able to pipe multiple
> tuples through it and the point you just raised.  We look for a row
> that matches each given tuple on the key columns; if one is found, we
> update it; if none is found, we insert.

That basically is my design, except that (tangentially) yours risks
bloat in exchange for not having to use a real locking mechanism, and
has a different syntax. The parts of inserting into an index scan that
I stagger include an initial part that is more or less just an index
scan. With this design you'll have to set up things so that all
indexes can be directly accessed in the manner of ExecInsert() (get a
list of them from the relcache, open them in an order that avoids
possible deadlocks, etc). Why not just use ExecInsert()? I don't think
I'm alone in seeing things that way.

On a mostly unrelated note, I'll remind you of the reason that I felt
it was best to lock indexes. It wasn't so much about avoiding bloat as
it was about avoiding deadlocks. When I highlighted the issue,
Heikki's prototype, which did insert optimistically rather than
locking, was then made to go and physically "super delete" the
upsert-insert conflicting heap tuple (inserted optimistically before
its index tuples), before going to lock a row, in order to avoid
unprincipled deadlocking. In contrast, my design just used a callback
that released page level heavyweight locks before going to lock a row.
Heikki's prototype involved making it so that *even someone else's
dirty snapshot* didn't see our dead speculatively-inserted heap tuple.

Anyway, making all that happen is fairly invasive to a bunch of places
that are just as critical as the nbtree code. I'm not saying it can't
be done, or even that it definitely shouldn't be, but taking an
approach that produces bloat, rather than locking values the way other
systems do (and, to a limited extent Postgres already does) is at
least way more invasive than it first appears. Plus,  I ask you to
consider that.

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Mon, Jul 28, 2014 at 10:43 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Plus,  I ask you to
> consider that.

Excuse me. I meant "Plus, you avoid bloat. I ask you to consider that".

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Mon, Jul 28, 2014 at 10:43 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On a mostly unrelated note, I'll remind you of the reason that I felt
> it was best to lock indexes. It wasn't so much about avoiding bloat as
> it was about avoiding deadlocks. When I highlighted the issue,
> Heikki's prototype, which did insert optimistically rather than
> locking, was then made to go and physically "super delete" the
> upsert-insert conflicting heap tuple (inserted optimistically before
> its index tuples), before going to lock a row, in order to avoid
> unprincipled deadlocking. In contrast, my design just used a callback
> that released page level heavyweight locks before going to lock a row.
> Heikki's prototype involved making it so that *even someone else's
> dirty snapshot* didn't see our dead speculatively-inserted heap tuple.
>
> Anyway, making all that happen is fairly invasive to a bunch of places
> that are just as critical as the nbtree code.

I think I should be more concrete about why this is more complicated
than it first appears. Andres said at pgCon that he would still go
with a design where "promise tuples" are inserted into indexes ahead
of any heap tuple (which differs from Heikki's prototype, where heap
tuple insertion occurs first). Accounting for deadlocking issues could
be particularly problematic with that design, since we must kill
tuples from each index in turn before row locking. In any case the
need to efficiently *release* locks must be weighed carefully. It
isn't obvious, but we must release locks if there is a conflict.

After Heikki acknowledged the problem [1], he produced a revision
addressing it. The details of the workaround and a patch were posted
[2]. I think it's fair to say that this area is a lot messier than it
first appears. If anyone wants to propose an alternative design, they
are of course quite welcome to, but I ask that the very real
difficulties with those designs be acknowledged. AFAICT, only Heikki
has independently acknowledged these issue on list.

In case it isn't obvious, let me be clear about what I care about
here. I feel it's important to get something that is easy to reason
about - you write a DML statement, and within certain fairly
reasonable parameters Postgres does the rest. I think we should not
accept something that may even occasionally through deadlock errors,
or unique violations, or RC-level serialization failures through no
fault of the user. That would be inferior to the plpgql looping
pattern we promote that does the right thing and avoids all of this.
It would be awful to have to tell users who hit problems like this
that they should just stop doing so much upserting, or use the old
pattern.

[1] http://www.postgresql.org/message-id/52B4AAF0.5090806@vmware.com
[2] http://www.postgresql.org/message-id/52D00D2D.6030307@vmware.com

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Robert Haas
Date:
On Mon, Jul 28, 2014 at 1:43 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Jul 28, 2014 at 8:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> AFAIUI, this is because your implementation uses lwlocks in a way that
>> Andres and I both find unacceptable.
>
> That's not the case. My implementation uses page-level heavyweight
> locks. The nbtree AM used to use them for other stuff. Plenty of other
> systems use index level locks managed by a heavyweight lock manager.

Oh, OK.  I think I missed that development somewhere.

>> Good point.  Maybe the syntax should be something like:
>>
>> UPSERT table (keycol [, keycol] ...) { VALUES (val [, val] ...) [,
>> ...] | select_query }
>>
>> That would address both the concern about being able to pipe multiple
>> tuples through it and the point you just raised.  We look for a row
>> that matches each given tuple on the key columns; if one is found, we
>> update it; if none is found, we insert.
>
> That basically is my design, except that (tangentially) yours risks
> bloat in exchange for not having to use a real locking mechanism, and
> has a different syntax.

I think it would be advisable to separate the syntax from the
implementation.  Presumably you can make your implementation use some
reasonable syntax we can all agree on, and conversely my proposed
syntax could be made to have a different set of semantics.  There's
some connection between the syntax and semantics, of course, but it's
not 100%.  I mention this because I was mostly concerned with getting
to a reasonable syntax proposal, not so much the implementation
details.  It may well be that your implementation details are perfect
at this point; I don't know because I haven't looked, and I'm not an
expert on that area of the code anyway.  But I have looked at your
syntax, which I wasn't altogether keen on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Tue, Jul 29, 2014 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think it would be advisable to separate the syntax from the
> implementation.  Presumably you can make your implementation use some
> reasonable syntax we can all agree on, and conversely my proposed
> syntax could be made to have a different set of semantics.  There's
> some connection between the syntax and semantics, of course, but it's
> not 100%.  I mention this because I was mostly concerned with getting
> to a reasonable syntax proposal, not so much the implementation
> details.  It may well be that your implementation details are perfect
> at this point; I don't know because I haven't looked, and I'm not an
> expert on that area of the code anyway.  But I have looked at your
> syntax, which I wasn't altogether keen on.


Fair enough. I think the syntax should reflect the fact that upserts
are driven by inserts, though. Users will get into trouble with a
syntax that allows a predicate that is evaluated before any rows are
locked.

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Robert Haas
Date:
On Tue, Jul 29, 2014 at 1:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Jul 29, 2014 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think it would be advisable to separate the syntax from the
>> implementation.  Presumably you can make your implementation use some
>> reasonable syntax we can all agree on, and conversely my proposed
>> syntax could be made to have a different set of semantics.  There's
>> some connection between the syntax and semantics, of course, but it's
>> not 100%.  I mention this because I was mostly concerned with getting
>> to a reasonable syntax proposal, not so much the implementation
>> details.  It may well be that your implementation details are perfect
>> at this point; I don't know because I haven't looked, and I'm not an
>> expert on that area of the code anyway.  But I have looked at your
>> syntax, which I wasn't altogether keen on.
>
> Fair enough. I think the syntax should reflect the fact that upserts
> are driven by inserts, though. Users will get into trouble with a
> syntax that allows a predicate that is evaluated before any rows are
> locked.

You've convinced me that only indexable predicates can be sensibly
used here.  I'm not sure that's the same as saying that upserts are
driven by inserts, though.  I'd tend to interpret that to mean
insert-the-heap-tuple-first, but I think what you're doing is
check-the-indexes-first.  The terminology in this area is tricky.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making joins involving ctid work for the benefit of UPSERT

From
Peter Geoghegan
Date:
On Wed, Jul 30, 2014 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> You've convinced me that only indexable predicates can be sensibly
> used here.  I'm not sure that's the same as saying that upserts are
> driven by inserts, though.  I'd tend to interpret that to mean
> insert-the-heap-tuple-first, but I think what you're doing is
> check-the-indexes-first.  The terminology in this area is tricky.

I'm glad. Indeed, I am proposing checking/locking indexes first, but
that is actually pretty similar to what Heikki did in his prototype
(insert a heap tuple first), in that those heap tuples were
conceptually speculative/optimistic "value locks". Heikki's need to
"super delete" in the event of a conflict (by setting the speculative
heap tuple's xmin to InvalidTransactionId) demonstrates that.
Furthermore, Heikki's last revision ended up having places like
_bt_checkunique() looking in the procarray to see if the xact to wait
on was still around and had a "speculative token", so that we could
wait on that to be released rather than the whole xact to commit/abort
(avoiding unprincipled deadlocking hinges upon this):

+ /*
+  * If it's a speculative insertion, wait for it to finish (ie.
+  * to go ahead with the insertion, or kill the tuple). Otherwise
+  * wait for the transaction to finish as usual.
+  */
+ if (speculativeToken)
+     SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+     XactLockTableWait(xwait);

So, as I said, what he ended up with was actually very close to what I
wanted to do. I felt it was more logical to just use an index page
lock, given the fact that they have been used before, and given the
limited code footprint (although I do accept that in general altering
the nbtree code is best avoided). But Heikki did ultimately understand
the considerations that informed my design, and so ultimately IMV the
disagreement there was more or less just on that one detail. (Not that
the rest of the patch was perfect or anything, but it was useful to
have my design *understood*). You can hopefully see why I'd
characterize both of those designs as insertion driven.

The terminology is very tricky here, though. It might even be the
trickiest aspect to the whole thing, because at times I've had a hard
time communicating what I mean effectively on list. Anyway, I'll be
sure to emphasize the distinction in future.

-- 
Peter Geoghegan



Re: Making joins involving ctid work for the benefit of UPSERT

From
Bruce Momjian
Date:
On Mon, Jul 28, 2014 at 11:37:07AM -0400, Robert Haas wrote:
> > Yes, but what if you don't see a conflict because it isn't visible to
> > your snapshot, and then you insert, and only then (step 5), presumably
> > with a dirty snapshot, you find a conflict? How does the loop
> > terminate if that brings you back to step 1 with the same MVCC
> > snapshot feeding the update?
> 
> Good point.  Maybe the syntax should be something like:
> 
> UPSERT table (keycol [, keycol] ...) { VALUES (val [, val] ...) [,
> ...] | select_query }

One idea would be to allow UPSERT with constants (single row), and use
CTEs with a SELECT or INSERT/RETURNING for multi-row upserts.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +