Re: Making joins involving ctid work for the benefit of UPSERT - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Making joins involving ctid work for the benefit of UPSERT
Date
Msg-id CA+TgmoYELwsp2V1FUkrhY7ZFJduovbFGFF4z0=DvB+ONvtnK8A@mail.gmail.com
Whole thread Raw
In response to Re: Making joins involving ctid work for the benefit of UPSERT  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Making joins involving ctid work for the benefit of UPSERT
Re: Making joins involving ctid work for the benefit of UPSERT
List pgsql-hackers
On Wed, Jul 23, 2014 at 4:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Jul 23, 2014 at 9:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> This all seems completely to one side of Andres's point.  I think what
>> he's saying is: don't implement an SQL syntax of the form INSERT ON
>> CONFLICT and let people use that to implement upsert.  Instead,
>> directly implement an SQL command called UPSERT or similar.  That's
>> long been my intuition as well.  I think generality here is not your
>> friend.
>
> Sure, I was just making one fairly narrow point in relation to Andres'
> concern about executor structuring of the UPSERT.
>
>> I'd suggest something like:
>>
>> UPSERT table SET col = value [, ...] WHERE predicate;
>
> Without dismissing the overall concern shared by you and Andres, that
> particular update-driven syntax doesn't strike me as something that
> should be pursued. I would like to be able to insert or update more
> than a single row at a time, for one thing. For another, what exactly
> appears in the predicate? Is it more or less the same as an UPDATE's
> predicate?

To the last question, yes.  To the first point, I'm not bent on this
particular syntax, but I am +1 for the idea that the command is
something specifically upsert-like rather than something more generic
like an ON CONFLICT SELECT that lets you do arbitrary things with the
returned rows.

>> with semantics like this:
>>
>> 1. Search the table, using any type of scan you like, for a row
>> matching the given predicate.
>
> Perhaps I've misunderstood, but this is fundamentally different to
> what I'd always thought would need to happen. I always understood that
> the UPSERT should be insert-driven, and so not really have an initial
> scan at all in the same sense that every Insert lacks one. Moreover,
> I've always thought that everyone agreed on that. We go to insert, and
> then in the course of inserting index tuples do something with dirty
> snapshots. That's how we get information on conflicts.

It's certain arguable whether you should INSERT and then turn failures
into an update or try to UPDATE and then turn failures into an INSERT;
we might even want to have both options available, though that smells
a little like airing too much of our dirty laundry.  But I think I
generally favor optimizing for the UPDATE case for more or less the
same reasons Kevin articulated.

> For one thing we cannot use any kind of scan unless there is a new
> mechanism for seeing not-yet-visible tuples, like some kind of
> non-MVCC snapshot that those scan nodes can selectively use. Now, I
> guess that you intend that in this scenario you go straight to 5, and
> then your insert finds the not-yet-visible conflict. But it's not
> clear that when 5 fails, and we pick up from 1 that we then get a new
> MVCC Snapshot or something else that gives us a hope of succeeding
> this time around. And it seems quite wasteful to not use knowledge of
> conflicts from the insert at all - that's one thing I'm trying to
> address here; removing unnecessary index scans when we actually know
> what our conflict TIDs are.

Here you seem to be suggested that I intended to propose your existing
design rather than something else, which I didn't.  In this design,
you find the conflict (at most one) but scanning for the tuple to be
updated.

>> 2. If you find more than one tuple that is visible to your scan, error.
>
> This point seems to concern making the UPSERT UPDATE only affect zero
> or one rows. Is that right? If so, why do you think that's a useful
> constraint to impose?

Because nobody wants an operation to either insert 1 tuple or update
n>=1 tuples.  The intention is that the predicate should probably be
something like WHERE unique_key = 'some_value', but you can use
something else.  So it's kinda like saying which index you care about
for a particular operation, except without having to explicitly name
an index.  But in any event you should use a predicate that uniquely
identifies the tuple you want to update.

>> 3. If you find exactly one tuple that is visible to your scan, update it.  Stop.
>> 4. If you find no tuple that is visible to your scan, but you do find
>> at least one tuple whose xmin has committed and whose xmax has not
>> committed, then (4a) if the isolation level is REPEATABLE READ or
>> higher, error; (4b) if there is more than one such tuple, error; else
>> (4b) update it, in violation of normal MVCC visibility rules.
>
> Point 4b ("if there is more than one such tuple...") seems like it
> needs some practical or theoretical justification - do you just not
> want to follow an update chain? If so, why? What would the error
> message actually be? And, to repeat myself: Why should an MVCC
> snapshot see more than one such tuple in the ordinary course of
> scanning a relation, since there is by definition a unique constraint
> that prevents this? Or maybe I'm wrong; maybe you don't even require
> that. That isn't clear.

In case (4b), like in case (2), you've done something like UPSERT tab
SET ... WHERE non_unique_index = 'value_appearing_more_than_once'.
You shouldn't do that, because you have only one replacement tuple
(embodied in the SET clause).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Shapes on the regression test for polygon
Next
From: Krystian Bigaj
Date:
Subject: Re: System shutdown signal on Windows (was Re: [GENERAL])