Thread: [HACKERS] INSERT ... ON CONFLICT () SELECT

[HACKERS] INSERT ... ON CONFLICT () SELECT

From
Matt Pulver
Date:
Hello,

I am looking to add a new language feature that returns the rows that conflict on an INSERT, and would appreciate feedback and guidance on this.

Here is an example.

To implement a get_or_create_id() function, this is how it must currently be done:

CREATE TABLE example (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    value FLOAT);
CREATE UNIQUE INDEX ON example (name);

CREATE FUNCTION get_or_create_id(_name TEXT) RETURNS INT AS
$$
WITH get AS (
    SELECT id FROM example WHERE name=_name
), new AS (
    INSERT INTO example (name) VALUES (_name)
    ON CONFLICT (name) DO NOTHING
    RETURNING id
)
SELECT id FROM get
UNION ALL
SELECT id FROM new
$$
LANGUAGE sql;

SELECT get_or_create_id('foo'); -- 1
SELECT get_or_create_id('bar'); -- 2
SELECT get_or_create_id('foo'); -- 1

With the proposed "INSERT ... ON CONFLICT () SELECT" feature, the get_or_create_id() function is simplified to:

CREATE FUNCTION get_or_create_id(_name TEXT) RETURNS INT AS
$$
INSERT INTO example (name) VALUES (_name)
ON CONFLICT (name) DO SELECT
RETURNING id
$$
LANGUAGE sql;

In the case of a CONFLICT, the selected rows are exactly those same rows that would be operated on by an ON CONFLICT () DO UPDATE clause. These rows are then made available to the RETURNING clause in the same manner. Just like "DO NOTHING", the "DO SELECT" clause takes no arguments. It only makes the conflicting rows available to the RETURNING clause.

Tom Lane has previously responded to a similar request which was ill-defined, especially in the context of exclusion constraints. I believe that by SELECTing exactly those same rows that an UPDATE clause would on a CONFLICT, this becomes well-defined, even with exclusion constraints.

Feedback/guidance is most welcome.

Best regards,
Matt

Re: [HACKERS] INSERT ... ON CONFLICT () SELECT

From
Peter Geoghegan
Date:
On Sat, Jun 17, 2017 at 7:49 AM, Matt Pulver <mpulver@unitytechgroup.com> wrote:
> With the proposed "INSERT ... ON CONFLICT () SELECT" feature, the
> get_or_create_id() function is simplified to:

Are you locking the existing rows? Because otherwise, the
determination that they're conflicting can become obsolete immediately
afterwards. (I guess you would be.)

The problem with this design and similar designs is that presumably
the user is sometimes projecting the conflicting rows with the
intention of separately updating them in a wCTE. That might not work,
because only ON CONFLICT doesn't use the MVCC snapshot, in order to
ensure that an UPDATE is guaranteed when an INSERT cannot go ahead.
That isn't what you're doing in the example you gave, but surely some
users would try to do things like that, and get very confused.

I think that what you propose to do here would likely create a lot of
confusion by mixing MVCC semantics with special UPSERT visibility
semantics ("show me the latest row version visible to any possible
snapshot for the special update") even without a separate UPDATE, in
fact. Would you be okay if "id" appeared duplicated in the rows you
project in your new syntax, even when there is a separate unique
constraint on that column? I suppose that there is some risk of things
like that today, but this would make the "sleight of hand" used by ON
CONFLICT DO UPDATE more likely to cause problems.

-- 
Peter Geoghegan



Re: [HACKERS] INSERT ... ON CONFLICT () SELECT

From
Matt Pulver
Date:
On Sat, Jun 17, 2017 at 9:55 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Sat, Jun 17, 2017 at 7:49 AM, Matt Pulver <mpulver@unitytechgroup.com> wrote:
> With the proposed "INSERT ... ON CONFLICT () SELECT" feature, the
> get_or_create_id() function is simplified to:

Are you locking the existing rows? Because otherwise, the
determination that they're conflicting can become obsolete immediately
afterwards. (I guess you would be.)

If that is required in order to return the rows in their conflicted state, then yes.


The problem with this design and similar designs is that presumably
the user is sometimes projecting the conflicting rows with the
intention of separately updating them in a wCTE. That might not work,
because only ON CONFLICT doesn't use the MVCC snapshot, in order to
ensure that an UPDATE is guaranteed when an INSERT cannot go ahead.
That isn't what you're doing in the example you gave, but surely some
users would try to do things like that, and get very confused.

Ultimately the proposed "INSERT ... ON CONFLICT () DO SELECT" syntax is still an INSERT statement, not a SELECT, so a user should not expect rows returned from it to be available for UPDATE/DELETE in another part of a wCTE. Anyone who understands this behavior for an INSERT statement, let alone the current "INSERT ... ON CONFLICT DO UPDATE" should not be too surprised if the same thing applies to the new "INSERT ... ON CONFLICT DO SELECT".


I think that what you propose to do here would likely create a lot of
confusion by mixing MVCC semantics with special UPSERT visibility
semantics ("show me the latest row version visible to any possible
snapshot for the special update") even without a separate UPDATE, in
fact. Would you be okay if "id" appeared duplicated in the rows you
project in your new syntax, even when there is a separate unique
constraint on that column? I suppose that there is some risk of things
like that today, but this would make the "sleight of hand" used by ON
CONFLICT DO UPDATE more likely to cause problems.

 Good point. Here is an example using the example table from my previous email:

INSERT INTO example (name) VALUES ('foo'), ('foo')
ON CONFLICT (name) DO SELECT
RETURNING *

Here are a couple options of how to handle this:

1) Return two identical rows (with the same id).
2) Produce an error, with error message:
"ERROR:  ON CONFLICT DO SELECT command cannot reference row a second time
HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values."

This would be nearly identical to the existing error message that is produced when running:

INSERT INTO example (name) VALUES ('foo'), ('foo')
ON CONFLICT (name) DO UPDATE SET value=1
RETURNING *

which gives the error message:
"ERROR:  ON CONFLICT DO UPDATE command cannot affect row a second time
HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values."

Technically, an error doesn't need to be produced for the above "ON CONFLICT DO SELECT" statement - I think it is still perfectly well-defined to return duplicate rows as in option 1. Option 2 is more for the benefit of the user who is probably doing something wrong by attempting to INSERT a set of rows that violate a constraint. What do you think would be best?

Thank you for the discussion.

Best regards,
Matt

Re: [HACKERS] INSERT ... ON CONFLICT () SELECT

From
Peter Geoghegan
Date:
On Sun, Jun 18, 2017 at 4:33 AM, Matt Pulver <mpulver@unitytechgroup.com> wrote:
> This would be nearly identical to the existing error message that is
> produced when running:
>
> INSERT INTO example (name) VALUES ('foo'), ('foo')
> ON CONFLICT (name) DO UPDATE SET value=1
> RETURNING *
>
>
> which gives the error message:
> "ERROR:  ON CONFLICT DO UPDATE command cannot affect row a second time
> HINT:  Ensure that no rows proposed for insertion within the same command
> have duplicate constrained values."

FWIW, the fact that ON CONFLICT DO UPDATE will never update a row that
it itself inserted has proved valuable. For example, transition
tables, which appeared within statement triggers, have no need to
consider the case where an inserted tuple appears for the INSERT
statement case, as well as the UPDATE statement case (in another
form).

> Technically, an error doesn't need to be produced for the above "ON CONFLICT
> DO SELECT" statement - I think it is still perfectly well-defined to return
> duplicate rows as in option 1.

Returning rows with duplicate values seems rather unorthodox.

-- 
Peter Geoghegan



Re: [HACKERS] INSERT ... ON CONFLICT () SELECT

From
Matt Pulver
Date:
On Sun, Jun 18, 2017 at 9:21 PM, Peter Geoghegan <pg@bowt.ie> wrote:
Returning rows with duplicate values seems rather unorthodox.

Ok, then option 2 it is.

In summary, this is what I am going to (attempt to) implement for the new syntax:

INSERT ...
ON CONFLICT (...) DO SELECT
RETURNING ...
  1. Rows that are in conflict are made available to the RETURNING clause. In other words, it is like an idempotent "ON CONFLICT DO UPDATE".
  2. Similarly, insertion sets that would cause the error "ON CONFLICT DO UPDATE command cannot affect row a second time" if it were an "ON CONFLICT DO UPDATE" statement will also cause a similar error for "ON CONFLICT DO SELECT". This will prevent duplicate rows from being returned.
  3. Like an "ON CONFLICT DO UPDATE", the returned rows cannot be changed by another part of the wCTE, even if no actual insertions occurred.
Unless I have missed anything, I think all other issues have been adequately addressed. Since there are no red lights, I shall proceed. :)

Best regards,
Matt