Thread: Why is unique constraint needed for upsert?

Why is unique constraint needed for upsert?

From
Seamus Abshere
Date:
hi all,

Upsert is usually defined [1] in reference to a violating a unique key:

> Insert, if unique constraint violation then update; or update, if not found then insert.

Is this theoretically preferable to just looking for a row that matches
certain criteria, updating it if found or inserting otherwise?

For an example of the latter approach, see MongoDB's flavor of upsert
[2]. You just give it a "query" and an "update". It seems to me this is
better because it functions correctly whether or not an index is in place.

Best, thanks,
Seamus


[1]
http://postgresql.uservoice.com/forums/21853-general/suggestions/245202-merge-upsert-replace
[2] http://docs.mongodb.org/manual/reference/method/db.collection.update/

--
Seamus Abshere, SCEA
https://github.com/seamusabshere


Re: Why is unique constraint needed for upsert?

From
John R Pierce
Date:
On 7/23/2014 10:21 AM, Seamus Abshere wrote:
> hi all,
>
> Upsert is usually defined [1] in reference to a violating a unique key:
>
>> Insert, if unique constraint violation then update; or update, if not
>> found then insert.
>
> Is this theoretically preferable to just looking for a row that
> matches certain criteria, updating it if found or inserting otherwise?

what happens when two connections do this more or less concurrently, in
transactions?



--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: Why is unique constraint needed for upsert?

From
Igor Neyman
Date:
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of John R Pierce
Sent: Wednesday, July 23, 2014 1:32 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Why is unique constraint needed for upsert?

On 7/23/2014 10:21 AM, Seamus Abshere wrote:
> hi all,
>
> Upsert is usually defined [1] in reference to a violating a unique key:
>
>> Insert, if unique constraint violation then update; or update, if not 
>> found then insert.
>
> Is this theoretically preferable to just looking for a row that 
> matches certain criteria, updating it if found or inserting otherwise?

what happens when two connections do this more or less concurrently, in transactions?



-- 
john r pierce                                      37N 122W
somewhere on the middle of the left coast


Well, that's exactly why OP prefers Mongo, which doesn't care about such "small" things as ACID.

Regards,
Igor Neyman

Re: Why is unique constraint needed for upsert?

From
Tom Lane
Date:
John R Pierce <pierce@hogranch.com> writes:
> On 7/23/2014 10:21 AM, Seamus Abshere wrote:
>> Upsert is usually defined [1] in reference to a violating a unique key:
>> Is this theoretically preferable to just looking for a row that
>> matches certain criteria, updating it if found or inserting otherwise?

> what happens when two connections do this more or less concurrently, in
> transactions?

For the OP's benefit --- the subtext John left unstated is that the
unique-key mechanism has already solved the problem of preventing
concurrent updates from creating duplicate keys.  If we build a version of
UPSERT that doesn't rely on a unique index then it'll need some entirely
new mechanism to prevent concurrent key insertion.  (And if you don't care
about concurrent cases, you don't really need UPSERT ...)

            regards, tom lane


Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

From
Seamus Abshere
Date:
On 7/23/14 3:40 PM, Tom Lane wrote:
> John R Pierce <pierce@hogranch.com> writes:
>> On 7/23/2014 10:21 AM, Seamus Abshere wrote:
>>> Upsert is usually defined [1] in reference to a violating a unique key:
>>> Is this theoretically preferable to just looking for a row that
>>> matches certain criteria, updating it if found or inserting otherwise?
>
>> what happens when two connections do this more or less concurrently, in
>> transactions?
>
> For the OP's benefit --- the subtext John left unstated is that the
> unique-key mechanism has already solved the problem of preventing
> concurrent updates from creating duplicate keys.  If we build a version of
> UPSERT that doesn't rely on a unique index then it'll need some entirely
> new mechanism to prevent concurrent key insertion.  (And if you don't care
> about concurrent cases, you don't really need UPSERT ...)

hi all,

What if we treat atomicity as optional? You could have extremely
readable syntax like:

> -- no guarantees, no index required
> UPSERT age = 5 INTO dogs WHERE name = 'Jerry';

> -- optionally tell us how you want to deal with collision
> UPSERT age = 3 INTO dogs WHERE name = 'Jerry' ON DUPLICATE KEY KEEP NEWEST;
> UPSERT age = 5 INTO dogs WHERE name = 'Jerry' ON DUPLICATE KEY KEEP NEWEST;

> -- only **require** (by throwing an error) a unique index or a locked table for queries like
> UPSERT age = age+1 INTO dogs WHERE name = 'Jerry';

Obviously this flies in the face of what most people say the
"fundamental Upsert property" is [1]

> At READ COMMITTED isolation level, you should always get an atomic insert or update [1]

I just think there are a lot of non-concurrent bulk loading and
processing workflows that could benefit from the performance advantages
of upsert (one trip to database).

Best, thanks,
Seamus

[1] http://www.pgcon.org/2014/schedule/events/661.en.html

--
Seamus Abshere, SCEA
https://github.com/seamusabshere


Seamus Abshere <seamus@abshere.net> writes:
> On 7/23/14 3:40 PM, Tom Lane wrote:
>> For the OP's benefit --- the subtext John left unstated is that the
>> unique-key mechanism has already solved the problem of preventing
>> concurrent updates from creating duplicate keys.

> What if we treat atomicity as optional?

You'll get a *much* warmer response to that kind of suggestion from
MongoDB or MySQL, no doubt.  PG is not in the business of optional
data integrity.

> I just think there are a lot of non-concurrent bulk loading and
> processing workflows that could benefit from the performance advantages
> of upsert (one trip to database).

What exactly is your argument for supposing that an UPSERT without an
underlying index would perform so well?  It seems much more likely
that it'd suck, because of having to do full-table scans to look
for existing rows.

            regards, tom lane


On 7/23/2014 1:45 PM, Seamus Abshere wrote:
>
> What if we treat atomicity as optional? You could have extremely
> readable syntax like:

atomicity is not and never will be optional in PostgreSQL.

>> -- no guarantees, no index required
>> UPSERT age = 5 INTO dogs WHERE name = 'Jerry';

and if there's several rows with name='Jerry', you'd want to update them
ALL ?  if name isn't indexed, this will, as Tom suggests, require a FULL
table scan, and it still will have issues with concurrency (connection
scans table, finds nothing, starts to insert, user 2 scans table, finds
nothing, starts to insert, poof, now we have two records?!?).   If name
*is* indexed and unique, this collision will cause a error at commit for
at least one of those connections.

--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

From
Seamus Abshere
Date:
On 7/23/14 6:03 PM, John R Pierce wrote:
> On 7/23/2014 1:45 PM, Seamus Abshere wrote:
>> What if we treat atomicity as optional?
>
> atomicity is not and never will be optional in PostgreSQL.

I'm wondering what a minimal definition of upsert could be - possibly
separating concurrency handling out as a (rigorously defined) option for
those who need it.

>>> -- no guarantees, no index required
>>> UPSERT age = 5 INTO dogs WHERE name = 'Jerry';
>
> and if there's several rows with name='Jerry', you'd want to update them
> ALL ?  if name isn't indexed, this will, as Tom suggests, require a FULL
> table scan, and it still will have issues with concurrency

Ah, I was just saying, in terms of correctness, it seems to me that
upsert shouldn't NEED a index to work, just like you don't need an index
on "name" when you say WHERE name = 'Jerry' in SELECTs or INSERTS or
UPDATES.

Appreciate the defense of data integrity in any case!!

Best,
Seamus

--
Seamus Abshere, SCEA
https://github.com/seamusabshere


Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

From
David G Johnston
Date:
seamusabshere wrote
>> At READ COMMITTED isolation level, you should always get an atomic insert
>> or update [1]
>
> I just think there are a lot of non-concurrent bulk loading and
> processing workflows that could benefit from the performance advantages
> of upsert (one trip to database).

Bulk load raw data into UNLOGGED staging table
LOCK production table
UPDATE matched records
INSERT unmatched records
UNLOCK production table
TRUNCATE staging table

This seems like it would be sufficient for "non-concurrent bulk loading"...

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Why-is-unique-constraint-needed-for-upsert-tp5812552p5812628.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

From
David G Johnston
Date:
seamusabshere wrote
> On 7/23/14 6:03 PM, John R Pierce wrote:
>> On 7/23/2014 1:45 PM, Seamus Abshere wrote:
>>> What if we treat atomicity as optional?
>>
>> atomicity is not and never will be optional in PostgreSQL.
>
> I'm wondering what a minimal definition of upsert could be - possibly
> separating concurrency handling out as a (rigorously defined) option for
> those who need it.

I don't know how you can avoid the implicit need for an "IF" in the
algorithm.  I guess if you had some way to force an INSERT to automatically
hide any previous entries/records with the same PK you could move the
checking to the read side of the equation - and deal with the necessary
periodic cleanup.  At this point you are basically implementing a Temporal
database...

If you leave the checking to occur during write why wouldn't you want an
index to make that go faster?  It isn't mandatory but any performant
implementation is going to use one.

You can enforce a "unique constraint violation" without an index so you
initial premise is wrong - though again why would you want to?

Also, why do you assume MongoDB doesn't use an index to execute the supplied
query?

From your link:

"To prevent MongoDB from inserting the same document more than once, create
a unique index on the name field. With a unique index, if an applications
issues a group of upsert operations, exactly one update() would successfully
insert a new document."

Given we do not have native UPSERT I'm not sure where your question is
coming from anyway.  I'm not sure what the plans are for UPSERT at the
moment but nothing prevents us from performing the UPSERT comparison on a
non-uniqe set of columns.  If the only unique index on a table is its
"serial" column then you would get behavior similar to MongoDB w/o a unique
index on "name".

Though that does re-up the question about what happens when you issue a
subsequent UPSERT and more than one matching record is returned...the most
logical being apply the update to all matched records.

I have difficulty imaging a situation where this would be desirable.  If I
am using UPSERT I am defining a complete entity that I need to cause to
exist.  If three of them already exist there is some differentiating factor
between them that my UPSERT command would clobber.  The example given in the
MongoDB link is not a particularly convincing use-case.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Why-is-unique-constraint-needed-for-upsert-tp5812552p5812631.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


On 7/23/14 6:50 PM, David G Johnston wrote:
> seamusabshere wrote
>> On 7/23/14 6:03 PM, John R Pierce wrote:
>>> On 7/23/2014 1:45 PM, Seamus Abshere wrote:
>>>> What if we treat atomicity as optional?
>>> atomicity is not and never will be optional in PostgreSQL.
>> I'm wondering what a minimal definition of upsert could be - possibly
>> separating concurrency handling out as a (rigorously defined) option for
>> those who need it.
> Given we do not have native UPSERT I'm not sure where your question is
> coming from anyway.  I'm not sure what the plans are for UPSERT at the
> moment but nothing prevents us from performing the UPSERT comparison on a
> non-uniqe set of columns.

hi David,

My argument lives and dies on the assumption that UPSERT would be useful
even if it was (when given with no options) just a macro for

>   UPDATE db SET b = data WHERE a = key;
>   IF NOT found THEN
>     INSERT INTO db(a,b) VALUES (key, data);
>   END IF;

Adding things like unique indexes would work like you would expect with
individual INSERTs or UPDATEs - your statement might raise an exception.
Then, going beyond, UPSERT would optionally support atomic "a = a+1"
stuff, special actions to take on duplicate keys, all the concurrency
stuff that people have been talking about.

IMO having such a complicated definition of what an upsert "must" be
makes it a unicorn when it could just be a sibling to INSERT and UPDATE.

Best,
Seamus


--
Seamus Abshere, SCEA
https://github.com/seamusabshere


On 7/23/2014 3:29 PM, Seamus Abshere wrote:
> My argument lives and dies on the assumption that UPSERT would be
> useful even if it was (when given with no options) just a macro for
>
>>   UPDATE db SET b = data WHERE a = key;
>>   IF NOT found THEN
>>     INSERT INTO db(a,b) VALUES (key, data);
>>   END IF;

but that won't work if two connections execute similar 'upserts'
concurrently.    both updates will see the record isn't there, then one
or the other insert will fail, depending on which transaction commits first.



--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

From
David G Johnston
Date:

hi David,

My argument lives and dies on the assumption that UPSERT would be useful
even if it was (when given with no options) just a macro for

>   UPDATE db SET b = data WHERE a = key;
>   IF NOT found THEN
>     INSERT INTO db(a,b) VALUES (key, data);
>   END IF;

Adding things like unique indexes would work like you would expect with
individual INSERTs or UPDATEs - your statement might raise an exception.
Then, going beyond, UPSERT would optionally support atomic "a = a+1"
stuff, special actions to take on duplicate keys, all the concurrency
stuff that people have been talking about.

IMO having such a complicated definition of what an upsert "must" be
makes it a unicorn when it could just be a sibling to INSERT and UPDATE.


Fair enough.  I'd personally much rather have a staging table and use writeable CTEs to implement something that simple - retrying on the off chance an error occurs.

I'd use UPSERT (probably still with a staging table) if I expect a high level of concurrency is going to force me to retry often and the implementation will handle that for me.

To be honest though I haven't given it that much thought as I've had little need for it.

David J.



View this message in context: Re: Why is unique constraint needed for upsert? (treat atomicity as optional)
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On 7/23/14 7:45 PM, John R Pierce wrote:
> On 7/23/2014 3:29 PM, Seamus Abshere wrote:
>> My argument lives and dies on the assumption that UPSERT would be
>> useful even if it was (when given with no options) just a macro for
>>
>>>   UPDATE db SET b = data WHERE a = key;
>>>   IF NOT found THEN
>>>     INSERT INTO db(a,b) VALUES (key, data);
>>>   END IF;
>
> but that won't work if two connections execute similar 'upserts'
> concurrently.    both updates will see the record isn't there, then one
> or the other insert will fail, depending on which transaction commits
> first.

John,

Right - if you had a situation where that might happen, you would use a
slightly more advanced version of the UPSERT command (and/or add a
unique index).

UPSERT, in this conception and in its most basic form, would be subject
to many of the same (and more) concurrency concerns as basic INSERTs and
UPDATEs.

Providing options may be preferable magically handling everything.

Best,
Seamus


--
Seamus Abshere, SCEA
+598 99 54 99 54
https://github.com/seamusabshere


On 7/23/2014 3:58 PM, Seamus Abshere wrote:
> Right - if you had a situation where that might happen, you would use
> a slightly more advanced version of the UPSERT command (and/or add a
> unique index).

a unique index wouldn't resolve the problem.  without one, you'd end up
with two records, with one, you'd end up with an error.

naive programmers never seem to expect concurrency, its something that
just happens.

--
john r pierce                                      37N 122W
somewhere on the middle of the left coast