Thread: UPSERT

UPSERT

From
Jonathan Scher
Date:
Hello,

I'd like to work on TODO item:> Add REPLACE or UPSERT command that does UPDATE, or on failure, INSERT

could you please tell me if I'm going in the right way?

There are some different syntaxes possible, but MySQL has an interesting 
one here:
http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html

INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1;
This allow to make an insert, and if the key is already there to modify 
the value depending on the current one.

Then I have two choices possible:
- Search for existing tuples among key or unique constraint, then if 
nothing is found, insert it.
- Try to insert a new row, catch if there is any error, and then search 
for all tuple matching.

As it would be a new command, I have no idea on what the data could be.
Does syntax meet your needs? Which choice should I implement?

Regards
Jonathan Scher


Re: UPSERT

From
Andrew Dunstan
Date:
Jonathan Scher wrote:
> Hello,
>
> I'd like to work on TODO item:
> > Add REPLACE or UPSERT command that does UPDATE, or on failure, INSERT
>
> could you please tell me if I'm going in the right way?
>
> There are some different syntaxes possible, but MySQL has an 
> interesting one here:
> http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html
>
> INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1;
> This allow to make an insert, and if the key is already there to 
> modify the value depending on the current one.
>
> Then I have two choices possible:
> - Search for existing tuples among key or unique constraint, then if 
> nothing is found, insert it.
> - Try to insert a new row, catch if there is any error, and then 
> search for all tuple matching.
>
> As it would be a new command, I have no idea on what the data could be.
> Does syntax meet your needs? Which choice should I implement?

Good. Some thoughts from the top of the head:

Is "insert or on failure update" semantically equivalent to "update or 
on failure insert"? If not the former seems more desirable to me anyway.

What are the syntax alternatives? This one from MySQL doesn't seem too 
bad, but it would be good to have them all on the table.

My instinct would be to follow your first strategy, i.e. detect which 
path is needed rather than try one and then if it fails do the other.

cheers

andrew



Re: UPSERT

From
"Florian G. Pflug"
Date:
Andrew Dunstan wrote:
> Jonathan Scher wrote:
>> Hello,
>>
>> I'd like to work on TODO item:
>> > Add REPLACE or UPSERT command that does UPDATE, or on failure, INSERT
>>
>> could you please tell me if I'm going in the right way?
>>
>> There are some different syntaxes possible, but MySQL has an 
>> interesting one here:
>> http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html
>>
>> INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1;
>> This allow to make an insert, and if the key is already there to 
>> modify the value depending on the current one.

May this could be generalized to a generic "<stmt> on <error> do <stmt>"?
You could then write
"update table set c=c+1 on not_found do insert into table (a,b,c) values (1,2,3)"

Just an idea I just had...

greetings, Florian Pflug



Re: UPSERT

From
Gregory Stark
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:

>>> INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1;
>>> This allow to make an insert, and if the key is already there to modify the
>>> value depending on the current one.
>
> May this could be generalized to a generic "<stmt> on <error> do <stmt>"?
> You could then write
> "update table set c=c+1 on not_found do insert into table (a,b,c) values (1,2,3)"
>
> Just an idea I just had...

We have such a thing, subtransactions.

The reason UPSERT or ON DUPLICATE is interesting is because it provides a way
to do it atomically. That is, you keep the locks acquired from the duplicate
key check and if it fails you update the same records you just found violating
the duplicate key.

If the user tries to do the same thing he has to repeat the search after the
duplicate key check has released the locks so it's possible they've been
deleted or updated since. So the user has to loop in case the update fails to
find any records and he has to start over trying to insert. The same problem
plagues you if you do it the other way around too.

The tricky part is avoiding race conditions. The way the unique index code
avoids having someone else come along and insert at the same time is by
holding a lock on an index page. I'm not sure if you can keep that lock while
you go lock the tuples for the update.



--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: UPSERT

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> My instinct would be to follow your first strategy, i.e. detect which 
> path is needed rather than try one and then if it fails do the other.

The very first thing you need to think about is how to solve the race
condition problem, ie, two backends concurrently trying to insert
identical data.  Until you have a plausible mechanism for that, the
whole thing is pie-in-the-sky.
        regards, tom lane


Re: UPSERT

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> My instinct would be to follow your first strategy, i.e. detect which 
>> path is needed rather than try one and then if it fails do the other.
> 
> The very first thing you need to think about is how to solve the race
> condition problem, ie, two backends concurrently trying to insert
> identical data.  Until you have a plausible mechanism for that, the
> whole thing is pie-in-the-sky.

How about:

1. Insert new heap tuple
2. Try to insert the index tuple. If there's a duplicate tuple, lock the 
existing tuple instead of throwing an error.
3. If there was no duplicate, we're done.

4. Otherwise, kill the new tuple inserted in step 1, by setting it's 
xmin to InvalidTransactionId.
5. Perform the UPDATE on the existing tuple.

This requires one change to the indexam api: a duplicate key violation 
needs to lock the existing tuple instead of throwing an error.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: UPSERT

From
"Florian G. Pflug"
Date:
Gregory Stark wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
> 
>>>> INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1;
>>>> This allow to make an insert, and if the key is already there to modify the
>>>> value depending on the current one.
>> May this could be generalized to a generic "<stmt> on <error> do <stmt>"?
>> You could then write
>> "update table set c=c+1 on not_found do insert into table (a,b,c) values (1,2,3)"
>>
>> Just an idea I just had...
> 
> We have such a thing, subtransactions.

Yeah, I know - but the syntax above would provide a way to write that "inline"
instead of doing it at the application (or plpgsql) level.

> The reason UPSERT or ON DUPLICATE is interesting is because it provides a way
> to do it atomically. That is, you keep the locks acquired from the duplicate
> key check and if it fails you update the same records you just found violating
> the duplicate key.
> 
> If the user tries to do the same thing he has to repeat the search after the
> duplicate key check has released the locks so it's possible they've been
> deleted or updated since. So the user has to loop in case the update fails to
> find any records and he has to start over trying to insert. The same problem
> plagues you if you do it the other way around too.
I agree - my "generic syntax" seems to be too generic, and doesn't take
locking into account.. :-(

> The tricky part is avoiding race conditions. The way the unique index code
> avoids having someone else come along and insert at the same time is by
> holding a lock on an index page. I'm not sure if you can keep that lock while
> you go lock the tuples for the update.

Maybe doing the following would work:
start:
do_index_lookup
if (found_row) {  lock_row  if (acquired_lock) {    do_update    return  }  //Row was deleted
}
create_row_on_heap
create_index_entry
if (success)  return
else {  mark_row_as_deleted //or remove row?  goto start
}

It seems like this would work without creating a subtransaction, but
I'm not really sure..

greetings, Florian Pflug


Re: UPSERT

From
"Simon Riggs"
Date:
On Fri, 2007-03-02 at 15:41 +0000, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Andrew Dunstan <andrew@dunslane.net> writes:
> >> My instinct would be to follow your first strategy, i.e. detect which 
> >> path is needed rather than try one and then if it fails do the other.
> > 
> > The very first thing you need to think about is how to solve the race
> > condition problem, ie, two backends concurrently trying to insert
> > identical data.  Until you have a plausible mechanism for that, the
> > whole thing is pie-in-the-sky.
> 
> How about:
> 
> 1. Insert new heap tuple
> 2. Try to insert the index tuple. If there's a duplicate tuple, lock the 
> existing tuple instead of throwing an error.
> 3. If there was no duplicate, we're done.
> 
> 4. Otherwise, kill the new tuple inserted in step 1, by setting it's 
> xmin to InvalidTransactionId.
> 5. Perform the UPDATE on the existing tuple.
> 
> This requires one change to the indexam api: a duplicate key violation 
> needs to lock the existing tuple instead of throwing an error.

So if the INSERT fails we will leave two dead copies of the tuples? Hmm.

Seems like we should try to locate a row first, then INSERT if we cannot
find one. That's slower on INSERT but more balanced overall - sometimes
the input will generate all UPDATEs, sometimes all INSERTs we'll never
know.


I'm a bit surprised the TODO didn't mention the MERGE statement, which
is the SQL:2003 syntax for specifying this as an atomic statement. There
are lots of other syntaxes, the most simple of which are the MySQL
REPLACE and Teradata's UPDATE ... ELSE INSERT. As seductive as they are,
I'd say that's all the more reason to go with the single approved
syntax. If MySQL are standards compliant, they will support that also,
so we get MySQL compatibility either way.

Another thought that really ought to be on the TODO is a MERGE FROM
(pick your syntax) that allows MERGE to act like a COPY, reading data
from an external data file. That would save effort, since the only way
of doing this currently is to do a COPY then an UPDATE and then an
INSERT. So the MERGE FROM would reduce three operations to just a single
command. 

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: UPSERT

From
Bricklen Anderson
Date:
Simon Riggs wrote:
> I'm a bit surprised the TODO didn't mention the MERGE statement, which
> is the SQL:2003 syntax for specifying this as an atomic statement.

http://archives.postgresql.org/pgsql-hackers/2004-05/thrd5.php#00497

There is a thread there entitled "Adding MERGE to the TODO list"


Re: UPSERT

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> Seems like we should try to locate a row first, then INSERT if we cannot
> find one. That's slower on INSERT but more balanced overall

Except it still has the race condition.

> I'm a bit surprised the TODO didn't mention the MERGE statement, which
> is the SQL:2003 syntax for specifying this as an atomic statement.

I believe we concluded that MERGE doesn't actually do quite what people
want/expect.  Please go back and read the archives.
        regards, tom lane


Re: UPSERT

From
Tom Lane
Date:
Bricklen Anderson <banderson@presinet.com> writes:
> http://archives.postgresql.org/pgsql-hackers/2004-05/thrd5.php#00497
> There is a thread there entitled "Adding MERGE to the TODO list"

The more interesting discussion is the one that got it taken off TODO again,
from Nov 2005.  Try these threads:
http://archives.postgresql.org/pgsql-hackers/2005-11/msg00501.php
http://archives.postgresql.org/pgsql-hackers/2005-11/msg00536.php
        regards, tom lane


Re: UPSERT

From
"Simon Riggs"
Date:
On Fri, 2007-03-02 at 13:19 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > Seems like we should try to locate a row first, then INSERT if we cannot
> > find one. That's slower on INSERT but more balanced overall
> 
> Except it still has the race condition.

I'm not saying it didn't; but dropping in two dead copies of a tuple
isn't acceptable either.

> > I'm a bit surprised the TODO didn't mention the MERGE statement, which
> > is the SQL:2003 syntax for specifying this as an atomic statement.
> 
> I believe we concluded that MERGE doesn't actually do quite what people
> want/expect.  Please go back and read the archives.

Yes, it was my thread. I recall that there wasn't any acceptable answer
to how it could be done with reasonable efficiency.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: UPSERT

From
Bricklen Anderson
Date:
Tom Lane wrote:
> Bricklen Anderson <banderson@presinet.com> writes:
>> http://archives.postgresql.org/pgsql-hackers/2004-05/thrd5.php#00497
>> There is a thread there entitled "Adding MERGE to the TODO list"
> 
> The more interesting discussion is the one that got it taken off TODO again,
> from Nov 2005.  Try these threads:
> http://archives.postgresql.org/pgsql-hackers/2005-11/msg00501.php
> http://archives.postgresql.org/pgsql-hackers/2005-11/msg00536.php
> 
>             regards, tom lane

Yeah, that's a better set of threads.


Re: UPSERT

From
Josh Berkus
Date:
Couple notes:

(1) Upsert is not just a desire of MySQL users.  I just spec'd a major 
proprietary-database replacement project at a fortune 500 where they want an 
Upsert and are unhappy that PostgreSQL doesn't have it.  Unfortunately, they 
don't want to spring for development funds :-(

(2) Doing upsert by checking for a unique key violaton error leads to horrible 
performance in addition to the previously mentioned race conditions.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: UPSERT

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2007-03-02 kell 10:13, kirjutas Tom Lane:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > My instinct would be to follow your first strategy, i.e. detect which 
> > path is needed rather than try one and then if it fails do the other.
> 
> The very first thing you need to think about is how to solve the race
> condition problem, ie, two backends concurrently trying to insert
> identical data.  

Then one of them will update the data inserted by whoeved got the insert
first.

> Until you have a plausible mechanism for that, the
> whole thing is pie-in-the-sky.

Is'nt the standard way of doing it thus:

UPDATE
IF NOT FOUND THEN  INSERT IF DUPLICATE KEY THEN   UPDATE END IF
END IF

At least this is how UPSERT is usually done in plpgsql


-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



Re: UPSERT

From
Bruno Wolff III
Date:
On Sun, Mar 04, 2007 at 14:55:47 +0200, Hannu Krosing <hannu@skype.net> wrote:
> 
> UPDATE
> IF NOT FOUND THEN 
>   INSERT
>   IF DUPLICATE KEY THEN
>     UPDATE
>   END IF
> END IF

I believe it is possible for the above to fail. For example another
transaction could create a matching record between the update and insert
and then another transaction could delete it between the insert and the
second update.


Re: UPSERT

From
Hannu Krosing
Date:
Ühel kenal päeval, P, 2007-03-04 kell 07:46, kirjutas Bruno Wolff III:
> On Sun, Mar 04, 2007 at 14:55:47 +0200,
>   Hannu Krosing <hannu@skype.net> wrote:
> > 
> > UPDATE
> > IF NOT FOUND THEN 
> >   INSERT
> >   IF DUPLICATE KEY THEN
> >     UPDATE
> >   END IF
> > END IF
> 
> I believe it is possible for the above to fail. For example another
> transaction could create a matching record between the update and insert
> and then another transaction could delete it between the insert and the
> second update.

Then we may do the second part as a loop and hope that eventually we hit
the right point with either INSERT or UPDATE:
UPDATEWHILE NOT FOUND THEN   INSERT  IF DUPLICATE KEY THEN    UPDATE  END IFEND WHILE

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



Re: UPSERT

From
Petr Jelinek
Date:
Bruno Wolff III wrote:
> On Sun, Mar 04, 2007 at 14:55:47 +0200,
>   Hannu Krosing <hannu@skype.net> wrote:
>> UPDATE
>> IF NOT FOUND THEN
>>   INSERT
>>   IF DUPLICATE KEY THEN
>>     UPDATE
>>   END IF
>> END IF
>
> I believe it is possible for the above to fail. For example another
> transaction could create a matching record between the update and insert
> and then another transaction could delete it between the insert and the
> second update.

You know we have example in manual right ?
http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE

:)

-- 
Regards
Petr Jelinek (PJMODOS)




Re: UPSERT

From
Martijn van Oosterhout
Date:
On Sun, Mar 04, 2007 at 02:55:47PM +0200, Hannu Krosing wrote:
> Is'nt the standard way of doing it thus:
>
> UPDATE
> IF NOT FOUND THEN
>   INSERT
>   IF DUPLICATE KEY THEN
>     UPDATE
>   END IF
> END IF
>
> At least this is how UPSERT is usually done in plpgsql

Well, you need to loop, because that last UPDATE can get a not-found
again, so you have to keep trying both until they work.

I think MERGE would still be cool, because then it's only one command
that has to be repeated, rather than two.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.