Thread: UPSERT
Hello, I'd like to work on TODO item:> Add REPLACE or UPSERT command that does UPDATE, or on failure, INSERT could you please tell me if I'm going in the right way? There are some different syntaxes possible, but MySQL has an interesting one here: http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1; This allow to make an insert, and if the key is already there to modify the value depending on the current one. Then I have two choices possible: - Search for existing tuples among key or unique constraint, then if nothing is found, insert it. - Try to insert a new row, catch if there is any error, and then search for all tuple matching. As it would be a new command, I have no idea on what the data could be. Does syntax meet your needs? Which choice should I implement? Regards Jonathan Scher
Jonathan Scher wrote: > Hello, > > I'd like to work on TODO item: > > Add REPLACE or UPSERT command that does UPDATE, or on failure, INSERT > > could you please tell me if I'm going in the right way? > > There are some different syntaxes possible, but MySQL has an > interesting one here: > http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html > > INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1; > This allow to make an insert, and if the key is already there to > modify the value depending on the current one. > > Then I have two choices possible: > - Search for existing tuples among key or unique constraint, then if > nothing is found, insert it. > - Try to insert a new row, catch if there is any error, and then > search for all tuple matching. > > As it would be a new command, I have no idea on what the data could be. > Does syntax meet your needs? Which choice should I implement? Good. Some thoughts from the top of the head: Is "insert or on failure update" semantically equivalent to "update or on failure insert"? If not the former seems more desirable to me anyway. What are the syntax alternatives? This one from MySQL doesn't seem too bad, but it would be good to have them all on the table. My instinct would be to follow your first strategy, i.e. detect which path is needed rather than try one and then if it fails do the other. cheers andrew
Andrew Dunstan wrote: > Jonathan Scher wrote: >> Hello, >> >> I'd like to work on TODO item: >> > Add REPLACE or UPSERT command that does UPDATE, or on failure, INSERT >> >> could you please tell me if I'm going in the right way? >> >> There are some different syntaxes possible, but MySQL has an >> interesting one here: >> http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html >> >> INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1; >> This allow to make an insert, and if the key is already there to >> modify the value depending on the current one. May this could be generalized to a generic "<stmt> on <error> do <stmt>"? You could then write "update table set c=c+1 on not_found do insert into table (a,b,c) values (1,2,3)" Just an idea I just had... greetings, Florian Pflug
"Florian G. Pflug" <fgp@phlo.org> writes: >>> INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1; >>> This allow to make an insert, and if the key is already there to modify the >>> value depending on the current one. > > May this could be generalized to a generic "<stmt> on <error> do <stmt>"? > You could then write > "update table set c=c+1 on not_found do insert into table (a,b,c) values (1,2,3)" > > Just an idea I just had... We have such a thing, subtransactions. The reason UPSERT or ON DUPLICATE is interesting is because it provides a way to do it atomically. That is, you keep the locks acquired from the duplicate key check and if it fails you update the same records you just found violating the duplicate key. If the user tries to do the same thing he has to repeat the search after the duplicate key check has released the locks so it's possible they've been deleted or updated since. So the user has to loop in case the update fails to find any records and he has to start over trying to insert. The same problem plagues you if you do it the other way around too. The tricky part is avoiding race conditions. The way the unique index code avoids having someone else come along and insert at the same time is by holding a lock on an index page. I'm not sure if you can keep that lock while you go lock the tuples for the update. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > My instinct would be to follow your first strategy, i.e. detect which > path is needed rather than try one and then if it fails do the other. The very first thing you need to think about is how to solve the race condition problem, ie, two backends concurrently trying to insert identical data. Until you have a plausible mechanism for that, the whole thing is pie-in-the-sky. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> My instinct would be to follow your first strategy, i.e. detect which >> path is needed rather than try one and then if it fails do the other. > > The very first thing you need to think about is how to solve the race > condition problem, ie, two backends concurrently trying to insert > identical data. Until you have a plausible mechanism for that, the > whole thing is pie-in-the-sky. How about: 1. Insert new heap tuple 2. Try to insert the index tuple. If there's a duplicate tuple, lock the existing tuple instead of throwing an error. 3. If there was no duplicate, we're done. 4. Otherwise, kill the new tuple inserted in step 1, by setting it's xmin to InvalidTransactionId. 5. Perform the UPDATE on the existing tuple. This requires one change to the indexam api: a duplicate key violation needs to lock the existing tuple instead of throwing an error. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: > >>>> INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1; >>>> This allow to make an insert, and if the key is already there to modify the >>>> value depending on the current one. >> May this could be generalized to a generic "<stmt> on <error> do <stmt>"? >> You could then write >> "update table set c=c+1 on not_found do insert into table (a,b,c) values (1,2,3)" >> >> Just an idea I just had... > > We have such a thing, subtransactions. Yeah, I know - but the syntax above would provide a way to write that "inline" instead of doing it at the application (or plpgsql) level. > The reason UPSERT or ON DUPLICATE is interesting is because it provides a way > to do it atomically. That is, you keep the locks acquired from the duplicate > key check and if it fails you update the same records you just found violating > the duplicate key. > > If the user tries to do the same thing he has to repeat the search after the > duplicate key check has released the locks so it's possible they've been > deleted or updated since. So the user has to loop in case the update fails to > find any records and he has to start over trying to insert. The same problem > plagues you if you do it the other way around too. I agree - my "generic syntax" seems to be too generic, and doesn't take locking into account.. :-( > The tricky part is avoiding race conditions. The way the unique index code > avoids having someone else come along and insert at the same time is by > holding a lock on an index page. I'm not sure if you can keep that lock while > you go lock the tuples for the update. Maybe doing the following would work: start: do_index_lookup if (found_row) { lock_row if (acquired_lock) { do_update return } //Row was deleted } create_row_on_heap create_index_entry if (success) return else { mark_row_as_deleted //or remove row? goto start } It seems like this would work without creating a subtransaction, but I'm not really sure.. greetings, Florian Pflug
On Fri, 2007-03-02 at 15:41 +0000, Heikki Linnakangas wrote: > Tom Lane wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: > >> My instinct would be to follow your first strategy, i.e. detect which > >> path is needed rather than try one and then if it fails do the other. > > > > The very first thing you need to think about is how to solve the race > > condition problem, ie, two backends concurrently trying to insert > > identical data. Until you have a plausible mechanism for that, the > > whole thing is pie-in-the-sky. > > How about: > > 1. Insert new heap tuple > 2. Try to insert the index tuple. If there's a duplicate tuple, lock the > existing tuple instead of throwing an error. > 3. If there was no duplicate, we're done. > > 4. Otherwise, kill the new tuple inserted in step 1, by setting it's > xmin to InvalidTransactionId. > 5. Perform the UPDATE on the existing tuple. > > This requires one change to the indexam api: a duplicate key violation > needs to lock the existing tuple instead of throwing an error. So if the INSERT fails we will leave two dead copies of the tuples? Hmm. Seems like we should try to locate a row first, then INSERT if we cannot find one. That's slower on INSERT but more balanced overall - sometimes the input will generate all UPDATEs, sometimes all INSERTs we'll never know. I'm a bit surprised the TODO didn't mention the MERGE statement, which is the SQL:2003 syntax for specifying this as an atomic statement. There are lots of other syntaxes, the most simple of which are the MySQL REPLACE and Teradata's UPDATE ... ELSE INSERT. As seductive as they are, I'd say that's all the more reason to go with the single approved syntax. If MySQL are standards compliant, they will support that also, so we get MySQL compatibility either way. Another thought that really ought to be on the TODO is a MERGE FROM (pick your syntax) that allows MERGE to act like a COPY, reading data from an external data file. That would save effort, since the only way of doing this currently is to do a COPY then an UPDATE and then an INSERT. So the MERGE FROM would reduce three operations to just a single command. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > I'm a bit surprised the TODO didn't mention the MERGE statement, which > is the SQL:2003 syntax for specifying this as an atomic statement. http://archives.postgresql.org/pgsql-hackers/2004-05/thrd5.php#00497 There is a thread there entitled "Adding MERGE to the TODO list"
"Simon Riggs" <simon@2ndquadrant.com> writes: > Seems like we should try to locate a row first, then INSERT if we cannot > find one. That's slower on INSERT but more balanced overall Except it still has the race condition. > I'm a bit surprised the TODO didn't mention the MERGE statement, which > is the SQL:2003 syntax for specifying this as an atomic statement. I believe we concluded that MERGE doesn't actually do quite what people want/expect. Please go back and read the archives. regards, tom lane
Bricklen Anderson <banderson@presinet.com> writes: > http://archives.postgresql.org/pgsql-hackers/2004-05/thrd5.php#00497 > There is a thread there entitled "Adding MERGE to the TODO list" The more interesting discussion is the one that got it taken off TODO again, from Nov 2005. Try these threads: http://archives.postgresql.org/pgsql-hackers/2005-11/msg00501.php http://archives.postgresql.org/pgsql-hackers/2005-11/msg00536.php regards, tom lane
On Fri, 2007-03-02 at 13:19 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > Seems like we should try to locate a row first, then INSERT if we cannot > > find one. That's slower on INSERT but more balanced overall > > Except it still has the race condition. I'm not saying it didn't; but dropping in two dead copies of a tuple isn't acceptable either. > > I'm a bit surprised the TODO didn't mention the MERGE statement, which > > is the SQL:2003 syntax for specifying this as an atomic statement. > > I believe we concluded that MERGE doesn't actually do quite what people > want/expect. Please go back and read the archives. Yes, it was my thread. I recall that there wasn't any acceptable answer to how it could be done with reasonable efficiency. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Bricklen Anderson <banderson@presinet.com> writes: >> http://archives.postgresql.org/pgsql-hackers/2004-05/thrd5.php#00497 >> There is a thread there entitled "Adding MERGE to the TODO list" > > The more interesting discussion is the one that got it taken off TODO again, > from Nov 2005. Try these threads: > http://archives.postgresql.org/pgsql-hackers/2005-11/msg00501.php > http://archives.postgresql.org/pgsql-hackers/2005-11/msg00536.php > > regards, tom lane Yeah, that's a better set of threads.
Couple notes: (1) Upsert is not just a desire of MySQL users. I just spec'd a major proprietary-database replacement project at a fortune 500 where they want an Upsert and are unhappy that PostgreSQL doesn't have it. Unfortunately, they don't want to spring for development funds :-( (2) Doing upsert by checking for a unique key violaton error leads to horrible performance in addition to the previously mentioned race conditions. -- Josh Berkus PostgreSQL @ Sun San Francisco
Ühel kenal päeval, R, 2007-03-02 kell 10:13, kirjutas Tom Lane: > Andrew Dunstan <andrew@dunslane.net> writes: > > My instinct would be to follow your first strategy, i.e. detect which > > path is needed rather than try one and then if it fails do the other. > > The very first thing you need to think about is how to solve the race > condition problem, ie, two backends concurrently trying to insert > identical data. Then one of them will update the data inserted by whoeved got the insert first. > Until you have a plausible mechanism for that, the > whole thing is pie-in-the-sky. Is'nt the standard way of doing it thus: UPDATE IF NOT FOUND THEN INSERT IF DUPLICATE KEY THEN UPDATE END IF END IF At least this is how UPSERT is usually done in plpgsql -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On Sun, Mar 04, 2007 at 14:55:47 +0200, Hannu Krosing <hannu@skype.net> wrote: > > UPDATE > IF NOT FOUND THEN > INSERT > IF DUPLICATE KEY THEN > UPDATE > END IF > END IF I believe it is possible for the above to fail. For example another transaction could create a matching record between the update and insert and then another transaction could delete it between the insert and the second update.
Ühel kenal päeval, P, 2007-03-04 kell 07:46, kirjutas Bruno Wolff III: > On Sun, Mar 04, 2007 at 14:55:47 +0200, > Hannu Krosing <hannu@skype.net> wrote: > > > > UPDATE > > IF NOT FOUND THEN > > INSERT > > IF DUPLICATE KEY THEN > > UPDATE > > END IF > > END IF > > I believe it is possible for the above to fail. For example another > transaction could create a matching record between the update and insert > and then another transaction could delete it between the insert and the > second update. Then we may do the second part as a loop and hope that eventually we hit the right point with either INSERT or UPDATE: UPDATEWHILE NOT FOUND THEN INSERT IF DUPLICATE KEY THEN UPDATE END IFEND WHILE -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Bruno Wolff III wrote: > On Sun, Mar 04, 2007 at 14:55:47 +0200, > Hannu Krosing <hannu@skype.net> wrote: >> UPDATE >> IF NOT FOUND THEN >> INSERT >> IF DUPLICATE KEY THEN >> UPDATE >> END IF >> END IF > > I believe it is possible for the above to fail. For example another > transaction could create a matching record between the update and insert > and then another transaction could delete it between the insert and the > second update. You know we have example in manual right ? http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE :) -- Regards Petr Jelinek (PJMODOS)
On Sun, Mar 04, 2007 at 02:55:47PM +0200, Hannu Krosing wrote: > Is'nt the standard way of doing it thus: > > UPDATE > IF NOT FOUND THEN > INSERT > IF DUPLICATE KEY THEN > UPDATE > END IF > END IF > > At least this is how UPSERT is usually done in plpgsql Well, you need to loop, because that last UPDATE can get a not-found again, so you have to keep trying both until they work. I think MERGE would still be cool, because then it's only one command that has to be repeated, rather than two. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.