SQL MERGE is quite distinct from UPSERT - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | SQL MERGE is quite distinct from UPSERT |
Date | |
Msg-id | CAM3SWZRP0c3g6+aJ=YYDGYAcTZg0xA8-1_FCVo5Xm7hrEL34kw@mail.gmail.com Whole thread Raw |
Responses |
Re: SQL MERGE is quite distinct from UPSERT
Re: SQL MERGE is quite distinct from UPSERT |
List | pgsql-hackers |
Someone asked me privately why we weren't immediately pursuing SQL MERGE, as opposed to yet another non-standard variant of UPSERT (UPSERT is loosely defined here as an insert-or-update DML command that goes to update based on would-be unique constraint violations, and does one or the other *atomically*). I believe at least one other system (Teradata) that now has both formerly only had the latter. I think it makes sense to have both. While I believe that it's already generally accepted that SQL MERGE is distinct from UPSERT in fairly fundamental ways, in that it solves a somewhat distinct set of problems, I would like to put my thoughts on the matter on record. This is mostly for the sake of the archives. The first reason why the two are distinct is that MERGE, as implemented within other major systems and as described by the SQL standard does not require an atomic insert-or-update for even the simplest possible case of OLTP style insert-or-update. As I went into during my talk at pgCon, even these simple cases may throw errors on other systems' MERGE implementations. This leeway to fail may be valuable for some use cases though, like the bulk loading use case. Simon Riggs has characterized MERGE as something that exists for the benefit of those two fairly distinct use cases [1], and I agree with him. I suspect that the predictability of MERGE's behavior suffers due to trying to support both at once, and I think that MERGE should be specifically promoted as a bulk loading command if we ever get around to adding it to PostgreSQL. The second reason they differ is that SQL MERGE as implemented within other systems doesn't require that a unique index be defined on the column or columns of the MERGE statement's outer join. I am not enthusiastic about the idea of falling back to table-level locking when an appropriate index isn't available. That could be a foot-gun for the large majority of users that just want to solve their basic insert-or-update problem. I did think to mention this when asked about my work on UPSERT, but the next question was: "So why not just throw an error when it's not possible to use an available unique index"? This is a fair point, but doesn't address all the problems with trying to make MERGE perform upserts that meet my requirements for UPSERT [2](principally the requirement of atomicity in the sense of always getting an insert or update). These two points, particularly the first are likely to weigh upon the implementation of both, which is why we cannot very well just insist upon having appropriate unique indexes defined. It buys my UPSERT implementation plenty to have the upsert driven by an insert into a unique index. With UPSERT, having an insert occur is always an outcome that the DML statement's author will consider acceptable. SQL MERGE statements can be written without any WHEN NOT MATCHED THEN INSERT; SQL MERGE statement authors are not necessarily going to expect an insert at all. As I outlined in my pgCon talk, there appears to be a fundamental trade-off involved when implementing UPSERT. My implementation is prepared for the possibility that upon finding a duplicate, and upon subsequently attempting to lock a row, the row may be found to be deleted. When that happens, we try again without having made any useful progress in the first iteration. The ideal of a strict queue of upserters, with totally fair row locking style lock arbitration appears to be basically impossible, at least provided you're unwilling to give up insert-or-update atomicity (that is, to get this you have to be willing to give up with an error even at READ COMMITTED, which is a cop out, or buy into using something like table level locks, which is also a cop out). Theoretical lock starvation hazards exist in my implementation because I don't do this, but all indications are that this is fine in the real world. In this regard my implementation is similar to the current looping upsert subtransaction pattern we've been recommending for years [3]. We don't hold on to "value locks" on values within indexes (except to finish staggered index-locks-first inserting - but not to update/row lock). This is all fine in the real world because *somebody* has to make progress when there is highly contended upserting. Some session has to insert-or-update successfully - otherwise you cannot have a conflict to begin with. Livelocking is therefore not possible, and you need a *sustained* perfect storm of conflicts starving out one session in order to see lock starvation at all. SQL MERGE has an outer join on a table of proposed values (as opposed to an inner join against already rejected values). MERGE allows you to do something other than insert WHEN NOT MATCHED (i.e. to do nothing), and I think that being able to always insert at that juncture gives us a way to usefully terminate the loop where there might not be another way. Suppose you had a similar "keep retrying" implementation (similar to what I've proposed for UPSERT), but built this for SQL MERGE. Ignoring the fact that no other SQL MERGE implementation has insert-or-update-or-delete atomicity, you insist upon this for your implementation. Consider how this would behave with a MERGE statement that had only a WHEN MATCHED THEN UPDATE clause and/or a WHEN MATCHED THEN DELETE clause. You need to do nothing for the WHEN NOT MATCHED case, but how exactly do you decide when and how to do nothing when it's based on some index value *not* being there? This isn't the same as INSERT IGNORE - it's the exact opposite. We're doing nothing because there was no conflict found, *not* because there was some conflict conclusively found. Furthermore, not only are we doing nothing, we're specifically not doing something else (updating). We're not doing updating in a way predicated on the physical absence of a value from an index at one instant in time. But what on earth does that absence tell us? In what sense is the value not there? Is it in the heap, but not yet in any index? Is it visible to our estate snapshot? My sense is that any attempt to resolve these questions would be extremely ticklish, and would likely introduce livelock hazards or deadlock hazards. You end up with head melting Heisenberg-esque questions when you try to impose this requirement on SQL MERGE. AFAICT that's why no one else has succeeded in doing so. I think you're just supposed to live with those problems ("Nothing visible to my MVCC snapshot? Good enough, do WHEN NOT MATCHED handling then"). A lot of the time these problems don't come up. To state it another way, my implementation relies on something *conclusively* being there (conclusively committed and visible to new snapshots) - our new index tuple, or someone else's existing one. If it's the latter case we're finally done when we conclusively lock the tuple without a conflict (that is, without it being deleted in the interim). We rely on things happening conclusively (things that have locking implications, like inserting an index tuple or locking the row). We cannot conclusively conclude that something is not there without extreme measures that are very questionable. At a high level SQL MERGE is quite distinct from UPSERT, in that it is a utility command that performs inserts, updates and deletes while avoiding race conditions (e.g. unique constraint violations) on a more or less best effort basis. MERGE is conceptually messy. In contrast UPSERT is actually atomic, and having its behavior be relatively easy to reason about ought to be the top priority. There is a *really* big demand for UPSERT from users, not MERGE, although MERGE is certainly useful too. [1] http://www.postgresql.org/message-id/1132000496.3388.31.camel@holly [2] http://www.pgcon.org/2014/schedule/attachments/327_upsert_weird.pdf, Slide 8, "Goals for UPSERT in Postgres" [3] http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE -- Peter Geoghegan
pgsql-hackers by date: