Thread: Making joins involving ctid work for the benefit of UPSERT
I'm working on UPSERT again. I think that in order to make useful progress, I'll have to find a better way of overcoming the visibility issues (i.e. the problem of what to do about a still-in-progress-to-our-snapshot row being locked at READ COMMITTED isolation level [1][2]). I've made some tentative additions to my earlier patch to change the syntax: postgres=# explain analyze with c as ( insert into ints(key, val) values (1, 'a'), (2, 'b') on conflict select * for update) update ints set val = c.valfrom c conflicts; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------Update onints (cost=0.03..62.34 rows=2 width=104) (actual time=0.095..0.095 rows=0 loops=1) CTE c -> Insert on ints ints_1 (cost=0.00..0.03 rows=2 width=36) (actual time=0.039..0.049 rows=2 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=36) (actual time=0.001..0.002 rows=2 loops=1) -> Nested Loop (cost=0.00..62.31 rows=2 width=104) (actual time=0.063..0.080 rows=2 loops=1) Join Filter: (ints.ctid = c.ctid) Rows Removed by Join Filter: 6 -> CTE Scan on c (cost=0.00..0.04 rows=2 width=100) (actual time=0.044..0.055 rows=2 loops=1) -> Materialize (cost=0.00..28.45 rows=1230 width=10) (actual time=0.005..0.006 rows=4 loops=2) -> Seq Scan on ints (cost=0.00..22.30 rows=1230 width=10) (actual time=0.009..0.009 rows=4 loops=1)Planning time: 0.132 msExecution time: 0.159 ms (12 rows) This works, as far as it goes. Parse analysis adds the ctid join to the top-level UPDATE based on the fact that a magical UPDATE statement (magical due to having a CONFLICTS clause) references a CTE in its FromList (this is mandated by CONFLICTS during parse analysis), a CTE containing an INSERT ON CONFLICT, and itself projecting a magical set of rejected rows sufficient to get a locked c.ctid implicitly. (FWIW, DELETEs may also accept a CONFLICTS clause and be used like this). The optimizer does not support joins involving ctid, which is why there is a seqscan plan - even setting "enable_seqscan = off" does not alter the plan right now. Apart from the obvious fact that we're doing an unnecessary Seq Scan on the target table, which is unacceptable on performance grounds (since we do in fact already know exactly what list of tids to pass to the top-level ModifyTable node to UPDATE), there is another problem: Since I cannot use a tid scan, I cannot play tricks with visibility within tidscans for the benefit of solving the basic problem of doing the right thing for READ COMMITTED. Clearly I need to add a new "MVCC violation" in the spirit of EvalPlanQual() to avoid this problem (since READ COMMITTED serialization failures are unacceptable to users), but I'd like this new MVCC violation to be as selective as possible. Let me take a step back, though. Presently UPDATE WHERE CURRENT OF does something that isn't a million miles from what I have in mind to address the problem. That feature exists for "sensitive cursors", where we lock rows (since FOR UPDATE was specified with the DECLARE CURSOR statement) as they're fetched from the cursor. The upshot is that in general, rows can change "out from under us" since they're not yet locked, including in ways that might result in a new row version not satisfying the original SELECT FOR UPDATE query predicate - clearly users don't want to have to deal with that when they subsequently go to UPDATE WHERE CURRENT OF. Special handling is required. Purely because UPDATE WHERE CURRENT OF was specified, as part of this special handling that update must use a tid scan. There is a little bit of special machinery within the optimizer (within cost_tidscan()) to force this for the isCurrentOf case. The tid scan code within the executor also has special UPDATE WHERE CURRENT OF handling. TidListCreate() specially handles the case by asking execCurrentOf() to look up the cursor's portal, and to get a tid from its QueryDesc for the current cursor position. TidNext() specially fetches the most recent version visible to our estate's Snapshot when we UPDATE WHERE CURRENT OF, without considering the original SELECT FOR UPDATE predicate (and whether or not it would in any sense continue to be satisfied). Each tid is then fetched from the heap and returned. I think I ought to be able to do something similar with this new CONFLICTS clause, by similarly marshaling some list of tids ultimately originating from the ModifyTable node that locked a potentially-not-yet-visible row (or maybe the optimizer recognizes the internal special case and creates the tid scan node "visibility credulous"). I'd then return heap tuples by using a dirty snapshot. Since this can only happen for the UPDATE's tid scan, and there must be a tid scan, it should be correct (on its own fairly reasonable terms, for READ COMMITTED). After all, at this juncture for the purposes of UPSERT it will already certainly be the case that the tuple is locked. I would like to: * Get feedback on the hazards of taking this direction, and indeed the advisability of the whole approach in general. For example, I think we'd always want to use a DirtySnapshot, and never try to just get the latest version visible to our esate snapshot, but I might be wrong about that. Since my magical UPDATE could have additional boolean conditions in its predicate (that is, additions to the implicit ctid join, in contrast to the existing magical WHERE CURRENT OF variant of UPDATE where that's always the only thing in the predicate), the "morality" of *which* tid should get to the UPDATE should certainly be considered. Of particular concern is the fact that having Postgres do one or the other could make the difference between an UPDATE qual being satisfied or not being satisfied. Maybe this is an argument against being able to tack on additional boolean conditions in the following fashion (which I'd prefer to support without also supporting more that one join (the magical implicit one), even though as I mentioned WHERE CURRENT OF does not support boolean conditions): with c as ( insert into ints(key, val) values (1, 'a'), (2, 'b') on conflict select * for update) update ints set val = c.valfrom c conflicts where c.key != 1; * Get some direction on what it would take to make joins involving ctid work, since all of this hinges upon that infrastructure becoming available. This appears to be a Simple Matter of Programming (at least for someone that happens to already have a good understanding of the optimizer), and is anticipated by this comment within tidpath.c: * There is currently no special support for joins involving CTID; in* particular nothing corresponding to best_inner_indexscan(). Since it's* not very useful to store TIDs of one table in another table, there* doesn't seem to beenough use-case to justify adding a lot of code* for that. [1] http://www.postgresql.org/message-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com [2] http://www.pgcon.org/2014/schedule/attachments/327_upsert_weird.pdf , slide 25 -- Peter Geoghegan
On Thu, Jul 17, 2014 at 6:18 PM, Peter Geoghegan <pg@heroku.com> wrote: > This appears to be a Simple Matter of Programming (at least > for someone that happens to already have a good understanding of the > optimizer), and is anticipated by this comment within tidpath.c: > > * There is currently no special support for joins involving CTID; in > * particular nothing corresponding to best_inner_indexscan(). Since it's > * not very useful to store TIDs of one table in another table, there > * doesn't seem to be enough use-case to justify adding a lot of code > * for that. By the way, this comment is obsolete since best_inner_indexscan() was removed in 2012 by the parameterized paths patch. -- Peter Geoghegan
Hi, On 2014-07-17 18:18:41 -0700, Peter Geoghegan wrote: > I'm working on UPSERT again. I think that in order to make useful > progress, I'll have to find a better way of overcoming the visibility > issues (i.e. the problem of what to do about a > still-in-progress-to-our-snapshot row being locked at READ COMMITTED > isolation level [1][2]). Agreed. > I've made some tentative additions to my earlier patch to change the syntax: > > postgres=# explain analyze > with c as ( > insert into ints(key, val) values (1, 'a'), (2, 'b') > on conflict > select * for update) > update ints set val = c.val from c conflicts; I still don't agree that this is a sane syntax for upsert. Avoiding it would reduce the scope of the problems you have measureably. We should provide a syntax that does the UPSERT itself, instead of delegating that to the user as you're proposing above. Afaics nearly none of the problems you're debating in your email would exist if upsert were entirely done internally. I think the things that use "wierd" visibility semantics are pretty much all doing that internally (things being EvalPlanQual stuff for INSERT/UPDATE/DELETE and the referential integrity triggers). I don't see sufficient reason why we should break away from that here. Yes, there's stuff that cannot be done when it's done internally, but we can live with those not being possible. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jul 18, 2014 at 10:46 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I think the things that use "wierd" visibility semantics are pretty much > all doing that internally (things being EvalPlanQual stuff for > INSERT/UPDATE/DELETE and the referential integrity triggers). I don't > see sufficient reason why we should break away from that here. Yes, > there's stuff that cannot be done when it's done internally, but we can > live with those not being possible. The whole point of what I was proposing was that those semantics would only apply to a special tid scan node. Perhaps I missed something, but I'm not sure why you'd consider that I was breaking away from that here at all. -- Peter Geoghegan
On 2014-07-18 10:53:36 -0700, Peter Geoghegan wrote: > On Fri, Jul 18, 2014 at 10:46 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I think the things that use "wierd" visibility semantics are pretty much > > all doing that internally (things being EvalPlanQual stuff for > > INSERT/UPDATE/DELETE and the referential integrity triggers). I don't > > see sufficient reason why we should break away from that here. Yes, > > there's stuff that cannot be done when it's done internally, but we can > > live with those not being possible. > > The whole point of what I was proposing was that those semantics would > only apply to a special tid scan node. Perhaps I missed something, but > I'm not sure why you'd consider that I was breaking away from that > here at all. I don't see why you'd need such a node at all if we had a fully builtin UPSERT. The whole stuff with ON CONFLICT SELECT FOR UPDATE and then UPDATE ... FROM c CONFLICTS is too complicated and exposes stuff that barely anybody will understand, let alone use correctly in queries they write themselves. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jul 18, 2014 at 11:06 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I don't see why you'd need such a node at all if we had a fully builtin > UPSERT. The whole stuff with ON CONFLICT SELECT FOR UPDATE and then > UPDATE ... FROM c CONFLICTS is too complicated and exposes stuff that > barely anybody will understand, let alone use correctly in queries they > write themselves. I accept that there will be a need for certain restrictions. Most obviously, if you update the target table referencing a CTE like this, not using the special CONFLICTS clause in the UPDATE (or DELETE) is an error. And as I mentioned, you may only join the projected duplicates to the UPDATE ModifyTable - an attempt to join any more relations is an error. In short, this *is* a fully built-in upsert. -- Peter Geoghegan
On 2014-07-18 11:14:34 -0700, Peter Geoghegan wrote: > On Fri, Jul 18, 2014 at 11:06 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I don't see why you'd need such a node at all if we had a fully builtin > > UPSERT. The whole stuff with ON CONFLICT SELECT FOR UPDATE and then > > UPDATE ... FROM c CONFLICTS is too complicated and exposes stuff that > > barely anybody will understand, let alone use correctly in queries they > > write themselves. > > I accept that there will be a need for certain restrictions. Most > obviously, if you update the target table referencing a CTE like this, > not using the special CONFLICTS clause in the UPDATE (or DELETE) is an > error. And as I mentioned, you may only join the projected duplicates > to the UPDATE ModifyTable - an attempt to join any more relations is > an error. In short, this *is* a fully built-in upsert. Meh. A understandable syntax wouldn't require the pullups with a special scan node and such. I think you're attempting a sort of genericity that's making your (important!) goal much harder to reach. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jul 18, 2014 at 11:23 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Meh. A understandable syntax wouldn't require the pullups with a special > scan node and such. I think you're attempting a sort of genericity > that's making your (important!) goal much harder to reach. Well, maybe. If the genericity of this syntax isn't what people want, I may go with something else. At the suggestion of Robert I did start a dedicated thread on that question months back (explicitly putting aside the nitty-gritty details of the implementation), but that didn't go anywhere. -- Peter Geoghegan
On Fri, Jul 18, 2014 at 11:32 AM, Peter Geoghegan <pg@heroku.com> wrote: > Well, maybe. If the genericity of this syntax isn't what people want, > I may go with something else. By the way, I didn't mention that there is now also the optional ability to specify a "merge on" unique index directly in DML. It would be much nicer to specify a sort key instead, but that might be tricky in the general case. I imagine that other systems solve the problem by being very restrictive in what will be (implicitly) accepted as a merge-on index. Seemingly there are problems with all major SQL MERGE implementations, so I don't necessarily expect to be able to draw lessons from them in any way here. -- Peter Geoghegan
On Fri, Jul 18, 2014 at 11:23 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Meh. A understandable syntax wouldn't require the pullups with a special > scan node and such. Well, in general ExecModifyTable()/ExecUpdate() trusts the tid passed to match the qual in the query. Unless you're willing to give up on the idea of having a qual on the update (other than something like an ON CONFLICTS qual, obviously) I think you'll probably end up with some kind of pseudo-scan node anyway, if only for consistency with how things already work, to give you somewhere to show the Filter in explain output and so on. ExecScan() probably needs to ExecQual(). Inserts don't have scan nodes, but updates do, and so it seems pretty odd to make the "Insert" node (a ModifyTable node) pullup from a result node (as always), and the "Update" node (also a ModifyTable node) treat the first ModifyTable node as its own pseudo-scan node, when in fact we are scanning the heap in a manner not unlike a conventional update. Or do you envision a second result node where an update qual might be evaluated? I am not enthused about either possibility. The idea of not having the ability to put a predicate on the update at all, much like the MySQL thing isn't completely outrageous, but it certainly isn't going to go down well those that have already expressed concerns about how much of a foot gun this could be. -- Peter Geoghegan
On Sat, Jul 19, 2014 at 10:04 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Jul 18, 2014 at 11:23 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> Meh. A understandable syntax wouldn't require the pullups with a special >> scan node and such. > > Well, in general ExecModifyTable()/ExecUpdate() trusts the tid passed > to match the qual in the query. Unless you're willing to give up on > the idea of having a qual on the update (other than something like an > ON CONFLICTS qual, obviously) I think you'll probably end up with some > kind of pseudo-scan node anyway, if only for consistency with how > things already work, to give you somewhere to show the Filter in > explain output and so on. ExecScan() probably needs to ExecQual(). > Inserts don't have scan nodes, but updates do, and so it seems pretty > odd to make the "Insert" node (a ModifyTable node) pullup from a > result node (as always), and the "Update" node (also a ModifyTable > node) treat the first ModifyTable node as its own pseudo-scan node, > when in fact we are scanning the heap in a manner not unlike a > conventional update. Or do you envision a second result node where an > update qual might be evaluated? I am not enthused about either > possibility. > > The idea of not having the ability to put a predicate on the update at > all, much like the MySQL thing isn't completely outrageous, but it > certainly isn't going to go down well those that have already > expressed concerns about how much of a foot gun this could be. This all seems completely to one side of Andres's point. I think what he's saying is: don't implement an SQL syntax of the form INSERT ON CONFLICT and let people use that to implement upsert. Instead, directly implement an SQL command called UPSERT or similar. That's long been my intuition as well. I think generality here is not your friend. I'd suggest something like: UPSERT table SET col = value [, ...] WHERE predicate; with semantics like this: 1. Search the table, using any type of scan you like, for a row matching the given predicate. 2. If you find more than one tuple that is visible to your scan, error. 3. If you find exactly one tuple that is visible to your scan, update it. Stop. 4. If you find no tuple that is visible to your scan, but you do find at least one tuple whose xmin has committed and whose xmax has not committed, then (4a) if the isolation level is REPEATABLE READ or higher, error; (4b) if there is more than one such tuple, error; else (4b) update it, in violation of normal MVCC visibility rules. 5. Construct a new tuple using the col/value pairs from the SET clause and try to insert it. If this would fail with a unique index violation, back out and go back to step 1. Having said all that, I believe the INSERT ON CONFLICT syntax is more easily comprehensible than previous proposals. But I still tend to agree with Andres that an explicit UPSERT syntax or something like it, that captures all of the MVCC games inside itself, is likely preferable from a user standpoint, whatever the implementation ends up looking like. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 23, 2014 at 9:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > This all seems completely to one side of Andres's point. I think what > he's saying is: don't implement an SQL syntax of the form INSERT ON > CONFLICT and let people use that to implement upsert. Instead, > directly implement an SQL command called UPSERT or similar. That's > long been my intuition as well. I think generality here is not your > friend. Sure, I was just making one fairly narrow point in relation to Andres' concern about executor structuring of the UPSERT. > I'd suggest something like: > > UPSERT table SET col = value [, ...] WHERE predicate; Without dismissing the overall concern shared by you and Andres, that particular update-driven syntax doesn't strike me as something that should be pursued. I would like to be able to insert or update more than a single row at a time, for one thing. For another, what exactly appears in the predicate? Is it more or less the same as an UPDATE's predicate? > with semantics like this: > > 1. Search the table, using any type of scan you like, for a row > matching the given predicate. Perhaps I've misunderstood, but this is fundamentally different to what I'd always thought would need to happen. I always understood that the UPSERT should be insert-driven, and so not really have an initial scan at all in the same sense that every Insert lacks one. Moreover, I've always thought that everyone agreed on that. We go to insert, and then in the course of inserting index tuples do something with dirty snapshots. That's how we get information on conflicts. For one thing we cannot use any kind of scan unless there is a new mechanism for seeing not-yet-visible tuples, like some kind of non-MVCC snapshot that those scan nodes can selectively use. Now, I guess that you intend that in this scenario you go straight to 5, and then your insert finds the not-yet-visible conflict. But it's not clear that when 5 fails, and we pick up from 1 that we then get a new MVCC Snapshot or something else that gives us a hope of succeeding this time around. And it seems quite wasteful to not use knowledge of conflicts from the insert at all - that's one thing I'm trying to address here; removing unnecessary index scans when we actually know what our conflict TIDs are. > 2. If you find more than one tuple that is visible to your scan, error. This point seems to concern making the UPSERT UPDATE only affect zero or one rows. Is that right? If so, why do you think that's a useful constraint to impose? > 3. If you find exactly one tuple that is visible to your scan, update it. Stop. > 4. If you find no tuple that is visible to your scan, but you do find > at least one tuple whose xmin has committed and whose xmax has not > committed, then (4a) if the isolation level is REPEATABLE READ or > higher, error; (4b) if there is more than one such tuple, error; else > (4b) update it, in violation of normal MVCC visibility rules. Point 4b ("if there is more than one such tuple...") seems like it needs some practical or theoretical justification - do you just not want to follow an update chain? If so, why? What would the error message actually be? And, to repeat myself: Why should an MVCC snapshot see more than one such tuple in the ordinary course of scanning a relation, since there is by definition a unique constraint that prevents this? Or maybe I'm wrong; maybe you don't even require that. That isn't clear. As you know, I've always opposed any type of READ COMMITTED serialization failure. If you're worried about lock starvation hazards - although I don't think strong concerns here are justified - we can always put in a fail-safe number of retries before throwing an error. I'm comfortable with that because I think based on experimentation that it's going to be virtually impossible to hit. Apparently other snapshot isolation databases acquire a new snapshot, and re-do the command rather than using an EvalPlanQual()-like mechanism and thereby violating snapshot isolation. Those other systems have just such a hard limit on retries before erroring, and it seems to work out okay for them. > 5. Construct a new tuple using the col/value pairs from the SET clause > and try to insert it. If this would fail with a unique index > violation, back out and go back to step 1. Again, I can't see why you'd do this step last and not first, since this is naturally the place to initially "see into the future" with a dirty snapshot. > Having said all that, I believe the INSERT ON CONFLICT syntax is more > easily comprehensible than previous proposals. But I still tend to > agree with Andres that an explicit UPSERT syntax or something like it, > that captures all of the MVCC games inside itself, is likely > preferable from a user standpoint, whatever the implementation ends up > looking like. Okay then. If you both feel that way, I will come up with something closer to what you sketch. But for now I still strongly feel it ought to be driven by an insert. Perhaps I've misunderstood you entirely, though. -- Peter Geoghegan
On Wed, Jul 23, 2014 at 5:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I'd suggest something like: > > UPSERT table SET col = value [, ...] WHERE predicate; I don't think this is actually good enough. Typical use cases are things like "increment this counter or insert a new one starting at 0". -- greg
Peter Geoghegan <pg@heroku.com> wrote: > I still strongly feel it ought to be driven by an insert Could you clarify that? Does this mean that you feel that we should write to the heap before reading the index to see if the row will be a duplicate? If so, I think that is a bad idea, since this will sometimes be used to apply a new data set which hasn't changed much from the old, and that approach will perform poorly for this use case, causing a lot of bloat. It certainly would work well for the case that most of the rows are expected to be INSERTs rather than DELETEs, but I'm not sure that's justification for causing extreme bloat in the other cases. Also, just a reminder that I'm going to squawk loudly if the implementation does not do something fairly predictable and sane for the case that the table has more than one UNIQUE index and you attempt to UPSERT a row that is a duplicate of one row on one of the indexes and a different row on a different index. The example discussed during your PGCon talk was something like a city table with two column, each with a UNIQUE constraint, containing: city_id | city_name ---------+----------- 1 | Toronto 2 | Ottawa ... and an UPSERT comes through for (1, 'Ottawa'). We would all like for that never to happen, but it will. There must be sane and documented behavior in that case. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 23, 2014 at 4:05 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Jul 23, 2014 at 9:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> This all seems completely to one side of Andres's point. I think what >> he's saying is: don't implement an SQL syntax of the form INSERT ON >> CONFLICT and let people use that to implement upsert. Instead, >> directly implement an SQL command called UPSERT or similar. That's >> long been my intuition as well. I think generality here is not your >> friend. > > Sure, I was just making one fairly narrow point in relation to Andres' > concern about executor structuring of the UPSERT. > >> I'd suggest something like: >> >> UPSERT table SET col = value [, ...] WHERE predicate; > > Without dismissing the overall concern shared by you and Andres, that > particular update-driven syntax doesn't strike me as something that > should be pursued. I would like to be able to insert or update more > than a single row at a time, for one thing. For another, what exactly > appears in the predicate? Is it more or less the same as an UPDATE's > predicate? To the last question, yes. To the first point, I'm not bent on this particular syntax, but I am +1 for the idea that the command is something specifically upsert-like rather than something more generic like an ON CONFLICT SELECT that lets you do arbitrary things with the returned rows. >> with semantics like this: >> >> 1. Search the table, using any type of scan you like, for a row >> matching the given predicate. > > Perhaps I've misunderstood, but this is fundamentally different to > what I'd always thought would need to happen. I always understood that > the UPSERT should be insert-driven, and so not really have an initial > scan at all in the same sense that every Insert lacks one. Moreover, > I've always thought that everyone agreed on that. We go to insert, and > then in the course of inserting index tuples do something with dirty > snapshots. That's how we get information on conflicts. It's certain arguable whether you should INSERT and then turn failures into an update or try to UPDATE and then turn failures into an INSERT; we might even want to have both options available, though that smells a little like airing too much of our dirty laundry. But I think I generally favor optimizing for the UPDATE case for more or less the same reasons Kevin articulated. > For one thing we cannot use any kind of scan unless there is a new > mechanism for seeing not-yet-visible tuples, like some kind of > non-MVCC snapshot that those scan nodes can selectively use. Now, I > guess that you intend that in this scenario you go straight to 5, and > then your insert finds the not-yet-visible conflict. But it's not > clear that when 5 fails, and we pick up from 1 that we then get a new > MVCC Snapshot or something else that gives us a hope of succeeding > this time around. And it seems quite wasteful to not use knowledge of > conflicts from the insert at all - that's one thing I'm trying to > address here; removing unnecessary index scans when we actually know > what our conflict TIDs are. Here you seem to be suggested that I intended to propose your existing design rather than something else, which I didn't. In this design, you find the conflict (at most one) but scanning for the tuple to be updated. >> 2. If you find more than one tuple that is visible to your scan, error. > > This point seems to concern making the UPSERT UPDATE only affect zero > or one rows. Is that right? If so, why do you think that's a useful > constraint to impose? Because nobody wants an operation to either insert 1 tuple or update n>=1 tuples. The intention is that the predicate should probably be something like WHERE unique_key = 'some_value', but you can use something else. So it's kinda like saying which index you care about for a particular operation, except without having to explicitly name an index. But in any event you should use a predicate that uniquely identifies the tuple you want to update. >> 3. If you find exactly one tuple that is visible to your scan, update it. Stop. >> 4. If you find no tuple that is visible to your scan, but you do find >> at least one tuple whose xmin has committed and whose xmax has not >> committed, then (4a) if the isolation level is REPEATABLE READ or >> higher, error; (4b) if there is more than one such tuple, error; else >> (4b) update it, in violation of normal MVCC visibility rules. > > Point 4b ("if there is more than one such tuple...") seems like it > needs some practical or theoretical justification - do you just not > want to follow an update chain? If so, why? What would the error > message actually be? And, to repeat myself: Why should an MVCC > snapshot see more than one such tuple in the ordinary course of > scanning a relation, since there is by definition a unique constraint > that prevents this? Or maybe I'm wrong; maybe you don't even require > that. That isn't clear. In case (4b), like in case (2), you've done something like UPSERT tab SET ... WHERE non_unique_index = 'value_appearing_more_than_once'. You shouldn't do that, because you have only one replacement tuple (embodied in the SET clause). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Peter Geoghegan-3 wrote >> with semantics like this: >> >> 1. Search the table, using any type of scan you like, for a row >> matching the given predicate. > > Perhaps I've misunderstood, but this is fundamentally different to > what I'd always thought would need to happen. I always understood that > the UPSERT should be insert-driven, and so not really have an initial > scan at all in the same sense that every Insert lacks one. Moreover, > I've always thought that everyone agreed on that. We go to insert, and > then in the course of inserting index tuples do something with dirty > snapshots. That's how we get information on conflicts. SQL standard not-withstanding there really is no way to pick one implementation and not have it be sub-optimal in some situations. Unless there is a high barrier why not introduce syntax: SCAN FIRST; INSERT FIRST that allows the user to specify the behavior that they expect would be most efficient given their existing/new data ratio? >> Having said all that, I believe the INSERT ON CONFLICT syntax is more >> easily comprehensible than previous proposals. But I still tend to >> agree with Andres that an explicit UPSERT syntax or something like it, >> that captures all of the MVCC games inside itself, is likely >> preferable from a user standpoint, whatever the implementation ends up >> looking like. > > Okay then. If you both feel that way, I will come up with something > closer to what you sketch. But for now I still strongly feel it ought > to be driven by an insert. Perhaps I've misunderstood you entirely, > though. Getting a little syntax crazy here but having all of: UPSERT [SCAN|INSERT] FIRST INSERT ON CONFLICT UPDATE - same as INSERT FIRST UPDATE ON MISSING INSERT - same as SCAN FIRST with the corresponding 2 implementations would make the user interface slightly more complicated but able to be conformed to the actual data that the user has. You could basically perform a two-phase pass where you run the user-requested algorithm and then for all failures attempt the alternate algorithm and then error if both fail. I am not at all fluent on the concurrency issues here, and the MVCC violations and re-tries that might be considered, but at a high-level there is disagreement here simply because both answers are "correct" and ideally both can be provided to the user. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Making-joins-involving-ctid-work-for-the-benefit-of-UPSERT-tp5811919p5812640.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Wed, Jul 23, 2014 at 3:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > Could you clarify that? Does this mean that you feel that we > should write to the heap before reading the index to see if the row > will be a duplicate? If so, I think that is a bad idea, since this > will sometimes be used to apply a new data set which hasn't changed > much from the old, and that approach will perform poorly for this > use case, causing a lot of bloat. It certainly would work well for > the case that most of the rows are expected to be INSERTs rather > than DELETEs, but I'm not sure that's justification for causing > extreme bloat in the other cases. No, I think we should stagger ordinary index insertion in a way that locks indexes - we lock indexes, then if successful insert a heap tuple before finally inserting index tuples using the existing heavyweight page-level index locks. My design doesn't cause bloat under any circumstances. Heikki's design, which he sketched with an actual POC implementation involved possible bloating in the event of a conflict. He also had to go and delete the promise tuple (from within ExecInsert()) in the event of the conflict before row locking in order to prevent unprincipled deadlocking. Andres wanted to do something else similarly involving "promise tuples", where the xid on the inserter was stored in indexes with a special flag. That could also cause bloat. I think that could be particularly bad when conflicts necessitate visiting indexes one by one to kill promise tuples, as opposed to just killing one heap tuple as in the case of Heikki's design. Anyway, both of those designs, and my own design are insert-driven. The main difference between the design that Heikki sketched and my own is that mine does not cause bloat, but is more invasive to the nbtree code (but less invasive to a lot of other places to make the deadlocking-ultimately-conflicting tuple killing work). But I believe that Heikki's design is identical to my own in terms of user-visible semantics. That said, his design was just a sketched and it wouldn't be fair to hold him to it. > Also, just a reminder that I'm going to squawk loudly if the > implementation does not do something fairly predictable and sane > for the case that the table has more than one UNIQUE index and you > attempt to UPSERT a row that is a duplicate of one row on one of > the indexes and a different row on a different index. Duly noted. :-) I think that it's going to have to support that one way or the other. It might be the case that I'll want to make the choice of unique index optionally "implicit", but it's clear that we want to be able to specify a specific unique index in one form or another. Actually, I've already added that. It's just optional right now. I haven't found a better way than by just specifying the name of the unique index in DML, which is ugly, which is the main reason I want to make it optional. Perhaps we can overcome this. -- Peter Geoghegan
Robert Haas wrote: > On Wed, Jul 23, 2014 at 4:05 PM, Peter Geoghegan <pg@heroku.com> wrote: > > On Wed, Jul 23, 2014 at 9:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> 2. If you find more than one tuple that is visible to your scan, error. > > > > This point seems to concern making the UPSERT UPDATE only affect zero > > or one rows. Is that right? If so, why do you think that's a useful > > constraint to impose? > > Because nobody wants an operation to either insert 1 tuple or update > n>=1 tuples. The intention is that the predicate should probably be > something like WHERE unique_key = 'some_value', but you can use > something else. So it's kinda like saying which index you care about > for a particular operation, except without having to explicitly name > an index. But in any event you should use a predicate that uniquely > identifies the tuple you want to update. This seemed a nice idea when I first read it earlier today, but now I'm not so sure. Are you saying that it wouldn't be allowed to use an UPSERT with some sort of join, such that each joined row would produce either one insert or one update? To clarify: suppose I import some external data into a temp table, then run UPSERT "USING" that table so that the rows end up in a permanent table; some of the rows might be already in the permanent table, some others might not. I would hope that by the time UPSERT is done, all the rows are in the permanent table. Would that raise an error, with your proposed design? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 23, 2014 at 3:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: > To the last question, yes. To the first point, I'm not bent on this > particular syntax, but I am +1 for the idea that the command is > something specifically upsert-like rather than something more generic > like an ON CONFLICT SELECT that lets you do arbitrary things with the > returned rows. FWIW I wasn't proposing that you'd be able to do arbitrary things. There'd be a few restrictions, such as forbidding unexpected DML commands, and requiring that only a special update reference the special rejected-projecting CTE. They just wouldn't be arbitrary restrictions. But that's not that interesting to me anymore; you and Andres want a dedicated DML command with the update part built in, that isn't as flexible. Okay, fine. I don't think it makes the implementation any easier, but that's what I'll do. > It's certain arguable whether you should INSERT and then turn failures > into an update or try to UPDATE and then turn failures into an INSERT; > we might even want to have both options available, though that smells > a little like airing too much of our dirty laundry. But I think I > generally favor optimizing for the UPDATE case for more or less the > same reasons Kevin articulated. I don't see the connection between this and Kevin's remarks. And FWIW, I don't see a reason to favor inserts or updates. Fortunately, what I have balances both cases very well, and doesn't cause bloat. The work of descending the index to lock it isn't wasted if an update is required. My implementation decides to either insert or update at literally the latest possible moment. >> For one thing we cannot use any kind of scan unless there is a new >> mechanism for seeing not-yet-visible tuples, like some kind of >> non-MVCC snapshot that those scan nodes can selectively use. Now, I >> guess that you intend that in this scenario you go straight to 5, and >> then your insert finds the not-yet-visible conflict. But it's not >> clear that when 5 fails, and we pick up from 1 that we then get a new >> MVCC Snapshot or something else that gives us a hope of succeeding >> this time around. And it seems quite wasteful to not use knowledge of >> conflicts from the insert at all - that's one thing I'm trying to >> address here; removing unnecessary index scans when we actually know >> what our conflict TIDs are. > > Here you seem to be suggested that I intended to propose your existing > design rather than something else, which I didn't. In this design, > you find the conflict (at most one) but scanning for the tuple to be > updated. Yes, but what if you don't see a conflict because it isn't visible to your snapshot, and then you insert, and only then (step 5), presumably with a dirty snapshot, you find a conflict? How does the loop terminate if that brings you back to step 1 with the same MVCC snapshot feeding the update? >>> 2. If you find more than one tuple that is visible to your scan, error. >> >> This point seems to concern making the UPSERT UPDATE only affect zero >> or one rows. Is that right? If so, why do you think that's a useful >> constraint to impose? > > Because nobody wants an operation to either insert 1 tuple or update > n>=1 tuples. The intention is that the predicate should probably be > something like WHERE unique_key = 'some_value', but you can use > something else. So it's kinda like saying which index you care about > for a particular operation, except without having to explicitly name > an index. But in any event you should use a predicate that uniquely > identifies the tuple you want to update. I agree that you want to uniquely identify each tuple. What I meant was, why should we not be able to upsert multiple rows in a single command? What's wrong with that? >> Point 4b ("if there is more than one such tuple...") seems like it >> needs some practical or theoretical justification - do you just not >> want to follow an update chain? If so, why? What would the error >> message actually be? And, to repeat myself: Why should an MVCC >> snapshot see more than one such tuple in the ordinary course of >> scanning a relation, since there is by definition a unique constraint >> that prevents this? Or maybe I'm wrong; maybe you don't even require >> that. That isn't clear. > > In case (4b), like in case (2), you've done something like UPSERT tab > SET ... WHERE non_unique_index = 'value_appearing_more_than_once'. > You shouldn't do that, because you have only one replacement tuple > (embodied in the SET clause). Oh, I totally agree that we should throw an error if any one row is affected more than once by the same command. Indeed, SQL MERGE requires this, since to do otherwise would leave the final value of the row unspecified. It seemed to me from your original explanation of point 4b that you were saying something much stronger. But if that's all you meant, then I agree. -- Peter Geoghegan
On Wed, Jul 23, 2014 at 9:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I'd suggest something like: > > UPSERT table SET col = value [, ...] WHERE predicate; I think I see another problem with this. The UPDATE must provide a WHERE clause Var on a unique indexed column (let's say it's constrained to provide a "(Var [unique-indexed opclass support function 3 op] Const)" qual during parse analysis because you asked for UPSERT. But it can also affect 0 rows for other reasons, since this UPDATE qual can have arbitrary other expressions. So in general you have any number of reasons for not updating, which implies that you must insert, which might not be possible because there might even still be duplicate key violation in a not-yet-visible-to-you row (even just in the unique index you intended to merge on). Whereas, when inserting, there is exactly one reason (per row) to go and update - a conflict (i.e. a would-be duplicate violation). And this is a conflict that is meaningful to every transaction, regardless of their snapshot, since it represents an objective fact about the physical presence of an index tuple. So, do you make the UPDATE differentiate between different reasons for the UPDATE failing, only inserting when an UPSERT would have happened had you omitted the extra stuff in your UPSERT predicate? Can you really differentiate anyway? And isn't the choice to do the insert based on what your MVCC snapshot happens to see - rather than something that there is necessarily objective truth on, the physical presence of a duplicate tuple in an index - rather arbitrary? It's a different situation to my implementation, where you do an insert-like thing, and then find a conflict row to lock and then update, because at that point the row is successfully locked, and the WHERE clause may be evaluated against rejects and the *most recent version* of the locked, conflict-causing row. The most recent, locked version is not arbitrary at all. That's the version you ought to evaluate a predicate against before finally deciding to UPDATE. I assume you agree with my view that UPSERT should always insert or update. This kind of business sounds closer to SQL MERGE, where an insert may not drive things, and you accept these kinds of anomalies in exchange for a lot more flexibility in not having inserts drive things. Did you happen to read my post on that? -- Peter Geoghegan
On Wed, Jul 23, 2014 at 7:32 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Because nobody wants an operation to either insert 1 tuple or update >> n>=1 tuples. The intention is that the predicate should probably be >> something like WHERE unique_key = 'some_value', but you can use >> something else. So it's kinda like saying which index you care about >> for a particular operation, except without having to explicitly name >> an index. But in any event you should use a predicate that uniquely >> identifies the tuple you want to update. > > This seemed a nice idea when I first read it earlier today, but now I'm > not so sure. Are you saying that it wouldn't be allowed to use an > UPSERT with some sort of join, such that each joined row would produce > either one insert or one update? To clarify: suppose I import some > external data into a temp table, then run UPSERT "USING" that table so > that the rows end up in a permanent table; some of the rows might be > already in the permanent table, some others might not. I would hope > that by the time UPSERT is done, all the rows are in the permanent > table. Would that raise an error, with your proposed design? Yeah, my syntax didn't have a mechanism for that. I agree we should have one. I was just brainstorming. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 23, 2014 at 7:35 PM, Peter Geoghegan <pg@heroku.com> wrote: >> It's certain arguable whether you should INSERT and then turn failures >> into an update or try to UPDATE and then turn failures into an INSERT; >> we might even want to have both options available, though that smells >> a little like airing too much of our dirty laundry. But I think I >> generally favor optimizing for the UPDATE case for more or less the >> same reasons Kevin articulated. > > I don't see the connection between this and Kevin's remarks. And FWIW, > I don't see a reason to favor inserts or updates. Fortunately, what I > have balances both cases very well, and doesn't cause bloat. The work > of descending the index to lock it isn't wasted if an update is > required. My implementation decides to either insert or update at > literally the latest possible moment. AFAIUI, this is because your implementation uses lwlocks in a way that Andres and I both find unacceptable. My suspicion is that any version of this that ends up getting committed is going to involve a risk of bloat in cases involving retries, and I think it will be easier to minimize bloat in an update-driven implementation. But I suppose that's speculative. >> Here you seem to be suggested that I intended to propose your existing >> design rather than something else, which I didn't. In this design, >> you find the conflict (at most one) but scanning for the tuple to be >> updated. > > Yes, but what if you don't see a conflict because it isn't visible to > your snapshot, and then you insert, and only then (step 5), presumably > with a dirty snapshot, you find a conflict? How does the loop > terminate if that brings you back to step 1 with the same MVCC > snapshot feeding the update? Good point. Maybe the syntax should be something like: UPSERT table (keycol [, keycol] ...) { VALUES (val [, val] ...) [, ...] | select_query } That would address both the concern about being able to pipe multiple tuples through it and the point you just raised. We look for a row that matches each given tuple on the key columns; if one is found, we update it; if none is found, we insert. > I agree that you want to uniquely identify each tuple. What I meant > was, why should we not be able to upsert multiple rows in a single > command? What's wrong with that? Nothing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 28, 2014 at 8:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > AFAIUI, this is because your implementation uses lwlocks in a way that > Andres and I both find unacceptable. That's not the case. My implementation uses page-level heavyweight locks. The nbtree AM used to use them for other stuff. Plenty of other systems use index level locks managed by a heavyweight lock manager. >>> Here you seem to be suggested that I intended to propose your existing >>> design rather than something else, which I didn't. In this design, >>> you find the conflict (at most one) but scanning for the tuple to be >>> updated. >> >> Yes, but what if you don't see a conflict because it isn't visible to >> your snapshot, and then you insert, and only then (step 5), presumably >> with a dirty snapshot, you find a conflict? How does the loop >> terminate if that brings you back to step 1 with the same MVCC >> snapshot feeding the update? > > Good point. Maybe the syntax should be something like: > > UPSERT table (keycol [, keycol] ...) { VALUES (val [, val] ...) [, > ...] | select_query } > > That would address both the concern about being able to pipe multiple > tuples through it and the point you just raised. We look for a row > that matches each given tuple on the key columns; if one is found, we > update it; if none is found, we insert. That basically is my design, except that (tangentially) yours risks bloat in exchange for not having to use a real locking mechanism, and has a different syntax. The parts of inserting into an index scan that I stagger include an initial part that is more or less just an index scan. With this design you'll have to set up things so that all indexes can be directly accessed in the manner of ExecInsert() (get a list of them from the relcache, open them in an order that avoids possible deadlocks, etc). Why not just use ExecInsert()? I don't think I'm alone in seeing things that way. On a mostly unrelated note, I'll remind you of the reason that I felt it was best to lock indexes. It wasn't so much about avoiding bloat as it was about avoiding deadlocks. When I highlighted the issue, Heikki's prototype, which did insert optimistically rather than locking, was then made to go and physically "super delete" the upsert-insert conflicting heap tuple (inserted optimistically before its index tuples), before going to lock a row, in order to avoid unprincipled deadlocking. In contrast, my design just used a callback that released page level heavyweight locks before going to lock a row. Heikki's prototype involved making it so that *even someone else's dirty snapshot* didn't see our dead speculatively-inserted heap tuple. Anyway, making all that happen is fairly invasive to a bunch of places that are just as critical as the nbtree code. I'm not saying it can't be done, or even that it definitely shouldn't be, but taking an approach that produces bloat, rather than locking values the way other systems do (and, to a limited extent Postgres already does) is at least way more invasive than it first appears. Plus, I ask you to consider that. -- Peter Geoghegan
On Mon, Jul 28, 2014 at 10:43 AM, Peter Geoghegan <pg@heroku.com> wrote: > Plus, I ask you to > consider that. Excuse me. I meant "Plus, you avoid bloat. I ask you to consider that". -- Peter Geoghegan
On Mon, Jul 28, 2014 at 10:43 AM, Peter Geoghegan <pg@heroku.com> wrote: > On a mostly unrelated note, I'll remind you of the reason that I felt > it was best to lock indexes. It wasn't so much about avoiding bloat as > it was about avoiding deadlocks. When I highlighted the issue, > Heikki's prototype, which did insert optimistically rather than > locking, was then made to go and physically "super delete" the > upsert-insert conflicting heap tuple (inserted optimistically before > its index tuples), before going to lock a row, in order to avoid > unprincipled deadlocking. In contrast, my design just used a callback > that released page level heavyweight locks before going to lock a row. > Heikki's prototype involved making it so that *even someone else's > dirty snapshot* didn't see our dead speculatively-inserted heap tuple. > > Anyway, making all that happen is fairly invasive to a bunch of places > that are just as critical as the nbtree code. I think I should be more concrete about why this is more complicated than it first appears. Andres said at pgCon that he would still go with a design where "promise tuples" are inserted into indexes ahead of any heap tuple (which differs from Heikki's prototype, where heap tuple insertion occurs first). Accounting for deadlocking issues could be particularly problematic with that design, since we must kill tuples from each index in turn before row locking. In any case the need to efficiently *release* locks must be weighed carefully. It isn't obvious, but we must release locks if there is a conflict. After Heikki acknowledged the problem [1], he produced a revision addressing it. The details of the workaround and a patch were posted [2]. I think it's fair to say that this area is a lot messier than it first appears. If anyone wants to propose an alternative design, they are of course quite welcome to, but I ask that the very real difficulties with those designs be acknowledged. AFAICT, only Heikki has independently acknowledged these issue on list. In case it isn't obvious, let me be clear about what I care about here. I feel it's important to get something that is easy to reason about - you write a DML statement, and within certain fairly reasonable parameters Postgres does the rest. I think we should not accept something that may even occasionally through deadlock errors, or unique violations, or RC-level serialization failures through no fault of the user. That would be inferior to the plpgql looping pattern we promote that does the right thing and avoids all of this. It would be awful to have to tell users who hit problems like this that they should just stop doing so much upserting, or use the old pattern. [1] http://www.postgresql.org/message-id/52B4AAF0.5090806@vmware.com [2] http://www.postgresql.org/message-id/52D00D2D.6030307@vmware.com -- Peter Geoghegan
On Mon, Jul 28, 2014 at 1:43 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Jul 28, 2014 at 8:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> AFAIUI, this is because your implementation uses lwlocks in a way that >> Andres and I both find unacceptable. > > That's not the case. My implementation uses page-level heavyweight > locks. The nbtree AM used to use them for other stuff. Plenty of other > systems use index level locks managed by a heavyweight lock manager. Oh, OK. I think I missed that development somewhere. >> Good point. Maybe the syntax should be something like: >> >> UPSERT table (keycol [, keycol] ...) { VALUES (val [, val] ...) [, >> ...] | select_query } >> >> That would address both the concern about being able to pipe multiple >> tuples through it and the point you just raised. We look for a row >> that matches each given tuple on the key columns; if one is found, we >> update it; if none is found, we insert. > > That basically is my design, except that (tangentially) yours risks > bloat in exchange for not having to use a real locking mechanism, and > has a different syntax. I think it would be advisable to separate the syntax from the implementation. Presumably you can make your implementation use some reasonable syntax we can all agree on, and conversely my proposed syntax could be made to have a different set of semantics. There's some connection between the syntax and semantics, of course, but it's not 100%. I mention this because I was mostly concerned with getting to a reasonable syntax proposal, not so much the implementation details. It may well be that your implementation details are perfect at this point; I don't know because I haven't looked, and I'm not an expert on that area of the code anyway. But I have looked at your syntax, which I wasn't altogether keen on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 29, 2014 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think it would be advisable to separate the syntax from the > implementation. Presumably you can make your implementation use some > reasonable syntax we can all agree on, and conversely my proposed > syntax could be made to have a different set of semantics. There's > some connection between the syntax and semantics, of course, but it's > not 100%. I mention this because I was mostly concerned with getting > to a reasonable syntax proposal, not so much the implementation > details. It may well be that your implementation details are perfect > at this point; I don't know because I haven't looked, and I'm not an > expert on that area of the code anyway. But I have looked at your > syntax, which I wasn't altogether keen on. Fair enough. I think the syntax should reflect the fact that upserts are driven by inserts, though. Users will get into trouble with a syntax that allows a predicate that is evaluated before any rows are locked. -- Peter Geoghegan
On Tue, Jul 29, 2014 at 1:28 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Jul 29, 2014 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think it would be advisable to separate the syntax from the >> implementation. Presumably you can make your implementation use some >> reasonable syntax we can all agree on, and conversely my proposed >> syntax could be made to have a different set of semantics. There's >> some connection between the syntax and semantics, of course, but it's >> not 100%. I mention this because I was mostly concerned with getting >> to a reasonable syntax proposal, not so much the implementation >> details. It may well be that your implementation details are perfect >> at this point; I don't know because I haven't looked, and I'm not an >> expert on that area of the code anyway. But I have looked at your >> syntax, which I wasn't altogether keen on. > > Fair enough. I think the syntax should reflect the fact that upserts > are driven by inserts, though. Users will get into trouble with a > syntax that allows a predicate that is evaluated before any rows are > locked. You've convinced me that only indexable predicates can be sensibly used here. I'm not sure that's the same as saying that upserts are driven by inserts, though. I'd tend to interpret that to mean insert-the-heap-tuple-first, but I think what you're doing is check-the-indexes-first. The terminology in this area is tricky. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 30, 2014 at 10:36 AM, Robert Haas <robertmhaas@gmail.com> wrote: > You've convinced me that only indexable predicates can be sensibly > used here. I'm not sure that's the same as saying that upserts are > driven by inserts, though. I'd tend to interpret that to mean > insert-the-heap-tuple-first, but I think what you're doing is > check-the-indexes-first. The terminology in this area is tricky. I'm glad. Indeed, I am proposing checking/locking indexes first, but that is actually pretty similar to what Heikki did in his prototype (insert a heap tuple first), in that those heap tuples were conceptually speculative/optimistic "value locks". Heikki's need to "super delete" in the event of a conflict (by setting the speculative heap tuple's xmin to InvalidTransactionId) demonstrates that. Furthermore, Heikki's last revision ended up having places like _bt_checkunique() looking in the procarray to see if the xact to wait on was still around and had a "speculative token", so that we could wait on that to be released rather than the whole xact to commit/abort (avoiding unprincipled deadlocking hinges upon this): + /* + * If it's a speculative insertion, wait for it to finish (ie. + * to go ahead with the insertion, or kill the tuple). Otherwise + * wait for the transaction to finish as usual. + */ + if (speculativeToken) + SpeculativeInsertionWait(xwait, speculativeToken); + else + XactLockTableWait(xwait); So, as I said, what he ended up with was actually very close to what I wanted to do. I felt it was more logical to just use an index page lock, given the fact that they have been used before, and given the limited code footprint (although I do accept that in general altering the nbtree code is best avoided). But Heikki did ultimately understand the considerations that informed my design, and so ultimately IMV the disagreement there was more or less just on that one detail. (Not that the rest of the patch was perfect or anything, but it was useful to have my design *understood*). You can hopefully see why I'd characterize both of those designs as insertion driven. The terminology is very tricky here, though. It might even be the trickiest aspect to the whole thing, because at times I've had a hard time communicating what I mean effectively on list. Anyway, I'll be sure to emphasize the distinction in future. -- Peter Geoghegan
On Mon, Jul 28, 2014 at 11:37:07AM -0400, Robert Haas wrote: > > Yes, but what if you don't see a conflict because it isn't visible to > > your snapshot, and then you insert, and only then (step 5), presumably > > with a dirty snapshot, you find a conflict? How does the loop > > terminate if that brings you back to step 1 with the same MVCC > > snapshot feeding the update? > > Good point. Maybe the syntax should be something like: > > UPSERT table (keycol [, keycol] ...) { VALUES (val [, val] ...) [, > ...] | select_query } One idea would be to allow UPSERT with constants (single row), and use CTEs with a SELECT or INSERT/RETURNING for multi-row upserts. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +