Thread: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0
Attached is a cumulative patch set - version 2.0 of INSERT ... ON CONFLICT {UPDATE | IGNORE}. This revision does not offer a variant implementing approach #1 to value locking [1] (only approach #2), since maintaining both approaches in parallel has about outlived its usefulness. I'm calling this version 2.0 because it has RLS support. This is significant because AFAICT it's the last feature that needs to have interactions with UPSERT considered. I've worked through a rather long list of existing interrelated features, implementing support in each case. I've had feedback from others on what behavior is appropriate when that wasn't obvious, and have made sure those areas had appropriate support. This now includes RLS, but past revisions added support for inheritance, updatable views, statement-level triggers, postgres_fdw, column-level privileges, partial indexes, exclusion constraints, and more. Basically, I think we're done with discussing those aspects, and the semantics/syntax in general, or are pretty close to done. Certainly, support for these other interrelated features is quite comprehensive at this point. Now the core mechanism of the patch should be discussed in detail. The general structure and design is also interesting. After months and months of discussion, it now seems very likely that the semantics offered are the right ones. Since even before V1.0 was posted back in August, that's all that we've discussed, really (apart from the recent back and forth with Heikki on value locking bugs, of course). I've approached RLS along the lines Stephen seemed to think would work best following extensive discussion [2], or at least I believe that I've produced RLS support that is what we informally agreed on. All security barrier quals are treated as WITH CHECK OPTIONs in the context of ON CONFLICT UPDATE. INSERTs don't have to deal with UPDATE-related policies/WITH CHECK OPTIONs, but when the update path is taken, both the INSERT and UPDATE related policies must both pass. They must pass for the tuple that necessitated taking the UPDATE path (the locked tuple to be updated), and also the finished tuple added back to the relation by ExecUpdate(). There are 3 possible calls to ExecWithCheckOptions() in the context of INSERT ... ON CONFLICT UPDATE. Those 2 that I just mentioned, that involve UPDATE *and* INSERT WITH CHECK options, and also the ExecInsert() ExecWithCheckOptions() call. RLS support is provided in a separate cumulative commit in the hope that this makes it easier to review by a subject matter expert. Documentation [3] and tests covering RLS are provided, of course. I also include various bugfixes to approach #2 to value locking (these were all previously separately posted, but are now integrated into the main ON CONFLICT commit). Specifically, these are fixes for the bugs that emerged thanks to Jeff Janes' great work on stress testing [4]. With these fixes, I have been unable to reproduce any problem with this patch with the test suite, even after many days of running the script on a quad-core server, with constant concurrent VACUUM runs, etc. I think that we still need to think about the issues that transpired with exclusion constraints, but since I couldn't find another problem with an adapted version of Jeff's tool that tested exclusion constraints, I'm inclined to think that it should be possible to support exclusion constraints for the IGNORE variant. It would be great to have more input on stress testing from Jeff. Thoughts? [1] https://wiki.postgresql.org/wiki/Value_locking#.231._Heavyweight_page_locking_.28Peter_Geoghegan.29 [2] http://www.postgresql.org/message-id/20150109214041.GK3062@tamriel.snowman.net [3] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/sql-createpolicy.html [4] https://github.com/petergeoghegan/jjanes_upsert -- Peter Geoghegan
Attachment
- 0008-User-visible-documentation-for-INSERT-.-ON-CONFLICT-.patch
- 0007-Internal-documentation-for-INSERT-.-ON-CONFLICT-UPDA.patch
- 0006-Tests-for-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch
- 0005-RLS-support-for-ON-CONFLICT-UPDATE.patch
- 0004-Project-updates-from-ON-CONFLICT-UPDATE-RETURNING.patch
- 0003-EXCLUDED-expressions-within-ON-CONFLICT-UPDATE.patch
- 0002-Support-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch
- 0001-Make-UPDATE-privileges-distinct-from-INSERT-privileg.patch
On Sat, Jan 10, 2015 at 8:32 PM, Peter Geoghegan <pg@heroku.com> wrote: > I also include various bugfixes to approach #2 to value locking (these > were all previously separately posted, but are now integrated into the > main ON CONFLICT commit). Specifically, these are fixes for the bugs > that emerged thanks to Jeff Janes' great work on stress testing [4]. > With these fixes, I have been unable to reproduce any problem with > this patch with the test suite, even after many days of running the > script on a quad-core server, with constant concurrent VACUUM runs, > etc. I continued with this since posting V2.0. I've run this bash script, that invokes Jeff's script at various client counts, with runs of various duration (since each client does a fixed amount of work): https://github.com/petergeoghegan/jjanes_upsert/blob/master/run_test.sh As previously discussed, Jeff's script comprehensively verifies the correctness of the final values of a few thousand rows within a table after many concurrent upserts, within and across upserting sessions, and with concurrent deletions, too. When building Postgres for this stress test, I included Jeff's modifications that increase the XID burn rate artificially (I chose a burn rate of X50). This makes anti-wraparound VACUUMs much more frequent. I'm also looking out for outlier query execution durations, because in theory they could indicate an unknown lock starvation problem. I haven't seen any notable outliers after over a week of testing. > I think that we still need to think about the issues that > transpired with exclusion constraints, but since I couldn't find > another problem with an adapted version of Jeff's tool that tested > exclusion constraints, I'm inclined to think that it should be > possible to support exclusion constraints for the IGNORE variant. Exclusion constraints were my focus with stress testing today. I performed equivalent verification of upserts using exclusion constraints (this is a hack; exclusion constraints are only intended to be used with the IGNORE variant, but I get better test coverage than I might otherwise this way). Unfortunately, even with the recent bugfixes, there are still problems. On this server (rather than my laptop), with 8 clients I can see errors like this before too long (note that this output includes custom instrumentation from Jeff): """"""" 6670 2015-01-17 18:02:54 PST LOG: JJ scan_all 1, relfrozenid -813636509 6670 2015-01-17 18:02:54 PST LOG: JJ freezeLimit -661025537 6670 2015-01-17 18:02:54 PST LOG: JJ freeze_min_age 50000000 vacuum_freeze_table_age 150000000 freeze_table_age 150000000 ReadNew -611025384 6670 2015-01-17 18:02:54 PST LOG: JJ scan_all 1, relfrozenid -813636101 6670 2015-01-17 18:02:54 PST LOG: JJ transaction ID wrap limit is 1352632427, limited by database with OID 12746 6670 2015-01-17 18:02:54 PST LOG: autovacuum: done processing database "postgres" at recent Xid of 3683945176 recent mxid of 1 6668 2015-01-17 18:02:54 PST ERROR: conflicting key value violates exclusion constraint "upsert_race_test_index_excl" 6668 2015-01-17 18:02:54 PST DETAIL: Key (index)=(7142) conflicts with existing key (index)=(600). 6668 2015-01-17 18:02:54 PST STATEMENT: insert into upsert_race_test (index, count) values ('7142','1') on conflict update set count=TARGET.count + EXCLUDED.count where TARGET.index = EXCLUDED.index returning upsert_race_test.count """"""" It's always an exclusion violation problem that I see here. As you can see, the query involved has no "unique index inference" specification, per the hack to make this work with exclusion constraints (the artificially much greater XID burn rate might have also increased the likelihood of this error dramatically). You'll also note that the DETAIL message seems to indicate that this btree_gist-based exclusion constraint doesn't behave like a unique constraint at all, because the conflicting new value (7142) is not at all the same as the existing value (600). But that's wrong -- it's supposed to be B-Tree-like. In short, there are further race conditions with exclusion constraints. I think that the fundamental, unfixable race condition here is the disconnect between index tuple insertion and checking for would-be exclusion violations that exclusion constraints naturally have here, that unique indexes naturally don't have [1] (note that I'm talking only about approach #2 to value locking here; approach #1 isn't in V2.0). I suspect that the feature is not technically feasible to make work correctly with exclusion constraints, end of story. VACUUM interlocking is probably also involved here, but the unfixable race condition seems like our fundamental problem. We could possibly spend a lot of time discussing whether or not I'm right about it being inherently impossible to make INSERT ... ON CONFLICT IGNORE work with exclusion constraints. However, I strongly suggest that we cut scope and at least leave them out of any version that can be committed for 9.5, and instead work on other areas, because it is at least now clear that they are much harder to get right than unique constraints. Besides, making exclusion constraints work with INSERT ... ON CONFLICT IGNORE is nice, but ultimately not all that important. For that matter I think that INSERT ... ON CONFLICT IGNORE is more generally not all that important compared to ON CONFLICT UPDATE. I'd cut scope by cutting ON CONFLICT IGNORE if that was the consensus....we could add back ON CONFLICT IGNORE in 9.6 when we had a better sense of exclusion constraints here. Exclusion constraints can never be useful with ON CONFLICT UPDATE anyway. Please work with me towards a committable patch. I think we have every chance of committing this for 9.5, with value locking approach #2, provided we now cut scope a bit. As I mention above, V2.0 has stood up to more than a week of aggressive, comprehensive stress testing/custom correctness verification on an 8 core box (plus numerous other stress tests in months past). UPSERT (which never involved exclusion constraints) is a very comprehensive and mature effort, and I think it now needs one big push from a senior community member. I feel that I cannot do anything more without that input. [1] http://www.postgresql.org/message-id/54A7C76D.3070101@vmware.com -- Peter Geoghegan
On Sat, Jan 17, 2015 at 6:48 PM, Peter Geoghegan <pg@heroku.com> wrote: > I continued with this since posting V2.0. Attached version (V2.1) fixes bit-rot caused by the recent changes by Stephen ("Fix column-privilege leak in error-message paths"). More precisely, it is rebased on top of today's 17792b commit. I have not addressed the recently described problems with exclusion constraints. I hope we can do so shortly. Simply removing IGNORE support until such time as we straighten that all out (9.6?) seems like the simplest solution. No need to block the progress of "UPSERT", since exclusion constraint support was only ever going to be useful for the less compelling IGNORE variant. What do other people think? Do you agree with my view that we should shelve IGNORE support for now, Heikki? There is one minor bugfix here: I have tightened up the conditions under which user-defined rule application will be rejected. Previously, I neglected to specifically check for UPDATE rules when an INSERT ... ON CONFLICT UPDATE statement was considered. That's been fixed. On the stress-testing front, I'm still running Jeff Janes' tool [1], while also continuing to use his Postgres modifications to artificially increase the XID burn rate. However, my personal server is no longer used for this task. I'm using an AWS EC2 instance - a r3.8xlarge. This server provides 32 logical cores, and uses an "Intel Xeon E5-2670 v2 @ 2.50GHz" CPU. It seems reasonable to suppose that any latent concurrency bugs are more likely to reveal themselves when using the new server. Anyone who would like access to the server should contact me privately. It's a throw-away EC2 instance, so this isn't particularly difficult to do. Thanks [1] https://github.com/petergeoghegan/jjanes_upsert -- Peter Geoghegan
Attachment
- 0008-User-visible-documentation-for-INSERT-.-ON-CONFLICT-.patch
- 0007-Internal-documentation-for-INSERT-.-ON-CONFLICT-UPDA.patch
- 0006-Tests-for-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch
- 0005-RLS-support-for-ON-CONFLICT-UPDATE.patch
- 0004-Project-updates-from-ON-CONFLICT-UPDATE-RETURNING.patch
- 0003-EXCLUDED-expressions-within-ON-CONFLICT-UPDATE.patch
- 0002-Support-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch
- 0001-Make-UPDATE-privileges-distinct-from-INSERT-privileg.patch
On Thu, Jan 29, 2015 at 11:38 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote: > Simply removing IGNORE support until such time as we straighten > that all out (9.6?) seems like the simplest solution. No need to block > the progress of "UPSERT", since exclusion constraint support was > only ever going to be useful for the less compelling IGNORE variant. > What do other people think? Do you agree with my view that we should > shelve IGNORE support for now, Heikki? I appreciate the work you're doing and (as a user rather than a pg-hacker) don't want to butt in but if it would be possible to allow support for IGNORE for unique but not exclusion constraints that would be really helpful for my own use cases, where being able to insert from a dataset into a table containing unique constraints without having to first check the dataset for uniqueness (within both itself and the target table) is very useful. It's possible that I've misunderstood anyway and that you mean purely that exclusion constraint IGNORE should be shelved until 9.6, in which case I apologise. Of course if there's no way to allow unique constraint IGNORE without exclusion constraints then just ignore me; I (along I'm sure with all the others who are following this conversation from afar) will be incredibly grateful for the work you've done either way. I suppose there's no reason why we couldn't use a no-op ON CONFLICT UPDATE anyway, but that does seem rather messy and (I imagine) would involve rather more work (unless the optimizer were able to optimize away the "update"? I don't know enough to be able to say if it would). Thanks Geoff
On Fri, Jan 30, 2015 at 6:59 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote: > I appreciate the work you're doing and (as a user rather than a > pg-hacker) don't want to butt in but if it would be possible to allow > support for IGNORE for unique but not exclusion constraints that would > be really helpful for my own use cases, where being able to insert > from a dataset into a table containing unique constraints without > having to first check the dataset for uniqueness (within both itself > and the target table) is very useful. > > It's possible that I've misunderstood anyway and that you mean purely > that exclusion constraint IGNORE should be shelved until 9.6, in which > case I apologise. Well, the issue is that we can't really add exclusion constraints to the IGNORE case later. So the fact that we cannot do exclusion constraints kind of implies that we can either shelve IGNORE and maybe look at it later, or accept that we'll never support exclusion constraints with IGNORE. We'd then include IGNORE without exclusion constraint support now and forever. I tend to think that we'll end up doing the latter anyway, but I really don't want to add additional risk of this not getting into 9.5 by arguing about that now. It doesn't matter that much. > I suppose there's no reason why we couldn't use a no-op ON CONFLICT > UPDATE anyway Right. IGNORE isn't really all that compelling for that reason. Note that this will still lock the unmodified row, though. -- Peter Geoghegan
Hi, A first (not actually that quick :() look through the patches to see what actually happened in the last months. I didn't keep up with the thread. Generally the split into the individual commits doesn't seem to make much sense to me. The commits individually (except the first) aren't indivdiually commitable and aren't even meaningful. Splitting off the internal docs, tests and such actually just seems to make reviewing harder because you miss context. Splitting it so that individual piece are committable and reviewable makes sense, but... I have no problem doing the user docs later. If you split of RLS support, you need to throw an error before it's implemented. 0001: * References INSERT with ON CONFLICT UPDATE, can thus not be committed independently. I don't think those references reallyare needed. * I'm not a fan of the increased code duplication in ExecCheckRTEPerms(). Can you look into cleaning that up? * Also the comments inside the ACL_INSERT case still reference UPDATE. Other than that I think we can just go ahead and commit this ahead of time. Mentioning ON CONFLICT UPDATE (OCU henceforth) in the commit message only. 0007: * "AMs" alone isn't particularly unique. * Without the context of the discussion "unprincipled deadlocks" aren't well defined. * Too many "" words. * Waiting "too long" isn't defined. Neither is why that'd imply unprincipled deadlocks. Somewhat understandable with thecontext of the discussion, but surely not a couple years down the road. * As is I don't find the README entry super helpful. It should state what the reason for doing this is cleary, start at thehigher level, and then move to the details. * Misses details about the speculative heavyweight locking of tuples. 0002: * Tentatively I'd say that killspeculative should be done via a separate function instead of heap_delete() * I think we should, as you ponder in a comment, do the OCU specific stuff lazily and/or in a separate function from BuildIndexInfo().That function is already quite visible in profiles, and the additional work isn't entirely trivial. * I doubt logical decoding works with the patch as it stands. * The added ereport (i.e. user facing) error message in ExecInsertIndexTuples won't help a user at all. * Personally I don't care one iota for comments like "Get information from the result relation info structure.". Yes, oneof these already exists, but ... * If a arbiter index is passed to ExecCheckIndexConstraints(), can't we abort the loop after checking it? Also, do we reallyhave to iterate over indexes for that case? How about moving the loop contents to a separate function and using thatseparately for the arbiter cases? * Don't like the comment above check_exclusion_or_unique_constraint()'s much. Makes too much of a special case of OCU * ItemPointerIsValid * ExecCheckHeapTupleVisible's comment "It is not acceptable to proceed " sounds like you're talking with a child or so ;) * ExecCheckHeapTupleVisible()'s errhint() sounds like an argument/excuse (actually like a code comment). That's not goingto help a user at all. * I find the modified control flow in ExecInsert() pretty darn ugly. I think this needs to be cleaned up. The speculativecase should imo be a separate function or something. * /* * This may occur when an instantaneously invisible tuple is blamed * as a conflict because multiple rows are insertedwith the same * constrained values. How can this happen? We don't insert multiple rows with the same command id? * ExecLockUpdatedTuple() has (too?) many comments, but little overview of what/why it is doing what it does on a higher level. * plan_speculative_use_index: "Use the planner to decide speculative insertion arbiter index" - Huh? " rel is the targetto undergo ON CONFLICT UPDATE/IGNORE." - Which rel? * formulations as "fundamental nexus" are hard to understand imo. * Perhaps it has previously been discussed but I'm not convinced by the reasoning for not looking at opclasses in infer_unique_index().This seems like it'd prohibit ever having e.g. case insensitive opclasses - something surely worthwile. * Doesn't infer_unique_index() have to look for indisvalid? This isn't going to work well with a invalid (not to speak fora !ready) index. * Is ->relation in the UpdateStmt generated in transformInsertStmt ever used for anything? If so, it'd possibly generatesome possible nastyness due to repeated name lookups. Looks like it'll be used in transformUpdateStmt * What's the reason for the !pstate->p_parent? Also why the parens?pstate->p_is_speculative = (pstate->parentParseState && (!pstate->p_parent_cte && pstate->parentParseState->p_is_insert&& pstate->parentParseState->p_is_speculative)); * Why did you need to make %nonassoc DISTINCT and ON nonassoc in the grammar? * The whole speculative insert logic isn't really well documented. Why, for example, do we actually need the token? And whyare there no issues with overflow? And where is it documented that a 0 means there's no token? ... * Isn't "SpecType" a awfully generic (and nondescriptive) name? * /* XXX: Make sure that re-use of bits is safe here */ - no, not unless you change existing checks. * /* * Immediately VACUUM "super-deleted" tuples */if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple), InvalidTransactionId)) return HEAPTUPLE_DEAD; Is that branch really needed? Shouldn't it just be happeningas a consequence of the already existing code? Same in SatisfiesMVCC. If you actually needed that block, it'd needto be done in SatisfiesSelf as well, no? You have a comment about a possible loop - but that seems wrong to me, implyingthat HEAP_XMIN_COMMITTED was set invalidly. Ok, I can't focus at all any further at this point. But there's enough comments here that some even might make sense ;) Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 30 January 2015 at 21:58, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Jan 30, 2015 at 6:59 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote: >> I suppose there's no reason why we couldn't use a no-op ON CONFLICT >> UPDATE anyway > > Right. IGNORE isn't really all that compelling for that reason. Note > that this will still lock the unmodified row, though. Mmmf. So I would have to make sure that my source tuples were unique before doing the INSERT (otherwise the first ON CONFLICT UPDATE for a tuple would block any other)? That's potentially very slow :( When you say that you can't add exclusion constraints later, do you mean from a coding point of view or just because people would get confused whether exclusion constraints could be IGNOREd or not? Geoff
On 2 February 2015 at 14:32, Geoff Winkless <pgsqladmin@geoff.dj> wrote: > Mmmf. So I would have to make sure that my source tuples were unique > before doing the INSERT (otherwise the first ON CONFLICT UPDATE for a > tuple would block any other)? That's potentially very slow :( Replying to my own message, because it occurs to me I might be being stupid (surely not :) ) When you say "this will still lock the unmodified row" did you mean just that it's locked to _other_ processes until commit? That would be much less impactful. Geoff
On 01/18/2015 04:48 AM, Peter Geoghegan wrote: > I think that the fundamental, unfixable race condition here is the > disconnect between index tuple insertion and checking for would-be > exclusion violations that exclusion constraints naturally have here, > that unique indexes naturally don't have [1] (note that I'm talking > only about approach #2 to value locking here; approach #1 isn't in > V2.0). I suspect that the feature is not technically feasible to make > work correctly with exclusion constraints, end of story. VACUUM > interlocking is probably also involved here, but the unfixable race > condition seems like our fundamental problem. It's not a fundamental, unfixable race condition. In [1], I gave you three ideas straight off the top of my head on how that could be fixed. > Please work with me towards a committable patch. I'm trying... > [1] http://www.postgresql.org/message-id/54A7C76D.3070101@vmware.com - Heikki
On 01/30/2015 01:38 AM, Peter Geoghegan wrote: > I have not addressed the recently described problems with exclusion > constraints. I hope we can do so shortly. Simply removing IGNORE > support until such time as we straighten that all out (9.6?) seems > like the simplest solution. No need to block the progress of "UPSERT", > since exclusion constraint support was only ever going to be useful > for the less compelling IGNORE variant. What do other people think? Do > you agree with my view that we should shelve IGNORE support for now, > Heikki? No, I don't agree. Let's fix it. - Heikki
On Mon, Feb 02, 2015 at 4:48 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> I think that the fundamental, unfixable race condition here is the >> disconnect between index tuple insertion and checking for would-be >> exclusion violations that exclusion constraints naturally have here, >> that unique indexes naturally don't have [1] (note that I'm talking >> only about approach #2 to value locking here; approach #1 isn't in >> V2.0). I suspect that the feature is not technically feasible to make >> work correctly with exclusion constraints, end of story. VACUUM >> interlocking is probably also involved here, but the unfixable race >> condition seems like our fundamental problem. > > It's not a fundamental, unfixable race condition. In [1], I gave you > three ideas straight off the top of my head on how that could be fixed. That was different - I tried to make it work by fixing some bugs there. However, I'm now finding myself up against these new bugs. I think that the underlying cause is the lack of any real locking (unlike with the B-Tree AM) in *both* cases, but I don't even know that for sure. The error messages you see are quite odd - why should a btree_gist-based exclusion constraint cause a violation when non-conflicting values are inserted? There is some other race condition here. This wasn't a livelock (or a deadlock), which is what your comments in early January apply to. I think that this has something to do with VACUUM interlocking. But with the B-Tree AM (which we're handling differently, by re-using infrastructure used for deferred unique constraints), things work quite well. The patch stands up to Jeff's vigorous stress-tests. I'm not fundamentally in disagreement with you about any of this. All I'm saying is that we should cut scope today. We're not precluding picking up an IGNORE feature that does support exclusion constraints in the future. Why should we insist upon having that in the first cut? It's both significantly harder, and significantly less useful to users, and so cutting that makes perfect sense AFAICT. As I've said many times, exclusion constraint support was only ever going to be useful to the IGNORE variant (I've tested exclusion constraints by contriving a case to make them do UPSERTs, but this is only for the benefit of the stress-test). Thanks -- Peter Geoghegan
On 01/30/2015 01:38 AM, Peter Geoghegan wrote: > On the stress-testing front, I'm still running Jeff Janes' tool [1], > while also continuing to use his Postgres modifications to > artificially increase the XID burn rate. > > [1] https://github.com/petergeoghegan/jjanes_upsert I followed the instructions in README.md to reproduce this. I downloaded the tool, applied the upsert patchset, applied the hack to parse_clause.c as instructed in the README.md file, installed btree_gist, and ran count_upsert_exclusion.pl. It failed immediately with an assertion failure: TRAP: FailedAssertion("!(node->spec != SPEC_INSERT || node->arbiterIndex != ((Oid) 0))", File: "nodeModifyTable.c", Line: 1619) Is that just because of the hack in parse_clause.c? With assertions disabled, count_upsert_exclusion.pl ran successfully to the end. I also tried running "VACUUM FREEZE upsert_race_test" in a loop in another session at the same time, but it didn't make a difference. How quickly do you see the errors? I also tried applying crash_REL9_5.patch from the jjanes_upsert kit, and set jj_xid=10000 to increase XID burn rate, but I'm still not seeing any errors. - Heikki
On Tue, Feb 3, 2015 at 2:05 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > TRAP: FailedAssertion("!(node->spec != SPEC_INSERT || node->arbiterIndex != > ((Oid) 0))", File: "nodeModifyTable.c", Line: 1619) > > Is that just because of the hack in parse_clause.c? Yes. I never built with assertions and so didn't see this, but it doesn't matter. > With assertions disabled, count_upsert_exclusion.pl ran successfully to the > end. I also tried running "VACUUM FREEZE upsert_race_test" in a loop in > another session at the same time, but it didn't make a difference. How > quickly do you see the errors? > > I also tried applying crash_REL9_5.patch from the jjanes_upsert kit, and set > jj_xid=10000 to increase XID burn rate, but I'm still not seeing any errors. Did you build fully optimized, assertion-free code? I've been doing that. I found it necessary to recreate some of the bugs Jeff's tool caught. I also think that I might have needed an 8 core box to see it, but less sure about that. -- Peter Geoghegan
On 02/03/2015 08:17 PM, Peter Geoghegan wrote: > On Tue, Feb 3, 2015 at 2:05 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> TRAP: FailedAssertion("!(node->spec != SPEC_INSERT || node->arbiterIndex != >> ((Oid) 0))", File: "nodeModifyTable.c", Line: 1619) >> >> Is that just because of the hack in parse_clause.c? > > Yes. I never built with assertions and so didn't see this, but it > doesn't matter. > >> With assertions disabled, count_upsert_exclusion.pl ran successfully to the >> end. I also tried running "VACUUM FREEZE upsert_race_test" in a loop in >> another session at the same time, but it didn't make a difference. How >> quickly do you see the errors? >> >> I also tried applying crash_REL9_5.patch from the jjanes_upsert kit, and set >> jj_xid=10000 to increase XID burn rate, but I'm still not seeing any errors. > > Did you build fully optimized, assertion-free code? I've been doing > that. I found it necessary to recreate some of the bugs Jeff's tool > caught. I also think that I might have needed an 8 core box to see it, > but less sure about that. I had compiled with -O0, but without assertions. I tried now again with -O3. It's been running for about 10 minutes now, and I haven't seen any errors. Since you can reproduce this, it would be good if you could debug this. The error message where the alleged duplicate key actually had a different value is a bit scary. Makes me wonder if it might be a bug with exclusion constraints in general, or just with the patch. - Heikki
On Wed, Feb 4, 2015 at 9:54 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > I had compiled with -O0, but without assertions. I tried now again with -O3. > It's been running for about 10 minutes now, and I haven't seen any errors. Did you run with an artificially high XID burn rate (i.e. did you also apply Jeff's modifications to Postgres, and specify a high burn rate using his custom GUC)? Maybe that was important. -- Peter Geoghegan
On Tue, Feb 2, 2015 at 01:06 AM, Andres Freund <andres@2ndquadrant.com> wrote: > A first (not actually that quick :() look through the patches to see > what actually happened in the last months. I didn't keep up with the > thread. So, let me get this out of the way: This is the first in-depth technical review that this work has had in a long time. Thank you for your help here. > Generally the split into the individual commits doesn't seem to make > much sense to me. The commits individually (except the first) aren't > indivdiually commitable and aren't even meaningful. Splitting off the > internal docs, tests and such actually just seems to make reviewing > harder because you miss context. Splitting it so that individual piece > are committable and reviewable makes sense, but... I have no problem > doing the user docs later. If you split of RLS support, you need to > throw an error before it's implemented. I mostly agree. Basically, I did not intend for all of the patches to be individually committed. The mechanism by which EXCLUDED.* expressions are added is somewhat novel, and deserves to be independently *considered*. I'm trying to show how the parts fit together more so than breaking things down in to smaller commits (as you picked up on, 0001 is the exception - that is genuinely intended to be committed early). Also, those commit messages give me the opportunity to put those parts in their appropriate context vis-a-vis our discussions. They refer to the Wiki, for example, or reasons why pg_stat_statements shouldn't care about ExcludedExpr. Obviously the final commit messages won't look that way. > 0001: > * References INSERT with ON CONFLICT UPDATE, can thus not be committed > independently. I don't think those references really are needed. > * I'm not a fan of the increased code duplication in > ExecCheckRTEPerms(). Can you look into cleaning that up? > * Also the comments inside the ACL_INSERT case still reference UPDATE. > > Other than that I think we can just go ahead and commit this ahead of > time. Mentioning ON CONFLICT UPDATE (OCU henceforth) in the commit > message only. Cool. Attached revision makes those changes. > 0007: > * "AMs" alone isn't particularly unique. > * Without the context of the discussion "unprincipled deadlocks" aren't > well defined. > * Too many "" words. > * Waiting "too long" isn't defined. Neither is why that'd imply > unprincipled deadlocks. Somewhat understandable with the context of > the discussion, but surely not a couple years down the road. > * As is I don't find the README entry super helpful. It should state > what the reason for doing this is cleary, start at the higher level, > and then move to the details. > * Misses details about the speculative heavyweight locking of tuples. Fair points. I'll work through that feedback. Actually, I think we should memorialize that "unprincipled deadlocks" should be avoided in some more general way, since they are after all a general problem that we've seen elsewhere. I'm not sure about how to go about doing that, though. > 0002: > * Tentatively I'd say that killspeculative should be done via a separate > function instead of heap_delete() Really? I guess if that were to happen, it would entail refactoring heap_delete() to call a static function, which was also called by a new kill_speculative() function that does this. Otherwise, you'd have far too much duplication. > * I think we should, as you ponder in a comment, do the OCU specific > stuff lazily and/or in a separate function from BuildIndexInfo(). That > function is already quite visible in profiles, and the additional work > isn't entirely trivial. Okay. > * I doubt logical decoding works with the patch as it stands. I thought so. Perhaps you could suggest a better use of the available XLOG_HEAP_* bits. I knew I needed to consider that more carefully (hence the XXX comment), but didn't get around to it. > * The added ereport (i.e. user facing) error message in > ExecInsertIndexTuples won't help a user at all. So, this: >> /* Skip this index-update if the predicate isn't satisfied */ >> if (!ExecQual(predicate, econtext, false)) >> + { >> + if (arbiterIdx == indexRelation->rd_index->indexrelid) >> + ereport(ERROR, >> + (errcode(ERRCODE_TRIGGERED_ACTION_EXCEPTION), >> + errmsg("partial arbiter unique index has predicate that does not cover tuple proposed for insertion"), >> + errdetail("ON CONFLICT inference clause implies that the tuple proposed for insertion actuallybe covered by partial predicate for index \"%s\".", >> + RelationGetRelationName(indexRelation)), >> + errhint("ON CONFLICT inference clause must infer a unique index that covers the final tuple, afterBEFORE ROW INSERT triggers fire."), >> + errtableconstraint(heapRelation, >> + RelationGetRelationName(indexRelation)))); >> continue; >> + } Yeah, that isn't a great error message. This happens here because you are using a partial unique index (and so you must have had an inference specification with a "WHERE" to get here). However, what you actually went to insert would not be covered by this partial unique index, and so couldn't ever take the alternative path, and so is likely not thought out. Maybe it would be better to silently always let the INSERT succeed as an INSERT. *That* actually wasn't really discussed - this is all my idea. > * Personally I don't care one iota for comments like "Get information > from the result relation info structure.". Yes, one of these already > exists, but ... Okay. > * If a arbiter index is passed to ExecCheckIndexConstraints(), can't we > abort the loop after checking it? Also, do we really have to iterate > over indexes for that case? How about moving the loop contents to a > separate function and using that separately for the arbiter cases? Well, the failure to do that implies very few extra cycles, but sure. I'll add a new reason to break at the end, when check_exclusion_or_unique_constraint() is called in respect of a particular (inferred) arbiter unique index. > * Don't like the comment above check_exclusion_or_unique_constraint()'s > much. Makes too much of a special case of OCU I guess I should just refer to speculative insertion. > * ItemPointerIsValid What about it? > * ExecCheckHeapTupleVisible's comment "It is not acceptable to proceed " > sounds like you're talking with a child or so ;) Fair point. I should say "It would not be consistent with the guarantees of the higher isolation levels..." > * ExecCheckHeapTupleVisible()'s errhint() sounds like an > argument/excuse (actually like a code comment). That's not going to > help a user at all. Really? I thought it might be less than intuitive that higher isolation levels cannot decide to do nothing on the basis of something not in their MVCC snapshot. But come to think of it, yeah, that errhint() isn't adding much over the main error message. > * I find the modified control flow in ExecInsert() pretty darn ugly. I > think this needs to be cleaned up. The speculative case should imo be > a separate function or something. Hmm. I'm not quite sold on that. Basically, if we did that, we'd still have a function that was more or less a strict superset of ExecInsert(). What have we saved? What I do agree with is that ExecInsert() should be refactored to make the common case (a vanilla insert) look like the common case, whereas the uncommon case (an upsert) should have that dealt with specially. There is room for improvement. Is that a fair compromise? > * /* > * This may occur when an instantaneously invisible tuple is blamed > * as a conflict because multiple rows are inserted with the same > * constrained values. > How can this happen? We don't insert multiple rows with the same > command id? This is a cardinality violation [1]. It can definitely happen - just try the examples you see on the Wiki. This is possible because I modified heap_lock_tuple() to return HeapTupleInvisible (and not just complain directly when HeapTupleSatisfiesUpdate() returns "HeapTupleInvisible"). It's also possible because we're using a DirtySnapshot at various points. This is sort of like how ExecUpdate() handles a return value of "HeapTupleSelfUpdated" from heap_update(). Not quite though, because 1. ) I prefer to throw an error (rather than silently not UPDATE that slot), and 2. ) we're not dealing with MVCC semantics, so the return values are different in both cases. The *nature* of the situation handled is similar between UPSERTs (in ExecLockUpdatedTuple()) and vanilla UPDATEs (in ExecUpdate()), though. Does that make sense? > * ExecLockUpdatedTuple() has (too?) many comments, but little overview > of what/why it is doing what it does on a higher level. Fair point. Seems like material for a better worked out executor README. > * plan_speculative_use_index: "Use the planner to decide speculative > insertion arbiter index" - Huh? " rel is the target to undergo ON > CONFLICT UPDATE/IGNORE." - Which rel? Sorry, that's an obsolete comment (the function signature changed). It should refer to the target of the Query being planned. > * formulations as "fundamental nexus" are hard to understand imo. I'm trying to suggest that INSERT ... ON CONFLICT UPDATE is not quite two separate top-level commands, and yet is also not a new, distinct type of top-level command. This is mostly a high level design decision that maximizes code reuse. > * Perhaps it has previously been discussed but I'm not convinced by the > reasoning for not looking at opclasses in infer_unique_index(). This > seems like it'd prohibit ever having e.g. case insensitive opclasses - > something surely worthwile. I don't think anyone gave that idea the thumbs-up. However, I really don't see the problem. Sure, we could have case insensitive opclasses in the future, and you may want to make a unique index using one. But then it becomes a matter of whatever unique indexes are available. The limitation is only that you cannot explicitly indicate that you want a certain opclass. It comes down to whatever unique indexes happen to be available, since of course taking the alternative path is arbitrated by a would-be unique violation. It's a bit odd that we're leaving it up to the available indexes to decide on semantics like that, but the problem is so narrow and the solution so involved that I'd argue it's acceptable. > * Doesn't infer_unique_index() have to look for indisvalid? This isn't > going to work well with a invalid (not to speak for a !ready) index. It does (check IndexIsValid()). I think the mistake I made here was not checking IndexIsReady(), since that is an additional concern above what the similar get_relation_info() function must consider. > * Is ->relation in the UpdateStmt generated in transformInsertStmt ever > used for anything? If so, it'd possibly generate some possible > nastyness due to repeated name lookups. Looks like it'll be used in > transformUpdateStmt What, you mean security issues, for example? I have a hard time seeing how that could work in practice, given that the one and only target RTE is marked with the appropriate updatedCols originating from transformUpdateStmt(). Still, it is a concern generally - better safe than sorry. I was thinking of plugging it by ensuring that the Relations matched, but that might not be good enough. Maybe it would be better to bite the bullet and have transformUpdateStmt() use the same Relation directly, which is something I hoped to avoid (better to have transformUpdateStmt() know as little as possible about this, I'd say). > * What's the reason for the !pstate->p_parent? Also why the parens? > pstate->p_is_speculative = (pstate->parentParseState && > (!pstate->p_parent_cte && > pstate->parentParseState->p_is_insert && > pstate->parentParseState->p_is_speculative)); You mean the "!pstate->p_parent_cte"? That's there because you can get queries to segfault if this logic doesn't consider that a data-modifying CTE can have an UPDATE that appears within a CTE referenced from an INSERT. :-) > * Why did you need to make %nonassoc DISTINCT and ON nonassoc in the grammar? To prevent a shift/reduce conflict, I changed the associativity. Without this, here are the details of State 700, which has the conflict (from gram.output): """"" State 700 1465 opt_distinct: DISTINCT . 1466 | DISTINCT . ON '(' expr_list ')' ON shift, and go to state 1094 ON [reduce using rule 1465 (opt_distinct)] $default reduce using rule 1465 (opt_distinct) """"" > * The whole speculative insert logic isn't really well documented. Why, > for example, do we actually need the token? And why are there no > issues with overflow? And where is it documented that a 0 means > there's no token? ... Fair enough. Presumably it's okay that overflow theoretically could occur, because a race is all but impossible. The token represents a particular attempt by some backend at inserting a tuple, that needs to be waited on specifically only if it is their active attempt (and the xact is still running). Otherwise, you get unprincipled deadlocks. Even if by some incredibly set of circumstances it wraps around, worst case scenario you get an unprinciped deadlock, which is hardly the end of the world given the immense number of insertions required, and the immense unlikelihood that things would work out that way - it'd be basically impossible. I'll document the "0" thing. > * Isn't "SpecType" a awfully generic (and nondescriptive) name? OK. That'll be changed. > * /* XXX: Make sure that re-use of bits is safe here */ - no, not > unless you change existing checks. I think that this is a restatement of your remarks on logical decoding. No? > * /* > * Immediately VACUUM "super-deleted" tuples > */ > if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple), > InvalidTransactionId)) > return HEAPTUPLE_DEAD; > Is that branch really needed? Shouldn't it just be happening as a > consequence of the already existing code? Same in SatisfiesMVCC. If > you actually needed that block, it'd need to be done in SatisfiesSelf > as well, no? You have a comment about a possible loop - but that seems > wrong to me, implying that HEAP_XMIN_COMMITTED was set invalidly. Indeed, this code is kind of odd. While I think the omission within SatisfiesSelf() may be problematic too, if you really want to know why this code is needed, uncomment it and run Jeff's stress-test. It will reliably break. This code: """"" if (HeapTupleHeaderXminInvalid(tuple)) return HEAPTUPLE_DEAD; """"" and this code: """"" /* * Immediately VACUUM "super-deleted" tuples */ if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple), InvalidTransactionId)) return HEAPTUPLE_DEAD; """"" are not equivalent (nor is the latter a superset of the former). Maybe they should be, but they're not. What's more, heap tuple header raw xmin has never been able to change, and I don't think there is any reason for it to be InvalidTransactionId. See my new comments within EvalPlanQualFetch() remarking on how it's now possible for that to change (before, the comment claimed that it wasn't possible). > Ok, I can't focus at all any further at this point. But there's enough > comments here that some even might make sense ;) Most do. :-) Thanks. [1] https://wiki.postgresql.org/wiki/UPSERT#.22Cardinality_violation.22_errors_in_detail -- Peter Geoghegan
Attachment
On Wed, Feb 4, 2015 at 10:03 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Feb 4, 2015 at 9:54 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> I had compiled with -O0, but without assertions. I tried now again with -O3. >> It's been running for about 10 minutes now, and I haven't seen any errors. > > Did you run with an artificially high XID burn rate (i.e. did you also > apply Jeff's modifications to Postgres, and specify a high burn rate > using his custom GUC)? Maybe that was important. Excuse me: I now see that you specifically indicated that you did. But looks like your XID burn rate was quite a lot higher than mine (assuming that you were consistent in using " jj_xid=10000", although I'm not asserting that that was significant). I attach a log of output from an example session where exclusion constraints are shown to break (plus the corresponding server log, plus /proc/cpuinfo on the off chance that that's significant). As you can from the fact that the span of time recorded in the log is only a couple of minutes, this is really easy for me to recreate....sometimes. I could not recreate the problem with only 4 clients (on this 8 core server) after a few dozen attempts, and then I couldn't recreate the issue at all, so clearly those details matter. It might have something to do with CPU scaling, which I've found can significantly affect outcomes for things like this (looks like my hosting provider changed settings in the system BIOS recently, such that I cannot set the CPU governor to "performance"). Perhaps you could take a crack at recreating this, Jeff? Thanks -- Peter Geoghegan
Attachment
On 29 January 2015 at 23:38, Peter Geoghegan <pg@heroku.com> wrote:
On Sat, Jan 17, 2015 at 6:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I continued with this since posting V2.0.
Attached version (V2.1) fixes bit-rot caused by the recent changes by
Stephen ("Fix column-privilege leak in error-message paths"). More
precisely, it is rebased on top of today's 17792b commit.
Patch 0002 no longer applies due to a conflict in src/backend/executor/execUtils.c.
Thom
On Wed, Feb 4, 2015 at 04:49:46PM -0800, Peter Geoghegan wrote: > On Tue, Feb 2, 2015 at 01:06 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > A first (not actually that quick :() look through the patches to see > > what actually happened in the last months. I didn't keep up with the > > thread. > > So, let me get this out of the way: This is the first in-depth > technical review that this work has had in a long time. Thank you for > your help here. I looked at all the patches too. The patch is only 9k lines, not huge. Other than the locking part, the biggest part of this patch is adjusting things so that an INSERT can change into an UPDATE. The code that handles SELECT/INSERT/UPDATE/DELETE is already complex, and this makes it even more so. I have no idea how we can be sure we have hit every single case, but I am also unclear how we will _ever_ know we have hit them all. We know people want this feature, and this patch seems to be our best bet to getting it. If we push this off for 9.6, I am not sure what that buys us. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Fri, Feb 6, 2015 at 1:51 PM, Bruce Momjian <bruce@momjian.us> wrote: > Other than the locking part, the biggest part of this patch is adjusting > things so that an INSERT can change into an UPDATE. Thanks for taking a look at it. That's somewhat cleaned up in the attached patchseries - V2.2. This has been rebased to repair the minor bit-rot pointed out by Thom. Highlights: * Better parser representation. The auxiliary UPDATE never uses its own RangeVar (in fact, UpdateStmt.relation is never set). This eliminates any possibility of repeated name lookup problems, addressing Andres' concern. But it also results in better code. The auxiliary UPDATE is not modified by the parent INSERT at all - rather, the UPDATE knows to fetch its target Relation from the parsestate parent INSERT. There is no need to artificially cut off the parent within the auxiliary UPDATE to make sure the independent RTE is not visible (during parse analysis, prior to merging the two, as happened in earlier revisions) - that was a kludge that I'm glad to be rid of. There is no merging of distinct INSERT and UPDATE Relations/RTEs because there is only ever one Relation/RTE to begin with. Previously, the merging merged RTE selectedCols and updatedCols into the parent INSERT (for column-level privileges, for example). I'm also a lot less cute about determining whether an UPDATE is an auxiliary UPDATE from within the parser, which was also a concern raised by Andres. About 90% of the special case code previously in transformInsertStmt() is now in setTargetTable(). This is a significant improvement all around, since the code is now more consistent with existing parse analysis code - setTargetTable() is naturally where the auxilary UPDATE figures out details on its target, and builds an EXCLUDED RTE and adds it to the namespace as a special case (just like for regular UPDATE targets, which similarly get added to the namespace + joinlist). All of this implies a slight behavioral change (which is documented): The TARGET.* alias is now visible everywhere. So you see it within every node of EXPLAIN output, and if you want to qualify a RETURNING column, the TARGET.* alias must be used (not the original table name). I think that this is an improvement too, although it is arguably a slight behavioral change to INSERTs in general (can't think why anyone would particularly want to qualify with an alias in INSERT's RETURNING, though). Note that the EXCLUDED.* pseudo-alias is still only visible within the UPDATE's targetlist and WHERE clause. I think it would be a bad idea to make the EXCLUDED.* tuples visible from RETURNING [1]. * Cleaner ExecInsert() control flow. Andres rightly complained that the existing control flow was convoluted. I believe he will find this revision a lot clearer, although I have not gone so far as creating something like an ExecUpsert(). * Better documentation. The executor README has been overhauled to describe the flow of things from a higher level. The procarray changes are better documented by comments, too. * Special work previously within BuildIndexInfo() that is needed for unique indexes for the UPSERT case only is now done only in the UPSERT case. There is now no added overhead in BuildIndexInfo() for existing cases. * Worked on feedback on various points of style raised by Andres (e.g. an errhint() was removed). * Better explanation of the re-use of XLOG_HEAP* flag bits. I believe that it's fine to reuse the "(1<<7)" bit, given that each distinct use of the bit can only appear in distinct record types (that is, the bit is used by xl_heap_multi_insert, and now xl_heap_delete). Those two uses should be mutually exclusive. It's not as if we've had to be economical with the use of heap flag XLog record bits before now, so the best approach here isn't obvious. For now, I see no problem with this reuse. * SnapshotSelf (that is, HeapTupleSatisfiesSelf()) has additions analogous to previous additions to the HeapTupleSatisfiesVacuum() and HeapTupleSatisfiesMVCC() visibility routines. I still don't think that the changes to tqual.c are completely satisfactory, but as long as they're directly necessary (which they evidently are - Jeff's stress-testing tool shows that) then I should at least make the appropriate changes everywhere. We should definitely focus on why they're necessary, and consider if we can do better. * There was some squashing of commits, since Andres felt that they weren't all useful as separate commits. I've still split out the RTE permissions commit, as well as the RLS commit (plus the documentation and test commits, FWIW). I hope that this will make it easier to review parts of the patch, without being generally excessive. When documentation and tests are left out, the entire patch series is left at: 68 files changed, 2958 insertions(+), 297 deletions(-) which is not too big. Thanks [1] http://www.postgresql.org/message-id/CAM3SWZTcpy9rroLM3TkfuU4HDLrEtuGzxLptGn2vLhVAFwQCVA@mail.gmail.com -- Peter Geoghegan
Attachment
- 0001-Make-UPDATE-privileges-distinct-from-INSERT-privileg.patch
- 0002-Support-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch
- 0003-RLS-support-for-ON-CONFLICT-UPDATE.patch
- 0004-Tests-for-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch
- 0005-Internal-documentation-for-INSERT-.-ON-CONFLICT-UPDA.patch
- 0006-User-visible-documentation-for-INSERT-.-ON-CONFLICT-.patch
On 2015-02-04 16:49:46 -0800, Peter Geoghegan wrote: > On Tue, Feb 2, 2015 at 01:06 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Generally the split into the individual commits doesn't seem to make > > much sense to me. I think trying to make that possible is a good idea in patches of this size. It e.g. seems entirely possible to structure the patchset so that the speculative lock infrastructure is added first and the rest later. I've not thought more about how to split it up further, but I'm pretty sure it's possible. > > The commits individually (except the first) aren't > > indivdiually commitable and aren't even meaningful. Splitting off the > > internal docs, tests and such actually just seems to make reviewing > > harder because you miss context. Splitting it so that individual piece > > are committable and reviewable makes sense, but... I have no problem > > doing the user docs later. If you split of RLS support, you need to > > throw an error before it's implemented. > > I mostly agree. Basically, I did not intend for all of the patches to > be individually committed. The mechanism by which EXCLUDED.* > expressions are added is somewhat novel, and deserves to be > independently *considered*. I'm trying to show how the parts fit > together more so than breaking things down in to smaller commits (as > you picked up on, 0001 is the exception - that is genuinely intended > to be committed early). Also, those commit messages give me the > opportunity to put those parts in their appropriate context vis-a-vis > our discussions. They refer to the Wiki, for example, or reasons why > pg_stat_statements shouldn't care about ExcludedExpr. Obviously the > final commit messages won't look that way. FWIW, I don't think anything here really should refer to the wiki... > > 0002: > > * Tentatively I'd say that killspeculative should be done via a separate > > function instead of heap_delete() > > Really? I guess if that were to happen, it would entail refactoring > heap_delete() to call a static function, which was also called by a > new kill_speculative() function that does this. Otherwise, you'd have > far too much duplication. I don't really think there actually is that much common inbetween those. It looks to me that most of the code in heap_delete isn't actually relevant for this case and could be cut short. My guess is that only the WAL logging would be separated out. > > * I doubt logical decoding works with the patch as it stands. > > I thought so. Perhaps you could suggest a better use of the available > XLOG_HEAP_* bits. I knew I needed to consider that more carefully > (hence the XXX comment), but didn't get around to it. I think you probably need to add test macros that make sure only the individual bits are sets, and not the combination and then only use those. > > * If a arbiter index is passed to ExecCheckIndexConstraints(), can't we > > abort the loop after checking it? Also, do we really have to iterate > > over indexes for that case? How about moving the loop contents to a > > separate function and using that separately for the arbiter cases? > > Well, the failure to do that implies very few extra cycles, but sure. It's not that much about the CPU cycles, but also about the mental ones. If you have to think what happens if there's more than one match... > > * ItemPointerIsValid > > What about it? Uh. Oh. No idea. I wrote this pretty late at night ;) > > * /* > > * This may occur when an instantaneously invisible tuple is blamed > > * as a conflict because multiple rows are inserted with the same > > * constrained values. > > How can this happen? We don't insert multiple rows with the same > > command id? > > This is a cardinality violation [1]. It can definitely happen - just > try the examples you see on the Wiki. I don't care about the wiki from the point of code comments. This needs to be understandable in five years time. > > * Perhaps it has previously been discussed but I'm not convinced by the > > reasoning for not looking at opclasses in infer_unique_index(). This > > seems like it'd prohibit ever having e.g. case insensitive opclasses - > > something surely worthwile. > > I don't think anyone gave that idea the thumbs-up. However, I really > don't see the problem. Sure, we could have case insensitive opclasses > in the future, and you may want to make a unique index using one. Then the problem suddenly becomes that previous choices of indexes/statements aren't possible anymore. It seems much better to introduce the syntax now and not have too much of a usecase for it. > > * The whole speculative insert logic isn't really well documented. Why, > > for example, do we actually need the token? And why are there no > > issues with overflow? And where is it documented that a 0 means > > there's no token? ... > > Fair enough. Presumably it's okay that overflow theoretically could > occur, because a race is all but impossible. The token represents a > particular attempt by some backend at inserting a tuple, that needs to > be waited on specifically only if it is their active attempt (and the > xact is still running). Otherwise, you get unprincipled deadlocks. > Even if by some incredibly set of circumstances it wraps around, worst > case scenario you get an unprinciped deadlock, which is hardly the end > of the world given the immense number of insertions required, and the > immense unlikelihood that things would work out that way - it'd be > basically impossible. > > I'll document the "0" thing. It's really not about me understanding it right now, but about longer term. > > * /* XXX: Make sure that re-use of bits is safe here */ - no, not > > unless you change existing checks. > > I think that this is a restatement of your remarks on logical > decoding. No? Yea. By here it was even later :P. Can you add a UPSERT test for logical decoding? I doubt it'll work right now, even in the repost. > > * /* > > * Immediately VACUUM "super-deleted" tuples > > */ > > if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple), > > InvalidTransactionId)) > > return HEAPTUPLE_DEAD; > > Is that branch really needed? Shouldn't it just be happening as a > > consequence of the already existing code? Same in SatisfiesMVCC. If > > you actually needed that block, it'd need to be done in SatisfiesSelf > > as well, no? You have a comment about a possible loop - but that seems > > wrong to me, implying that HEAP_XMIN_COMMITTED was set invalidly. > > Indeed, this code is kind of odd. While I think the omission within > SatisfiesSelf() may be problematic too, if you really want to know why > this code is needed, uncomment it and run Jeff's stress-test. It will > reliably break. Again. I don't care about running some strange tool when reading code comments. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 10, 2015 at 12:04 AM, Andres Freund <andres@2ndquadrant.com> wrote: > FWIW, I don't think anything here really should refer to the wiki... The Wiki pages have done a good job of summarizing things...it certainly didn't seem that hard for you to get up to speed here. Also, as you'll know from working on logical decoding, it's hard to know what is complex and esoteric and what is relatively accessible when you're this close to a big project. I recall that you said as much before. I'm focused on signposting so that reviewers can follow what each patch does with the minimum of effort (with reference to the Wiki or whatever). I see no reason to not do things that way until commit...it feels like there is less chance of the concepts going over people's head this way. > I don't really think there actually is that much common inbetween > those. It looks to me that most of the code in heap_delete isn't > actually relevant for this case and could be cut short. My guess is that > only the WAL logging would be separated out. I'll think about that some more. >> > * I doubt logical decoding works with the patch as it stands. >> >> I thought so. Perhaps you could suggest a better use of the available >> XLOG_HEAP_* bits. I knew I needed to consider that more carefully >> (hence the XXX comment), but didn't get around to it. > > I think you probably need to add test macros that make sure only the > individual bits are sets, and not the combination and then only use those. This too. >> > * /* >> > * This may occur when an instantaneously invisible tuple is blamed >> > * as a conflict because multiple rows are inserted with the same >> > * constrained values. >> > How can this happen? We don't insert multiple rows with the same >> > command id? >> >> This is a cardinality violation [1]. It can definitely happen - just >> try the examples you see on the Wiki. > > I don't care about the wiki from the point of code comments. This needs > to be understandable in five years time. That wasn't clear before - you seemed to me to be questioning if this was even possible. There is a section in the INSERT SQL reference page about cardinality violations, so we're certainly talking about something that a code reader likely heard of. Also, the nitty gritty showing various scenarios on the Wiki is the quickest way to know what is possible (but is much too long winded for user visible documentation or code comments). >> > * Perhaps it has previously been discussed but I'm not convinced by the >> > reasoning for not looking at opclasses in infer_unique_index(). This >> > seems like it'd prohibit ever having e.g. case insensitive opclasses - >> > something surely worthwile. >> >> I don't think anyone gave that idea the thumbs-up. However, I really >> don't see the problem. Sure, we could have case insensitive opclasses >> in the future, and you may want to make a unique index using one. > > Then the problem suddenly becomes that previous choices of > indexes/statements aren't possible anymore. It seems much better to > introduce the syntax now and not have too much of a usecase for > it. The only way the lack of a way of specifying which opclass to use could bite us is if there were two *actually* defined unique indexes on the same column, each with different "equality" operators. That seems like kind of a funny scenario, even if that were quite possible (even if non-default opclasses existed that had a non-identical "equality" operators, which is basically not the case today). I grant that is a bit odd that we're talking about unique indexes definitions affecting semantics, but that is to a certain extent the nature of the beast. As a compromise, I suggest having the inference specification optionally accept a named opclass per attribute, in the style of CREATE INDEX (I'm already reusing a bit of the raw parser support for CREATE INDEX, you see) - that'll make inference insist on that opclass, rather than make it a strict matter of costing available alternatives (not that any alternative is expected with idiomatic usage). That implies no additional parser overhead, really. If that's considered ugly, then at least it's an ugly thing that literally no one will ever use in the foreseeable future...and an ugly thing that is no more necessary in CREATE INDEX than here (and yet CREATE INDEX lives with the ugliness). > It's really not about me understanding it right now, but about longer term. Sure. > Can you add a UPSERT test for logical decoding? I doubt it'll work right > now, even in the repost. Okay. >> > * /* >> > * Immediately VACUUM "super-deleted" tuples >> > */ >> > if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple), >> > InvalidTransactionId)) >> > return HEAPTUPLE_DEAD; >> > Is that branch really needed? Shouldn't it just be happening as a >> > consequence of the already existing code? Same in SatisfiesMVCC. If >> > you actually needed that block, it'd need to be done in SatisfiesSelf >> > as well, no? You have a comment about a possible loop - but that seems >> > wrong to me, implying that HEAP_XMIN_COMMITTED was set invalidly. >> >> Indeed, this code is kind of odd. While I think the omission within >> SatisfiesSelf() may be problematic too, if you really want to know why >> this code is needed, uncomment it and run Jeff's stress-test. It will >> reliably break. > > Again. I don't care about running some strange tool when reading code > comments. Again, I thought you were skeptical about the very need for this code (and not how it was presented). If that was the case, that tool would provide you with a pretty quick and easy way of satisfying yourself that it is needed. The actual reason that it is needed is that if it isn't, then the system can see a "broken promise" tuple as spuriously visible. This will cause Jeff's tool to spit out a bunch of errors due to finding all-NULL values in these tuples. VACUUM could not reclaim the tuples unless and until they stopped appearing visible for VACUUM's purposes (which this particular code snippet relates to). Maybe the comment could be improved, and maybe the code could be improved, but the code is necessary as things stand. -- Peter Geoghegan
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Tue, Feb 10, 2015 at 9:21 AM, Peter Geoghegan<span dir="ltr"><<a href="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> * There was somesquashing of commits, since Andres felt that they<br /> weren't all useful as separate commits. I've still split outthe RTE<br /> permissions commit, as well as the RLS commit (plus the documentation<br /> and test commits, FWIW). I hopethat this will make it easier to<br /> review parts of the patch, without being generally excessive.<br /><br /> Whendocumentation and tests are left out, the entire patch series is left at:<br /></blockquote></div><br /></div><div class="gmail_extra">Patchmoved to next CF.<br />-- <br /><div class="gmail_signature">Michael<br /></div></div></div>
On 2/9/15 6:21 PM, Peter Geoghegan wrote: > Thanks for taking a look at it. That's somewhat cleaned up in the > attached patchseries - V2.2. In patch 1, "sepgsql is also affected by this commit. Note that this commit necessitates an initdb, since stored ACLs are broken." Doesn't that warrant bumping catversion? Patch 2 + * When killing a speculatively-inserted tuple, we set xmin to invalid and +if (!(xlrec->flags & XLOG_HEAP_KILLED_SPECULATIVE_TUPLE)) When doing this, should we also set the HEAP_XMIN_INVALID hint bit? <reads more of patch> Ok, I see we're not doing this because the check for a super deleted tuple is already cheap. Probably worth mentioning that in the comment... ExecInsert(): + * We don't suppress the effects (or, perhaps, side-effects) of + * BEFORE ROW INSERT triggers. This isn't ideal, but then we + * cannot proceed with even considering uniqueness violations until + * these triggers fire on the one hand, but on the other hand they + * have the ability to execute arbitrary user-defined code which + * may perform operations entirely outside the system's ability to + * nullify. I'm a bit confused as to why we're calling BEFORE triggers out here... hasn't this always been true for both BEFORE *and* AFTER triggers? The comment makes me wonder if there's some magic that happens with AFTER... +spec != SPEC_NONE? HEAP_INSERT_SPECULATIVE:0 Non-standard formatting. Given the size of the patch and work already into it I'd just leave it for the next formatting run; I only mention it in case someone has some compelling reason not to. ExecLockUpdateTuple(): + * Try to lock tuple for update as part of speculative insertion. If + * a qual originating from ON CONFLICT UPDATE is satisfied, update + * (but still lock row, even though it may not satisfy estate's + * snapshot). I find this confusing... which row is being locked? The previously inserted one? Perhaps a better wording would be "satisfied, update. Lock the original row even if it doesn't satisfy estate's snapshot." infer_unique_index(): +/* + * We need not lock the relation since it was already locked, either by + * the rewriter or when expand_inherited_rtentry() added it to the query's + * rangetable. + */ +relationObjectId = rt_fetch(parse->resultRelation, parse->rtable)->relid; + +relation = heap_open(relationObjectId, NoLock); Seems like there should be an Assert here. Also, the comment should probably go before the heap_open call. heapam_xlog.h: +/* reuse xl_heap_multi_insert-only bit for xl_heap_delete */ I wish this would mention why it's safe to do this. Also, the comment mentions xl_heap_delete when the #define is for XLOG_HEAP_KILLED_SPECULATIVE_TUPLE; presumably that's wrong. Perhaps: /* * reuse XLOG_HEAP_LAST_MULTI_INSERT bit for * XLOG_HEAP_KILLED_SPECULATIVE_TUPLE. This is safe because we never do * multi-insertsfor INSERT ON CONFLICT. */ I'll review the remaining patches later. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 02/10/2015 02:21 AM, Peter Geoghegan wrote: > On Fri, Feb 6, 2015 at 1:51 PM, Bruce Momjian <bruce@momjian.us> wrote: >> Other than the locking part, the biggest part of this patch is adjusting >> things so that an INSERT can change into an UPDATE. > > Thanks for taking a look at it. That's somewhat cleaned up in the > attached patchseries - V2.2. This has been rebased to repair the minor > bit-rot pointed out by Thom. I don't really have the energy to review this patch in one batch, so I'd really like to split this into two: 1. Solve the existing "problem" with exclusion constraints that if two insertions happen concurrently, one of them might be aborted with "deadlock detected" error, rather then "conflicting key violation" error. That really wouldn't be worth fixing otherwise, but it happens to be a useful stepping stone for this upsert patch. 2. All the rest. I took a stab at extracting the parts needed to do 1. See attached. I didn't modify ExecUpdate to use speculative insertions, so the issue remains for UPDATEs, but that's easy to add. I did not solve the potential for livelocks (discussed at http://www.postgresql.org/message-id/CAM3SWZTfTt_fehet3tU3YKCpCYPYnNaUqUZ3Q+NAASnH-60teA@mail.gmail.com). The patch always super-deletes the already-inserted tuple on conflict, and then waits for the other inserter. That would be nice to fix, using one of the ideas from that thread, or something else. We never really debated the options for how to do the speculative insertion and super-deletion. This patch is still based on the quick & dirty demo patches I posted about a year ago, in response to issues you found with earlier versions. That might be OK - maybe I really hit the nail on designing those things and got them right on first try - but more likely there are better alternatives. Are we happy with acquiring the SpeculativeInsertLock heavy-weight lock for every insertion? That seems bad for performance reasons. Also, are we happy with adding the new fields to the proc array? Another alternative that was discussed was storing the speculative insertion token on the heap tuple itself. (See http://www.postgresql.org/message-id/52D00D2D.6030307@vmware.com) Are we happy with the way super-deletion works? Currently, the xmin field is set to invalid to mark the tuple as super-deleted. That required checks in HeapTupleSatisfies* functions. One alternative would be to set xmax equal to xmin, and teach the code currently calls XactLockTableWait() to check if xmax=xmin, and not consider the tuple as conflicting. - Heikki
Attachment
On Sat, Feb 14, 2015 at 2:06 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> Thanks for taking a look at it. That's somewhat cleaned up in the >> attached patchseries - V2.2. This has been rebased to repair the minor >> bit-rot pointed out by Thom. > > I don't really have the energy to review this patch in one batch, so I'd > really like to split this into two: I think we're all feeling worn out at this point, by this patch and by others. I do appreciate your making the effort. > 1. Solve the existing "problem" with exclusion constraints that if two > insertions happen concurrently, one of them might be aborted with "deadlock > detected" error, rather then "conflicting key violation" error. That really > wouldn't be worth fixing otherwise, but it happens to be a useful stepping > stone for this upsert patch. > > 2. All the rest. I think we need a more pragmatic approach to dealing with the exclusion constraint problems. > I took a stab at extracting the parts needed to do 1. See attached. I didn't > modify ExecUpdate to use speculative insertions, so the issue remains for > UPDATEs, but that's easy to add. Cool. > I did not solve the potential for livelocks (discussed at > http://www.postgresql.org/message-id/CAM3SWZTfTt_fehet3tU3YKCpCYPYnNaUqUZ3Q+NAASnH-60teA@mail.gmail.com). > The patch always super-deletes the already-inserted tuple on conflict, and > then waits for the other inserter. That would be nice to fix, using one of > the ideas from that thread, or something else. How about we don't super-delete at all with exclusion constraints? I'd be willing to accept unprincipled deadlocks for exclusion constraints, because they already exist today for UPSERT/NOSERT type use cases, and with idiomatic usage seem far less likely for the IGNORE variant (especially with exclusion constraints). I can see people using ON CONFLICT UPDATE where they'd almost or actually be better off using a plain UPDATE - that's quite a different situation. I find livelocks to be a *very* scary prospect, and I don't think the remediations that were mentioned are going to fly. It's just too messy, and too much of a niche use case. TBH I am skeptical of the idea that we can fix exclusion constraints properly in this way at all, at least not without the exponential backoff thing, which just seems horrible. > We never really debated the options for how to do the speculative insertion > and super-deletion. This patch is still based on the quick & dirty demo > patches I posted about a year ago, in response to issues you found with > earlier versions. That might be OK - maybe I really hit the nail on > designing those things and got them right on first try - but more likely > there are better alternatives. Intuitively, it seem likely that you're right here. However, it was necessary to work through the approach to see what the problems are. For example, the need for modifications to tqual.c became apparent only through putting a full implementation of ON CONFLICT UPDATE through various tests. In general, I've emphasized that the problems with any given value locking implementation are non-obvious. Anyone who thinks that he can see all the problems with his approach to value locking without having a real implementation that is tested for problems like unprincipled deadlocks is probably wrong. That's sort of where I'm coming from with suggesting we allow unprincipled deadlocks with exclusion constraints. I'm not confident that we can find a perfect solution, and know that it's a perfect solution. It's too odd, and too niche a requirement. Although you understood what I was on about when I first talked about unprincipled deadlocks, I think that acceptance of that idea came much later from other people, because it's damn complicated. It's not worth it to add some weird "Dining philosophers" exponential backoff thing to make sure that the IGNORE variant when used with exclusion constraints can never deadlock in an unprincipled way, but if it is worth it then it seems unreasonable to suppose that this patch needs to solve that pre-existing problem. No? If we do something like an exponential backoff, which I think might work, I fear that that kind of yak-shaving will leave us with something impossible to justify committing; a total mess. Better the devil you know (possible unprincipled deadlocks with the IGNORE variant + exclusion constraints). > Are we happy with acquiring the SpeculativeInsertLock heavy-weight lock for > every insertion? That seems bad for performance reasons. Also, are we happy > with adding the new fields to the proc array? Another alternative that was > discussed was storing the speculative insertion token on the heap tuple > itself. (See > http://www.postgresql.org/message-id/52D00D2D.6030307@vmware.com) Whatever works, really. I can't say that the performance implications of acquiring that hwlock are at the forefront of my mind. I never found that to be a big problem on an 8 core box, relative to vanilla INSERTs, FWIW - lock contention is not normal, and may be where any heavweight lock costs would really be encountered. Besides, the update path won't have to do this at all. I don't see how storing the promise token in heap tuples buys us not having to involve heavyweight locking of tokens. (I think you may have made a thinko in suggesting otherwise) > Are we happy with the way super-deletion works? Currently, the xmin field is > set to invalid to mark the tuple as super-deleted. That required checks in > HeapTupleSatisfies* functions. One alternative would be to set xmax equal to > xmin, and teach the code currently calls XactLockTableWait() to check if > xmax=xmin, and not consider the tuple as conflicting. That couldn't work without further HeapTupleSatisfiesDirty() logic. Besides, why should one transaction be entitled to insert a conflicting value tuple just because a still running transaction deleted it (having also inserted the tuple itself)? Didn't one prototype version of value locking #2 have that as a bug [1]? In fact, originally, didn't the "xmin set to invalid" thing come immediately from realizing that that wasn't workable? I too think the tqual.c changes are ugly. I can't see a way around using a token lock, though - I would only consider (and only consider) refining the interface by which a waiter becomes aware that it must wait on the outcome of the inserting xact's speculative insertion/value lock (and in particular, that is should not wait on xact outcome). We clearly need something to wait on that isn't an XID...heavyweight locking of a token value is the obvious way of doing that. (thinks about it some more for a while) It seems like what you're really taking issue with - the real issue - is that we're introducing this whole new idea of making a tuple visible for a moment, a moment that is potentially only a slice of its originating transaction's total duration. Setting xmin to invalidTransactionId is one way to do that, and may, counterintuitively, even be the best way, but the fact that we're playing new games with visibility is the real point of concern. We could do something like store the token in the heap tuple, allowing us to make fewer additions to PGPROC, but that seems kind of pointless (and a waste of disk space). Playing new games with visibility is the nature of the beast. We keep talking about mechanism, but what if we *did* have the infomask bit space to represent that xmin=xmax is a broken promise tuple (and not a deleted, self-inserted, combocid-requiring tuple that may or may not have been a promise tuple at some point in time)? I think that a minor aesthetic improvement is the best we can hope for, and maybe even that is too much to expect. Maybe we should just own the fact that we're playing a new sort of game with visibility, and keep things as they are. You might consider that setting xmin to invalidTransactionId is more "honest" than any alternative scheme that happens to avoid changes to HeapTupleSatisfiesMVCC() and so on. Jim Nasby said something about setting the HEAP_XMIN_INVALID hint bit. Maybe he is right...if that can be made to be reliable (always WAL-logged), it could be marginally better than setting xmin to invalidTransactionId. But only marginally; it seems like your discomfort is basically inherent to playing these new games with visibility, which is inherent to this design. As I said, I am having a hard time seeing a way to do anything more than polish what we have here. That's probably fine, though...it hasn't proven to be too problematic (exclusion constraints aside). Not having to change HeapTupleSatisfiesMVCC() and so on (to check if xmin = InvalidTransactionId) is not obviously a useful goal here, since ISTM that any alternative would have to *logically* do the same thing. If I'm off the mark about your thinking here, please correct me....are you just worried about extra cycles in the HeapTupleSatisfies* routines? [1] http://www.postgresql.org/message-id/52D5C74E.6090608@vmware.com -- Peter Geoghegan
On Fri, Feb 13, 2015 at 7:22 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > In patch 1, "sepgsql is also affected by this commit. Note that this commit > necessitates an initdb, since stored ACLs are broken." > > Doesn't that warrant bumping catversion? Yes, but traditionally that is left until the last minute - when the patch is committed. That's why it's called out in the commit message (it isn't otherwise obvious - it's not a common catversion necessitating change like an addition to a system catalog). > Patch 2 > + * When killing a speculatively-inserted tuple, we set xmin to invalid > and > +if (!(xlrec->flags & XLOG_HEAP_KILLED_SPECULATIVE_TUPLE)) > > When doing this, should we also set the HEAP_XMIN_INVALID hint bit? > > <reads more of patch> > > Ok, I see we're not doing this because the check for a super deleted tuple > is already cheap. Probably worth mentioning that in the comment... See my remarks to Heikki on this. I don't think it adds much. > ExecInsert(): > + * We don't suppress the effects (or, perhaps, side-effects) of > + * BEFORE ROW INSERT triggers. This isn't ideal, but then we > + * cannot proceed with even considering uniqueness violations until > + * these triggers fire on the one hand, but on the other hand they > + * have the ability to execute arbitrary user-defined code which > + * may perform operations entirely outside the system's ability to > + * nullify. > > I'm a bit confused as to why we're calling BEFORE triggers out here... > hasn't this always been true for both BEFORE *and* AFTER triggers? The > comment makes me wonder if there's some magic that happens with AFTER... Yes, but the difference here is that the UPDATE path could be taken (which is sort of like when a before row insert path returns NULL). What I'm calling out is the dependency on firing before row insert triggers to decide if the alternative path must be taken. Roughly speaking, we must perform "part" of the INSERT (firing of before row insert triggers) in order to decide to do or not do an INSERT. This is, as I said, similar to when those triggers return NULL, and won't matter with idiomatic patterns of before trigger usage. Still feels worth calling out, because sometimes users do foolishly write before triggers with many external side-effects. > ExecLockUpdateTuple(): > + * Try to lock tuple for update as part of speculative insertion. If > + * a qual originating from ON CONFLICT UPDATE is satisfied, update > + * (but still lock row, even though it may not satisfy estate's > + * snapshot). > > I find this confusing... which row is being locked? The previously inserted > one? Perhaps a better wording would be "satisfied, update. Lock the original > row even if it doesn't satisfy estate's snapshot." Take a look at the executor README. We're locking the only row that can be locked when an UPSERT non-conclusively thinks to take the UPDATE path: the row that was found during our pre-check. We can only UPDATE when we find such a tuple, and then lock it without finding a row-level conflict. > infer_unique_index(): > +/* > + * We need not lock the relation since it was already locked, either by > + * the rewriter or when expand_inherited_rtentry() added it to the query's > + * rangetable. > + */ > +relationObjectId = rt_fetch(parse->resultRelation, parse->rtable)->relid; > + > +relation = heap_open(relationObjectId, NoLock); > > Seems like there should be an Assert here. Also, the comment should probably > go before the heap_open call. An Assert() of what? Note that the similar function get_relation_info() does about the same thing here. > heapam_xlog.h: > +/* reuse xl_heap_multi_insert-only bit for xl_heap_delete */ > I wish this would mention why it's safe to do this. Also, the comment > mentions xl_heap_delete when the #define is for > XLOG_HEAP_KILLED_SPECULATIVE_TUPLE; presumably that's wrong. Perhaps: > /* > * reuse XLOG_HEAP_LAST_MULTI_INSERT bit for > * XLOG_HEAP_KILLED_SPECULATIVE_TUPLE. This is safe because we never do > * multi-inserts for INSERT ON CONFLICT. > */ It's safe, as the comment indicates, because the former is only used for xl_heap_multi_insert records, while the latter is only used for xl_heap_delete records. There is no ambiguity, because naturally we're always able to establish what type of record is under consideration before checking the bit is set. The XLOG_HEAP_* format is used for other flags there, despite the fact that other flags (like XLOG_HEAP_PREFIX_FROM_OLD) can only appear in certain context-appropriate xl_heap_* records. AFAICT, all that I've done that's new here is rely on that, since a bunch of those bits were used up in the last year or two, and the need to even consider bit reuse here is a new problem. > I'll review the remaining patches later. Thanks. -- Peter Geoghegan
On 02/16/2015 02:44 AM, Peter Geoghegan wrote: > On Sat, Feb 14, 2015 at 2:06 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> I did not solve the potential for livelocks (discussed at >> http://www.postgresql.org/message-id/CAM3SWZTfTt_fehet3tU3YKCpCYPYnNaUqUZ3Q+NAASnH-60teA@mail.gmail.com). >> The patch always super-deletes the already-inserted tuple on conflict, and >> then waits for the other inserter. That would be nice to fix, using one of >> the ideas from that thread, or something else. > > How about we don't super-delete at all with exclusion constraints? I'd > be willing to accept unprincipled deadlocks for exclusion constraints, > because they already exist today for UPSERT/NOSERT type use cases, and > with idiomatic usage seem far less likely for the IGNORE variant > (especially with exclusion constraints). So INSERT ON CONFLICT IGNORE on a table with an exclusion constraint might fail. I don't like that. The point of having the command in the first place is to deal with concurrency issues. If it sometimes doesn't work, it's broken. > I can see people using ON > CONFLICT UPDATE where they'd almost or actually be better off using a > plain UPDATE - that's quite a different situation. I find livelocks to > be a *very* scary prospect, and I don't think the remediations that > were mentioned are going to fly. It's just too messy, and too much of > a niche use case. TBH I am skeptical of the idea that we can fix > exclusion constraints properly in this way at all, at least not > without the exponential backoff thing, which just seems horrible. The idea of comparing the TIDs of the tuples as a tie-breaker seems most promising to me. If the conflicting tuple's TID is smaller than the inserted tuple's, super-delete and wait. Otherwise, wait without super-deletion. That's really simple. Do you see a problem with that? > Although you understood what I was on about when I first talked about > unprincipled deadlocks, I think that acceptance of that idea came > much later from other people, because it's damn complicated. BTW, it would good to explain somewhere in comments or a README the term "unprincipled deadlock". It's been a useful concept, and hard to grasp. If you defined it once, with examples and everything, then you could just have "See .../README" in other places that need to refer it. > It's not worth it to add > some weird "Dining philosophers" exponential backoff thing to make > sure that the IGNORE variant when used with exclusion constraints can > never deadlock in an unprincipled way, but if it is worth it then it > seems unreasonable to suppose that this patch needs to solve that > pre-existing problem. No? The point of solving the existing problem is that it allows us to split the patch, for easier review. >> Are we happy with acquiring the SpeculativeInsertLock heavy-weight lock for >> every insertion? That seems bad for performance reasons. Also, are we happy >> with adding the new fields to the proc array? Another alternative that was >> discussed was storing the speculative insertion token on the heap tuple >> itself. (See >> http://www.postgresql.org/message-id/52D00D2D.6030307@vmware.com) > > Whatever works, really. I can't say that the performance implications > of acquiring that hwlock are at the forefront of my mind. I never > found that to be a big problem on an 8 core box, relative to vanilla > INSERTs, FWIW - lock contention is not normal, and may be where any > heavweight lock costs would really be encountered. Oh, cool. I guess the fast-path in lmgr.c kicks ass, then :-). > I don't see how storing the promise token in heap tuples buys us not > having to involve heavyweight locking of tokens. (I think you may have > made a thinko in suggesting otherwise) It wouldn't get rid of heavyweight locks, but it would allow getting rid of the procarray changes. The inserter could generate a token, then acquire the hw-lock on the token, and lastly insert the heap tuple with the correct token. >> Are we happy with the way super-deletion works? Currently, the xmin field is >> set to invalid to mark the tuple as super-deleted. That required checks in >> HeapTupleSatisfies* functions. One alternative would be to set xmax equal to >> xmin, and teach the code currently calls XactLockTableWait() to check if >> xmax=xmin, and not consider the tuple as conflicting. > > That couldn't work without further HeapTupleSatisfiesDirty() logic. Why not? > Besides, why should one transaction be entitled to insert a > conflicting value tuple just because a still running transaction > deleted it (having also inserted the tuple itself)? Didn't one > prototype version of value locking #2 have that as a bug [1]? In fact, > originally, didn't the "xmin set to invalid" thing come immediately > from realizing that that wasn't workable? Ah, right. So the problem was that some code might assume that if you insert a row, delete it in the same transaction, and then insert the same value again, the 2nd insertion will always succeed because the previous insertion effectively acquired a value-lock on the key. Ok, so we can't unconditionally always ignore tuples with xmin==xmax. We would need an infomask bit to indicate super-deletion, and only ignore it if the bit is set. I'm starting to think that we should bite the bullet and consume an infomask bit for this. The infomask bits are a scarce resource, but we should use them when it makes sense. It would be good for forensic purposes too, to leave a trace that a super-deletion happened. > I too think the tqual.c changes are ugly. I can't see a way around > using a token lock, though - I would only consider (and only consider) > refining the interface by which a waiter becomes aware that it must > wait on the outcome of the inserting xact's speculative > insertion/value lock (and in particular, that is should not wait on > xact outcome). We clearly need something to wait on that isn't an > XID...heavyweight locking of a token value is the obvious way of doing > that. Yeah. > Jim Nasby said something about setting the HEAP_XMIN_INVALID hint bit. > Maybe he is right...if that can be made to be reliable (always > WAL-logged), it could be marginally better than setting xmin to > invalidTransactionId. I'm not a big fan of that. The xmin-invalid bit is currently always just a hint. > But only marginally; it seems like your > discomfort is basically inherent to playing these new games with > visibility, which is inherent to this design. No, I get that we're playing games with visibility. I want to find the best way to implement those games. > As I said, I am having a > hard time seeing a way to do anything more than polish what we have > here. That's probably fine, though...it hasn't proven to be too > problematic (exclusion constraints aside). > > Not having to change HeapTupleSatisfiesMVCC() and so on (to check if > xmin = InvalidTransactionId) is not obviously a useful goal here, > since ISTM that any alternative would have to *logically* do the same > thing. If I'm off the mark about your thinking here, please correct > me....are you just worried about extra cycles in the > HeapTupleSatisfies* routines? Extra cycles yes, but even more importantly, code readability and maintainability. - Heikki
On 2015-02-16 10:00:24 +0200, Heikki Linnakangas wrote: > On 02/16/2015 02:44 AM, Peter Geoghegan wrote: > >>Are we happy with acquiring the SpeculativeInsertLock heavy-weight lock for > >>every insertion? That seems bad for performance reasons. Also, are we happy > >>with adding the new fields to the proc array? Another alternative that was > >>discussed was storing the speculative insertion token on the heap tuple > >>itself. (See > >>http://www.postgresql.org/message-id/52D00D2D.6030307@vmware.com) > > > >Whatever works, really. I can't say that the performance implications > >of acquiring that hwlock are at the forefront of my mind. I never > >found that to be a big problem on an 8 core box, relative to vanilla > >INSERTs, FWIW - lock contention is not normal, and may be where any > >heavweight lock costs would really be encountered. > > Oh, cool. I guess the fast-path in lmgr.c kicks ass, then :-). I don't think it actually uses the fast path, does it? IIRC that's restricted to LOCKTAG_RELATION when done via LockAcquireExtended + open coded for the VirtualXactLock table. I'm not super worried atm either. Worth checking, but probably not worth obsessing over. > >Besides, why should one transaction be entitled to insert a > >conflicting value tuple just because a still running transaction > >deleted it (having also inserted the tuple itself)? Didn't one > >prototype version of value locking #2 have that as a bug [1]? In fact, > >originally, didn't the "xmin set to invalid" thing come immediately > >from realizing that that wasn't workable? > > Ah, right. So the problem was that some code might assume that if you insert > a row, delete it in the same transaction, and then insert the same value > again, the 2nd insertion will always succeed because the previous insertion > effectively acquired a value-lock on the key. > > Ok, so we can't unconditionally always ignore tuples with xmin==xmax. We > would need an infomask bit to indicate super-deletion, and only ignore it if > the bit is set. > > I'm starting to think that we should bite the bullet and consume an infomask > bit for this. The infomask bits are a scarce resource, but we should use > them when it makes sense. It would be good for forensic purposes too, to > leave a trace that a super-deletion happened. Well, we IIRC don't have any left right now. We could reuse MOVED_IN|MOVED_OUT, as that never was set in the past. But it'd essentially use two infomask bits forever... > >Jim Nasby said something about setting the HEAP_XMIN_INVALID hint bit. > >Maybe he is right...if that can be made to be reliable (always > >WAL-logged), it could be marginally better than setting xmin to > >invalidTransactionId. > > I'm not a big fan of that. The xmin-invalid bit is currently always just a > hint. We already rely on XMIN_INVALID being set correctly for freezing. C.f. HeapTupleHeaderXminFrozen, HeapTupleHeaderXminInvalid, et al. So it'd not necessarily end up being that bad. And the super deletion could easily just set it in the course of it's WAL logging. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Feb 16, 2015 at 12:00 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > So INSERT ON CONFLICT IGNORE on a table with an exclusion constraint might > fail. I don't like that. The point of having the command in the first place > is to deal with concurrency issues. If it sometimes doesn't work, it's > broken. I don't like it either, although I think it wouldn't come up very often with exclusion constraints - basically, it would rarely be noticed due to the different use cases. To be honest, in suggesting this idea I was hedging against us not figuring out a solution to that problem in time. You didn't like my suggestion of dropping IGNORE entirely, either. I'll do my best to come up with something, but I'm uncomfortable that at this late stage, ON CONFLICT IGNORE support for exclusion constraints seems like a risk to the entire project. I ask that if push comes to shove you show some flexibility here. I'll try my best to ensure that you don't have to even consider committing something with a notable omission. You don't have to give me an answer to this now. > The idea of comparing the TIDs of the tuples as a tie-breaker seems most > promising to me. If the conflicting tuple's TID is smaller than the inserted > tuple's, super-delete and wait. Otherwise, wait without super-deletion. > That's really simple. Do you see a problem with that? No. I'll work on that, and see how it stands up to stress testing. Come to think of it, that does seem most promising. > BTW, it would good to explain somewhere in comments or a README the term > "unprincipled deadlock". It's been a useful concept, and hard to grasp. If > you defined it once, with examples and everything, then you could just have > "See .../README" in other places that need to refer it. Agreed. I have described those in the revised executor README, in case you missed that. Do you think they ought to have their own section? Note that hash indexes have "unprincipled deadlock" issues, but no one has bothered to fix them. >> Whatever works, really. I can't say that the performance implications >> of acquiring that hwlock are at the forefront of my mind. I never >> found that to be a big problem on an 8 core box, relative to vanilla >> INSERTs, FWIW - lock contention is not normal, and may be where any >> heavweight lock costs would really be encountered. > > Oh, cool. I guess the fast-path in lmgr.c kicks ass, then :-). Seems that way. But even if that wasn't true, it wouldn't matter, since I don't see that we have a choice. >> I don't see how storing the promise token in heap tuples buys us not >> having to involve heavyweight locking of tokens. (I think you may have >> made a thinko in suggesting otherwise) > > It wouldn't get rid of heavyweight locks, but it would allow getting rid of > the procarray changes. The inserter could generate a token, then acquire the > hw-lock on the token, and lastly insert the heap tuple with the correct > token. Do you really think that's worth the disk overhead? I generally agree with the "zero overhead" principle, which is that anyone not using the feature shouldn't pay no price for it (or vanishingly close to no price). Can't say that I have a good sense of the added distributed cost (if any) to be paid by adding new fields to the PGPROC struct (since the PGXACT struct was added in 9.2). Is your only concern that the PGPROC array will now have more fields, making it more complicated? Surely that's a price worth paying to avoid making these heap tuples artificially somewhat larger. You're probably left with tuples that are at least 8 bytes larger, once alignment is taken into consideration. That doesn't seem great. >> That couldn't work without further HeapTupleSatisfiesDirty() logic. > > Why not? Just meant that it wasn't sufficient to check xmin == xmax, while allowing that something like that could work with extra work (e.g. the use of infomask bits)... > Ok, so we can't unconditionally always ignore tuples with xmin==xmax. We > would need an infomask bit to indicate super-deletion, and only ignore it if > the bit is set. ...which is what you say here. > I'm starting to think that we should bite the bullet and consume an infomask > bit for this. The infomask bits are a scarce resource, but we should use > them when it makes sense. It would be good for forensic purposes too, to > leave a trace that a super-deletion happened. > >> I too think the tqual.c changes are ugly. I can't see a way around >> using a token lock, though - I would only consider (and only consider) >> refining the interface by which a waiter becomes aware that it must >> wait on the outcome of the inserting xact's speculative >> insertion/value lock (and in particular, that is should not wait on >> xact outcome). We clearly need something to wait on that isn't an >> XID...heavyweight locking of a token value is the obvious way of doing >> that. > > > Yeah. > >> Jim Nasby said something about setting the HEAP_XMIN_INVALID hint bit. >> Maybe he is right...if that can be made to be reliable (always >> WAL-logged), it could be marginally better than setting xmin to >> invalidTransactionId. > > > I'm not a big fan of that. The xmin-invalid bit is currently always just a > hint. Well, Andres makes the point that that isn't quite so. TBH, I have a hard time justifying the use of MOVED_IN|MOVED_OUT...it's not as if there is a correctness problem with either setting xmin to InvalidTransactionId, or setting the xmin-invalid hint bit (with appropriate precautions so that it's not really just a hint). And as was pointed out, there is something to be said for preserving tuple header xmin where possible, for forensic reasons. >> But only marginally; it seems like your >> discomfort is basically inherent to playing these new games with >> visibility, which is inherent to this design. > > > No, I get that we're playing games with visibility. I want to find the best > way to implement those games. That's useful information. Thanks. >> are you just worried about extra cycles in the >> HeapTupleSatisfies* routines? > > Extra cycles yes, but even more importantly, code readability and > maintainability. Sure. -- Peter Geoghegan
On Mon, Feb 16, 2015 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> Jim Nasby said something about setting the HEAP_XMIN_INVALID hint bit. >>> Maybe he is right...if that can be made to be reliable (always >>> WAL-logged), it could be marginally better than setting xmin to >>> invalidTransactionId. >> >> >> I'm not a big fan of that. The xmin-invalid bit is currently always just a >> hint. > > Well, Andres makes the point that that isn't quite so. Hmm. So the tqual.c routines actually check "if (HeapTupleHeaderXminInvalid(tuple))". Which is: #define HeapTupleHeaderXminInvalid(tup) \ ( \ ((tup)->t_infomask & (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)) == \ HEAP_XMIN_INVALID \ ) What appears to happen if I try to only use the HEAP_XMIN_INVALID bit (and not explicitly set xmin to InvalidTransactionId, and not explicitly check that xmin isn't InvalidTransactionId in each HeapTupleSatisfies* routine) is that after a little while, Jeff Janes' tool shows up spurious unique violations, as if some tuple has become visible when it shouldn't have. I guess this is because the HEAP_XMIN_COMMITTED hint bit can still be set, which in effect invalidates the HEAP_XMIN_INVALID hint bit. It takes about 2 minutes before this happens for the first time when fsync = off, following a fresh initdb, which is unacceptable. I'm not sure if it's worth trying to figure out how HEAP_XMIN_COMMITTED gets set. Not that I'm 100% sure that that's what this really is; it just seems very likely. Attached broken patch (broken_visible.patch) shows the changes made to reveal this problem. Only the changes to tqual.c and heap_delete() should matter here, since I did not test recovery. I also thought about unifying the check for "if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple), InvalidTransactionId))" with the conventional HeapTupleHeaderXminInvalid() macro, and leaving everything else as-is. This is no good either, and for similar reasons - control often won't reach the macro, which is behind a check of "if (!HeapTupleHeaderXminCommitted(tuple))". The best I think we can hope for is having a dedicated "if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple), InvalidTransactionId))" macro HeapTupleHeaderSuperDeleted() to do the work each time, which does not need to be checked so often. A second attached patch (compact_tqual_works.patch - which is non-broken, AFAICT) shows how this is possible, while also moving the check further down within each tqual.c routine (which seems more in keeping with the fact that super deletion is a relatively obscure concept). I haven't been able to break this variant using my existing test suite, so this seems promising as a way of reducing tqual.c clutter. However, as I said the other day, this is basically just polish. -- Peter Geoghegan
Attachment
On Mon, Feb 16, 2015 at 6:32 PM, Peter Geoghegan <pg@heroku.com> wrote: > The best I think we can hope for is having a dedicated "if > (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple), > InvalidTransactionId))" macro HeapTupleHeaderSuperDeleted() to do the > work each time, which does not need to be checked so often. A second > attached patch (compact_tqual_works.patch - which is non-broken, > AFAICT) shows how this is possible, while also moving the check > further down within each tqual.c routine (which seems more in keeping > with the fact that super deletion is a relatively obscure concept). I attach the patch series of V2.3. Highlights: * Overhaul of speculative insertion related changes within tqual.c. Refactored for readability as outlined in my earlier comments quoted above. Assertions added, serving to show exactly where super deleted tuples are and are not expected. * Formally forbid INSERT ... ON CONFLICT into system catalogs. If nothing else, this obviates the need for historic snapshots to care about super deleted tuples. * Minor setrefs.c tweaks. Minor ExecInitModifyTable() tweaks, too. * Fix for minor bitrot against master branch. * Further comments on the speculativeInsertionToken per-backend variable. * Livelock insurance for exclusion constraints. Importantly, Heikki wanted us to break out the patch to fix the current problem of theoretical deadlock risks [1] ahead of committing ON CONFLICT UPDATE/IGNORE. Heikki acknowledged that there were still theoretical livelock risks in his reworked minimal patch. After careful consideration, I have not broken out the changes to do this incrementally along the lines that Heikki suggested. Heikki seemed to think that the deadlock problems were not really worth fixing independently of ON CONFLICT UPDATE support, but rather represented a useful way of committing code incrementally. Do I have that right? Certainly, anyone would agree that unprincipled deadlocks (for regular inserters with exclusion constraints) are better than livelocks. Heikki did not address the livelock risks with his minimal reworked patch, which I've done here for ON CONFLICT. The way I chose to break the livelock (what I call "livelock insurance") does indeed involve comparing XIDs, which Heikki thought most promising. But it also involves having the oldest XID wait on another session's speculative token in the second phase, which ordinarily does not occur in the second phase/check_exclusion_or_unique_constraint() call. The idea is that one session is guaranteed to be the waiter that has a second iteration within its second-phase check_exclusion_or_unique_constraint() call, where (following the super deletion of conflict TIDs by the other conflicting session or sessions) reliably finds that it can proceed (those other sessions are denied the opportunity to reach their second phase by our second phase waiter's still-not-super-deleted tuple). However, it's difficult to see how to map this on to general exclusion constraint insertion + enforcement. In Heikki's recent sketch of this [1], there is no pre-check, since that is considered an "UPSERT thing" deferred until a later patch, and therefore my scheme here cannot work (recall that for plain inserts with exclusion constraints, we insert first and check last). I have a hard time justifying adding the pre-check for plain exclusion constraint inserters given the total lack of complaints from the field regarding unprincipled deadlocks, and given the fact that it would probably make the code more complicated than it needs to be. It is critical that there be a pre-check to prevent livelock, though, because otherwise the restarting sessions can go straight to their "second" phase (check_exclusion_or_unique_constraint() call), without ever realizing that they should not do so. Therefore, as I said, I have not broken out the code in line with Heikki's suggestion. It's possible that I have it wrong here - I was wrong to dismiss Heikki's contention that the livelock hazards were fixable without too much pain - but I don't think so. It seems like the livelock insurance is pretty close to or actually free, and doesn't imply that a "second phase wait for token" needs to happen too often. With heavy contention on a small number of possible tuples (100), and 8 clients deleting and inserting, I can see it happening only once every couple of hundred milliseconds on my laptop. It's not hard to imagine why the code didn't obviously appear to be broken before now, as the window for an actual livelock must have been small. Also, deadlocks bring about more deadlocks (since the deadlock detector kicks in), whereas livelocks do not bring about more livelocks. I haven't been able to reproduce earlier apparent bugs with exclusion constraints [2] recently. I can only speculate that they were fixed. Does anyone with a big server care to run the procedure outlined for exclusion constraints in the jjanes_upsert tool [3]? It would be nice to have additional confidence that the exclusion constraint stuff is robust. [1] http://www.postgresql.org/message-id/54DFC6F8.5050108@vmware.com [2] http://www.postgresql.org/message-id/CAM3SWZTkHOwyA5A9ib=uVf0Vs326yoCBdpp_NYkDjM2_-ScxFA@mail.gmail.com [3] https://github.com/petergeoghegan/jjanes_upsert -- Peter Geoghegan
Attachment
- 0006-User-visible-documentation-for-INSERT-.-ON-CONFLICT-.patch
- 0005-Internal-documentation-for-INSERT-.-ON-CONFLICT-UPDA.patch
- 0004-Tests-for-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch
- 0003-RLS-support-for-ON-CONFLICT-UPDATE.patch
- 0002-Support-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch
- 0001-Make-UPDATE-privileges-distinct-from-INSERT-privileg.patch
On Tue, Feb 10, 2015 at 12:09 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Then the problem suddenly becomes that previous choices of >> indexes/statements aren't possible anymore. It seems much better to >> introduce the syntax now and not have too much of a usecase for >> it. > > The only way the lack of a way of specifying which opclass to use > could bite us is if there were two *actually* defined unique indexes > on the same column, each with different "equality" operators. That > seems like kind of a funny scenario, even if that were quite possible > (even if non-default opclasses existed that had a non-identical > "equality" operators, which is basically not the case today). > > I grant that is a bit odd that we're talking about unique indexes > definitions affecting semantics, but that is to a certain extent the > nature of the beast. As a compromise, I suggest having the inference > specification optionally accept a named opclass per attribute, in the > style of CREATE INDEX (I'm already reusing a bit of the raw parser > support for CREATE INDEX, you see) - that'll make inference insist on > that opclass, rather than make it a strict matter of costing available > alternatives (not that any alternative is expected with idiomatic > usage). That implies no additional parser overhead, really. If that's > considered ugly, then at least it's an ugly thing that literally no > one will ever use in the foreseeable future...and an ugly thing that > is no more necessary in CREATE INDEX than here (and yet CREATE INDEX > lives with the ugliness). Any thoughts on this, anyone? AFAICT, only this and the behavior of logical decoding are open items at this point. I'd like to close out both of those sooner rather than later. Thanks -- Peter Geoghegan
On 02/18/2015 11:43 PM, Peter Geoghegan wrote: > Heikki seemed to think that the deadlock problems were not really > worth fixing independently of ON CONFLICT UPDATE support, but rather > represented a useful way of committing code incrementally. Do I have > that right? Yes. > The way I chose to break the livelock (what I call "livelock > insurance") does indeed involve comparing XIDs, which Heikki thought > most promising. But it also involves having the oldest XID wait on > another session's speculative token in the second phase, which > ordinarily does not occur in the second > phase/check_exclusion_or_unique_constraint() call. The idea is that > one session is guaranteed to be the waiter that has a second iteration > within its second-phase check_exclusion_or_unique_constraint() call, > where (following the super deletion of conflict TIDs by the other > conflicting session or sessions) reliably finds that it can proceed > (those other sessions are denied the opportunity to reach their second > phase by our second phase waiter's still-not-super-deleted tuple). > > However, it's difficult to see how to map this on to general exclusion > constraint insertion + enforcement. In Heikki's recent sketch of this > [1], there is no pre-check, since that is considered an "UPSERT thing" > deferred until a later patch, and therefore my scheme here cannot work > (recall that for plain inserts with exclusion constraints, we insert > first and check last). I have a hard time justifying adding the > pre-check for plain exclusion constraint inserters given the total > lack of complaints from the field regarding unprincipled deadlocks, > and given the fact that it would probably make the code more > complicated than it needs to be. It is critical that there be a > pre-check to prevent livelock, though, because otherwise the > restarting sessions can go straight to their "second" phase > (check_exclusion_or_unique_constraint() call), without ever realizing > that they should not do so. Hmm. I haven't looked at your latest patch, but I don't think you need to pre-check for this to work. To recap, the situation is that two backends have already inserted the heap tuple, and then see that the other backend's tuple conflicts. To avoid a livelock, it's enough that one backend super-deletes its own tuple first, before waiting for the other to complete, while the other other backend waits without super-deleting. No? > It seems like the livelock insurance is pretty close to or actually > free, and doesn't imply that a "second phase wait for token" needs to > happen too often. With heavy contention on a small number of possible > tuples (100), and 8 clients deleting and inserting, I can see it > happening only once every couple of hundred milliseconds on my laptop. > It's not hard to imagine why the code didn't obviously appear to be > broken before now, as the window for an actual livelock must have been > small. Also, deadlocks bring about more deadlocks (since the deadlock > detector kicks in), whereas livelocks do not bring about more > livelocks. It might be easier to provoke the livelocks with a GiST opclass that's unusually slow. I wrote the attached opclass for the purpose of testing this a while ago, but I haven't actually gotten around to do much with it. It's called "useless_gist", because it's a GiST opclass for integers, like btree_gist, but the penalty and picksplit functions are totally dumb. The result is that the tuples are put to the index in pretty much random order, and every scan has to scan the whole index. I'm posting it here, in the hope that it happens to be useful, but I don't really know if it is. - Heikki
Attachment
On 02/16/2015 11:31 AM, Andres Freund wrote: > On 2015-02-16 10:00:24 +0200, Heikki Linnakangas wrote: >> I'm starting to think that we should bite the bullet and consume an infomask >> bit for this. The infomask bits are a scarce resource, but we should use >> them when it makes sense. It would be good for forensic purposes too, to >> leave a trace that a super-deletion happened. > > Well, we IIRC don't have any left right now. We could reuse > MOVED_IN|MOVED_OUT, as that never was set in the past. But it'd > essentially use two infomask bits forever... t_infomask is all used, but t_infomask2 has two bits left: > /* > * information stored in t_infomask2: > */ > #define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */ > /* bits 0x1800 are available */ > #define HEAP_KEYS_UPDATED 0x2000 /* tuple was updated and key cols > * modified, or tuple deleted */ > #define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */ > #define HEAP_ONLY_TUPLE 0x8000 /* this is heap-only tuple */ > > #define HEAP2_XACT_MASK 0xE000 /* visibility-related bits */ - Heikki
On Thu, Feb 19, 2015 at 5:21 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Hmm. I haven't looked at your latest patch, but I don't think you need to > pre-check for this to work. To recap, the situation is that two backends > have already inserted the heap tuple, and then see that the other backend's > tuple conflicts. To avoid a livelock, it's enough that one backend > super-deletes its own tuple first, before waiting for the other to complete, > while the other other backend waits without super-deleting. No? I fully agree with your summary here. However, why should we suppose that while we wait, the other backends don't both delete and then re-insert their tuple? They need the pre-check to know not to re-insert their tuple (seeing our tuple, immediately after we wake as the preferred backend with the older XID) in order to break the race. But today, exclusion constraints are optimistic in that the insert happens first, and only then the check. The pre-check turns that the other way around, in a limited though necessary sense. Granted, it's unlikely that we'd livelock due to one session always deleting and then nullifying that by re-inserting in time, but the theoretical risk seems real. Therefore, I'm not inclined to bother committing something without a pre-check, particularly since we're not really interested in fixing unprincipled deadlocks for ordinary exclusion constraint inserters (useful to know that you also think that doesn't matter, BTW). Does that make sense? This is explained within "livelock insurance" new-to-V2.3 comments in check_exclusion_or_unique_constraint(). (Not that I think that explanation is easier to follow than this one). > It might be easier to provoke the livelocks with a GiST opclass that's > unusually slow. I wrote the attached opclass for the purpose of testing this > a while ago, but I haven't actually gotten around to do much with it. It's > called "useless_gist", because it's a GiST opclass for integers, like > btree_gist, but the penalty and picksplit functions are totally dumb. The > result is that the tuples are put to the index in pretty much random order, > and every scan has to scan the whole index. I'm posting it here, in the hope > that it happens to be useful, but I don't really know if it is. Thanks. I'll try and use this for testing. Haven't been able to break exclusion constraints with the jjanes_upsert test suite in a long time, now. -- Peter Geoghegan
On 02/19/2015 08:16 PM, Peter Geoghegan wrote: > On Thu, Feb 19, 2015 at 5:21 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Hmm. I haven't looked at your latest patch, but I don't think you need to >> pre-check for this to work. To recap, the situation is that two backends >> have already inserted the heap tuple, and then see that the other backend's >> tuple conflicts. To avoid a livelock, it's enough that one backend >> super-deletes its own tuple first, before waiting for the other to complete, >> while the other other backend waits without super-deleting. No? > > I fully agree with your summary here. However, why should we suppose > that while we wait, the other backends don't both delete and then > re-insert their tuple? They need the pre-check to know not to > re-insert their tuple (seeing our tuple, immediately after we wake as > the preferred backend with the older XID) in order to break the race. > But today, exclusion constraints are optimistic in that the insert > happens first, and only then the check. The pre-check turns that the > other way around, in a limited though necessary sense. I'm not sure I understand exactly what you're saying, but AFAICS the pre-check doesn't completely solve that either. It's entirely possible that the other backend deletes its tuple, our backend then performs the pre-check, and the other backend re-inserts its tuple again. Sure, the pre-check reduces the chances, but we're talking about a rare condition to begin with, so I don't think it makes sense to add much code just to reduce the chances further. - Heikki
On Thu, Feb 19, 2015 at 11:10 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> I fully agree with your summary here. However, why should we suppose >> that while we wait, the other backends don't both delete and then >> re-insert their tuple? They need the pre-check to know not to >> re-insert their tuple (seeing our tuple, immediately after we wake as >> the preferred backend with the older XID) in order to break the race. >> But today, exclusion constraints are optimistic in that the insert >> happens first, and only then the check. The pre-check turns that the >> other way around, in a limited though necessary sense. > > I'm not sure I understand exactly what you're saying, but AFAICS the > pre-check doesn't completely solve that either. It's entirely possible that > the other backend deletes its tuple, our backend then performs the > pre-check, and the other backend re-inserts its tuple again. Sure, the > pre-check reduces the chances, but we're talking about a rare condition to > begin with, so I don't think it makes sense to add much code just to reduce > the chances further. But super deletion occurs *before* releasing the token lock, which is the last thing we do before looping around and starting again. So iff we're the oldest XID, the one that gets to "win" by unexpectedly waiting on another's token in our second phase (second call to check_exclusion_or_unique_constraint()), we will not, in fact, see anyone else's tuple, because they'll all be forced to go through the first phase and find our pre-existing, never-deleted tuple, so we can't see any new tuple from them. And, because they super delete before releasing their token, they'll definitely have super deleted when we're woken up, so we can't see any old/existing tuple either. We have our tuple inserted this whole time - ergo, we do, in fact, "win" reliably. The fly in the ointment is regular inserters, perhaps, but we've agreed that they're not too important here, and even when that happens we're in "deadlock land", not "livelock land", which is obviously a nicer place to live. Does that make sense? -- Peter Geoghegan
On 02/19/2015 10:09 PM, Peter Geoghegan wrote: > On Thu, Feb 19, 2015 at 11:10 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>> I fully agree with your summary here. However, why should we suppose >>> that while we wait, the other backends don't both delete and then >>> re-insert their tuple? They need the pre-check to know not to >>> re-insert their tuple (seeing our tuple, immediately after we wake as >>> the preferred backend with the older XID) in order to break the race. >>> But today, exclusion constraints are optimistic in that the insert >>> happens first, and only then the check. The pre-check turns that the >>> other way around, in a limited though necessary sense. >> >> I'm not sure I understand exactly what you're saying, but AFAICS the >> pre-check doesn't completely solve that either. It's entirely possible that >> the other backend deletes its tuple, our backend then performs the >> pre-check, and the other backend re-inserts its tuple again. Sure, the >> pre-check reduces the chances, but we're talking about a rare condition to >> begin with, so I don't think it makes sense to add much code just to reduce >> the chances further. > > But super deletion occurs *before* releasing the token lock, which is > the last thing we do before looping around and starting again. So iff > we're the oldest XID, the one that gets to "win" by unexpectedly > waiting on another's token in our second phase (second call to > check_exclusion_or_unique_constraint()), we will not, in fact, see > anyone else's tuple, because they'll all be forced to go through the > first phase and find our pre-existing, never-deleted tuple, so we > can't see any new tuple from them. And, because they super delete > before releasing their token, they'll definitely have super deleted > when we're woken up, so we can't see any old/existing tuple either. We > have our tuple inserted this whole time - ergo, we do, in fact, "win" > reliably. So, um, are you agreeing that there is no problem? Or did I misunderstand? If you see a potential issue here, can you explain it as a simple list of steps, please. - Heikki
On Fri, Feb 20, 2015 at 11:34 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > So, um, are you agreeing that there is no problem? Or did I misunderstand? > If you see a potential issue here, can you explain it as a simple list of > steps, please. Yes. I'm saying that AFAICT, there is no livelock hazard provided other sessions must do the pre-check (as they must for ON CONFLICT IGNORE). So I continue to believe that they must pre-check, which you questioned. The only real downside here (with not doing the pre-check for regular inserters with exclusion constraints) is that we can't fix unprincipled deadlocks for general exclusion constraint inserters (since we don't want to make them pre-check), but we don't actually care about that directly. But it also means that it's hard to see how we can incrementally commit such a fix to break down the patch into more manageable chunks, which is a little unfortunate. Hard to break down the problem into steps, since it isn't a problem that I was able to recreate (as a noticeable livelock). But the point of what I was saying is that the first phase pre-check allows us to notice that the one session that got stuck waiting in the second phase (other conflicters notice its tuple, and so don't insert a new tuple this iteration). Conventional insertion with exclusion constraints insert first and only then looks for conflicts (since there is no pre-check). More concretely, if you're the transaction that does not "break" here, within check_exclusion_or_unique_constraint(), and so unexpectedly waits in the second phase: + /* + * At this point we have either a conflict or a potential conflict. If + * we're not supposed to raise error, just return the fact of the + * potential conflict without waiting to see if it's real. + */ + if (violationOK && !wait) + { + /* + * For unique indexes, detecting conflict is coupled with physical + * index tuple insertion, so we won't be called for recheck + */ + Assert(!indexInfo->ii_Unique); + + conflict = true; + if (conflictTid) + *conflictTid = tup->t_self; + + /* + * Livelock insurance. + * + * When doing a speculative insertion pre-check, we cannot have an + * "unprincipled deadlock" with another session, fundamentally + * because there is no possible mutual dependency, since we only + * hold a lock on our token, without attempting to lock anything + * else (maybe this is not the first iteration, but no matter; + * we'll have super deleted and released insertion token lock if + * so, and all locks needed are already held. Also, our XID lock + * is irrelevant.) + * + * In the second phase, where there is a re-check for conflicts, we + * can't deadlock either (we never lock another thing, since we + * don't wait in that phase). However, a theoretical livelock + * hazard exists: Two sessions could each see each other's + * conflicting tuple, and each could go and delete, retrying + * forever. + * + * To break the mutual dependency, we may wait on the other xact + * here over our caller's request to not do so (in the second + * phase). This does not imply the risk of unprincipled deadlocks + * either, because if we end up unexpectedly waiting, the other + * session will super delete its own tuple *before* releasing its + * token lock and freeing us, and without attempting to wait on us + * to release our token lock. We'll take another iteration here, + * after waiting on the other session's token, not find a conflict + * this time, and then proceed (assuming we're the oldest XID). + * + * N.B.: Unprincipled deadlocks are still theoretically possible + * with non-speculative insertion with exclusion constraints, but + * this seems inconsequential, since an error was inevitable for + * one of the sessions anyway. We only worry about speculative + * insertion's problems, since they're likely with idiomatic usage. + */ + if (TransactionIdPrecedes(xwait, GetCurrentTransactionId())) + break; /* go and super delete/restart speculative insertion */ + } + Then you must successfully insert when you finally "goto retry" and do another iteration within check_exclusion_or_unique_constraint(). The other conflicters can't have failed to notice your pre-existing tuple, and can't have failed to super delete their own tuples before you are woken (provided you really are the single oldest XID). Is that any clearer? -- Peter Geoghegan
On 02/20/2015 10:39 PM, Peter Geoghegan wrote: > On Fri, Feb 20, 2015 at 11:34 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> So, um, are you agreeing that there is no problem? Or did I misunderstand? >> If you see a potential issue here, can you explain it as a simple list of >> steps, please. > > Yes. I'm saying that AFAICT, there is no livelock hazard provided > other sessions must do the pre-check (as they must for ON CONFLICT > IGNORE). So I continue to believe that they must pre-check, which you > questioned. > ... > Hard to break down the problem into steps, since it isn't a problem > that I was able to recreate (as a noticeable livelock). Then I refuse to believe that the livelock hazard exists, without the pre-check. If you have a livelock scenario in mind, it really shouldn't be that difficult to write down the list of steps. - Heikki
On Fri, Feb 20, 2015 at 1:07 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Then I refuse to believe that the livelock hazard exists, without the > pre-check. If you have a livelock scenario in mind, it really shouldn't be > that difficult to write down the list of steps. I just meant practical, recreatable steps - a test case. I should emphasize that what I'm saying is not that important. Even if I am wrong, I'm not suggesting that we do anything that we don't both agree is needed anyway. If I'm right, that is merely an impediment to incrementally committing the work by "fixing" exclusion constraints, AFAICT. Ultimately, that isn't all that important. Anyway, here is how I think livelocks could happen, in theory, with regular insertion into a table with exclusion constraints, with your patch [1] applied (which has no pre-check), this can happen: * Session 1 physically inserts, and then checks for a conflict. * Session 2 physically inserts, and then checks for a conflict. * Session 1 sees session 2's conflicting TID, then super deletes and releases token. * Session 2 sees session 1's conflicting TID, then super deletes and releases token. * Session 1 waits or tries to wait on session 2's token. It isn't held anymore, or is only held for an instant. * Session 2 waits or tries to wait on session 1's token. It isn't held anymore, or is only held for an instant. * Session 1 restarts from scratch, having made no useful progress in respect of the slot being inserted. * Session 2 restarts from scratch, having made no useful progress in respect of the slot being inserted. (Livelock) If there is a tie-break on XID (you omitted this from your patch [1] but acknowledge it as an omission), than that doesn't really change anything (without adding a pre-check, too). That's because: What do we actually do or not do when we're the oldest XID, that gets to "win"? Do we not wait on anything, and just declare that we're done? Then I think that breaks exclusion constraint enforcement, because we need to rescan the index to do that (i.e., "goto retry"). Do we wait on their token, as my most recent revision does, but *without* a pre-check, for regular inserters? Then I think that our old tuple could keep multiple other sessions spinning indefinitely. Only by checking for conflicts *first*, without a non-super-deleted physical index tuple can these other sessions notice that there is a conflict *without creating more conflicts*, which is what I believe is really needed. At the very least it's something I'm much more comfortable with, and that seems like reason enough to do it that way, given that we don't actually care about unprincipled deadlocks with regular inserters with exclusion constraints. Why take the chance with livelocking such inserters, though? I hope that we don't get bogged down on this, because, as I said, it doesn't matter too much. I'm tempted to concede the point for that reason, since the livelock is probably virtually impossible. I'm just giving you my opinion on how to make the handling of exclusion constraints as reliable as possible. Thanks [1] http://www.postgresql.org/message-id/54DFC6F8.5050108@vmware.com -- Peter Geoghegan
On 02/21/2015 12:15 AM, Peter Geoghegan wrote: > On Fri, Feb 20, 2015 at 1:07 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Then I refuse to believe that the livelock hazard exists, without the >> pre-check. If you have a livelock scenario in mind, it really shouldn't be >> that difficult to write down the list of steps. > > I just meant practical, recreatable steps - a test case. I should > emphasize that what I'm saying is not that important. Even if I am > wrong, I'm not suggesting that we do anything that we don't both agree > is needed anyway. If I'm right, that is merely an impediment to > incrementally committing the work by "fixing" exclusion constraints, > AFAICT. Ultimately, that isn't all that important. Anyway, here is how > I think livelocks could happen, in theory, with regular insertion into > a table with exclusion constraints, with your patch [1] applied (which > has no pre-check), this can happen: > > * Session 1 physically inserts, and then checks for a conflict. > > * Session 2 physically inserts, and then checks for a conflict. > > * Session 1 sees session 2's conflicting TID, then super deletes and > releases token. > > * Session 2 sees session 1's conflicting TID, then super deletes and > releases token. > > * Session 1 waits or tries to wait on session 2's token. It isn't held > anymore, or is only held for an instant. > > * Session 2 waits or tries to wait on session 1's token. It isn't held > anymore, or is only held for an instant. > > * Session 1 restarts from scratch, having made no useful progress in > respect of the slot being inserted. > > * Session 2 restarts from scratch, having made no useful progress in > respect of the slot being inserted. > > (Livelock) > > If there is a tie-break on XID (you omitted this from your patch [1] > but acknowledge it as an omission), than that doesn't really change > anything (without adding a pre-check, too). That's because: What do we > actually do or not do when we're the oldest XID, that gets to "win"? Ah, ok, I can see the confusion now. > Do we not wait on anything, and just declare that we're done? Then I > think that breaks exclusion constraint enforcement, because we need to > rescan the index to do that (i.e., "goto retry"). Do we wait on their > token, as my most recent revision does, but *without* a pre-check, for > regular inserters? Then I think that our old tuple could keep multiple > other sessions spinning indefinitely. What I had in mind is that the "winning" inserter waits on the other inserter's token, without super-deleting. Like all inserts do today. So the above scenario becomes: * Session 1 physically inserts, and then checks for a conflict. * Session 2 physically inserts, and then checks for a conflict. * Session 1 sees session 2's conflicting TID. Session 1's XID is older, so it "wins". It waits for session 2's token, without super-deleting. * Session 2 sees session 1's conflicting TID. It super deletes, releases token, and sleeps on session 1's token. * Session 1 wakes up. It looks at session 2's tuple again and sees that it was super-deleted. There are no further conflicts, so the insertion is complete, and it releases the token. * Session 2 wakes up. It looks at session 1's tuple again and sees that it's still there. It goes back to sleep, this time on session 2's XID. * Session 1 commits. Session 2 wakes up, sees that the tuple is still there, and throws a "contraint violation" error. - Heikki
On Sat, Feb 21, 2015 at 11:15 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Ah, ok, I can see the confusion now. Cool. >> Do we not wait on anything, and just declare that we're done? Then I >> think that breaks exclusion constraint enforcement, because we need to >> rescan the index to do that (i.e., "goto retry"). Do we wait on their >> token, as my most recent revision does, but *without* a pre-check, for >> regular inserters? Then I think that our old tuple could keep multiple >> other sessions spinning indefinitely. > > What I had in mind is that the "winning" inserter waits on the other > inserter's token, without super-deleting. Like all inserts do today. So the > above scenario becomes: > > * Session 1 physically inserts, and then checks for a conflict. > > * Session 2 physically inserts, and then checks for a conflict. > > * Session 1 sees session 2's conflicting TID. Session 1's XID is older, so > it "wins". It waits for session 2's token, without super-deleting. > > * Session 2 sees session 1's conflicting TID. It super deletes, > releases token, and sleeps on session 1's token. > > * Session 1 wakes up. It looks at session 2's tuple again and sees that it > was super-deleted. There are no further conflicts, so the insertion is > complete, and it releases the token. > > * Session 2 wakes up. It looks at session 1's tuple again and sees that it's > still there. It goes back to sleep, this time on session 2's XID. > > * Session 1 commits. Session 2 wakes up, sees that the tuple is still there, > and throws a "contraint violation" error. I think we're actually 100% in agreement, then. I just prefer to have the second last step (the check without a promise tuple visible to anyone made by the "loser") occur as part of the pre-check that happens anyway with ON CONFLICT IGNORE. Otherwise, you'll end up with some much more complicated control flow that has to care about not doing that twice for ON CONFLICT IGNORE...and for the benefit of what? For regular inserters, that we don't actually care about fixing unprincipled deadlocks for? Are we on the same page now? -- Peter Geoghegan
On 02/21/2015 10:41 PM, Peter Geoghegan wrote: > On Sat, Feb 21, 2015 at 11:15 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> What I had in mind is that the "winning" inserter waits on the other >> inserter's token, without super-deleting. Like all inserts do today. So the >> above scenario becomes: >> >> * Session 1 physically inserts, and then checks for a conflict. >> >> * Session 2 physically inserts, and then checks for a conflict. >> >> * Session 1 sees session 2's conflicting TID. Session 1's XID is older, so >> it "wins". It waits for session 2's token, without super-deleting. >> >> * Session 2 sees session 1's conflicting TID. It super deletes, >> releases token, and sleeps on session 1's token. >> >> * Session 1 wakes up. It looks at session 2's tuple again and sees that it >> was super-deleted. There are no further conflicts, so the insertion is >> complete, and it releases the token. >> >> * Session 2 wakes up. It looks at session 1's tuple again and sees that it's >> still there. It goes back to sleep, this time on session 2's XID. >> >> * Session 1 commits. Session 2 wakes up, sees that the tuple is still there, >> and throws a "contraint violation" error. > > I think we're actually 100% in agreement, then. I just prefer to have > the second last step (the check without a promise tuple visible to > anyone made by the "loser") occur as part of the pre-check that > happens anyway with ON CONFLICT IGNORE. Otherwise, you'll end up with > some much more complicated control flow that has to care about not > doing that twice for ON CONFLICT IGNORE...and for the benefit of what? > For regular inserters, that we don't actually care about fixing > unprincipled deadlocks for? Right. That will allow me to review and test the locking part of the patch separately. - Heikki
On 02/17/2015 02:11 AM, Peter Geoghegan wrote: > >>> >>Whatever works, really. I can't say that the performance implications >>> >>of acquiring that hwlock are at the forefront of my mind. I never >>> >>found that to be a big problem on an 8 core box, relative to vanilla >>> >>INSERTs, FWIW - lock contention is not normal, and may be where any >>> >>heavweight lock costs would really be encountered. >> > >> >Oh, cool. I guess the fast-path in lmgr.c kicks ass, then :-). > Seems that way. But even if that wasn't true, it wouldn't matter, > since I don't see that we have a choice. I did some quick performance testing on this. For easy testing, I used a checkout of git master, and simply added LockAcquire + LockRelease calls to ExecInsert, around the heap_insert() call. The test case I used was: psql -c "create table footest (id serial primary key);" echo "insert into footest select from generate_series(1, 10000);" > inserts.sql pgbench -n -f inserts.sql postgres -T100 -c4" With the extra lock calls, I got 56 tps on my laptop. With unpatched git master, I got 60 tps. I also looked at the profile with "perf", and indeed about 10% of the CPU time was spent in LockAcquire and LockRelease together. So the extra locking incurs about 10% overhead. I think this was pretty ḿuch a worst case scenario, but not a hugely unrealistic one - many real-world tables have only a few columns, and few indexes. With more CPUs you would probably start to see contention, in addition to just the extra overhead. Are we OK with a 10% overhead, caused by the locking? That's probably acceptable if that's what it takes to get UPSERT. But it's not OK just to solve the deadlock issue with regular insertions into a table with exclusion constraints. Can we find a scheme to eliminate that overhead? - Heikki
On Mon, Mar 2, 2015 at 11:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Are we OK with a 10% overhead, caused by the locking? That's probably > acceptable if that's what it takes to get UPSERT. But it's not OK just to > solve the deadlock issue with regular insertions into a table with exclusion > constraints. Can we find a scheme to eliminate that overhead? Looks like you tested a B-Tree index here. That doesn't seem particularly representative of what you'd see with exclusion constraints. -- Peter Geoghegan
On 03/02/2015 09:29 PM, Peter Geoghegan wrote: > On Mon, Mar 2, 2015 at 11:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Are we OK with a 10% overhead, caused by the locking? That's probably >> acceptable if that's what it takes to get UPSERT. But it's not OK just to >> solve the deadlock issue with regular insertions into a table with exclusion >> constraints. Can we find a scheme to eliminate that overhead? > > Looks like you tested a B-Tree index here. That doesn't seem > particularly representative of what you'd see with exclusion > constraints. Hmm. I used a b-tree to estimate the effect that the locking would have in the UPSERT case, for UPSERT into a table with a b-tree index. But you're right that for the question of whether this is acceptable for the case of regular insert into a table with exclusion constraints, other indexams are more interesting. In a very quick test, the overhead with a single GiST index seems to be about 5%. IMHO that's still a noticeable slowdown, so let's try to find a way to eliminate it. - Heikki
On Mon, Mar 2, 2015 at 12:15 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Hmm. I used a b-tree to estimate the effect that the locking would have in > the UPSERT case, for UPSERT into a table with a b-tree index. But you're > right that for the question of whether this is acceptable for the case of > regular insert into a table with exclusion constraints, other indexams are > more interesting. In a very quick test, the overhead with a single GiST > index seems to be about 5%. IMHO that's still a noticeable slowdown, so > let's try to find a way to eliminate it. Honestly, I don't know why you're so optimistic that this can be fixed, even with that heavyweight lock (for regular inserters + exclusion constraints). My experimental branch, which I showed you privately shows big problems when I simulate upserting with exclusion constraints (so we insert first, handle exclusion violations using the traditional upsert subxact pattern that does work with B-Trees). It's much harder to back out of a heap_update() than it is to back out of a heap_insert(), since we might well be updated a tuple visible to some other MVCC snapshot. Whereas we can super delete a tuple knowing that only a dirty snapshot might have seen it, which bounds the complexity (only wait sites for B-Trees + exclusion constraints need to worry about speculative insertion tokens and so on). My experimental branch works just fine (with a variant jjanes_upsert with subxact looping), until I need to restart an update after a "failed" heap_update() that still returned HeapTupleMayBeUpdated (having super deleted within an ExecUpdate() call). There is no good way to do that for ExecUpdate() that I can see, because an existing, visible row is affected (unlike with ExecInsert()). Even if it was possible, it would be hugely invasive to already very complicated code paths. I continue to believe that the best way forward is to incrementally commit the work by committing ON CONFLICT IGNORE first. That way, speculative tokens will remain strictly the concern of UPSERTers or sessions doing INSERT ... ON CONFLICT IGNORE. Unless, you're happy to have UPDATEs still deadlock, and only fix unprincipled deadlocks for the INSERT case. I don't know why you want to make the patch incremental by "fixing" unprincipled deadlocks in the first place, since you've said that you don't really care about it as a real world problem, and because it now appears to have significant additional difficulties that were not anticipated. Please, let's focus on getting ON CONFLICT IGNORE into a commitable state - that's the best way of incrementally committing this work. Fixing unprincipled deadlocks is not a useful way of incrementally committing this work. I've spent several days producing a prototype that shows the exact nature of the problem. If you think I'm mistaken, and that fixing unprincipled deadlocks first is the right thing to do, please explain why with reference to that prototype. AFAICT, doing things your way is going to add significant additional complexity for *no appreciable benefit*. You've already gotten exactly what you were looking for in every other regard. In particular, ON CONFLICT UPDATE could work with exclusion constraints without any of these problems, because of the way we do the auxiliary update there (we lock the row ahead of updating/qual evaluation). I've bent over backwards to make sure that is the case. Please, throw me a bone here. Thank you -- Peter Geoghegan
On 03/02/2015 11:21 PM, Peter Geoghegan wrote: > On Mon, Mar 2, 2015 at 12:15 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Hmm. I used a b-tree to estimate the effect that the locking would have in >> the UPSERT case, for UPSERT into a table with a b-tree index. But you're >> right that for the question of whether this is acceptable for the case of >> regular insert into a table with exclusion constraints, other indexams are >> more interesting. In a very quick test, the overhead with a single GiST >> index seems to be about 5%. IMHO that's still a noticeable slowdown, so >> let's try to find a way to eliminate it. > > Honestly, I don't know why you're so optimistic that this can be > fixed, even with that heavyweight lock (for regular inserters + > exclusion constraints). We already concluded that it can be fixed, with the heavyweight lock. See http://www.postgresql.org/message-id/54F4A0E0.4070602@iki.fi. Do you see some new problem with that that hasn't been discussed yet? To eliminate the heavy-weight lock, we'll need some new ideas, but it doesn't seem that hard. > My experimental branch, which I showed you privately shows big > problems when I simulate upserting with exclusion constraints (so we > insert first, handle exclusion violations using the traditional upsert > subxact pattern that does work with B-Trees). It's much harder to back > out of a heap_update() than it is to back out of a heap_insert(), > since we might well be updated a tuple visible to some other MVCC > snapshot. Whereas we can super delete a tuple knowing that only a > dirty snapshot might have seen it, which bounds the complexity (only > wait sites for B-Trees + exclusion constraints need to worry about > speculative insertion tokens and so on). > > My experimental branch works just fine (with a variant jjanes_upsert > with subxact looping), until I need to restart an update after a > "failed" heap_update() that still returned HeapTupleMayBeUpdated > (having super deleted within an ExecUpdate() call). There is no good > way to do that for ExecUpdate() that I can see, because an existing, > visible row is affected (unlike with ExecInsert()). Even if it was > possible, it would be hugely invasive to already very complicated code > paths. Ah, so we can't easily use super-deletion to back out an UPDATE. I had not considered that. > I continue to believe that the best way forward is to incrementally > commit the work by committing ON CONFLICT IGNORE first. That way, > speculative tokens will remain strictly the concern of UPSERTers or > sessions doing INSERT ... ON CONFLICT IGNORE. Ok, let's try that. Can you cut a patch that does just ON CONFLICT IGNORE, please? - Heikki
On Tue, Mar 3, 2015 at 12:05 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> My experimental branch works just fine (with a variant jjanes_upsert >> with subxact looping), until I need to restart an update after a >> "failed" heap_update() that still returned HeapTupleMayBeUpdated >> (having super deleted within an ExecUpdate() call). There is no good >> way to do that for ExecUpdate() that I can see, because an existing, >> visible row is affected (unlike with ExecInsert()). Even if it was >> possible, it would be hugely invasive to already very complicated code >> paths. > > Ah, so we can't easily use super-deletion to back out an UPDATE. I had not > considered that. Yeah. When I got into considering making EvalPlanQualFetch() look at speculative tokens, it became abundantly clear that that code would never be committed, even if I could make it work. >> I continue to believe that the best way forward is to incrementally >> commit the work by committing ON CONFLICT IGNORE first. That way, >> speculative tokens will remain strictly the concern of UPSERTers or >> sessions doing INSERT ... ON CONFLICT IGNORE. > > > Ok, let's try that. Can you cut a patch that does just ON CONFLICT IGNORE, > please? Of course. I'll have that for your shortly. Thanks -- Peter Geoghegan