Thread: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
Attached is a cumulative patch set - version 2.0 of INSERT ... ON
CONFLICT {UPDATE | IGNORE}.

This revision does not offer a variant implementing approach #1 to
value locking [1] (only approach #2), since maintaining both
approaches in parallel has about outlived its usefulness.

I'm calling this version 2.0 because it has RLS support. This is
significant because AFAICT it's the last feature that needs to have
interactions with UPSERT considered. I've worked through a rather long
list of existing interrelated features, implementing support in each
case. I've had feedback from others on what behavior is appropriate
when that wasn't obvious, and have made sure those areas had
appropriate support. This now includes RLS, but past revisions added
support for inheritance, updatable views, statement-level triggers,
postgres_fdw, column-level privileges, partial indexes, exclusion
constraints, and more. Basically, I think we're done with discussing
those aspects, and the semantics/syntax in general, or are pretty
close to done. Certainly, support for these other interrelated
features is quite comprehensive at this point. Now the core mechanism
of the patch should be discussed in detail. The general structure and
design is also interesting. After months and months of discussion, it
now seems very likely that the semantics offered are the right ones.
Since even before V1.0 was posted back in August, that's all that
we've discussed, really (apart from the recent back and forth with
Heikki on value locking bugs, of course).

I've approached RLS along the lines Stephen seemed to think would work
best following extensive discussion [2], or at least I believe that
I've produced RLS support that is what we informally agreed on. All
security barrier quals are treated as WITH CHECK OPTIONs in the
context of ON CONFLICT UPDATE. INSERTs don't have to deal with
UPDATE-related policies/WITH CHECK OPTIONs, but when the update path
is taken, both the INSERT and UPDATE related policies must both pass.
They must pass for the tuple that necessitated taking the UPDATE path
(the locked tuple to be updated), and also the finished tuple added
back to the relation by ExecUpdate(). There are 3 possible calls to
ExecWithCheckOptions() in the context of INSERT ... ON CONFLICT
UPDATE. Those 2 that I just mentioned, that involve UPDATE *and*
INSERT WITH CHECK options, and also the ExecInsert()
ExecWithCheckOptions() call.

RLS support is provided in a separate cumulative commit in the hope
that this makes it easier to review by a subject matter expert.
Documentation [3] and tests covering RLS are provided, of course.

I also include various bugfixes to approach #2 to value locking (these
were all previously separately posted, but are now integrated into the
main ON CONFLICT commit). Specifically, these are fixes for the bugs
that emerged thanks to Jeff Janes' great work on stress testing [4].
With these fixes, I have been unable to reproduce any problem with
this patch with the test suite, even after many days of running the
script on a quad-core server, with constant concurrent VACUUM runs,
etc. I think that we still need to think about the issues that
transpired with exclusion constraints, but since I couldn't find
another problem with an adapted version of Jeff's tool that tested
exclusion constraints, I'm inclined to think that it should be
possible to support exclusion constraints for the IGNORE variant.

It would be great to have more input on stress testing from Jeff.

Thoughts?

[1] https://wiki.postgresql.org/wiki/Value_locking#.231._Heavyweight_page_locking_.28Peter_Geoghegan.29
[2] http://www.postgresql.org/message-id/20150109214041.GK3062@tamriel.snowman.net
[3] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/sql-createpolicy.html
[4] https://github.com/petergeoghegan/jjanes_upsert
--
Peter Geoghegan

Attachment

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Sat, Jan 10, 2015 at 8:32 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I also include various bugfixes to approach #2 to value locking (these
> were all previously separately posted, but are now integrated into the
> main ON CONFLICT commit). Specifically, these are fixes for the bugs
> that emerged thanks to Jeff Janes' great work on stress testing [4].
> With these fixes, I have been unable to reproduce any problem with
> this patch with the test suite, even after many days of running the
> script on a quad-core server, with constant concurrent VACUUM runs,
> etc.

I continued with this since posting V2.0. I've run this bash script,
that invokes Jeff's script at various client counts, with runs of
various duration (since each client does a fixed amount of work):

https://github.com/petergeoghegan/jjanes_upsert/blob/master/run_test.sh

As previously discussed, Jeff's script comprehensively verifies the
correctness of the final values of a few thousand rows within a table
after many concurrent upserts, within and across upserting sessions,
and with concurrent deletions, too.

When building Postgres for this stress test, I included Jeff's
modifications that increase the XID burn rate artificially (I chose a
burn rate of X50). This makes anti-wraparound VACUUMs much more
frequent. I'm also looking out for outlier query execution durations,
because in theory they could indicate an unknown lock starvation
problem. I haven't seen any notable outliers after over a week of
testing.

> I think that we still need to think about the issues that
> transpired with exclusion constraints, but since I couldn't find
> another problem with an adapted version of Jeff's tool that tested
> exclusion constraints, I'm inclined to think that it should be
> possible to support exclusion constraints for the IGNORE variant.

Exclusion constraints were my focus with stress testing today. I
performed equivalent verification of upserts using exclusion
constraints (this is a hack; exclusion constraints are only intended
to be used with the IGNORE variant, but I get better test coverage
than I might otherwise this way). Unfortunately, even with the recent
bugfixes, there are still problems. On this server (rather than my
laptop), with 8 clients I can see errors like this before too long
(note that this output includes custom instrumentation from Jeff):

"""""""
6670 2015-01-17 18:02:54 PST LOG:  JJ scan_all 1, relfrozenid -813636509
6670 2015-01-17 18:02:54 PST LOG:  JJ freezeLimit -661025537
6670 2015-01-17 18:02:54 PST LOG:  JJ freeze_min_age 50000000
vacuum_freeze_table_age 150000000 freeze_table_age 150000000 ReadNew
-611025384
6670 2015-01-17 18:02:54 PST LOG:  JJ scan_all 1, relfrozenid -813636101
6670 2015-01-17 18:02:54 PST LOG:  JJ transaction ID wrap limit is
1352632427, limited by database with OID 12746
6670 2015-01-17 18:02:54 PST LOG:  autovacuum: done processing
database "postgres" at recent Xid of 3683945176 recent mxid of 1
6668 2015-01-17 18:02:54 PST ERROR:  conflicting key value violates
exclusion constraint "upsert_race_test_index_excl"
6668 2015-01-17 18:02:54 PST DETAIL:  Key (index)=(7142) conflicts
with existing key (index)=(600).
6668 2015-01-17 18:02:54 PST STATEMENT:  insert into upsert_race_test
(index, count) values ('7142','1') on conflict                     update set count=TARGET.count + EXCLUDED.count
             where TARGET.index = EXCLUDED.index                     returning upsert_race_test.count
 
"""""""

It's always an exclusion violation problem that I see here.

As you can see, the query involved has no "unique index inference"
specification, per the hack to make this work with exclusion
constraints (the artificially much greater XID burn rate might have
also increased the likelihood of this error dramatically). You'll also
note that the DETAIL message seems to indicate that this
btree_gist-based exclusion constraint doesn't behave like a unique
constraint at all, because the conflicting new value (7142) is not at
all the same as the existing value (600). But that's wrong -- it's
supposed to be B-Tree-like. In short, there are further race
conditions with exclusion constraints.

I think that the fundamental, unfixable race condition here is the
disconnect between index tuple insertion and checking for would-be
exclusion violations that exclusion constraints naturally have here,
that unique indexes naturally don't have [1] (note that I'm talking
only about approach #2 to value locking here; approach #1 isn't in
V2.0). I suspect that the feature is not technically feasible to make
work correctly with exclusion constraints, end of story. VACUUM
interlocking is probably also involved here, but the unfixable race
condition seems like our fundamental problem.

We could possibly spend a lot of time discussing whether or not I'm
right about it being inherently impossible to make INSERT ... ON
CONFLICT IGNORE work with exclusion constraints. However, I strongly
suggest that we cut scope and at least leave them out of any version
that can be committed for 9.5, and instead work on other areas,
because it is at least now clear that they are much harder to get
right than unique constraints. Besides, making exclusion constraints
work with INSERT ... ON CONFLICT IGNORE is nice, but ultimately not
all that important. For that matter I think that INSERT ... ON
CONFLICT IGNORE is more generally not all that important compared to
ON CONFLICT UPDATE. I'd cut scope by cutting ON CONFLICT IGNORE if
that was the consensus....we could add back ON CONFLICT IGNORE in 9.6
when we had a better sense of exclusion constraints here. Exclusion
constraints can never be useful with ON CONFLICT UPDATE anyway.

Please work with me towards a committable patch. I think we have every
chance of committing this for 9.5, with value locking approach #2,
provided we now cut scope a bit. As I mention above, V2.0 has stood up
to more than a week of aggressive, comprehensive stress testing/custom
correctness verification on an 8 core box (plus numerous other stress
tests in months past). UPSERT (which never involved exclusion
constraints) is a very comprehensive and mature effort, and I think it
now needs one big push from a senior community member. I feel that I
cannot do anything more without that input.

[1] http://www.postgresql.org/message-id/54A7C76D.3070101@vmware.com
-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Sat, Jan 17, 2015 at 6:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I continued with this since posting V2.0.

Attached version (V2.1) fixes bit-rot caused by the recent changes by
Stephen ("Fix column-privilege leak in error-message paths"). More
precisely, it is rebased on top of today's 17792b commit.

I have not addressed the recently described problems with exclusion
constraints. I hope we can do so shortly. Simply removing IGNORE
support until such time as we straighten that all out (9.6?) seems
like the simplest solution. No need to block the progress of "UPSERT",
since exclusion constraint support was only ever going to be useful
for the less compelling IGNORE variant. What do other people think? Do
you agree with my view that we should shelve IGNORE support for now,
Heikki?

There is one minor bugfix here: I have tightened up the conditions
under which user-defined rule application will be rejected.
Previously, I neglected to specifically check for UPDATE rules when an
INSERT ... ON CONFLICT UPDATE statement was considered. That's been
fixed.

On the stress-testing front, I'm still running Jeff Janes' tool [1],
while also continuing to use his Postgres modifications to
artificially increase the XID burn rate. However, my personal server
is no longer used for this task. I'm using an AWS EC2 instance - a
r3.8xlarge. This server provides 32 logical cores, and uses an "Intel
Xeon E5-2670 v2 @ 2.50GHz" CPU. It seems reasonable to suppose that
any latent concurrency bugs are more likely to reveal themselves when
using the new server.

Anyone who would like access to the server should contact me
privately. It's a throw-away EC2 instance, so this isn't particularly
difficult to do.

Thanks

[1] https://github.com/petergeoghegan/jjanes_upsert
--
Peter Geoghegan

Attachment

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Geoff Winkless
Date:
On Thu, Jan 29, 2015 at 11:38 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> Simply removing IGNORE support until such time as we straighten
> that all out (9.6?) seems like the simplest solution. No need to block
> the progress of "UPSERT", since exclusion constraint support was
> only ever going to be useful for the less compelling IGNORE variant.
> What do other people think? Do you agree with my view that we should
> shelve IGNORE support for now, Heikki?

I appreciate the work you're doing and (as a user rather than a
pg-hacker) don't want to butt in but if it would be possible to allow
support for IGNORE for unique but not exclusion constraints that would
be really helpful for my own use cases, where being able to insert
from a dataset into a table containing unique constraints without
having to first check the dataset for uniqueness (within both itself
and the target table) is very useful.

It's possible that I've misunderstood anyway and that you mean purely
that exclusion constraint IGNORE should be shelved until 9.6, in which
case I apologise.

Of course if there's no way to allow unique constraint IGNORE without
exclusion constraints then just ignore me; I (along I'm sure with all
the others who are following this conversation from afar) will be
incredibly grateful for the work you've done either way.

I suppose there's no reason why we couldn't use a no-op ON CONFLICT
UPDATE anyway, but that does seem rather messy and (I imagine) would
involve rather more work (unless the optimizer were able to optimize
away the "update"? I don't know enough to be able to say if it would).

Thanks

Geoff



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Fri, Jan 30, 2015 at 6:59 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> I appreciate the work you're doing and (as a user rather than a
> pg-hacker) don't want to butt in but if it would be possible to allow
> support for IGNORE for unique but not exclusion constraints that would
> be really helpful for my own use cases, where being able to insert
> from a dataset into a table containing unique constraints without
> having to first check the dataset for uniqueness (within both itself
> and the target table) is very useful.
>
> It's possible that I've misunderstood anyway and that you mean purely
> that exclusion constraint IGNORE should be shelved until 9.6, in which
> case I apologise.

Well, the issue is that we can't really add exclusion constraints to
the IGNORE case later. So the fact that we cannot do exclusion
constraints kind of implies that we can either shelve IGNORE and maybe
look at it later, or accept that we'll never support exclusion
constraints with IGNORE. We'd then include IGNORE without exclusion
constraint support now and forever. I tend to think that we'll end up
doing the latter anyway, but I really don't want to add additional
risk of this not getting into 9.5 by arguing about that now. It
doesn't matter that much.

> I suppose there's no reason why we couldn't use a no-op ON CONFLICT
> UPDATE anyway

Right. IGNORE isn't really all that compelling for that reason. Note
that this will still lock the unmodified row, though.

-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Andres Freund
Date:
Hi,

A first (not actually that quick :() look through the patches to see
what actually happened in the last months. I didn't keep up with the
thread.

Generally the split into the individual commits doesn't seem to make
much sense to me. The commits individually (except the first) aren't
indivdiually commitable and aren't even meaningful. Splitting off the
internal docs, tests and such actually just seems to make reviewing
harder because you miss context. Splitting it so that individual piece
are committable and reviewable makes sense, but... I have no problem
doing the user docs later. If you split of RLS support, you need to
throw an error before it's implemented.

0001:
* References INSERT with ON CONFLICT UPDATE, can thus not be committed independently. I don't think those references
reallyare needed.
 
* I'm not a fan of the increased code duplication in ExecCheckRTEPerms(). Can you look into cleaning that up?
* Also the comments inside the ACL_INSERT case still reference UPDATE.

Other than that I think we can just go ahead and commit this ahead of
time. Mentioning ON CONFLICT UPDATE (OCU henceforth) in the commit
message only.

0007:
* "AMs" alone isn't particularly unique.
* Without the context of the discussion "unprincipled deadlocks" aren't well defined.
* Too many "" words.
* Waiting "too long" isn't defined. Neither is why that'd imply unprincipled deadlocks. Somewhat understandable with
thecontext of the discussion, but surely not a couple years down the road.
 
* As is I don't find the README entry super helpful. It should state what the reason for doing this is cleary, start at
thehigher level, and then move to the details.
 
* Misses details about the speculative heavyweight locking of tuples.

0002:
* Tentatively I'd say that killspeculative should be done via a separate function instead of heap_delete()
* I think we should, as you ponder in a comment, do the OCU specific stuff lazily and/or in a separate function from
BuildIndexInfo().That function is already quite visible in profiles, and the additional work isn't entirely trivial.
 
* I doubt logical decoding works with the patch as it stands.
* The added ereport (i.e. user facing) error message in ExecInsertIndexTuples won't help a user at all.
* Personally I don't care one iota for comments like "Get information from the result relation info structure.". Yes,
oneof these already exists, but ...
 
* If a arbiter index is passed to ExecCheckIndexConstraints(), can't we abort the loop after checking it? Also, do we
reallyhave to iterate over indexes for that case? How about moving the loop contents to a separate function and using
thatseparately for the arbiter cases?
 
* Don't like the comment above check_exclusion_or_unique_constraint()'s much. Makes too much of a special case of OCU
* ItemPointerIsValid
* ExecCheckHeapTupleVisible's comment "It is not acceptable to proceed " sounds like you're talking with a child or so
;)
* ExecCheckHeapTupleVisible()'s errhint() sounds like an argument/excuse (actually like a code comment). That's not
goingto help a user at all.
 
* I find the modified control flow in ExecInsert() pretty darn ugly. I think this needs to be cleaned up. The
speculativecase should imo be a separate function or something.
 
* /*  * This may occur when an instantaneously invisible tuple is blamed  * as a conflict because multiple rows are
insertedwith the same  * constrained values.  How can this happen? We don't insert multiple rows with the same  command
id?
* ExecLockUpdatedTuple() has (too?) many comments, but little overview of what/why it is doing what it does on a higher
level.

* plan_speculative_use_index: "Use the planner to decide speculative insertion arbiter index" - Huh? " rel is the
targetto undergo ON CONFLICT UPDATE/IGNORE." - Which rel?
 
* formulations as "fundamental nexus" are hard to understand imo.
* Perhaps it has previously been discussed but I'm not convinced by the reasoning for not looking at opclasses in
infer_unique_index().This seems like it'd prohibit ever having e.g. case insensitive opclasses - something surely
worthwile.
* Doesn't infer_unique_index() have to look for indisvalid? This isn't going to work well with a invalid (not to speak
fora !ready) index.
 
* Is ->relation in the UpdateStmt generated in transformInsertStmt ever used for anything? If so, it'd possibly
generatesome possible nastyness due to repeated name lookups. Looks like it'll be used in transformUpdateStmt
 
* What's the reason for the !pstate->p_parent? Also why the parens?pstate->p_is_speculative = (pstate->parentParseState
&&                           (!pstate->p_parent_cte &&
pstate->parentParseState->p_is_insert&&                             pstate->parentParseState->p_is_speculative));
 
* Why did you need to make %nonassoc   DISTINCT and ON nonassoc in the grammar?
* The whole speculative insert logic isn't really well documented. Why, for example, do we actually need the token? And
whyare there no issues with overflow? And where is it documented that a 0 means there's no token? ...
 
* Isn't "SpecType" a awfully generic (and nondescriptive) name?
* /* XXX:  Make sure that re-use of bits is safe here */ - no, not unless you change existing checks.
*     /* * Immediately VACUUM "super-deleted" tuples */if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple),
               InvalidTransactionId))    return HEAPTUPLE_DEAD; Is that branch really needed? Shouldn't it just be
happeningas a consequence of the already existing code? Same in SatisfiesMVCC. If you actually needed that block, it'd
needto be done in SatisfiesSelf as well, no? You have a comment about a possible loop - but that seems wrong to me,
implyingthat HEAP_XMIN_COMMITTED was set invalidly.
 


Ok, I can't focus at all any further at this point. But there's enough
comments here that some even might make sense ;)

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Geoff Winkless
Date:
On 30 January 2015 at 21:58, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Jan 30, 2015 at 6:59 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
>> I suppose there's no reason why we couldn't use a no-op ON CONFLICT
>> UPDATE anyway
>
> Right. IGNORE isn't really all that compelling for that reason. Note
> that this will still lock the unmodified row, though.

Mmmf. So I would have to make sure that my source tuples were unique
before doing the INSERT (otherwise the first ON CONFLICT UPDATE for a
tuple would block any other)? That's potentially very slow :(

When you say that you can't add exclusion constraints later, do you
mean from a coding point of view or just because people would get
confused whether exclusion constraints could be IGNOREd or not?

Geoff



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Geoff Winkless
Date:
On 2 February 2015 at 14:32, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
> Mmmf. So I would have to make sure that my source tuples were unique
> before doing the INSERT (otherwise the first ON CONFLICT UPDATE for a
> tuple would block any other)? That's potentially very slow :(

Replying to my own message, because it occurs to me I might be being
stupid (surely not :) )

When you say "this will still lock the unmodified row" did you mean
just that it's locked to _other_ processes until commit? That would be
much less impactful.

Geoff



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 01/18/2015 04:48 AM, Peter Geoghegan wrote:
> I think that the fundamental, unfixable race condition here is the
> disconnect between index tuple insertion and checking for would-be
> exclusion violations that exclusion constraints naturally have here,
> that unique indexes naturally don't have [1] (note that I'm talking
> only about approach #2 to value locking here; approach #1 isn't in
> V2.0). I suspect that the feature is not technically feasible to make
> work correctly with exclusion constraints, end of story. VACUUM
> interlocking is probably also involved here, but the unfixable race
> condition seems like our fundamental problem.

It's not a fundamental, unfixable race condition. In [1], I gave you 
three ideas straight off the top of my head on how that could be fixed.

> Please work with me towards a committable patch.

I'm trying...

> [1] http://www.postgresql.org/message-id/54A7C76D.3070101@vmware.com

- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 01/30/2015 01:38 AM, Peter Geoghegan wrote:
> I have not addressed the recently described problems with exclusion
> constraints. I hope we can do so shortly. Simply removing IGNORE
> support until such time as we straighten that all out (9.6?) seems
> like the simplest solution. No need to block the progress of "UPSERT",
> since exclusion constraint support was only ever going to be useful
> for the less compelling IGNORE variant. What do other people think? Do
> you agree with my view that we should shelve IGNORE support for now,
> Heikki?

No, I don't agree. Let's fix it.

- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Mon, Feb 02, 2015 at 4:48 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> I think that the fundamental, unfixable race condition here is the
>> disconnect between index tuple insertion and checking for would-be
>> exclusion violations that exclusion constraints naturally have here,
>> that unique indexes naturally don't have [1] (note that I'm talking
>> only about approach #2 to value locking here; approach #1 isn't in
>> V2.0). I suspect that the feature is not technically feasible to make
>> work correctly with exclusion constraints, end of story. VACUUM
>> interlocking is probably also involved here, but the unfixable race
>> condition seems like our fundamental problem.
>
> It's not a fundamental, unfixable race condition. In [1], I gave you
> three ideas straight off the top of my head on how that could be fixed.

That was different - I tried to make it work by fixing some bugs
there. However, I'm now finding myself up against these new bugs. I
think that the underlying cause is the lack of any real locking
(unlike with the B-Tree AM) in *both* cases, but I don't even know
that for sure. The error messages you see are quite odd - why should a
btree_gist-based exclusion constraint cause a violation when
non-conflicting values are inserted? There is some other race
condition here. This wasn't a livelock (or a deadlock), which is what
your comments in early January apply to. I think that this has
something to do with VACUUM interlocking. But with the B-Tree AM
(which we're handling differently, by re-using infrastructure used for
deferred unique constraints), things work quite well. The patch stands
up to Jeff's vigorous stress-tests.

I'm not fundamentally in disagreement with you about any of this. All
I'm saying is that we should cut scope today. We're not precluding
picking up an IGNORE feature that does support exclusion constraints
in the future. Why should we insist upon having that in the first cut?
It's both significantly harder, and significantly less useful to
users, and so cutting that makes perfect sense AFAICT. As I've said
many times, exclusion constraint support was only ever going to be
useful to the IGNORE variant (I've tested exclusion constraints by
contriving a case to make them do UPSERTs, but this is only for the
benefit of the stress-test).

Thanks
-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 01/30/2015 01:38 AM, Peter Geoghegan wrote:
> On the stress-testing front, I'm still running Jeff Janes' tool [1],
> while also continuing to use his Postgres modifications to
> artificially increase the XID burn rate.
>
> [1] https://github.com/petergeoghegan/jjanes_upsert

I followed the instructions in README.md to reproduce this. I downloaded 
the tool, applied the upsert patchset, applied the hack to 
parse_clause.c as instructed in the README.md file, installed 
btree_gist, and ran count_upsert_exclusion.pl.

It failed immediately with an assertion failure:

TRAP: FailedAssertion("!(node->spec != SPEC_INSERT || node->arbiterIndex 
!= ((Oid) 0))", File: "nodeModifyTable.c", Line: 1619)

Is that just because of the hack in parse_clause.c?

With assertions disabled, count_upsert_exclusion.pl ran successfully to 
the end. I also tried running "VACUUM FREEZE upsert_race_test" in a loop 
in another session at the same time, but it didn't make a difference. 
How quickly do you see the errors?

I also tried applying crash_REL9_5.patch from the jjanes_upsert kit, and 
set jj_xid=10000 to increase XID burn rate, but I'm still not seeing any 
errors.

- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Tue, Feb 3, 2015 at 2:05 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> TRAP: FailedAssertion("!(node->spec != SPEC_INSERT || node->arbiterIndex !=
> ((Oid) 0))", File: "nodeModifyTable.c", Line: 1619)
>
> Is that just because of the hack in parse_clause.c?

Yes. I never built with assertions and so didn't see this, but it
doesn't matter.

> With assertions disabled, count_upsert_exclusion.pl ran successfully to the
> end. I also tried running "VACUUM FREEZE upsert_race_test" in a loop in
> another session at the same time, but it didn't make a difference. How
> quickly do you see the errors?
>
> I also tried applying crash_REL9_5.patch from the jjanes_upsert kit, and set
> jj_xid=10000 to increase XID burn rate, but I'm still not seeing any errors.

Did you build fully optimized, assertion-free code? I've been doing
that. I found it necessary to recreate some of the bugs Jeff's tool
caught. I also think that I might have needed an 8 core box to see it,
but less sure about that.

-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/03/2015 08:17 PM, Peter Geoghegan wrote:
> On Tue, Feb 3, 2015 at 2:05 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> TRAP: FailedAssertion("!(node->spec != SPEC_INSERT || node->arbiterIndex !=
>> ((Oid) 0))", File: "nodeModifyTable.c", Line: 1619)
>>
>> Is that just because of the hack in parse_clause.c?
>
> Yes. I never built with assertions and so didn't see this, but it
> doesn't matter.
>
>> With assertions disabled, count_upsert_exclusion.pl ran successfully to the
>> end. I also tried running "VACUUM FREEZE upsert_race_test" in a loop in
>> another session at the same time, but it didn't make a difference. How
>> quickly do you see the errors?
>>
>> I also tried applying crash_REL9_5.patch from the jjanes_upsert kit, and set
>> jj_xid=10000 to increase XID burn rate, but I'm still not seeing any errors.
>
> Did you build fully optimized, assertion-free code? I've been doing
> that. I found it necessary to recreate some of the bugs Jeff's tool
> caught. I also think that I might have needed an 8 core box to see it,
> but less sure about that.

I had compiled with -O0, but without assertions. I tried now again with 
-O3. It's been running for about 10 minutes now, and I haven't seen any 
errors.

Since you can reproduce this, it would be good if you could debug this. 
The error message where the alleged duplicate key actually had a 
different value is a bit scary. Makes me wonder if it might be a bug 
with exclusion constraints in general, or just with the patch.

- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Wed, Feb 4, 2015 at 9:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I had compiled with -O0, but without assertions. I tried now again with -O3.
> It's been running for about 10 minutes now, and I haven't seen any errors.

Did you run with an artificially high XID burn rate (i.e. did you also
apply Jeff's modifications to Postgres, and specify a high burn rate
using his custom GUC)? Maybe that was important.


-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Tue, Feb 2, 2015 at 01:06 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> A first (not actually that quick :() look through the patches to see
> what actually happened in the last months. I didn't keep up with the
> thread.

So, let me get this out of the way: This is the first in-depth
technical review that this work has had in a long time. Thank you for
your help here.

> Generally the split into the individual commits doesn't seem to make
> much sense to me. The commits individually (except the first) aren't
> indivdiually commitable and aren't even meaningful. Splitting off the
> internal docs, tests and such actually just seems to make reviewing
> harder because you miss context. Splitting it so that individual piece
> are committable and reviewable makes sense, but... I have no problem
> doing the user docs later. If you split of RLS support, you need to
> throw an error before it's implemented.

I mostly agree. Basically, I did not intend for all of the patches to
be individually committed. The mechanism by which EXCLUDED.*
expressions are added is somewhat novel, and deserves to be
independently *considered*. I'm trying to show how the parts fit
together more so than breaking things down in to smaller commits (as
you picked up on, 0001 is the exception - that is genuinely intended
to be committed early). Also, those commit messages give me the
opportunity to put those parts in their appropriate context vis-a-vis
our discussions. They refer to the Wiki, for example, or reasons why
pg_stat_statements shouldn't care about ExcludedExpr. Obviously the
final commit messages won't look that way.

> 0001:
> * References INSERT with ON CONFLICT UPDATE, can thus not be committed
>   independently. I don't think those references really are needed.
> * I'm not a fan of the increased code duplication in
>   ExecCheckRTEPerms(). Can you look into cleaning that up?
> * Also the comments inside the ACL_INSERT case still reference UPDATE.
>
> Other than that I think we can just go ahead and commit this ahead of
> time. Mentioning ON CONFLICT UPDATE (OCU henceforth) in the commit
> message only.

Cool. Attached revision makes those changes.

> 0007:
> * "AMs" alone isn't particularly unique.
> * Without the context of the discussion "unprincipled deadlocks" aren't
>   well defined.
> * Too many "" words.
> * Waiting "too long" isn't defined. Neither is why that'd imply
>   unprincipled deadlocks. Somewhat understandable with the context of
>   the discussion, but surely not a couple years down the road.
> * As is I don't find the README entry super helpful. It should state
>   what the reason for doing this is cleary, start at the higher level,
>   and then move to the details.
> * Misses details about the speculative heavyweight locking of tuples.

Fair points. I'll work through that feedback.

Actually, I think we should memorialize that "unprincipled deadlocks"
should be avoided in some more general way, since they are after all a
general problem that we've seen elsewhere. I'm not sure about how to
go about doing that, though.

> 0002:
> * Tentatively I'd say that killspeculative should be done via a separate
>   function instead of heap_delete()

Really? I guess if that were to happen, it would entail refactoring
heap_delete() to call a static function, which was also called by a
new kill_speculative() function that does this. Otherwise, you'd have
far too much duplication.

> * I think we should, as you ponder in a comment, do the OCU specific
>   stuff lazily and/or in a separate function from BuildIndexInfo(). That
>   function is already quite visible in profiles, and the additional work
>   isn't entirely trivial.

Okay.

> * I doubt logical decoding works with the patch as it stands.

I thought so. Perhaps you could suggest a better use of the available
XLOG_HEAP_* bits. I knew I needed to consider that more carefully
(hence the XXX comment), but didn't get around to it.

> * The added ereport (i.e. user facing) error message in
>   ExecInsertIndexTuples won't help a user at all.

So, this:

>>    /* Skip this index-update if the predicate isn't satisfied */
>>    if (!ExecQual(predicate, econtext, false))
>> + {
>> +     if (arbiterIdx == indexRelation->rd_index->indexrelid)
>> +         ereport(ERROR,
>> +                      (errcode(ERRCODE_TRIGGERED_ACTION_EXCEPTION),
>> +                       errmsg("partial arbiter unique index has predicate that does not cover tuple proposed for
insertion"),
>> +                       errdetail("ON CONFLICT inference clause implies that the tuple proposed for insertion
actuallybe covered by partial predicate for index \"%s\".", 
>> +                                     RelationGetRelationName(indexRelation)),
>> +                       errhint("ON CONFLICT inference clause must infer a unique index that covers the final tuple,
afterBEFORE ROW INSERT triggers fire."), 
>> +                       errtableconstraint(heapRelation,
>> +                                                    RelationGetRelationName(indexRelation))));
>>        continue;
>> + }

Yeah, that isn't a great error message. This happens here because you
are using a partial unique index (and so you must have had an
inference specification with a "WHERE" to get here). However, what you
actually went to insert would not be covered by this partial unique
index, and so couldn't ever take the alternative path, and so is
likely not thought out. Maybe it would be better to silently always
let the INSERT succeed as an INSERT. *That* actually wasn't really
discussed - this is all my idea.

> * Personally I don't care one iota for comments like "Get information
>   from the result relation info structure.". Yes, one of these already
>   exists, but ...

Okay.

> * If a arbiter index is passed to ExecCheckIndexConstraints(), can't we
>   abort the loop after checking it? Also, do we really have to iterate
>   over indexes for that case? How about moving the loop contents to a
>   separate function and using that separately for the arbiter cases?

Well, the failure to do that implies very few extra cycles, but sure.
I'll add a new reason to break at the end, when
check_exclusion_or_unique_constraint() is called in respect of a
particular (inferred) arbiter unique index.

> * Don't like the comment above check_exclusion_or_unique_constraint()'s
>   much. Makes too much of a special case of OCU

I guess I should just refer to speculative insertion.

> * ItemPointerIsValid

What about it?

> * ExecCheckHeapTupleVisible's comment "It is not acceptable to proceed "
>   sounds like you're talking with a child or so ;)

Fair point. I should say "It would not be consistent with the
guarantees of the higher isolation levels..."

> * ExecCheckHeapTupleVisible()'s errhint() sounds like an
>   argument/excuse (actually like a code comment). That's not going to
>   help a user at all.

Really? I thought it might be less than intuitive that higher
isolation levels cannot decide to do nothing on the basis of something
not in their MVCC snapshot. But come to think of it, yeah, that
errhint() isn't adding much over the main error message.

> * I find the modified control flow in ExecInsert() pretty darn ugly. I
>   think this needs to be cleaned up. The speculative case should imo be
>   a separate function or something.

Hmm. I'm not quite sold on that. Basically, if we did that, we'd still
have a function that was more or less a strict superset of
ExecInsert(). What have we saved?

What I do agree with is that ExecInsert() should be refactored to make
the common case (a vanilla insert) look like the common case, whereas
the uncommon case (an upsert) should have that dealt with specially.
There is room for improvement. Is that a fair compromise?

> * /*
>    * This may occur when an instantaneously invisible tuple is blamed
>    * as a conflict because multiple rows are inserted with the same
>    * constrained values.
>    How can this happen? We don't insert multiple rows with the same
>    command id?

This is a cardinality violation [1]. It can definitely happen - just
try the examples you see on the Wiki. This is possible because I
modified heap_lock_tuple() to return HeapTupleInvisible (and not just
complain directly when HeapTupleSatisfiesUpdate() returns
"HeapTupleInvisible"). It's also possible because we're using a
DirtySnapshot at various points. This is sort of like how ExecUpdate()
handles a return value of "HeapTupleSelfUpdated" from heap_update().
Not quite though, because 1. ) I prefer to throw an error (rather than
silently not UPDATE that slot), and 2. ) we're not dealing with MVCC
semantics, so the return values are different in both cases. The
*nature* of the situation handled is similar between UPSERTs (in
ExecLockUpdatedTuple()) and vanilla UPDATEs (in ExecUpdate()), though.
Does that make sense?

> * ExecLockUpdatedTuple() has (too?) many comments, but little overview
>   of what/why it is doing what it does on a higher level.

Fair point. Seems like material for a better worked out executor README.

> * plan_speculative_use_index: "Use the planner to decide speculative
>   insertion arbiter index" - Huh? " rel is the target to undergo ON
>   CONFLICT UPDATE/IGNORE." - Which rel?

Sorry, that's an obsolete comment (the function signature changed). It
should refer to the target of the Query being planned.

> * formulations as "fundamental nexus" are hard to understand imo.

I'm trying to suggest that INSERT ... ON CONFLICT UPDATE is not quite
two separate top-level commands, and yet is also not a new, distinct
type of top-level command. This is mostly a high level design decision
that maximizes code reuse.

> * Perhaps it has previously been discussed but I'm not convinced by the
>   reasoning for not looking at opclasses in infer_unique_index(). This
>   seems like it'd prohibit ever having e.g. case insensitive opclasses -
>   something surely worthwile.

I don't think anyone gave that idea the thumbs-up. However, I really
don't see the problem. Sure, we could have case insensitive opclasses
in the future, and you may want to make a unique index using one. But
then it becomes a matter of whatever unique indexes are available. The
limitation is only that you cannot explicitly indicate that you want a
certain opclass. It comes down to whatever unique indexes happen to be
available, since of course taking the alternative path is arbitrated
by a would-be unique violation. It's a bit odd that we're leaving it
up to the available indexes to decide on semantics like that, but the
problem is so narrow and the solution so involved that I'd argue it's
acceptable.

> * Doesn't infer_unique_index() have to look for indisvalid? This isn't
>   going to work well with a invalid (not to speak for a !ready) index.

It does (check IndexIsValid()). I think the mistake I made here was
not checking IndexIsReady(), since that is an additional concern above
what the similar get_relation_info() function must consider.

> * Is ->relation in the UpdateStmt generated in transformInsertStmt ever
>   used for anything? If so, it'd possibly generate some possible
>   nastyness due to repeated name lookups. Looks like it'll be used in
>   transformUpdateStmt

What, you mean security issues, for example? I have a hard time seeing
how that could work in practice, given that the one and only target
RTE is marked with the appropriate updatedCols originating from
transformUpdateStmt(). Still, it is a concern generally - better safe
than sorry. I was thinking of plugging it by ensuring that the
Relations matched, but that might not be good enough. Maybe it would
be better to bite the bullet and have transformUpdateStmt() use the
same Relation directly, which is something I hoped to avoid (better to
have transformUpdateStmt() know as little as possible about this, I'd
say).

> * What's the reason for the !pstate->p_parent? Also why the parens?
> pstate->p_is_speculative = (pstate->parentParseState &&
> (!pstate->p_parent_cte &&
> pstate->parentParseState->p_is_insert &&
> pstate->parentParseState->p_is_speculative));

You mean the "!pstate->p_parent_cte"? That's there because you can get
queries to segfault if this logic doesn't consider that a
data-modifying CTE can have an UPDATE that appears within a CTE
referenced from an INSERT.  :-)

> * Why did you need to make %nonassoc   DISTINCT and ON nonassoc in the grammar?

To prevent a shift/reduce conflict, I changed the associativity.
Without this, here are the details of State 700, which has the
conflict (from gram.output):

"""""
State 700

  1465 opt_distinct: DISTINCT .
  1466             | DISTINCT . ON '(' expr_list ')'

    ON  shift, and go to state 1094

    ON        [reduce using rule 1465 (opt_distinct)]
    $default  reduce using rule 1465 (opt_distinct)

"""""

> * The whole speculative insert logic isn't really well documented. Why,
>   for example, do we actually need the token? And why are there no
>   issues with overflow? And where is it documented that a 0 means
>   there's no token? ...

Fair enough. Presumably it's okay that overflow theoretically could
occur, because a race is all but impossible. The token represents a
particular attempt by some backend at inserting a tuple, that needs to
be waited on specifically only if it is their active attempt (and the
xact is still running). Otherwise, you get unprincipled deadlocks.
Even if by some incredibly set of circumstances it wraps around, worst
case scenario you get an unprinciped deadlock, which is hardly the end
of the world given the immense number of insertions required, and the
immense unlikelihood that things would work out that way - it'd be
basically impossible.

I'll document the "0" thing.

> * Isn't "SpecType" a awfully generic (and nondescriptive) name?

OK. That'll be changed.

> * /* XXX:  Make sure that re-use of bits is safe here */ - no, not
>   unless you change existing checks.

I think that this is a restatement of your remarks on logical decoding. No?

> * /*
> * Immediately VACUUM "super-deleted" tuples
> */
> if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple),
> InvalidTransactionId))
> return HEAPTUPLE_DEAD;
>   Is that branch really needed? Shouldn't it just be happening as a
>   consequence of the already existing code? Same in SatisfiesMVCC. If
>   you actually needed that block, it'd need to be done in SatisfiesSelf
>   as well, no? You have a comment about a possible loop - but that seems
>   wrong to me, implying that HEAP_XMIN_COMMITTED was set invalidly.

Indeed, this code is kind of odd. While I think the omission within
SatisfiesSelf() may be problematic too, if you really want to know why
this code is needed, uncomment it and run Jeff's stress-test. It will
reliably break.

This code:

"""""
if (HeapTupleHeaderXminInvalid(tuple))
  return HEAPTUPLE_DEAD;

"""""

and this code:

"""""
/*
 * Immediately VACUUM "super-deleted" tuples
 */
if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple),
    InvalidTransactionId))
    return HEAPTUPLE_DEAD;

"""""

are not equivalent (nor is the latter a superset of the former). Maybe
they should be, but they're not. What's more, heap tuple header raw
xmin has never been able to change, and I don't think there is any
reason for it to be InvalidTransactionId. See my new comments within
EvalPlanQualFetch() remarking on how it's now possible for that to
change (before, the comment claimed that it wasn't possible).

> Ok, I can't focus at all any further at this point. But there's enough
> comments here that some even might make sense ;)

Most do.  :-)

Thanks.

[1] https://wiki.postgresql.org/wiki/UPSERT#.22Cardinality_violation.22_errors_in_detail
--
Peter Geoghegan

Attachment

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Wed, Feb 4, 2015 at 10:03 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Feb 4, 2015 at 9:54 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> I had compiled with -O0, but without assertions. I tried now again with -O3.
>> It's been running for about 10 minutes now, and I haven't seen any errors.
>
> Did you run with an artificially high XID burn rate (i.e. did you also
> apply Jeff's modifications to Postgres, and specify a high burn rate
> using his custom GUC)? Maybe that was important.

Excuse me: I now see that you specifically indicated that you did. But
looks like your XID burn rate was quite a lot higher than mine
(assuming that you were consistent in using " jj_xid=10000", although
I'm not asserting that that was significant).

I attach a log of output from an example session where exclusion
constraints are shown to break (plus the corresponding server log,
plus /proc/cpuinfo on the off chance that that's significant). As you
can from the fact that the span of time recorded in the log is only a
couple of minutes, this is really easy for me to
recreate....sometimes. I could not recreate the problem with only 4
clients (on this 8 core server) after a few dozen attempts, and then I
couldn't recreate the issue at all, so clearly those details matter.
It might have something to do with CPU scaling, which I've found can
significantly affect outcomes for things like this (looks like my
hosting provider changed settings in the system BIOS recently, such
that I cannot set the CPU governor to "performance").

Perhaps you could take a crack at recreating this, Jeff?

Thanks
--
Peter Geoghegan

Attachment

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Thom Brown
Date:
On 29 January 2015 at 23:38, Peter Geoghegan <pg@heroku.com> wrote:
On Sat, Jan 17, 2015 at 6:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I continued with this since posting V2.0.

Attached version (V2.1) fixes bit-rot caused by the recent changes by
Stephen ("Fix column-privilege leak in error-message paths"). More
precisely, it is rebased on top of today's 17792b commit.
 
Patch 0002 no longer applies due to a conflict in src/backend/executor/execUtils.c.

Thom

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Bruce Momjian
Date:
On Wed, Feb  4, 2015 at 04:49:46PM -0800, Peter Geoghegan wrote:
> On Tue, Feb 2, 2015 at 01:06 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > A first (not actually that quick :() look through the patches to see
> > what actually happened in the last months. I didn't keep up with the
> > thread.
> 
> So, let me get this out of the way: This is the first in-depth
> technical review that this work has had in a long time. Thank you for
> your help here.

I looked at all the patches too.  The patch is only 9k lines, not huge. 

Other than the locking part, the biggest part of this patch is adjusting
things so that an INSERT can change into an UPDATE.  The code that
handles SELECT/INSERT/UPDATE/DELETE is already complex, and this makes
it even more so.  I have no idea how we can be sure we have hit every
single case, but I am also unclear how we will _ever_ know we have hit
them all.

We know people want this feature, and this patch seems to be our best
bet to getting it.  If we push this off for 9.6, I am not sure what that
buys us.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Fri, Feb 6, 2015 at 1:51 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Other than the locking part, the biggest part of this patch is adjusting
> things so that an INSERT can change into an UPDATE.

Thanks for taking a look at it. That's somewhat cleaned up in the
attached patchseries - V2.2. This has been rebased to repair the minor
bit-rot pointed out by Thom.

Highlights:

* Better parser representation.

The auxiliary UPDATE never uses its own RangeVar (in fact,
UpdateStmt.relation is never set). This eliminates any possibility of
repeated name lookup problems, addressing Andres' concern. But it also
results in better code. The auxiliary UPDATE is not modified by the
parent INSERT at all - rather, the UPDATE knows to fetch its target
Relation from the parsestate parent INSERT. There is no need to
artificially cut off the parent within the auxiliary UPDATE to make
sure the independent RTE is not visible (during parse analysis, prior
to merging the two, as happened in earlier revisions) - that was a
kludge that I'm glad to be rid of. There is no merging of distinct
INSERT and UPDATE Relations/RTEs because there is only ever one
Relation/RTE to begin with. Previously, the merging merged RTE
selectedCols and updatedCols into the parent INSERT (for column-level
privileges, for example). I'm also a lot less cute about determining
whether an UPDATE is an auxiliary UPDATE from within the parser, which
was also a concern raised by Andres.

About 90% of the special case code previously in transformInsertStmt()
is now in setTargetTable(). This is a significant improvement all
around, since the code is now more consistent with existing parse
analysis code - setTargetTable() is naturally where the auxilary
UPDATE figures out details on its target, and builds an EXCLUDED RTE
and adds it to the namespace as a special case (just like for regular
UPDATE targets, which similarly get added to the namespace +
joinlist).

All of this implies a slight behavioral change (which is documented):
The TARGET.* alias is now visible everywhere. So you see it within
every node of EXPLAIN output, and if you want to qualify a RETURNING
column, the TARGET.* alias must be used (not the original table name).
I think that this is an improvement too, although it is arguably a
slight behavioral change to INSERTs in general (can't think why anyone
would particularly want to qualify with an alias in INSERT's
RETURNING, though). Note that the EXCLUDED.* pseudo-alias is still
only visible within the UPDATE's targetlist and WHERE clause. I think
it would be a bad idea to make the EXCLUDED.* tuples visible from
RETURNING [1].

* Cleaner ExecInsert() control flow. Andres rightly complained that
the existing control flow was convoluted. I believe he will find this
revision a lot clearer, although I have not gone so far as creating
something like an ExecUpsert().

* Better documentation. The executor README has been overhauled to
describe the flow of things from a higher level. The procarray changes
are better documented by comments, too.

* Special work previously within BuildIndexInfo() that is needed for
unique indexes for the UPSERT case only is now done only in the UPSERT
case. There is now no added overhead in BuildIndexInfo() for existing
cases.

* Worked on feedback on various points of style raised by Andres (e.g.
an errhint() was removed).

* Better explanation of the re-use of XLOG_HEAP* flag bits. I believe
that it's fine to reuse the "(1<<7)" bit, given that each distinct use
of the bit can only appear in distinct record types (that is, the bit
is used by xl_heap_multi_insert, and now xl_heap_delete). Those two
uses should be mutually exclusive. It's not as if we've had to be
economical with the use of heap flag XLog record bits before now, so
the best approach here isn't obvious. For now, I see no problem with
this reuse.

* SnapshotSelf (that is, HeapTupleSatisfiesSelf()) has additions
analogous to previous additions to the HeapTupleSatisfiesVacuum() and
HeapTupleSatisfiesMVCC() visibility routines. I still don't think that
the changes to tqual.c are completely satisfactory, but as long as
they're directly necessary (which they evidently are - Jeff's
stress-testing tool shows that) then I should at least make the
appropriate changes everywhere. We should definitely focus on why
they're necessary, and consider if we can do better.

* There was some squashing of commits, since Andres felt that they
weren't all useful as separate commits. I've still split out the RTE
permissions commit, as well as the RLS commit (plus the documentation
and test commits, FWIW). I hope that this will make it easier to
review parts of the patch, without being generally excessive.

When documentation and tests are left out, the entire patch series is left at:

 68 files changed, 2958 insertions(+), 297 deletions(-)

which is not too big.

Thanks

[1] http://www.postgresql.org/message-id/CAM3SWZTcpy9rroLM3TkfuU4HDLrEtuGzxLptGn2vLhVAFwQCVA@mail.gmail.com
--
Peter Geoghegan

Attachment

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Andres Freund
Date:
On 2015-02-04 16:49:46 -0800, Peter Geoghegan wrote:
> On Tue, Feb 2, 2015 at 01:06 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Generally the split into the individual commits doesn't seem to make
> > much sense to me.

I think trying to make that possible is a good idea in patches of this
size. It e.g. seems entirely possible to structure the patchset so that
the speculative lock infrastructure is added first and the rest
later. I've not thought more about how to split it up further, but I'm
pretty sure it's possible.

> > The commits individually (except the first) aren't
> > indivdiually commitable and aren't even meaningful. Splitting off the
> > internal docs, tests and such actually just seems to make reviewing
> > harder because you miss context. Splitting it so that individual piece
> > are committable and reviewable makes sense, but... I have no problem
> > doing the user docs later. If you split of RLS support, you need to
> > throw an error before it's implemented.
>
> I mostly agree. Basically, I did not intend for all of the patches to
> be individually committed. The mechanism by which EXCLUDED.*
> expressions are added is somewhat novel, and deserves to be
> independently *considered*. I'm trying to show how the parts fit
> together more so than breaking things down in to smaller commits (as
> you picked up on, 0001 is the exception - that is genuinely intended
> to be committed early). Also, those commit messages give me the
> opportunity to put those parts in their appropriate context vis-a-vis
> our discussions. They refer to the Wiki, for example, or reasons why
> pg_stat_statements shouldn't care about ExcludedExpr. Obviously the
> final commit messages won't look that way.

FWIW, I don't think anything here really should refer to the wiki...

> > 0002:
> > * Tentatively I'd say that killspeculative should be done via a separate
> >   function instead of heap_delete()
>
> Really? I guess if that were to happen, it would entail refactoring
> heap_delete() to call a static function, which was also called by a
> new kill_speculative() function that does this. Otherwise, you'd have
> far too much duplication.

I don't really think there actually is that much common inbetween
those. It looks to me that most of the code in heap_delete isn't
actually relevant for this case and could be cut short. My guess is that
only the WAL logging would be separated out.

> > * I doubt logical decoding works with the patch as it stands.
>
> I thought so. Perhaps you could suggest a better use of the available
> XLOG_HEAP_* bits. I knew I needed to consider that more carefully
> (hence the XXX comment), but didn't get around to it.

I think you probably need to add test macros that make sure only the
individual bits are sets, and not the combination and then only use those.

> > * If a arbiter index is passed to ExecCheckIndexConstraints(), can't we
> >   abort the loop after checking it? Also, do we really have to iterate
> >   over indexes for that case? How about moving the loop contents to a
> >   separate function and using that separately for the arbiter cases?
>
> Well, the failure to do that implies very few extra cycles, but sure.

It's not that much about the CPU cycles, but also about the mental
ones. If you have to think what happens if there's more than one
match...

> > * ItemPointerIsValid
>
> What about it?

Uh. Oh. No idea. I wrote this pretty late at night ;)


> > * /*
> >    * This may occur when an instantaneously invisible tuple is blamed
> >    * as a conflict because multiple rows are inserted with the same
> >    * constrained values.
> >    How can this happen? We don't insert multiple rows with the same
> >    command id?
>
> This is a cardinality violation [1]. It can definitely happen - just
> try the examples you see on the Wiki.

I don't care about the wiki from the point of code comments. This needs
to be understandable in five years time.

> > * Perhaps it has previously been discussed but I'm not convinced by the
> >   reasoning for not looking at opclasses in infer_unique_index(). This
> >   seems like it'd prohibit ever having e.g. case insensitive opclasses -
> >   something surely worthwile.
>
> I don't think anyone gave that idea the thumbs-up. However, I really
> don't see the problem. Sure, we could have case insensitive opclasses
> in the future, and you may want to make a unique index using one.

Then the problem suddenly becomes that previous choices of
indexes/statements aren't possible anymore. It seems much better to
introduce the syntax now and not have too much of a usecase for
it.


> > * The whole speculative insert logic isn't really well documented. Why,
> >   for example, do we actually need the token? And why are there no
> >   issues with overflow? And where is it documented that a 0 means
> >   there's no token? ...
>
> Fair enough. Presumably it's okay that overflow theoretically could
> occur, because a race is all but impossible. The token represents a
> particular attempt by some backend at inserting a tuple, that needs to
> be waited on specifically only if it is their active attempt (and the
> xact is still running). Otherwise, you get unprincipled deadlocks.
> Even if by some incredibly set of circumstances it wraps around, worst
> case scenario you get an unprinciped deadlock, which is hardly the end
> of the world given the immense number of insertions required, and the
> immense unlikelihood that things would work out that way - it'd be
> basically impossible.
>
> I'll document the "0" thing.

It's really not about me understanding it right now, but about longer term.

> > * /* XXX:  Make sure that re-use of bits is safe here */ - no, not
> >   unless you change existing checks.
>
> I think that this is a restatement of your remarks on logical
> decoding. No?

Yea. By here it was even later :P.

Can you add a UPSERT test for logical decoding? I doubt it'll work right
now, even in the repost.


> > * /*
> > * Immediately VACUUM "super-deleted" tuples
> > */
> > if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple),
> > InvalidTransactionId))
> > return HEAPTUPLE_DEAD;
> >   Is that branch really needed? Shouldn't it just be happening as a
> >   consequence of the already existing code? Same in SatisfiesMVCC. If
> >   you actually needed that block, it'd need to be done in SatisfiesSelf
> >   as well, no? You have a comment about a possible loop - but that seems
> >   wrong to me, implying that HEAP_XMIN_COMMITTED was set invalidly.
>
> Indeed, this code is kind of odd. While I think the omission within
> SatisfiesSelf() may be problematic too, if you really want to know why
> this code is needed, uncomment it and run Jeff's stress-test. It will
> reliably break.

Again. I don't care about running some strange tool when reading code
comments.


Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Tue, Feb 10, 2015 at 12:04 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> FWIW, I don't think anything here really should refer to the wiki...

The Wiki pages have done a good job of summarizing things...it
certainly didn't seem that hard for you to get up to speed here. Also,
as you'll know from working on logical decoding, it's hard to know
what is complex and esoteric and what is relatively accessible when
you're this close to a big project. I recall that you said as much
before. I'm focused on signposting so that reviewers can follow what
each patch does with the minimum of effort (with reference to the Wiki
or whatever). I see no reason to not do things that way until
commit...it feels like there is less chance of the concepts going over
people's head this way.

> I don't really think there actually is that much common inbetween
> those. It looks to me that most of the code in heap_delete isn't
> actually relevant for this case and could be cut short. My guess is that
> only the WAL logging would be separated out.

I'll think about that some more.

>> > * I doubt logical decoding works with the patch as it stands.
>>
>> I thought so. Perhaps you could suggest a better use of the available
>> XLOG_HEAP_* bits. I knew I needed to consider that more carefully
>> (hence the XXX comment), but didn't get around to it.
>
> I think you probably need to add test macros that make sure only the
> individual bits are sets, and not the combination and then only use those.

This too.

>> > * /*
>> >    * This may occur when an instantaneously invisible tuple is blamed
>> >    * as a conflict because multiple rows are inserted with the same
>> >    * constrained values.
>> >    How can this happen? We don't insert multiple rows with the same
>> >    command id?
>>
>> This is a cardinality violation [1]. It can definitely happen - just
>> try the examples you see on the Wiki.
>
> I don't care about the wiki from the point of code comments. This needs
> to be understandable in five years time.

That wasn't clear before - you seemed to me to be questioning if this
was even possible. There is a section in the INSERT SQL reference page
about cardinality violations, so we're certainly talking about
something that a code reader likely heard of. Also, the nitty gritty
showing various scenarios on the Wiki is the quickest way to know what
is possible (but is much too long winded for user visible
documentation or code comments).

>> > * Perhaps it has previously been discussed but I'm not convinced by the
>> >   reasoning for not looking at opclasses in infer_unique_index(). This
>> >   seems like it'd prohibit ever having e.g. case insensitive opclasses -
>> >   something surely worthwile.
>>
>> I don't think anyone gave that idea the thumbs-up. However, I really
>> don't see the problem. Sure, we could have case insensitive opclasses
>> in the future, and you may want to make a unique index using one.
>
> Then the problem suddenly becomes that previous choices of
> indexes/statements aren't possible anymore. It seems much better to
> introduce the syntax now and not have too much of a usecase for
> it.

The only way the lack of a way of specifying which opclass to use
could bite us is if there were two *actually* defined unique indexes
on the same column, each with different "equality" operators. That
seems like kind of a funny scenario, even if that were quite possible
(even if non-default opclasses existed that had a non-identical
"equality" operators, which is basically not the case today).

I grant that is a bit odd that we're talking about unique indexes
definitions affecting semantics, but that is to a certain extent the
nature of the beast. As a compromise, I suggest having the inference
specification optionally accept a named opclass per attribute, in the
style of CREATE INDEX (I'm already reusing a bit of the raw parser
support for CREATE INDEX, you see) - that'll make inference insist on
that opclass, rather than make it a strict matter of costing available
alternatives (not that any alternative is expected with idiomatic
usage). That implies no additional parser overhead, really. If that's
considered ugly, then at least it's an ugly thing that literally no
one will ever use in the foreseeable future...and an ugly thing that
is no more necessary in CREATE INDEX than here (and yet CREATE INDEX
lives with the ugliness).

> It's really not about me understanding it right now, but about longer term.

Sure.

> Can you add a UPSERT test for logical decoding? I doubt it'll work right
> now, even in the repost.

Okay.

>> > * /*
>> > * Immediately VACUUM "super-deleted" tuples
>> > */
>> > if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple),
>> > InvalidTransactionId))
>> > return HEAPTUPLE_DEAD;
>> >   Is that branch really needed? Shouldn't it just be happening as a
>> >   consequence of the already existing code? Same in SatisfiesMVCC. If
>> >   you actually needed that block, it'd need to be done in SatisfiesSelf
>> >   as well, no? You have a comment about a possible loop - but that seems
>> >   wrong to me, implying that HEAP_XMIN_COMMITTED was set invalidly.
>>
>> Indeed, this code is kind of odd. While I think the omission within
>> SatisfiesSelf() may be problematic too, if you really want to know why
>> this code is needed, uncomment it and run Jeff's stress-test. It will
>> reliably break.
>
> Again. I don't care about running some strange tool when reading code
> comments.

Again, I thought you were skeptical about the very need for this code
(and not how it was presented). If that was the case, that tool would
provide you with a pretty quick and easy way of satisfying yourself
that it is needed. The actual reason that it is needed is that if it
isn't, then the system can see a "broken promise" tuple as spuriously
visible. This will cause Jeff's tool to spit out a bunch of errors due
to finding all-NULL values in these tuples. VACUUM could not reclaim
the tuples unless and until they stopped appearing visible for
VACUUM's purposes (which this particular code snippet relates to).
Maybe the comment could be improved, and maybe the code could be
improved, but the code is necessary as things stand.

-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Michael Paquier
Date:
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Tue, Feb 10, 2015 at 9:21 AM, Peter
Geoghegan<span dir="ltr"><<a href="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> * There was
somesquashing of commits, since Andres felt that they<br /> weren't all useful as separate commits. I've still split
outthe RTE<br /> permissions commit, as well as the RLS commit (plus the documentation<br /> and test commits, FWIW). I
hopethat this will make it easier to<br /> review parts of the patch, without being generally excessive.<br /><br />
Whendocumentation and tests are left out, the entire patch series is left at:<br /></blockquote></div><br /></div><div
class="gmail_extra">Patchmoved to next CF.<br />-- <br /><div class="gmail_signature">Michael<br /></div></div></div> 

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Jim Nasby
Date:
On 2/9/15 6:21 PM, Peter Geoghegan wrote:
> Thanks for taking a look at it. That's somewhat cleaned up in the
> attached patchseries - V2.2.

In patch 1, "sepgsql is also affected by this commit.  Note that this commit
necessitates an initdb, since stored ACLs are broken."

Doesn't that warrant bumping catversion?

Patch 2
+ * When killing a speculatively-inserted tuple, we set xmin to invalid
and
+if (!(xlrec->flags & XLOG_HEAP_KILLED_SPECULATIVE_TUPLE))

When doing this, should we also set the HEAP_XMIN_INVALID hint bit?

<reads more of patch>

Ok, I see we're not doing this because the check for a super deleted 
tuple is already cheap. Probably worth mentioning that in the comment...

ExecInsert():
+ * We don't suppress the effects (or, perhaps, side-effects) of
+ * BEFORE ROW INSERT triggers.  This isn't ideal, but then we
+ * cannot proceed with even considering uniqueness violations until
+ * these triggers fire on the one hand, but on the other hand they
+ * have the ability to execute arbitrary user-defined code which
+ * may perform operations entirely outside the system's ability to
+ * nullify.

I'm a bit confused as to why we're calling BEFORE triggers out here... 
hasn't this always been true for both BEFORE *and* AFTER triggers? The 
comment makes me wonder if there's some magic that happens with AFTER...

+spec != SPEC_NONE? HEAP_INSERT_SPECULATIVE:0
Non-standard formatting. Given the size of the patch and work already 
into it I'd just leave it for the next formatting run; I only mention it 
in case someone has some compelling reason not to.

ExecLockUpdateTuple():
+ * Try to lock tuple for update as part of speculative insertion.  If
+ * a qual originating from ON CONFLICT UPDATE is satisfied, update
+ * (but still lock row, even though it may not satisfy estate's
+ * snapshot).

I find this confusing... which row is being locked? The previously 
inserted one? Perhaps a better wording would be "satisfied, update. Lock 
the original row even if it doesn't satisfy estate's snapshot."

infer_unique_index():
+/*
+ * We need not lock the relation since it was already locked, either by
+ * the rewriter or when expand_inherited_rtentry() added it to the query's
+ * rangetable.
+ */
+relationObjectId = rt_fetch(parse->resultRelation, parse->rtable)->relid;
+
+relation = heap_open(relationObjectId, NoLock);

Seems like there should be an Assert here. Also, the comment should 
probably go before the heap_open call.

heapam_xlog.h:
+/* reuse xl_heap_multi_insert-only bit for xl_heap_delete */
I wish this would mention why it's safe to do this. Also, the comment 
mentions xl_heap_delete when the #define is for 
XLOG_HEAP_KILLED_SPECULATIVE_TUPLE; presumably that's wrong. Perhaps:
/* * reuse XLOG_HEAP_LAST_MULTI_INSERT bit for * XLOG_HEAP_KILLED_SPECULATIVE_TUPLE. This is safe because we never do *
multi-insertsfor INSERT ON CONFLICT. */
 

I'll review the remaining patches later.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/10/2015 02:21 AM, Peter Geoghegan wrote:
> On Fri, Feb 6, 2015 at 1:51 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> Other than the locking part, the biggest part of this patch is adjusting
>> things so that an INSERT can change into an UPDATE.
>
> Thanks for taking a look at it. That's somewhat cleaned up in the
> attached patchseries - V2.2. This has been rebased to repair the minor
> bit-rot pointed out by Thom.

I don't really have the energy to review this patch in one batch, so I'd
really like to split this into two:

1. Solve the existing "problem" with exclusion constraints that if two
insertions happen concurrently, one of them might be aborted with
"deadlock detected" error, rather then "conflicting key violation"
error. That really wouldn't be worth fixing otherwise, but it happens to
be a useful stepping stone for this upsert patch.

2. All the rest.

I took a stab at extracting the parts needed to do 1. See attached. I
didn't modify ExecUpdate to use speculative insertions, so the issue
remains for UPDATEs, but that's easy to add.

I did not solve the potential for livelocks (discussed at
http://www.postgresql.org/message-id/CAM3SWZTfTt_fehet3tU3YKCpCYPYnNaUqUZ3Q+NAASnH-60teA@mail.gmail.com).
The patch always super-deletes the already-inserted tuple on conflict,
and then waits for the other inserter. That would be nice to fix, using
one of the ideas from that thread, or something else.

We never really debated the options for how to do the speculative
insertion and super-deletion. This patch is still based on the quick &
dirty demo patches I posted about a year ago, in response to issues you
found with earlier versions. That might be OK - maybe I really hit the
nail on designing those things and got them right on first try - but
more likely there are better alternatives.

Are we happy with acquiring the SpeculativeInsertLock heavy-weight lock
for every insertion? That seems bad for performance reasons. Also, are
we happy with adding the new fields to the proc array? Another
alternative that was discussed was storing the speculative insertion
token on the heap tuple itself. (See
http://www.postgresql.org/message-id/52D00D2D.6030307@vmware.com)

Are we happy with the way super-deletion works? Currently, the xmin
field is set to invalid to mark the tuple as super-deleted. That
required checks in HeapTupleSatisfies* functions. One alternative would
be to set xmax equal to xmin, and teach the code currently calls
XactLockTableWait() to check if xmax=xmin, and not consider the tuple as
conflicting.

- Heikki

Attachment

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Sat, Feb 14, 2015 at 2:06 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> Thanks for taking a look at it. That's somewhat cleaned up in the
>> attached patchseries - V2.2. This has been rebased to repair the minor
>> bit-rot pointed out by Thom.
>
> I don't really have the energy to review this patch in one batch, so I'd
> really like to split this into two:

I think we're all feeling worn out at this point, by this patch and by
others. I do appreciate your making the effort.

> 1. Solve the existing "problem" with exclusion constraints that if two
> insertions happen concurrently, one of them might be aborted with "deadlock
> detected" error, rather then "conflicting key violation" error. That really
> wouldn't be worth fixing otherwise, but it happens to be a useful stepping
> stone for this upsert patch.
>
> 2. All the rest.

I think we need a more pragmatic approach to dealing with the
exclusion constraint problems.

> I took a stab at extracting the parts needed to do 1. See attached. I didn't
> modify ExecUpdate to use speculative insertions, so the issue remains for
> UPDATEs, but that's easy to add.

Cool.

> I did not solve the potential for livelocks (discussed at
> http://www.postgresql.org/message-id/CAM3SWZTfTt_fehet3tU3YKCpCYPYnNaUqUZ3Q+NAASnH-60teA@mail.gmail.com).
> The patch always super-deletes the already-inserted tuple on conflict, and
> then waits for the other inserter. That would be nice to fix, using one of
> the ideas from that thread, or something else.

How about we don't super-delete at all with exclusion constraints? I'd
be willing to accept unprincipled deadlocks for exclusion constraints,
because they already exist today for UPSERT/NOSERT type use cases, and
with idiomatic usage seem far less likely for the IGNORE variant
(especially with exclusion constraints). I can see people using ON
CONFLICT UPDATE where they'd almost or actually be better off using a
plain UPDATE - that's quite a different situation. I find livelocks to
be a *very* scary prospect, and I don't think the remediations that
were mentioned are going to fly. It's just too messy, and too much of
a niche use case. TBH I am skeptical of the idea that we can fix
exclusion constraints properly in this way at all, at least not
without the exponential backoff thing, which just seems horrible.

> We never really debated the options for how to do the speculative insertion
> and super-deletion. This patch is still based on the quick & dirty demo
> patches I posted about a year ago, in response to issues you found with
> earlier versions. That might be OK - maybe I really hit the nail on
> designing those things and got them right on first try - but more likely
> there are better alternatives.

Intuitively, it seem likely that you're right here. However, it was
necessary to work through the approach to see what the problems are.
For example, the need for modifications to tqual.c became apparent
only through putting a full implementation of ON CONFLICT UPDATE
through various tests. In general, I've emphasized that the problems
with any given value locking implementation are non-obvious. Anyone
who thinks that he can see all the problems with his approach to value
locking without having a real implementation that is tested for
problems like unprincipled deadlocks is probably wrong.

That's sort of where I'm coming from with suggesting we allow
unprincipled deadlocks with exclusion constraints. I'm not confident
that we can find a perfect solution, and know that it's a perfect
solution. It's too odd, and too niche a requirement. Although you
understood what I was on about when I first talked about unprincipled
deadlocks, I think that acceptance of that idea came much later from
other people, because it's damn complicated. It's not worth it to add
some weird "Dining philosophers" exponential backoff thing to make
sure that the IGNORE variant when used with exclusion constraints can
never deadlock in an unprincipled way, but if it is worth it then it
seems unreasonable to suppose that this patch needs to solve that
pre-existing problem. No?

If we do something like an exponential backoff, which I think might
work, I fear that that kind of yak-shaving will leave us with
something impossible to justify committing; a total mess. Better the
devil you know (possible unprincipled deadlocks with the IGNORE
variant + exclusion constraints).

> Are we happy with acquiring the SpeculativeInsertLock heavy-weight lock for
> every insertion? That seems bad for performance reasons. Also, are we happy
> with adding the new fields to the proc array? Another alternative that was
> discussed was storing the speculative insertion token on the heap tuple
> itself. (See
> http://www.postgresql.org/message-id/52D00D2D.6030307@vmware.com)

Whatever works, really. I can't say that the performance implications
of acquiring that hwlock are at the forefront of my mind. I never
found that to be a big problem on an 8 core box, relative to vanilla
INSERTs, FWIW - lock contention is not normal, and may be where any
heavweight lock costs would really be encountered. Besides, the update
path won't have to do this at all.

I don't see how storing the promise token in heap tuples buys us not
having to involve heavyweight locking of tokens. (I think you may have
made a thinko in suggesting otherwise)

> Are we happy with the way super-deletion works? Currently, the xmin field is
> set to invalid to mark the tuple as super-deleted. That required checks in
> HeapTupleSatisfies* functions. One alternative would be to set xmax equal to
> xmin, and teach the code currently calls XactLockTableWait() to check if
> xmax=xmin, and not consider the tuple as conflicting.

That couldn't work without further HeapTupleSatisfiesDirty() logic.
Besides, why should one transaction be entitled to insert a
conflicting value tuple just because a still running transaction
deleted it (having also inserted the tuple itself)? Didn't one
prototype version of value locking #2 have that as a bug [1]? In fact,
originally, didn't the "xmin set to invalid" thing come immediately
from realizing that that wasn't workable?

I too think the tqual.c changes are ugly. I can't see a way around
using a token lock, though - I would only consider (and only consider)
refining the interface by which a waiter becomes aware that it must
wait on the outcome of the inserting xact's speculative
insertion/value lock (and in particular, that is should not wait on
xact outcome). We clearly need something to wait on that isn't an
XID...heavyweight locking of a token value is the obvious way of doing
that.

(thinks about it some more for a while)

It seems like what you're really taking issue with - the real issue -
is that we're introducing this whole new idea of making a tuple
visible for a moment, a moment that is potentially only a slice of its
originating transaction's total duration. Setting xmin to
invalidTransactionId is one way to do that, and may,
counterintuitively, even be the best way, but the fact that we're
playing new games with visibility is the real point of concern. We
could do something like store the token in the heap tuple, allowing us
to make fewer additions to PGPROC, but that seems kind of pointless
(and a waste of disk space). Playing new games with visibility is the
nature of the beast.

We keep talking about mechanism, but what if we *did* have the
infomask bit space to represent that xmin=xmax is a broken promise
tuple (and not a deleted, self-inserted, combocid-requiring tuple that
may or may not have been a promise tuple at some point in time)? I
think that a minor aesthetic improvement is the best we can hope for,
and maybe even that is too much to expect. Maybe we should just own
the fact that we're playing a new sort of game with visibility, and
keep things as they are. You might consider that setting xmin to
invalidTransactionId is more "honest" than any alternative scheme that
happens to avoid changes to HeapTupleSatisfiesMVCC() and so on.

Jim Nasby said something about setting the HEAP_XMIN_INVALID hint bit.
Maybe he is right...if that can be made to be reliable (always
WAL-logged), it could be marginally better than setting xmin to
invalidTransactionId. But only marginally; it seems like your
discomfort is basically inherent to playing these new games with
visibility, which is inherent to this design. As I said, I am having a
hard time seeing a way to do anything more than polish what we have
here. That's probably fine, though...it hasn't proven to be too
problematic (exclusion constraints aside).

Not having to change HeapTupleSatisfiesMVCC() and so on (to check if
xmin = InvalidTransactionId) is not obviously a useful goal here,
since ISTM that any alternative would have to *logically* do the same
thing. If I'm off the mark about your thinking here, please correct
me....are you just worried about extra cycles in the
HeapTupleSatisfies* routines?

[1] http://www.postgresql.org/message-id/52D5C74E.6090608@vmware.com
-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Fri, Feb 13, 2015 at 7:22 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> In patch 1, "sepgsql is also affected by this commit.  Note that this commit
> necessitates an initdb, since stored ACLs are broken."
>
> Doesn't that warrant bumping catversion?

Yes, but traditionally that is left until the last minute - when the
patch is committed. That's why it's called out in the commit message
(it isn't otherwise obvious - it's not a common catversion
necessitating change like an addition to a system catalog).

> Patch 2
> + * When killing a speculatively-inserted tuple, we set xmin to invalid
> and
> +if (!(xlrec->flags & XLOG_HEAP_KILLED_SPECULATIVE_TUPLE))
>
> When doing this, should we also set the HEAP_XMIN_INVALID hint bit?
>
> <reads more of patch>
>
> Ok, I see we're not doing this because the check for a super deleted tuple
> is already cheap. Probably worth mentioning that in the comment...

See my remarks to Heikki on this. I don't think it adds much.

> ExecInsert():
> + * We don't suppress the effects (or, perhaps, side-effects) of
> + * BEFORE ROW INSERT triggers.  This isn't ideal, but then we
> + * cannot proceed with even considering uniqueness violations until
> + * these triggers fire on the one hand, but on the other hand they
> + * have the ability to execute arbitrary user-defined code which
> + * may perform operations entirely outside the system's ability to
> + * nullify.
>
> I'm a bit confused as to why we're calling BEFORE triggers out here...
> hasn't this always been true for both BEFORE *and* AFTER triggers? The
> comment makes me wonder if there's some magic that happens with AFTER...

Yes, but the difference here is that the UPDATE path could be taken
(which is sort of like when a before row insert path returns NULL).
What I'm calling out is the dependency on firing before row insert
triggers to decide if the alternative path must be taken. Roughly
speaking, we must perform "part" of the INSERT (firing of before row
insert triggers) in order to decide to do or not do an INSERT. This
is, as I said, similar to when those triggers return NULL, and won't
matter with idiomatic patterns of before trigger usage. Still feels
worth calling out, because sometimes users do foolishly write before
triggers with many external side-effects.

> ExecLockUpdateTuple():
> + * Try to lock tuple for update as part of speculative insertion.  If
> + * a qual originating from ON CONFLICT UPDATE is satisfied, update
> + * (but still lock row, even though it may not satisfy estate's
> + * snapshot).
>
> I find this confusing... which row is being locked? The previously inserted
> one? Perhaps a better wording would be "satisfied, update. Lock the original
> row even if it doesn't satisfy estate's snapshot."

Take a look at the executor README. We're locking the only row that
can be locked when an UPSERT non-conclusively thinks to take the
UPDATE path: the row that was found during our pre-check. We can only
UPDATE when we find such a tuple, and then lock it without finding a
row-level conflict.

> infer_unique_index():
> +/*
> + * We need not lock the relation since it was already locked, either by
> + * the rewriter or when expand_inherited_rtentry() added it to the query's
> + * rangetable.
> + */
> +relationObjectId = rt_fetch(parse->resultRelation, parse->rtable)->relid;
> +
> +relation = heap_open(relationObjectId, NoLock);
>
> Seems like there should be an Assert here. Also, the comment should probably
> go before the heap_open call.

An Assert() of what? Note that the similar function
get_relation_info() does about the same thing here.

> heapam_xlog.h:
> +/* reuse xl_heap_multi_insert-only bit for xl_heap_delete */
> I wish this would mention why it's safe to do this. Also, the comment
> mentions xl_heap_delete when the #define is for
> XLOG_HEAP_KILLED_SPECULATIVE_TUPLE; presumably that's wrong. Perhaps:
> /*
>  * reuse XLOG_HEAP_LAST_MULTI_INSERT bit for
>  * XLOG_HEAP_KILLED_SPECULATIVE_TUPLE. This is safe because we never do
>  * multi-inserts for INSERT ON CONFLICT.
>  */

It's safe, as the comment indicates, because the former is only used
for xl_heap_multi_insert records, while the latter is only used for
xl_heap_delete records. There is no ambiguity, because naturally we're
always able to establish what type of record is under consideration
before checking the bit is set.

The XLOG_HEAP_* format is used for other flags there, despite the fact
that other flags (like XLOG_HEAP_PREFIX_FROM_OLD) can only appear in
certain context-appropriate xl_heap_* records. AFAICT, all that I've
done that's new here is rely on that, since a bunch of those bits were
used up in the last year or two, and the need to even consider bit
reuse here is a new problem.

> I'll review the remaining patches later.

Thanks.

-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/16/2015 02:44 AM, Peter Geoghegan wrote:
> On Sat, Feb 14, 2015 at 2:06 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> I did not solve the potential for livelocks (discussed at
>> http://www.postgresql.org/message-id/CAM3SWZTfTt_fehet3tU3YKCpCYPYnNaUqUZ3Q+NAASnH-60teA@mail.gmail.com).
>> The patch always super-deletes the already-inserted tuple on conflict, and
>> then waits for the other inserter. That would be nice to fix, using one of
>> the ideas from that thread, or something else.
>
> How about we don't super-delete at all with exclusion constraints? I'd
> be willing to accept unprincipled deadlocks for exclusion constraints,
> because they already exist today for UPSERT/NOSERT type use cases, and
> with idiomatic usage seem far less likely for the IGNORE variant
> (especially with exclusion constraints).

So INSERT ON CONFLICT IGNORE on a table with an exclusion constraint 
might fail. I don't like that. The point of having the command in the 
first place is to deal with concurrency issues. If it sometimes doesn't 
work, it's broken.

> I can see people using ON
> CONFLICT UPDATE where they'd almost or actually be better off using a
> plain UPDATE - that's quite a different situation. I find livelocks to
> be a *very* scary prospect, and I don't think the remediations that
> were mentioned are going to fly. It's just too messy, and too much of
> a niche use case. TBH I am skeptical of the idea that we can fix
> exclusion constraints properly in this way at all, at least not
> without the exponential backoff thing, which just seems horrible.

The idea of comparing the TIDs of the tuples as a tie-breaker seems most 
promising to me. If the conflicting tuple's TID is smaller than the 
inserted tuple's, super-delete and wait. Otherwise, wait without 
super-deletion. That's really simple. Do you see a problem with that?

> Although you understood what I was on about when I first talked about
> unprincipled deadlocks, I think that acceptance of that idea came
> much later from other people, because it's damn complicated.

BTW, it would good to explain somewhere in comments or a README the term 
"unprincipled deadlock". It's been a useful concept, and hard to grasp. 
If you defined it once, with examples and everything, then you could 
just have "See .../README" in other places that need to refer it.

> It's not worth it to add
> some weird "Dining philosophers" exponential backoff thing to make
> sure that the IGNORE variant when used with exclusion constraints can
> never deadlock in an unprincipled way, but if it is worth it then it
> seems unreasonable to suppose that this patch needs to solve that
> pre-existing problem. No?

The point of solving the existing problem is that it allows us to split 
the patch, for easier review.

>> Are we happy with acquiring the SpeculativeInsertLock heavy-weight lock for
>> every insertion? That seems bad for performance reasons. Also, are we happy
>> with adding the new fields to the proc array? Another alternative that was
>> discussed was storing the speculative insertion token on the heap tuple
>> itself. (See
>> http://www.postgresql.org/message-id/52D00D2D.6030307@vmware.com)
>
> Whatever works, really. I can't say that the performance implications
> of acquiring that hwlock are at the forefront of my mind. I never
> found that to be a big problem on an 8 core box, relative to vanilla
> INSERTs, FWIW - lock contention is not normal, and may be where any
> heavweight lock costs would really be encountered.

Oh, cool. I guess the fast-path in lmgr.c kicks ass, then :-).

> I don't see how storing the promise token in heap tuples buys us not
> having to involve heavyweight locking of tokens. (I think you may have
> made a thinko in suggesting otherwise)

It wouldn't get rid of heavyweight locks, but it would allow getting rid 
of the procarray changes. The inserter could generate a token, then 
acquire the hw-lock on the token, and lastly insert the heap tuple with 
the correct token.

>> Are we happy with the way super-deletion works? Currently, the xmin field is
>> set to invalid to mark the tuple as super-deleted. That required checks in
>> HeapTupleSatisfies* functions. One alternative would be to set xmax equal to
>> xmin, and teach the code currently calls XactLockTableWait() to check if
>> xmax=xmin, and not consider the tuple as conflicting.
>
> That couldn't work without further HeapTupleSatisfiesDirty() logic.

Why not?

> Besides, why should one transaction be entitled to insert a
> conflicting value tuple just because a still running transaction
> deleted it (having also inserted the tuple itself)? Didn't one
> prototype version of value locking #2 have that as a bug [1]? In fact,
> originally, didn't the "xmin set to invalid" thing come immediately
> from realizing that that wasn't workable?

Ah, right. So the problem was that some code might assume that if you 
insert a row, delete it in the same transaction, and then insert the 
same value again, the 2nd insertion will always succeed because the 
previous insertion effectively acquired a value-lock on the key.

Ok, so we can't unconditionally always ignore tuples with xmin==xmax. We 
would need an infomask bit to indicate super-deletion, and only ignore 
it if the bit is set.

I'm starting to think that we should bite the bullet and consume an 
infomask bit for this. The infomask bits are a scarce resource, but we 
should use them when it makes sense. It would be good for forensic 
purposes too, to leave a trace that a super-deletion happened.

> I too think the tqual.c changes are ugly. I can't see a way around
> using a token lock, though - I would only consider (and only consider)
> refining the interface by which a waiter becomes aware that it must
> wait on the outcome of the inserting xact's speculative
> insertion/value lock (and in particular, that is should not wait on
> xact outcome). We clearly need something to wait on that isn't an
> XID...heavyweight locking of a token value is the obvious way of doing
> that.

Yeah.

> Jim Nasby said something about setting the HEAP_XMIN_INVALID hint bit.
> Maybe he is right...if that can be made to be reliable (always
> WAL-logged), it could be marginally better than setting xmin to
> invalidTransactionId.

I'm not a big fan of that. The xmin-invalid bit is currently always just 
a hint.

> But only marginally; it seems like your
> discomfort is basically inherent to playing these new games with
> visibility, which is inherent to this design.

No, I get that we're playing games with visibility. I want to find the 
best way to implement those games.

> As I said, I am having a
> hard time seeing a way to do anything more than polish what we have
> here. That's probably fine, though...it hasn't proven to be too
> problematic (exclusion constraints aside).
>
> Not having to change HeapTupleSatisfiesMVCC() and so on (to check if
> xmin = InvalidTransactionId) is not obviously a useful goal here,
> since ISTM that any alternative would have to *logically* do the same
> thing. If I'm off the mark about your thinking here, please correct
> me....are you just worried about extra cycles in the
> HeapTupleSatisfies* routines?

Extra cycles yes, but even more importantly, code readability and 
maintainability.

- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Andres Freund
Date:
On 2015-02-16 10:00:24 +0200, Heikki Linnakangas wrote:
> On 02/16/2015 02:44 AM, Peter Geoghegan wrote:
> >>Are we happy with acquiring the SpeculativeInsertLock heavy-weight lock for
> >>every insertion? That seems bad for performance reasons. Also, are we happy
> >>with adding the new fields to the proc array? Another alternative that was
> >>discussed was storing the speculative insertion token on the heap tuple
> >>itself. (See
> >>http://www.postgresql.org/message-id/52D00D2D.6030307@vmware.com)
> >
> >Whatever works, really. I can't say that the performance implications
> >of acquiring that hwlock are at the forefront of my mind. I never
> >found that to be a big problem on an 8 core box, relative to vanilla
> >INSERTs, FWIW - lock contention is not normal, and may be where any
> >heavweight lock costs would really be encountered.
> 
> Oh, cool. I guess the fast-path in lmgr.c kicks ass, then :-).

I don't think it actually uses the fast path, does it? IIRC that's
restricted to LOCKTAG_RELATION when done via LockAcquireExtended + open
coded for the VirtualXactLock table.

I'm not super worried atm either. Worth checking, but probably not worth
obsessing over.

> >Besides, why should one transaction be entitled to insert a
> >conflicting value tuple just because a still running transaction
> >deleted it (having also inserted the tuple itself)? Didn't one
> >prototype version of value locking #2 have that as a bug [1]? In fact,
> >originally, didn't the "xmin set to invalid" thing come immediately
> >from realizing that that wasn't workable?
> 
> Ah, right. So the problem was that some code might assume that if you insert
> a row, delete it in the same transaction, and then insert the same value
> again, the 2nd insertion will always succeed because the previous insertion
> effectively acquired a value-lock on the key.
> 
> Ok, so we can't unconditionally always ignore tuples with xmin==xmax. We
> would need an infomask bit to indicate super-deletion, and only ignore it if
> the bit is set.
> 
> I'm starting to think that we should bite the bullet and consume an infomask
> bit for this. The infomask bits are a scarce resource, but we should use
> them when it makes sense. It would be good for forensic purposes too, to
> leave a trace that a super-deletion happened.

Well, we IIRC don't have any left right now. We could reuse
MOVED_IN|MOVED_OUT, as that never was set in the past. But it'd
essentially use two infomask bits forever...

> >Jim Nasby said something about setting the HEAP_XMIN_INVALID hint bit.
> >Maybe he is right...if that can be made to be reliable (always
> >WAL-logged), it could be marginally better than setting xmin to
> >invalidTransactionId.
> 
> I'm not a big fan of that. The xmin-invalid bit is currently always just a
> hint.

We already rely on XMIN_INVALID being set correctly for
freezing. C.f. HeapTupleHeaderXminFrozen, HeapTupleHeaderXminInvalid, et
al. So it'd not necessarily end up being that bad. And the super
deletion could easily just set it in the course of it's WAL logging.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Mon, Feb 16, 2015 at 12:00 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> So INSERT ON CONFLICT IGNORE on a table with an exclusion constraint might
> fail. I don't like that. The point of having the command in the first place
> is to deal with concurrency issues. If it sometimes doesn't work, it's
> broken.

I don't like it either, although I think it wouldn't come up very
often with exclusion constraints - basically, it would rarely be
noticed due to the different use cases. To be honest, in suggesting
this idea I was hedging against us not figuring out a solution to that
problem in time. You didn't like my suggestion of dropping IGNORE
entirely, either. I'll do my best to come up with something, but I'm
uncomfortable that at this late stage, ON CONFLICT IGNORE support for
exclusion constraints seems like a risk to the entire project.

I ask that if push comes to shove you show some flexibility here. I'll
try my best to ensure that you don't have to even consider committing
something with a notable omission. You don't have to give me an answer
to this now.

> The idea of comparing the TIDs of the tuples as a tie-breaker seems most
> promising to me. If the conflicting tuple's TID is smaller than the inserted
> tuple's, super-delete and wait. Otherwise, wait without super-deletion.
> That's really simple. Do you see a problem with that?

No. I'll work on that, and see how it stands up to stress testing.
Come to think of it, that does seem most promising.

> BTW, it would good to explain somewhere in comments or a README the term
> "unprincipled deadlock". It's been a useful concept, and hard to grasp. If
> you defined it once, with examples and everything, then you could just have
> "See .../README" in other places that need to refer it.

Agreed. I have described those in the revised executor README, in case
you missed that. Do you think they ought to have their own section?
Note that hash indexes have "unprincipled deadlock" issues, but no one
has bothered to fix them.

>> Whatever works, really. I can't say that the performance implications
>> of acquiring that hwlock are at the forefront of my mind. I never
>> found that to be a big problem on an 8 core box, relative to vanilla
>> INSERTs, FWIW - lock contention is not normal, and may be where any
>> heavweight lock costs would really be encountered.
>
> Oh, cool. I guess the fast-path in lmgr.c kicks ass, then :-).

Seems that way. But even if that wasn't true, it wouldn't matter,
since I don't see that we have a choice.

>> I don't see how storing the promise token in heap tuples buys us not
>> having to involve heavyweight locking of tokens. (I think you may have
>> made a thinko in suggesting otherwise)
>
> It wouldn't get rid of heavyweight locks, but it would allow getting rid of
> the procarray changes. The inserter could generate a token, then acquire the
> hw-lock on the token, and lastly insert the heap tuple with the correct
> token.

Do you really think that's worth the disk overhead? I generally agree
with the "zero overhead" principle, which is that anyone not using the
feature shouldn't pay no price for it (or vanishingly close to no
price). Can't say that I have a good sense of the added distributed
cost (if any) to be paid by adding new fields to the PGPROC struct
(since the PGXACT struct was added in 9.2). Is your only concern that
the PGPROC array will now have more fields, making it more
complicated? Surely that's a price worth paying to avoid making these
heap tuples artificially somewhat larger. You're probably left with
tuples that are at least 8 bytes larger, once alignment is taken into
consideration. That doesn't seem great.

>> That couldn't work without further HeapTupleSatisfiesDirty() logic.
>
> Why not?

Just meant that it wasn't sufficient to check xmin == xmax, while
allowing that something like that could work with extra work (e.g. the
use of infomask bits)...

> Ok, so we can't unconditionally always ignore tuples with xmin==xmax. We
> would need an infomask bit to indicate super-deletion, and only ignore it if
> the bit is set.

...which is what you say here.

> I'm starting to think that we should bite the bullet and consume an infomask
> bit for this. The infomask bits are a scarce resource, but we should use
> them when it makes sense. It would be good for forensic purposes too, to
> leave a trace that a super-deletion happened.
>
>> I too think the tqual.c changes are ugly. I can't see a way around
>> using a token lock, though - I would only consider (and only consider)
>> refining the interface by which a waiter becomes aware that it must
>> wait on the outcome of the inserting xact's speculative
>> insertion/value lock (and in particular, that is should not wait on
>> xact outcome). We clearly need something to wait on that isn't an
>> XID...heavyweight locking of a token value is the obvious way of doing
>> that.
>
>
> Yeah.
>
>> Jim Nasby said something about setting the HEAP_XMIN_INVALID hint bit.
>> Maybe he is right...if that can be made to be reliable (always
>> WAL-logged), it could be marginally better than setting xmin to
>> invalidTransactionId.
>
>
> I'm not a big fan of that. The xmin-invalid bit is currently always just a
> hint.

Well, Andres makes the point that that isn't quite so. TBH, I have a
hard time justifying the use of MOVED_IN|MOVED_OUT...it's not as if
there is a correctness problem with either setting xmin to
InvalidTransactionId, or setting the xmin-invalid hint bit (with
appropriate precautions so that it's not really just a hint). And as
was pointed out, there is something to be said for preserving tuple
header xmin where possible, for forensic reasons.

>> But only marginally; it seems like your
>> discomfort is basically inherent to playing these new games with
>> visibility, which is inherent to this design.
>
>
> No, I get that we're playing games with visibility. I want to find the best
> way to implement those games.

That's useful information. Thanks.

>> are you just worried about extra cycles in the
>> HeapTupleSatisfies* routines?
>
> Extra cycles yes, but even more importantly, code readability and
> maintainability.

Sure.

-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Mon, Feb 16, 2015 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> Jim Nasby said something about setting the HEAP_XMIN_INVALID hint bit.
>>> Maybe he is right...if that can be made to be reliable (always
>>> WAL-logged), it could be marginally better than setting xmin to
>>> invalidTransactionId.
>>
>>
>> I'm not a big fan of that. The xmin-invalid bit is currently always just a
>> hint.
>
> Well, Andres makes the point that that isn't quite so.

Hmm. So the tqual.c routines actually check "if
(HeapTupleHeaderXminInvalid(tuple))". Which is:

#define HeapTupleHeaderXminInvalid(tup) \
( \
    ((tup)->t_infomask & (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)) == \
        HEAP_XMIN_INVALID \
)

What appears to happen if I try to only use the HEAP_XMIN_INVALID bit
(and not explicitly set xmin to InvalidTransactionId, and not
explicitly check that xmin isn't InvalidTransactionId in each
HeapTupleSatisfies* routine) is that after a little while, Jeff Janes'
tool shows up spurious unique violations, as if some tuple has become
visible when it shouldn't have. I guess this is because the
HEAP_XMIN_COMMITTED hint bit can still be set, which in effect
invalidates the HEAP_XMIN_INVALID hint bit.

It takes about 2 minutes before this happens for the first time when
fsync = off, following a fresh initdb, which is unacceptable. I'm not
sure if it's worth trying to figure out how HEAP_XMIN_COMMITTED gets
set. Not that I'm 100% sure that that's what this really is; it just
seems very likely.

Attached broken patch (broken_visible.patch) shows the changes made to
reveal this problem. Only the changes to tqual.c and heap_delete()
should matter here, since I did not test recovery.

I also thought about unifying the check for "if
(TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple),
InvalidTransactionId))" with the conventional
HeapTupleHeaderXminInvalid() macro, and leaving everything else as-is.
This is no good either, and for similar reasons - control often won't
reach the macro, which is behind a check of "if
(!HeapTupleHeaderXminCommitted(tuple))".

The best I think we can hope for is having a dedicated "if
(TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple),
InvalidTransactionId))" macro HeapTupleHeaderSuperDeleted() to do the
work each time, which does not need to be checked so often. A second
attached patch (compact_tqual_works.patch - which is non-broken,
AFAICT) shows how this is possible, while also moving the check
further down within each tqual.c routine (which seems more in keeping
with the fact that super deletion is a relatively obscure concept). I
haven't been able to break this variant using my existing test suite,
so this seems promising as a way of reducing tqual.c clutter. However,
as I said the other day, this is basically just polish.

--
Peter Geoghegan

Attachment

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Mon, Feb 16, 2015 at 6:32 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The best I think we can hope for is having a dedicated "if
> (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple),
> InvalidTransactionId))" macro HeapTupleHeaderSuperDeleted() to do the
> work each time, which does not need to be checked so often. A second
> attached patch (compact_tqual_works.patch - which is non-broken,
> AFAICT) shows how this is possible, while also moving the check
> further down within each tqual.c routine (which seems more in keeping
> with the fact that super deletion is a relatively obscure concept).

I attach the patch series of V2.3. Highlights:

* Overhaul of speculative insertion related changes within tqual.c.
Refactored for readability as outlined in my earlier comments quoted
above. Assertions added, serving to show exactly where super deleted
tuples are and are not expected.

* Formally forbid INSERT ... ON CONFLICT into system catalogs. If
nothing else, this obviates the need for historic snapshots to care
about super deleted tuples.

* Minor setrefs.c tweaks. Minor ExecInitModifyTable() tweaks, too.

* Fix for minor bitrot against master branch.

* Further comments on the speculativeInsertionToken per-backend variable.

* Livelock insurance for exclusion constraints.

Importantly, Heikki wanted us to break out the patch to fix the
current problem of theoretical deadlock risks [1] ahead of committing
ON CONFLICT UPDATE/IGNORE. Heikki acknowledged that there were still
theoretical livelock risks in his reworked minimal patch. After
careful consideration, I have not broken out the changes to do this
incrementally along the lines that Heikki suggested.

Heikki seemed to think that the deadlock problems were not really
worth fixing independently of ON CONFLICT UPDATE support, but rather
represented a useful way of committing code incrementally. Do I have
that right? Certainly, anyone would agree that unprincipled deadlocks
(for regular inserters with exclusion constraints) are better than
livelocks. Heikki did not address the livelock risks with his minimal
reworked patch, which I've done here for ON CONFLICT.

The way I chose to break the livelock (what I call "livelock
insurance") does indeed involve comparing XIDs, which Heikki thought
most promising. But it also involves having the oldest XID wait on
another session's speculative token in the second phase, which
ordinarily does not occur in the second
phase/check_exclusion_or_unique_constraint() call. The idea is that
one session is guaranteed to be the waiter that has a second iteration
within its second-phase check_exclusion_or_unique_constraint() call,
where (following the super deletion of conflict TIDs by the other
conflicting session or sessions) reliably finds that it can proceed
(those other sessions are denied the opportunity to reach their second
phase by our second phase waiter's still-not-super-deleted tuple).

However, it's difficult to see how to map this on to general exclusion
constraint insertion + enforcement. In Heikki's recent sketch of this
[1], there is no pre-check, since that is considered an "UPSERT thing"
deferred until a later patch, and therefore my scheme here cannot work
(recall that for plain inserts with exclusion constraints, we insert
first and check last). I have a hard time justifying adding the
pre-check for plain exclusion constraint inserters given the total
lack of complaints from the field regarding unprincipled deadlocks,
and given the fact that it would probably make the code more
complicated than it needs to be. It is critical that there be a
pre-check to prevent livelock, though, because otherwise the
restarting sessions can go straight to their "second" phase
(check_exclusion_or_unique_constraint() call), without ever realizing
that they should not do so. Therefore, as I said, I have not broken
out the code in line with Heikki's suggestion.

It's possible that I have it wrong here - I was wrong to dismiss
Heikki's contention that the livelock hazards were fixable without too
much pain - but I don't think so.

It seems like the livelock insurance is pretty close to or actually
free, and doesn't imply that a "second phase wait for token" needs to
happen too often. With heavy contention on a small number of possible
tuples (100), and 8 clients deleting and inserting, I can see it
happening only once every couple of hundred milliseconds on my laptop.
It's not hard to imagine why the code didn't obviously appear to be
broken before now, as the window for an actual livelock must have been
small. Also, deadlocks bring about more deadlocks (since the deadlock
detector kicks in), whereas livelocks do not bring about more
livelocks.

I haven't been able to reproduce earlier apparent bugs with exclusion
constraints [2] recently. I can only speculate that they were fixed.
Does anyone with a big server care to run the procedure outlined for
exclusion constraints in the jjanes_upsert tool [3]? It would be nice
to have additional confidence that the exclusion constraint stuff is
robust.

[1] http://www.postgresql.org/message-id/54DFC6F8.5050108@vmware.com
[2] http://www.postgresql.org/message-id/CAM3SWZTkHOwyA5A9ib=uVf0Vs326yoCBdpp_NYkDjM2_-ScxFA@mail.gmail.com
[3] https://github.com/petergeoghegan/jjanes_upsert
--
Peter Geoghegan

Attachment

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Tue, Feb 10, 2015 at 12:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Then the problem suddenly becomes that previous choices of
>> indexes/statements aren't possible anymore. It seems much better to
>> introduce the syntax now and not have too much of a usecase for
>> it.
>
> The only way the lack of a way of specifying which opclass to use
> could bite us is if there were two *actually* defined unique indexes
> on the same column, each with different "equality" operators. That
> seems like kind of a funny scenario, even if that were quite possible
> (even if non-default opclasses existed that had a non-identical
> "equality" operators, which is basically not the case today).
>
> I grant that is a bit odd that we're talking about unique indexes
> definitions affecting semantics, but that is to a certain extent the
> nature of the beast. As a compromise, I suggest having the inference
> specification optionally accept a named opclass per attribute, in the
> style of CREATE INDEX (I'm already reusing a bit of the raw parser
> support for CREATE INDEX, you see) - that'll make inference insist on
> that opclass, rather than make it a strict matter of costing available
> alternatives (not that any alternative is expected with idiomatic
> usage). That implies no additional parser overhead, really. If that's
> considered ugly, then at least it's an ugly thing that literally no
> one will ever use in the foreseeable future...and an ugly thing that
> is no more necessary in CREATE INDEX than here (and yet CREATE INDEX
> lives with the ugliness).

Any thoughts on this, anyone?

AFAICT, only this and the behavior of logical decoding are open items
at this point. I'd like to close out both of those sooner rather than
later.

Thanks
-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/18/2015 11:43 PM, Peter Geoghegan wrote:
> Heikki seemed to think that the deadlock problems were not really
> worth fixing independently of ON CONFLICT UPDATE support, but rather
> represented a useful way of committing code incrementally. Do I have
> that right?

Yes.

> The way I chose to break the livelock (what I call "livelock
> insurance") does indeed involve comparing XIDs, which Heikki thought
> most promising. But it also involves having the oldest XID wait on
> another session's speculative token in the second phase, which
> ordinarily does not occur in the second
> phase/check_exclusion_or_unique_constraint() call. The idea is that
> one session is guaranteed to be the waiter that has a second iteration
> within its second-phase check_exclusion_or_unique_constraint() call,
> where (following the super deletion of conflict TIDs by the other
> conflicting session or sessions) reliably finds that it can proceed
> (those other sessions are denied the opportunity to reach their second
> phase by our second phase waiter's still-not-super-deleted tuple).
>
> However, it's difficult to see how to map this on to general exclusion
> constraint insertion + enforcement. In Heikki's recent sketch of this
> [1], there is no pre-check, since that is considered an "UPSERT thing"
> deferred until a later patch, and therefore my scheme here cannot work
> (recall that for plain inserts with exclusion constraints, we insert
> first and check last). I have a hard time justifying adding the
> pre-check for plain exclusion constraint inserters given the total
> lack of complaints from the field regarding unprincipled deadlocks,
> and given the fact that it would probably make the code more
> complicated than it needs to be. It is critical that there be a
> pre-check to prevent livelock, though, because otherwise the
> restarting sessions can go straight to their "second" phase
> (check_exclusion_or_unique_constraint() call), without ever realizing
> that they should not do so.

Hmm. I haven't looked at your latest patch, but I don't think you need
to pre-check for this to work. To recap, the situation is that two
backends have already inserted the heap tuple, and then see that the
other backend's tuple conflicts. To avoid a livelock, it's enough that
one backend super-deletes its own tuple first, before waiting for the
other to complete, while the other other backend waits without
super-deleting. No?

> It seems like the livelock insurance is pretty close to or actually
> free, and doesn't imply that a "second phase wait for token" needs to
> happen too often. With heavy contention on a small number of possible
> tuples (100), and 8 clients deleting and inserting, I can see it
> happening only once every couple of hundred milliseconds on my laptop.
> It's not hard to imagine why the code didn't obviously appear to be
> broken before now, as the window for an actual livelock must have been
> small. Also, deadlocks bring about more deadlocks (since the deadlock
> detector kicks in), whereas livelocks do not bring about more
> livelocks.

It might be easier to provoke the livelocks with a GiST opclass that's
unusually slow. I wrote the attached opclass for the purpose of testing
this a while ago, but I haven't actually gotten around to do much with
it. It's called "useless_gist", because it's a GiST opclass for
integers, like btree_gist, but the penalty and picksplit functions are
totally dumb. The result is that the tuples are put to the index in
pretty much random order, and every scan has to scan the whole index.
I'm posting it here, in the hope that it happens to be useful, but I
don't really know if it is.

- Heikki


Attachment

Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/16/2015 11:31 AM, Andres Freund wrote:
> On 2015-02-16 10:00:24 +0200, Heikki Linnakangas wrote:
>> I'm starting to think that we should bite the bullet and consume an infomask
>> bit for this. The infomask bits are a scarce resource, but we should use
>> them when it makes sense. It would be good for forensic purposes too, to
>> leave a trace that a super-deletion happened.
>
> Well, we IIRC don't have any left right now. We could reuse
> MOVED_IN|MOVED_OUT, as that never was set in the past. But it'd
> essentially use two infomask bits forever...

t_infomask is all used, but t_infomask2 has two bits left:

> /*
>  * information stored in t_infomask2:
>  */
> #define HEAP_NATTS_MASK            0x07FF    /* 11 bits for number of attributes */
> /* bits 0x1800 are available */
> #define HEAP_KEYS_UPDATED        0x2000    /* tuple was updated and key cols
>                                          * modified, or tuple deleted */
> #define HEAP_HOT_UPDATED        0x4000    /* tuple was HOT-updated */
> #define HEAP_ONLY_TUPLE            0x8000    /* this is heap-only tuple */
>
> #define HEAP2_XACT_MASK            0xE000    /* visibility-related bits */

- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Thu, Feb 19, 2015 at 5:21 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Hmm. I haven't looked at your latest patch, but I don't think you need to
> pre-check for this to work. To recap, the situation is that two backends
> have already inserted the heap tuple, and then see that the other backend's
> tuple conflicts. To avoid a livelock, it's enough that one backend
> super-deletes its own tuple first, before waiting for the other to complete,
> while the other other backend waits without super-deleting. No?

I fully agree with your summary here. However, why should we suppose
that while we wait, the other backends don't both delete and then
re-insert their tuple? They need the pre-check to know not to
re-insert their tuple (seeing our tuple, immediately after we wake as
the preferred backend with the older XID) in order to break the race.
But today, exclusion constraints are optimistic in that the insert
happens first, and only then the check. The pre-check turns that the
other way around, in a limited though necessary sense.

Granted, it's unlikely that we'd livelock due to one session always
deleting and then nullifying that by re-inserting in time, but the
theoretical risk seems real. Therefore, I'm not inclined to bother
committing something without a pre-check, particularly since we're not
really interested in fixing unprincipled deadlocks for ordinary
exclusion constraint inserters (useful to know that you also think
that doesn't matter, BTW). Does that make sense?

This is explained within "livelock insurance" new-to-V2.3 comments in
check_exclusion_or_unique_constraint().  (Not that I think that
explanation is easier to follow than this one).

> It might be easier to provoke the livelocks with a GiST opclass that's
> unusually slow. I wrote the attached opclass for the purpose of testing this
> a while ago, but I haven't actually gotten around to do much with it. It's
> called "useless_gist", because it's a GiST opclass for integers, like
> btree_gist, but the penalty and picksplit functions are totally dumb. The
> result is that the tuples are put to the index in pretty much random order,
> and every scan has to scan the whole index. I'm posting it here, in the hope
> that it happens to be useful, but I don't really know if it is.

Thanks. I'll try and use this for testing. Haven't been able to break
exclusion constraints with the jjanes_upsert test suite in a long
time, now.

-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/19/2015 08:16 PM, Peter Geoghegan wrote:
> On Thu, Feb 19, 2015 at 5:21 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Hmm. I haven't looked at your latest patch, but I don't think you need to
>> pre-check for this to work. To recap, the situation is that two backends
>> have already inserted the heap tuple, and then see that the other backend's
>> tuple conflicts. To avoid a livelock, it's enough that one backend
>> super-deletes its own tuple first, before waiting for the other to complete,
>> while the other other backend waits without super-deleting. No?
>
> I fully agree with your summary here. However, why should we suppose
> that while we wait, the other backends don't both delete and then
> re-insert their tuple? They need the pre-check to know not to
> re-insert their tuple (seeing our tuple, immediately after we wake as
> the preferred backend with the older XID) in order to break the race.
> But today, exclusion constraints are optimistic in that the insert
> happens first, and only then the check. The pre-check turns that the
> other way around, in a limited though necessary sense.

I'm not sure I understand exactly what you're saying, but AFAICS the 
pre-check doesn't completely solve that either. It's entirely possible 
that the other backend deletes its tuple, our backend then performs the 
pre-check, and the other backend re-inserts its tuple again. Sure, the 
pre-check reduces the chances, but we're talking about a rare condition 
to begin with, so I don't think it makes sense to add much code just to 
reduce the chances further.

- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Thu, Feb 19, 2015 at 11:10 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> I fully agree with your summary here. However, why should we suppose
>> that while we wait, the other backends don't both delete and then
>> re-insert their tuple? They need the pre-check to know not to
>> re-insert their tuple (seeing our tuple, immediately after we wake as
>> the preferred backend with the older XID) in order to break the race.
>> But today, exclusion constraints are optimistic in that the insert
>> happens first, and only then the check. The pre-check turns that the
>> other way around, in a limited though necessary sense.
>
> I'm not sure I understand exactly what you're saying, but AFAICS the
> pre-check doesn't completely solve that either. It's entirely possible that
> the other backend deletes its tuple, our backend then performs the
> pre-check, and the other backend re-inserts its tuple again. Sure, the
> pre-check reduces the chances, but we're talking about a rare condition to
> begin with, so I don't think it makes sense to add much code just to reduce
> the chances further.

But super deletion occurs *before* releasing the token lock, which is
the last thing we do before looping around and starting again. So iff
we're the oldest XID, the one that gets to "win" by unexpectedly
waiting on another's token in our second phase (second call to
check_exclusion_or_unique_constraint()), we will not, in fact, see
anyone else's tuple, because they'll all be forced to go through the
first phase and find our pre-existing, never-deleted tuple, so we
can't see any new tuple from them. And, because they super delete
before releasing their token, they'll definitely have super deleted
when we're woken up, so we can't see any old/existing tuple either. We
have our tuple inserted this whole time - ergo, we do, in fact, "win"
reliably.

The fly in the ointment is regular inserters, perhaps, but we've
agreed that they're not too important here, and even when that happens
we're in "deadlock land", not "livelock land", which is obviously a
nicer place to live.

Does that make sense?
-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/19/2015 10:09 PM, Peter Geoghegan wrote:
> On Thu, Feb 19, 2015 at 11:10 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>>> I fully agree with your summary here. However, why should we suppose
>>> that while we wait, the other backends don't both delete and then
>>> re-insert their tuple? They need the pre-check to know not to
>>> re-insert their tuple (seeing our tuple, immediately after we wake as
>>> the preferred backend with the older XID) in order to break the race.
>>> But today, exclusion constraints are optimistic in that the insert
>>> happens first, and only then the check. The pre-check turns that the
>>> other way around, in a limited though necessary sense.
>>
>> I'm not sure I understand exactly what you're saying, but AFAICS the
>> pre-check doesn't completely solve that either. It's entirely possible that
>> the other backend deletes its tuple, our backend then performs the
>> pre-check, and the other backend re-inserts its tuple again. Sure, the
>> pre-check reduces the chances, but we're talking about a rare condition to
>> begin with, so I don't think it makes sense to add much code just to reduce
>> the chances further.
>
> But super deletion occurs *before* releasing the token lock, which is
> the last thing we do before looping around and starting again. So iff
> we're the oldest XID, the one that gets to "win" by unexpectedly
> waiting on another's token in our second phase (second call to
> check_exclusion_or_unique_constraint()), we will not, in fact, see
> anyone else's tuple, because they'll all be forced to go through the
> first phase and find our pre-existing, never-deleted tuple, so we
> can't see any new tuple from them. And, because they super delete
> before releasing their token, they'll definitely have super deleted
> when we're woken up, so we can't see any old/existing tuple either. We
> have our tuple inserted this whole time - ergo, we do, in fact, "win"
> reliably.

So, um, are you agreeing that there is no problem? Or did I 
misunderstand? If you see a potential issue here, can you explain it as 
a simple list of steps, please.

- Heikki



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Fri, Feb 20, 2015 at 11:34 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> So, um, are you agreeing that there is no problem? Or did I misunderstand?
> If you see a potential issue here, can you explain it as a simple list of
> steps, please.

Yes. I'm saying that AFAICT, there is no livelock hazard provided
other sessions must do the pre-check (as they must for ON CONFLICT
IGNORE). So I continue to believe that they must pre-check, which you
questioned.

The only real downside here (with not doing the pre-check for regular
inserters with exclusion constraints) is that we can't fix
unprincipled deadlocks for general exclusion constraint inserters
(since we don't want to make them pre-check), but we don't actually
care about that directly. But it also means that it's hard to see how
we can incrementally commit such a fix to break down the patch into
more manageable chunks, which is a little unfortunate.

Hard to break down the problem into steps, since it isn't a problem
that I was able to recreate (as a noticeable livelock). But the point
of what I was saying is that the first phase pre-check allows us to
notice that the one session that got stuck waiting in the second phase
(other conflicters notice its tuple, and so don't insert a new tuple
this iteration).

Conventional insertion with exclusion constraints insert first and
only then looks for conflicts (since there is no pre-check). More
concretely, if you're the transaction that does not "break" here,
within check_exclusion_or_unique_constraint(), and so unexpectedly
waits in the second phase:

+     /*
+      * At this point we have either a conflict or a potential conflict. If
+      * we're not supposed to raise error, just return the fact of the
+      * potential conflict without waiting to see if it's real.
+      */
+     if (violationOK && !wait)
+     {
+         /*
+          * For unique indexes, detecting conflict is coupled with physical
+          * index tuple insertion, so we won't be called for recheck
+          */
+         Assert(!indexInfo->ii_Unique);
+
+         conflict = true;
+         if (conflictTid)
+             *conflictTid = tup->t_self;
+
+         /*
+          * Livelock insurance.
+          *
+          * When doing a speculative insertion pre-check, we cannot have an
+          * "unprincipled deadlock" with another session, fundamentally
+          * because there is no possible mutual dependency, since we only
+          * hold a lock on our token, without attempting to lock anything
+          * else (maybe this is not the first iteration, but no matter;
+          * we'll have super deleted and released insertion token lock if
+          * so, and all locks needed are already held.  Also, our XID lock
+          * is irrelevant.)
+          *
+          * In the second phase, where there is a re-check for conflicts, we
+          * can't deadlock either (we never lock another thing, since we
+          * don't wait in that phase).  However, a theoretical livelock
+          * hazard exists:  Two sessions could each see each other's
+          * conflicting tuple, and each could go and delete, retrying
+          * forever.
+          *
+          * To break the mutual dependency, we may wait on the other xact
+          * here over our caller's request to not do so (in the second
+          * phase).  This does not imply the risk of unprincipled deadlocks
+          * either, because if we end up unexpectedly waiting, the other
+          * session will super delete its own tuple *before* releasing its
+          * token lock and freeing us, and without attempting to wait on us
+          * to release our token lock.  We'll take another iteration here,
+          * after waiting on the other session's token, not find a conflict
+          * this time, and then proceed (assuming we're the oldest XID).
+          *
+          * N.B.:  Unprincipled deadlocks are still theoretically possible
+          * with non-speculative insertion with exclusion constraints, but
+          * this seems inconsequential, since an error was inevitable for
+          * one of the sessions anyway.  We only worry about speculative
+          * insertion's problems, since they're likely with idiomatic usage.
+          */
+         if (TransactionIdPrecedes(xwait, GetCurrentTransactionId()))
+             break;  /* go and super delete/restart speculative insertion */
+     }
+

Then you must successfully insert when you finally "goto retry" and do
another iteration within check_exclusion_or_unique_constraint(). The
other conflicters can't have failed to notice your pre-existing tuple,
and can't have failed to super delete their own tuples before you are
woken (provided you really are the single oldest XID).

Is that any clearer?
-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/20/2015 10:39 PM, Peter Geoghegan wrote:
> On Fri, Feb 20, 2015 at 11:34 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> So, um, are you agreeing that there is no problem? Or did I misunderstand?
>> If you see a potential issue here, can you explain it as a simple list of
>> steps, please.
>
> Yes. I'm saying that AFAICT, there is no livelock hazard provided
> other sessions must do the pre-check (as they must for ON CONFLICT
> IGNORE). So I continue to believe that they must pre-check, which you
> questioned.
> ...
> Hard to break down the problem into steps, since it isn't a problem
> that I was able to recreate (as a noticeable livelock).

Then I refuse to believe that the livelock hazard exists, without the 
pre-check. If you have a livelock scenario in mind, it really shouldn't 
be that difficult to write down the list of steps.
- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Fri, Feb 20, 2015 at 1:07 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Then I refuse to believe that the livelock hazard exists, without the
> pre-check. If you have a livelock scenario in mind, it really shouldn't be
> that difficult to write down the list of steps.

I just meant practical, recreatable steps - a test case. I should
emphasize that what I'm saying is not that important. Even if I am
wrong, I'm not suggesting that we do anything that we don't both agree
is needed anyway. If I'm right, that is merely an impediment to
incrementally committing the work by "fixing" exclusion constraints,
AFAICT. Ultimately, that isn't all that important. Anyway, here is how
I think livelocks could happen, in theory, with regular insertion into
a table with exclusion constraints, with your patch [1] applied (which
has no pre-check), this can happen:

* Session 1 physically inserts, and then checks for a conflict.

* Session 2 physically inserts, and then checks for a conflict.

* Session 1 sees session 2's conflicting TID, then super deletes and
releases token.

* Session 2 sees session 1's conflicting TID, then super deletes and
releases token.

* Session 1 waits or tries to wait on session 2's token. It isn't held
anymore, or is only held for an instant.

* Session 2 waits or tries to wait on session 1's token. It isn't held
anymore, or is only held for an instant.

* Session 1 restarts from scratch, having made no useful progress in
respect of the slot being inserted.

* Session 2 restarts from scratch, having made no useful progress in
respect of the slot being inserted.

(Livelock)

If there is a tie-break on XID (you omitted this from your patch [1]
but acknowledge it as an omission), than that doesn't really change
anything (without adding a pre-check, too). That's because: What do we
actually do or not do when we're the oldest XID, that gets to "win"?
Do we not wait on anything, and just declare that we're done? Then I
think that breaks exclusion constraint enforcement, because we need to
rescan the index to do that (i.e., "goto retry"). Do we wait on their
token, as my most recent revision does, but *without* a pre-check, for
regular inserters? Then I think that our old tuple could keep multiple
other sessions spinning indefinitely. Only by checking for conflicts
*first*, without a non-super-deleted physical index tuple can these
other sessions notice that there is a conflict *without creating more
conflicts*, which is what I believe is really needed. At the very
least it's something I'm much more comfortable with, and that seems
like reason enough to do it that way, given that we don't actually
care about unprincipled deadlocks with regular inserters with
exclusion constraints. Why take the chance with livelocking such
inserters, though?

I hope that we don't get bogged down on this, because, as I said, it
doesn't matter too much. I'm tempted to concede the point for that
reason, since the livelock is probably virtually impossible. I'm just
giving you my opinion on how to make the handling of exclusion
constraints as reliable as possible.

Thanks

[1] http://www.postgresql.org/message-id/54DFC6F8.5050108@vmware.com
-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/21/2015 12:15 AM, Peter Geoghegan wrote:
> On Fri, Feb 20, 2015 at 1:07 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Then I refuse to believe that the livelock hazard exists, without the
>> pre-check. If you have a livelock scenario in mind, it really shouldn't be
>> that difficult to write down the list of steps.
>
> I just meant practical, recreatable steps - a test case. I should
> emphasize that what I'm saying is not that important. Even if I am
> wrong, I'm not suggesting that we do anything that we don't both agree
> is needed anyway. If I'm right, that is merely an impediment to
> incrementally committing the work by "fixing" exclusion constraints,
> AFAICT. Ultimately, that isn't all that important. Anyway, here is how
> I think livelocks could happen, in theory, with regular insertion into
> a table with exclusion constraints, with your patch [1] applied (which
> has no pre-check), this can happen:
>
> * Session 1 physically inserts, and then checks for a conflict.
>
> * Session 2 physically inserts, and then checks for a conflict.
>
> * Session 1 sees session 2's conflicting TID, then super deletes and
> releases token.
>
> * Session 2 sees session 1's conflicting TID, then super deletes and
> releases token.
>
> * Session 1 waits or tries to wait on session 2's token. It isn't held
> anymore, or is only held for an instant.
>
> * Session 2 waits or tries to wait on session 1's token. It isn't held
> anymore, or is only held for an instant.
>
> * Session 1 restarts from scratch, having made no useful progress in
> respect of the slot being inserted.
>
> * Session 2 restarts from scratch, having made no useful progress in
> respect of the slot being inserted.
>
> (Livelock)
>
> If there is a tie-break on XID (you omitted this from your patch [1]
> but acknowledge it as an omission), than that doesn't really change
> anything (without adding a pre-check, too). That's because: What do we
> actually do or not do when we're the oldest XID, that gets to "win"?

Ah, ok, I can see the confusion now.

> Do we not wait on anything, and just declare that we're done? Then I
> think that breaks exclusion constraint enforcement, because we need to
> rescan the index to do that (i.e., "goto retry"). Do we wait on their
> token, as my most recent revision does, but *without* a pre-check, for
> regular inserters? Then I think that our old tuple could keep multiple
> other sessions spinning indefinitely.

What I had in mind is that the "winning" inserter waits on the other 
inserter's token, without super-deleting. Like all inserts do today. So 
the above scenario becomes:

* Session 1 physically inserts, and then checks for a conflict.

* Session 2 physically inserts, and then checks for a conflict.

* Session 1 sees session 2's conflicting TID. Session 1's XID is older, 
so it "wins". It waits for session 2's token, without super-deleting.

* Session 2 sees session 1's conflicting TID. It super deletes,
releases token, and sleeps on session 1's token.

* Session 1 wakes up. It looks at session 2's tuple again and sees that 
it was super-deleted. There are no further conflicts, so the insertion 
is complete, and it releases the token.

* Session 2 wakes up. It looks at session 1's tuple again and sees that 
it's still there. It goes back to sleep, this time on session 2's XID.

* Session 1 commits. Session 2 wakes up, sees that the tuple is still 
there, and throws a "contraint violation" error.

- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Sat, Feb 21, 2015 at 11:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Ah, ok, I can see the confusion now.

Cool.

>> Do we not wait on anything, and just declare that we're done? Then I
>> think that breaks exclusion constraint enforcement, because we need to
>> rescan the index to do that (i.e., "goto retry"). Do we wait on their
>> token, as my most recent revision does, but *without* a pre-check, for
>> regular inserters? Then I think that our old tuple could keep multiple
>> other sessions spinning indefinitely.
>
> What I had in mind is that the "winning" inserter waits on the other
> inserter's token, without super-deleting. Like all inserts do today. So the
> above scenario becomes:
>
> * Session 1 physically inserts, and then checks for a conflict.
>
> * Session 2 physically inserts, and then checks for a conflict.
>
> * Session 1 sees session 2's conflicting TID. Session 1's XID is older, so
> it "wins". It waits for session 2's token, without super-deleting.
>
> * Session 2 sees session 1's conflicting TID. It super deletes,
> releases token, and sleeps on session 1's token.
>
> * Session 1 wakes up. It looks at session 2's tuple again and sees that it
> was super-deleted. There are no further conflicts, so the insertion is
> complete, and it releases the token.
>
> * Session 2 wakes up. It looks at session 1's tuple again and sees that it's
> still there. It goes back to sleep, this time on session 2's XID.
>
> * Session 1 commits. Session 2 wakes up, sees that the tuple is still there,
> and throws a "contraint violation" error.

I think we're actually 100% in agreement, then. I just prefer to have
the second last step (the check without a promise tuple visible to
anyone made by the "loser") occur as part of the pre-check that
happens anyway with ON CONFLICT IGNORE. Otherwise, you'll end up with
some much more complicated control flow that has to care about not
doing that twice for ON CONFLICT IGNORE...and for the benefit of what?
For regular inserters, that we don't actually care about fixing
unprincipled deadlocks for?

Are we on the same page now?
-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/21/2015 10:41 PM, Peter Geoghegan wrote:
> On Sat, Feb 21, 2015 at 11:15 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> What I had in mind is that the "winning" inserter waits on the other
>> inserter's token, without super-deleting. Like all inserts do today. So the
>> above scenario becomes:
>>
>> * Session 1 physically inserts, and then checks for a conflict.
>>
>> * Session 2 physically inserts, and then checks for a conflict.
>>
>> * Session 1 sees session 2's conflicting TID. Session 1's XID is older, so
>> it "wins". It waits for session 2's token, without super-deleting.
>>
>> * Session 2 sees session 1's conflicting TID. It super deletes,
>> releases token, and sleeps on session 1's token.
>>
>> * Session 1 wakes up. It looks at session 2's tuple again and sees that it
>> was super-deleted. There are no further conflicts, so the insertion is
>> complete, and it releases the token.
>>
>> * Session 2 wakes up. It looks at session 1's tuple again and sees that it's
>> still there. It goes back to sleep, this time on session 2's XID.
>>
>> * Session 1 commits. Session 2 wakes up, sees that the tuple is still there,
>> and throws a "contraint violation" error.
>
> I think we're actually 100% in agreement, then. I just prefer to have
> the second last step (the check without a promise tuple visible to
> anyone made by the "loser") occur as part of the pre-check that
> happens anyway with ON CONFLICT IGNORE. Otherwise, you'll end up with
> some much more complicated control flow that has to care about not
> doing that twice for ON CONFLICT IGNORE...and for the benefit of what?
> For regular inserters, that we don't actually care about fixing
> unprincipled deadlocks for?

Right. That will allow me to review and test the locking part of the 
patch separately.

- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 02/17/2015 02:11 AM, Peter Geoghegan wrote:
>
>>> >>Whatever works, really. I can't say that the performance implications
>>> >>of acquiring that hwlock are at the forefront of my mind. I never
>>> >>found that to be a big problem on an 8 core box, relative to vanilla
>>> >>INSERTs, FWIW - lock contention is not normal, and may be where any
>>> >>heavweight lock costs would really be encountered.
>> >
>> >Oh, cool. I guess the fast-path in lmgr.c kicks ass, then :-).
> Seems that way. But even if that wasn't true, it wouldn't matter,
> since I don't see that we have a choice.

I did some quick performance testing on this. For easy testing, I used a 
checkout of git master, and simply added LockAcquire + LockRelease calls 
to ExecInsert, around the heap_insert() call. The test case I used was:

psql -c "create table footest (id serial primary key);"

echo "insert into footest select from generate_series(1, 10000);" > 
inserts.sql

pgbench -n -f inserts.sql postgres -T100 -c4"

With the extra lock calls, I got 56 tps on my laptop. With unpatched git 
master, I got 60 tps. I also looked at the profile with "perf", and 
indeed about 10% of the CPU time was spent in LockAcquire and 
LockRelease together.

So the extra locking incurs about 10% overhead. I think this was pretty 
ḿuch a worst case scenario, but not a hugely unrealistic one - many 
real-world tables have only a few columns, and few indexes. With more 
CPUs you would probably start to see contention, in addition to just the 
extra overhead.

Are we OK with a 10% overhead, caused by the locking? That's probably 
acceptable if that's what it takes to get UPSERT. But it's not OK just 
to solve the deadlock issue with regular insertions into a table with 
exclusion constraints. Can we find a scheme to eliminate that overhead?

- Heikki



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Mon, Mar 2, 2015 at 11:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Are we OK with a 10% overhead, caused by the locking? That's probably
> acceptable if that's what it takes to get UPSERT. But it's not OK just to
> solve the deadlock issue with regular insertions into a table with exclusion
> constraints. Can we find a scheme to eliminate that overhead?

Looks like you tested a B-Tree index here. That doesn't seem
particularly representative of what you'd see with exclusion
constraints.

-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 03/02/2015 09:29 PM, Peter Geoghegan wrote:
> On Mon, Mar 2, 2015 at 11:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Are we OK with a 10% overhead, caused by the locking? That's probably
>> acceptable if that's what it takes to get UPSERT. But it's not OK just to
>> solve the deadlock issue with regular insertions into a table with exclusion
>> constraints. Can we find a scheme to eliminate that overhead?
>
> Looks like you tested a B-Tree index here. That doesn't seem
> particularly representative of what you'd see with exclusion
> constraints.

Hmm. I used a b-tree to estimate the effect that the locking would have 
in the UPSERT case, for UPSERT into a table with a b-tree index. But 
you're right that for the question of whether this is acceptable for the 
case of regular insert into a table with exclusion constraints, other 
indexams are more interesting. In a very quick test, the overhead with a 
single GiST index seems to be about 5%. IMHO that's still a noticeable 
slowdown, so let's try to find a way to eliminate it.

- Heikki




Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Mon, Mar 2, 2015 at 12:15 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm. I used a b-tree to estimate the effect that the locking would have in
> the UPSERT case, for UPSERT into a table with a b-tree index. But you're
> right that for the question of whether this is acceptable for the case of
> regular insert into a table with exclusion constraints, other indexams are
> more interesting. In a very quick test, the overhead with a single GiST
> index seems to be about 5%. IMHO that's still a noticeable slowdown, so
> let's try to find a way to eliminate it.

Honestly, I don't know why you're so optimistic that this can be
fixed, even with that heavyweight lock (for regular inserters +
exclusion constraints).

My experimental branch, which I showed you privately shows big
problems when I simulate upserting with exclusion constraints (so we
insert first, handle exclusion violations using the traditional upsert
subxact pattern that does work with B-Trees). It's much harder to back
out of a heap_update() than it is to back out of a heap_insert(),
since we might well be updated a tuple visible to some other MVCC
snapshot. Whereas we can super delete a tuple knowing that only a
dirty snapshot might have seen it, which bounds the complexity (only
wait sites for B-Trees + exclusion constraints need to worry about
speculative insertion tokens and so on).

My experimental branch works just fine (with a variant jjanes_upsert
with subxact looping), until I need to restart an update after a
"failed" heap_update() that still returned HeapTupleMayBeUpdated
(having super deleted within an ExecUpdate() call). There is no good
way to do that for ExecUpdate() that I can see, because an existing,
visible row is affected (unlike with ExecInsert()). Even if it was
possible, it would be hugely invasive to already very complicated code
paths.

I continue to believe that the best way forward is to incrementally
commit the work by committing ON CONFLICT IGNORE first. That way,
speculative tokens will remain strictly the concern of UPSERTers or
sessions doing INSERT ... ON CONFLICT IGNORE. Unless, you're happy to
have UPDATEs still deadlock, and only fix unprincipled deadlocks for
the INSERT case. I don't know why you want to make the patch
incremental by "fixing" unprincipled deadlocks in the first place,
since you've said that you don't really care about it as a real world
problem, and because it now appears to have significant additional
difficulties that were not anticipated.

Please, let's focus on getting ON CONFLICT IGNORE into a commitable
state - that's the best way of incrementally committing this work.
Fixing unprincipled deadlocks is not a useful way of incrementally
committing this work. I've spent several days producing a prototype
that shows the exact nature of the problem. If you think I'm mistaken,
and that fixing unprincipled deadlocks first is the right thing to do,
please explain why with reference to that prototype. AFAICT, doing
things your way is going to add significant additional complexity for
*no appreciable benefit*. You've already gotten exactly what you were
looking for in every other regard. In particular, ON CONFLICT UPDATE
could work with exclusion constraints without any of these problems,
because of the way we do the auxiliary update there (we lock the row
ahead of updating/qual evaluation). I've bent over backwards to make
sure that is the case. Please, throw me a bone here.

Thank you
-- 
Peter Geoghegan



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Heikki Linnakangas
Date:
On 03/02/2015 11:21 PM, Peter Geoghegan wrote:
> On Mon, Mar 2, 2015 at 12:15 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Hmm. I used a b-tree to estimate the effect that the locking would have in
>> the UPSERT case, for UPSERT into a table with a b-tree index. But you're
>> right that for the question of whether this is acceptable for the case of
>> regular insert into a table with exclusion constraints, other indexams are
>> more interesting. In a very quick test, the overhead with a single GiST
>> index seems to be about 5%. IMHO that's still a noticeable slowdown, so
>> let's try to find a way to eliminate it.
>
> Honestly, I don't know why you're so optimistic that this can be
> fixed, even with that heavyweight lock (for regular inserters +
> exclusion constraints).

We already concluded that it can be fixed, with the heavyweight lock. 
See http://www.postgresql.org/message-id/54F4A0E0.4070602@iki.fi. Do you 
see some new problem with that that hasn't been discussed yet? To 
eliminate the heavy-weight lock, we'll need some new ideas, but it 
doesn't seem that hard.

> My experimental branch, which I showed you privately shows big
> problems when I simulate upserting with exclusion constraints (so we
> insert first, handle exclusion violations using the traditional upsert
> subxact pattern that does work with B-Trees). It's much harder to back
> out of a heap_update() than it is to back out of a heap_insert(),
> since we might well be updated a tuple visible to some other MVCC
> snapshot. Whereas we can super delete a tuple knowing that only a
> dirty snapshot might have seen it, which bounds the complexity (only
> wait sites for B-Trees + exclusion constraints need to worry about
> speculative insertion tokens and so on).
>
> My experimental branch works just fine (with a variant jjanes_upsert
> with subxact looping), until I need to restart an update after a
> "failed" heap_update() that still returned HeapTupleMayBeUpdated
> (having super deleted within an ExecUpdate() call). There is no good
> way to do that for ExecUpdate() that I can see, because an existing,
> visible row is affected (unlike with ExecInsert()). Even if it was
> possible, it would be hugely invasive to already very complicated code
> paths.

Ah, so we can't easily use super-deletion to back out an UPDATE. I had 
not considered that.

> I continue to believe that the best way forward is to incrementally
> commit the work by committing ON CONFLICT IGNORE first. That way,
> speculative tokens will remain strictly the concern of UPSERTers or
> sessions doing INSERT ... ON CONFLICT IGNORE.

Ok, let's try that. Can you cut a patch that does just ON CONFLICT 
IGNORE, please?

- Heikki



Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

From
Peter Geoghegan
Date:
On Tue, Mar 3, 2015 at 12:05 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> My experimental branch works just fine (with a variant jjanes_upsert
>> with subxact looping), until I need to restart an update after a
>> "failed" heap_update() that still returned HeapTupleMayBeUpdated
>> (having super deleted within an ExecUpdate() call). There is no good
>> way to do that for ExecUpdate() that I can see, because an existing,
>> visible row is affected (unlike with ExecInsert()). Even if it was
>> possible, it would be hugely invasive to already very complicated code
>> paths.
>
> Ah, so we can't easily use super-deletion to back out an UPDATE. I had not
> considered that.

Yeah. When I got into considering making EvalPlanQualFetch() look at
speculative tokens, it became abundantly clear that that code would
never be committed, even if I could make it work.

>> I continue to believe that the best way forward is to incrementally
>> commit the work by committing ON CONFLICT IGNORE first. That way,
>> speculative tokens will remain strictly the concern of UPSERTers or
>> sessions doing INSERT ... ON CONFLICT IGNORE.
>
>
> Ok, let's try that. Can you cut a patch that does just ON CONFLICT IGNORE,
> please?

Of course. I'll have that for your shortly.

Thanks
-- 
Peter Geoghegan