Thread: INSERT ... ON CONFLICT UPDATE and logical decoding

INSERT ... ON CONFLICT UPDATE and logical decoding

From

Peter Geoghegan

Date:

19 February 2015, 00:35:24

Andres pointed out that the INSERT ... ON CONFLICT UPDATE patch
doesn't work well with logical decoding. Basically, as things stand,
there is no means of determining that an INSERT had an ON CONFLICT
UPDATE clause from logical decoding. It isn't obvious that this
matters - after all, the relevant WAL records are generally consistent
with there being a vanilla INSERT (or vanilla UPDATE). If there was no
possibility of concurrent conflicts, that might actually be the end of
it, but of course there is.

I guess that the best way of fixing this is exposing to output plugins
that a "super deletion" is a
REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE.  This is kind of an
internal ReorderBufferChangeType constant, because it means that the
output plugin should probably just omit the tuple just inserted and
now deleted. I tend to think that it would be best if the fact that a
speculative insertion was a speculative insertion is not exposed. We
just formalize that a REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE is a
variant of REORDER_BUFFER_CHANGE_DELETE...it's implicit that things
like triggers are not fired here (which may or may not matter,
depending on the use case, I suppose).

Would that be sufficiently flexible for the various logical
replication use cases? I guess we are somewhat forced to push that
kind of thing into output plugins, because doing so lets them decide
how to handle this. It's a little unfortunate that this implementation
detail is exposed to output plugins, though, which is why I'd be
willing to believe that it'd be better to have transaction reassembly
normalize the records such that a super deleted tuple was never
exposed to output plugins in the first place...they'd only see a
REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of
an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that
was the definitive outcome. No need for output plugins to consider
REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all.

Thoughts? Can't say that I've given conflict resolution for
multi-master systems a great deal of thought before now, so I might be
quite off the mark here.

-- 
Peter Geoghegan

Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From

Andres Freund

Date:

19 February 2015, 10:11:14

Hi,

On 2015-02-18 16:35:14 -0800, Peter Geoghegan wrote:
> Andres pointed out that the INSERT ... ON CONFLICT UPDATE patch
> doesn't work well with logical decoding.

Just to make that clear: I didn't actually test it, but it ddidn't look
good.

> I guess that the best way of fixing this is exposing to output plugins
> that a "super deletion" is a
> REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE.

I'm pretty much dead set against exposing anything like this output
plugins. The output plugin shouldn't care that a insertion was a
upsert. For one they won't get it right since they will be really
infrequent, for another it'll be hard to replay such an event on another
node.

> This is kind of an
> internal ReorderBufferChangeType constant, because it means that the
> output plugin should probably just omit the tuple just inserted and
> now deleted.

An output plugin can't just go back and call a tuple that was already
sent to a client back.

> I tend to think that it would be best if the fact that a
> speculative insertion was a speculative insertion is not exposed. We
> just formalize that a REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE is a
> variant of REORDER_BUFFER_CHANGE_DELETE...it's implicit that things
> like triggers are not fired here (which may or may not matter,
> depending on the use case, I suppose).

I can't see that working. How such a event - that obviously causes
unique conflicts! - be replayed on another node?

> Would that be sufficiently flexible for the various logical
> replication use cases? I guess we are somewhat forced to push that
> kind of thing into output plugins, because doing so lets them decide
> how to handle this.

I don't see what benefits that'd bring.

> It's a little unfortunate that this implementation
> detail is exposed to output plugins, though, which is why I'd be
> willing to believe that it'd be better to have transaction reassembly
> normalize the records such that a super deleted tuple was never
> exposed to output plugins in the first place...

Yes.

> they'd only see a
> REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of
> an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that
> was the definitive outcome. No need for output plugins to consider
> REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all.

Yes. It'd be easiest if the only the final insert/update were actually
WAL logged as full actions. IIUC we can actually otherwise end with a,
theoretically, arbitrarily large chain of insert/super deletions.

> Thoughts? Can't say that I've given conflict resolution for
> multi-master systems a great deal of thought before now, so I might be
> quite off the mark here.

I don't think conflict resolution actually plays a role here. This is
about consistency inside a single system, not consistency across
systems.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From

Peter Geoghegan

Date:

20 February 2015, 23:44:24

On Thu, Feb 19, 2015 at 2:11 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-02-18 16:35:14 -0800, Peter Geoghegan wrote:
>> Andres pointed out that the INSERT ... ON CONFLICT UPDATE patch
>> doesn't work well with logical decoding.
>
> Just to make that clear: I didn't actually test it, but it ddidn't look
> good.

You didn't have to test it - it wasn't considered properly, but should
have been. Simple as that.

>> Would that be sufficiently flexible for the various logical
>> replication use cases? I guess we are somewhat forced to push that
>> kind of thing into output plugins, because doing so lets them decide
>> how to handle this.
>
> I don't see what benefits that'd bring.

Hmm. Yeah, I did somewhat talk myself out of it as I wrote the e-mail.  :-)

>> It's a little unfortunate that this implementation
>> detail is exposed to output plugins, though, which is why I'd be
>> willing to believe that it'd be better to have transaction reassembly
>> normalize the records such that a super deleted tuple was never
>> exposed to output plugins in the first place...
>
> Yes.
>
>> they'd only see a
>> REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of
>> an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that
>> was the definitive outcome. No need for output plugins to consider
>> REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all.
>
> Yes. It'd be easiest if the only the final insert/update were actually
> WAL logged as full actions.

Well, that implies that we'd actually know that we'd succeed when WAL
logging the speculative heap tuple's insertion. We literally have no
way of knowing if that's the case at that point, though - that's just
the nature of value locking scheme #2's optimistic approach.

> IIUC we can actually otherwise end with a,
> theoretically, arbitrarily large chain of insert/super deletions.

We can end up with an infinite number of insertions and super
deletions, in theory. However, one session is always guaranteed to
make progress, so the lock starvation hazards [1] seem acceptable. You
can say the same thing about the subxact looping pattern too.

I'll look into making it so that a "normalization" process happens
during transaction reassembly, thus relieving logical changeset
plugins from the burden of having to think of ON CONFLICT UPDATE as a
special case at all. I think that's what it will take.

Thanks

[1] https://wiki.postgresql.org/wiki/UPSERT#Theoretical_lock_starvation_hazards
-- 
Peter Geoghegan

Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From

Andres Freund

Date:

20 February 2015, 23:52:30

On 2015-02-20 15:44:12 -0800, Peter Geoghegan wrote:
> On Thu, Feb 19, 2015 at 2:11 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Yes. It'd be easiest if the only the final insert/update were actually
> > WAL logged as full actions.
> 
> Well, that implies that we'd actually know that we'd succeed when WAL
> logging the speculative heap tuple's insertion.

I don't think it does. It'd certainly be possible to simply only emit
the final WAL logging action once the insertion has actually
non-speculatively succeeded.  We might decide against that for
eefficiency or complexity reasons, but it'd be far from impossible or
even ugly.  We could even not log the actual values for the speculative
insertion - after all, those aren't needed if we crash halfway
through...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From

Peter Geoghegan

Date:

20 February 2015, 23:59:00

On Fri, Feb 20, 2015 at 3:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-02-20 15:44:12 -0800, Peter Geoghegan wrote:
>> On Thu, Feb 19, 2015 at 2:11 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > Yes. It'd be easiest if the only the final insert/update were actually
>> > WAL logged as full actions.
>>
>> Well, that implies that we'd actually know that we'd succeed when WAL
>> logging the speculative heap tuple's insertion.
>
> I don't think it does. It'd certainly be possible to simply only emit
> the final WAL logging action once the insertion has actually
> non-speculatively succeeded.  We might decide against that for
> eefficiency or complexity reasons, but it'd be far from impossible or
> even ugly.  We could even not log the actual values for the speculative
> insertion - after all, those aren't needed if we crash halfway
> through...

I think that that would be prohibitively complex and inefficient,
though. No? I will concede that it's probably possible in principle,
but that seems like a pretty academic point.

-- 
Peter Geoghegan

Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From

Peter Geoghegan

Date:

24 February 2015, 01:47:09

On Fri, Feb 20, 2015 at 3:44 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> they'd only see a
>>> REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of
>>> an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that
>>> was the definitive outcome. No need for output plugins to consider
>>> REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all.
>>
>> Yes. It'd be easiest if the only the final insert/update were actually
>> WAL logged as full actions.
>
> Well, that implies that we'd actually know that we'd succeed when WAL
> logging the speculative heap tuple's insertion. We literally have no
> way of knowing if that's the case at that point, though - that's just
> the nature of value locking scheme #2's optimistic approach.

Attached patch does this. This is just a sketch, to provoke a quick
discussion (which is hopefully all this needs). I did things this way
because it seemed more efficient than rolling this into a whole new
revision of the ON CONFLICT patch series.

Ostensibly, this does the right thing: The test_decoding output plugin
only ever reports either an INSERT or an UPDATE within a transaction
(that contains an INSERT ... ON CONFLICT UPDATE). With jjanes_upsert
running, I have not found any problematic cases where this fails to be
true (although I have certainly been less than thorough so far -
please take that into consideration).

I have two basic concerns:

* Do you think that what I've sketched here is roughly the right
approach? Consolidating super deletions and insertions at transaction
reassembly seemed like the right thing to do, but obviously I'm not
best qualified to judge that. Clearly, it would be possible to
optimize this so REORDER_BUFFER_CHANGE_INSERT "peek ahead" happens
much more selectively, which I haven't bothered with yet.

* What are the hazards implied by my abuse of the
ReorderBufferIterTXNNext() interface? It looks like it should be okay
to "peek ahead" like this, because it has a slightly odd convention
around memory management that accidentally works. But honestly, I'm
feeling too lazy to make myself grok the code within
ReorderBufferIterTXNNext() at the moment, and so have not put concerns
about the code being totally broken to rest yet. Could the "current"
ReorderBufferChange be freed before it is sent to a logical decoding
plugin by way of a call to the  rb->apply_change() callback (when the
xact is spooled to disk, for example)?

Thanks
--
Peter Geoghegan

Attachment

0007-Logical-decoding-support-for-ON-CONFLICT-UPDATE.patch

Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From

Andres Freund

Date:

25 February 2015, 11:26:23

Hi,

> > Well, that implies that we'd actually know that we'd succeed when WAL
> > logging the speculative heap tuple's insertion. We literally have no
> > way of knowing if that's the case at that point, though - that's just
> > the nature of value locking scheme #2's optimistic approach.
> 
> Attached patch does this. This is just a sketch, to provoke a quick
> discussion (which is hopefully all this needs). I did things this way
> because it seemed more efficient than rolling this into a whole new
> revision of the ON CONFLICT patch series.

> Ostensibly, this does the right thing: The test_decoding output plugin
> only ever reports either an INSERT or an UPDATE within a transaction
> (that contains an INSERT ... ON CONFLICT UPDATE). With jjanes_upsert
> running, I have not found any problematic cases where this fails to be
> true (although I have certainly been less than thorough so far -
> please take that into consideration).

I'm pretty sure this will entirely fail if you have a transaction that's
large enough to spill to disk. Calling ReorderBufferIterTXNNext() will
reuse the memory for the in memory changes.

It's also borked because it skips a large number of checks for the next
change. You're entirely disregarding what the routine does otherwise -
it could be a toast record, it could be a change to a system table, it
could be a new snapshot...

Also, this seems awfully fragile in th presence of toast rows and such?
Can we be entirely sure that the next WAL record logged by that session
would be the internal super deletion?

> * Do you think that what I've sketched here is roughly the right
> approach? Consolidating super deletions and insertions at transaction
> reassembly seemed like the right thing to do, but obviously I'm not
> best qualified to judge that. Clearly, it would be possible to
> optimize this so REORDER_BUFFER_CHANGE_INSERT "peek ahead" happens
> much more selectively, which I haven't bothered with yet.

It's far from pretty, but I guess it might end up being ok. I still think
the alternative of only emitting a real WAL record once the insertion
actually succeeded deserves attention - you hand waved it away, without
looking that much.

> * What are the hazards implied by my abuse of the
> ReorderBufferIterTXNNext() interface? It looks like it should be okay
> to "peek ahead" like this, because it has a slightly odd convention
> around memory management that accidentally works. But honestly, I'm
> feeling too lazy to make myself grok the code within
> ReorderBufferIterTXNNext() at the moment, and so have not put concerns
> about the code being totally broken to rest yet. Could the "current"
> ReorderBufferChange be freed before it is sent to a logical decoding
> plugin by way of a call to the  rb->apply_change() callback (when the
> xact is spooled to disk, for example)?

As explained above, I can't see it working directly this way.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From

Peter Geoghegan

Date:

25 February 2015, 22:05:04

On Wed, Feb 25, 2015 at 3:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I'm pretty sure this will entirely fail if you have a transaction that's
> large enough to spill to disk. Calling ReorderBufferIterTXNNext() will
> reuse the memory for the in memory changes.
>
> It's also borked because it skips a large number of checks for the next
> change. You're entirely disregarding what the routine does otherwise -
> it could be a toast record, it could be a change to a system table, it
> could be a new snapshot...
>
> Also, this seems awfully fragile in th presence of toast rows and such?
> Can we be entirely sure that the next WAL record logged by that session
> would be the internal super deletion?

toast_save_datum() is called with a heap_insert() call before heap
insertion for the tuple proper. We're relying on the assumption that
if there is no immediate super deletion record, things are fine. We
cannot speculatively insert into catalogs, BTW. Why should we see a
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID case next, in the case that
we need to care about (when the tuple was super deleted immediately
afterwards)?

Now, I fully admit that as written, the quick sketch patch is
unacceptably fragile (I think it might accidentally work despite this,
leaving aside concerns about resource lifespans). The "peek ahead"
needs to be more robust, for sure, most obviously by not failing when
we spill to disk, but also by being paranoid about super deletion
records. Basically, by "more paranoid", I mean the next thing
immediately following a speculatively inserted tuple might not be a
super deletion record, but we still look until we conclude that there
won't be one (based on something else  - some "terminating condition"
- happening, including the xact committing). Other examples of
terminating conditions would be new, unrelated
REORDER_BUFFER_CHANGE_INSERT/ REORDER_BUFFER_CHANGE_UPDATE/
REORDER_BUFFER_CHANGE_DELETE changes. We can make some assumptions
about when it's safe to conclude that there was no super deletion that
are not fragile.

As for the idea of writing WAL separately, I don't think it's worth
it. We'd need to go exclusive lock the buffer again, which seems like
a non-starter. While what I have here *is* very ugly, I see no better
alternative. By doing what you suggest, we'd be finding a special case
way of safely violating the general "WAL log-before-data" rule.
Performance aside, that seems like a worse, less isolated kludge...no?
-- 
Peter Geoghegan

Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From

Jim Nasby

Date:

25 February 2015, 23:41:10

On 2/19/15 4:11 AM, Andres Freund wrote:
>> Thoughts? Can't say that I've given conflict resolution for
>> >multi-master systems a great deal of thought before now, so I might be
>> >quite off the mark here.
> I don't think conflict resolution actually plays a role here. This is
> about consistency inside a single system, not consistency across
> systems.

Isn't it possible that a multi-master resolution algo would want to know 
that something was an UPSERT though? ISTM handling that differently than 
a plain INSERT/UPDATE is something you might well want to do.

For example, if you're using last-wins conflict resolution and UPSERTs 
on two nodes overlap, wouldn't you want to know that the second UPSERT 
was an UPSERT and not a plain INSERT?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From

Andres Freund

Date:

03 March 2015, 11:13:50

On 2015-02-25 14:04:55 -0800, Peter Geoghegan wrote:
> On Wed, Feb 25, 2015 at 3:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I'm pretty sure this will entirely fail if you have a transaction that's
> > large enough to spill to disk. Calling ReorderBufferIterTXNNext() will
> > reuse the memory for the in memory changes.
> >
> > It's also borked because it skips a large number of checks for the next
> > change. You're entirely disregarding what the routine does otherwise -
> > it could be a toast record, it could be a change to a system table, it
> > could be a new snapshot...
> >
> > Also, this seems awfully fragile in th presence of toast rows and such?
> > Can we be entirely sure that the next WAL record logged by that session
> > would be the internal super deletion?
> 
> toast_save_datum() is called with a heap_insert() call before heap
> insertion for the tuple proper. We're relying on the assumption that
> if there is no immediate super deletion record, things are fine. We
> cannot speculatively insert into catalogs, BTW.

I fail to see what prohibits e.g. another catalog modifying transaction
to commit inbetween and distribute a new snapshot.

> Why should we see a REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID case
> next, in the case that we need to care about (when the tuple was super
> deleted immediately afterwards)?

It's irrelevant whether you care about it or not. Your
ReorderBufferIterTXNNext() consumes a change that needs to actually be
processed. It could very well be something important like a new
snapshot.

> As for the idea of writing WAL separately, I don't think it's worth
> it. We'd need to go exclusive lock the buffer again, which seems like
> a non-starter.

Meh^2. That won't be the most significant cost of an UPSERT.

> While what I have here *is* very ugly, I see no better alternative. By
> doing what you suggest, we'd be finding a special case way of safely
> violating the general "WAL log-before-data" rule.

No, it won't. We don't WAL log modifications that don't need to persist
in a bunch of places. It'd be perfectly fine to start upsert with a
'acquiring a insertion lock' record that pretty much only contains the
item pointer (and a FPI if necessary to prevent torn pages). And then
only log the full new record once it's sure that the whole thing needs
to survive.  Redo than can simply clean out the size touched by the
insertion lock.

> Performance aside, that seems like a worse, less isolated kludge...no?

No. I'm not sure it'll come up significantly better, but I don't see it
coming up much worse. WAL logging where the result of the WAL logged
action essentially can't be determined until many records later aren't
pretty.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From

Peter Geoghegan

Date:

04 March 2015, 02:48:26

On Tue, Mar 3, 2015 at 3:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> toast_save_datum() is called with a heap_insert() call before heap
>> insertion for the tuple proper. We're relying on the assumption that
>> if there is no immediate super deletion record, things are fine. We
>> cannot speculatively insert into catalogs, BTW.
>
> I fail to see what prohibits e.g. another catalog modifying transaction
> to commit inbetween and distribute a new snapshot.

SnapBuildDistributeNewCatalogSnapshot()/REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT
does look like a problematic case. It's problematic specifically
because it involves some xact queuing a change to *some other* xact -
we cannot assume that this won't happen between WAL-logging within
heap_insert() and heap_delete(). Can you think of any other such
cases?

I think that REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID should be fine.
REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID is not an issue either. In
both cases I believe my assumption does not break because there won't
be writes to system catalogs between the relevant heap_insert() and
heap_delete() calls, which are tightly coupled, and also because
speculative insertion into catalogs is unsupported. That just leaves
the non-"*CHANGE_INTERAL_* "(regular DML) cases, which should also be
fine.

As for SnapBuildDistributeNewCatalogSnapshot(), I'm open to
suggestions. Do you see any opportunity around making assumptions
about heavyweight locking making the interleaving of some
REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT change between a
REORDER_BUFFER_CHANGE_INTERNAL_INSERT change and a
REORDER_BUFFER_CHANGE_INTERNAL_DELETE change? AFAICT, that's all this
approach really needs to worry about (or the interleaving of something
else not under the control of speculative insertion - doesn't have to
be a REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT, that's just the most
obvious problematic case).

>> Why should we see a REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID case
>> next, in the case that we need to care about (when the tuple was super
>> deleted immediately afterwards)?
>
> It's irrelevant whether you care about it or not. Your
> ReorderBufferIterTXNNext() consumes a change that needs to actually be
> processed. It could very well be something important like a new
> snapshot.

But it is actually processed, in the next iteration (we avoid calling
ReorderBufferIterTXNNext() at the top of the loop if it has already
been called for that iteration as part of peeking ahead). AFAICT all
that is at issue is the safety of one particular assumption I've made:
that it is safe to assume that there will never be a super deletion in
the event of not seeing a super deletion change immediately following
a speculative insertion change within some xact when decoding. It's
still going to be processed if it's something else. The implementation
will, however, fail to consolidate the speculative insertion change
and super deletion change if my assumption is ever faulty.

This whole approach doesn't seem to be all that different to how there
is currently coordination within ReorderBufferCommit() between
TOAST-related changes, and changes to the relation proper. In fact,
I've now duplicated the way the IsToastRelation() case performs
"dlist_delete(&change->node)" in order to avoid chunk reuse in the
event of spilling to disk. Stress testing by decreasing
"max_changes_in_memory" to 1 does not exhibit broken behavior, I
believe (although that does break the test_decoding regression tests
on the master branch, FWIW). Also, obviously I have not considered the
SnapBuildDistributeNewCatalogSnapshot() case too closely.

I attach an updated patch that I believe at least handles the
serialization aspects correctly, FYI. Note that I assert that
REORDER_BUFFER_CHANGE_INTERNAL_DELETE follows a
REORDER_BUFFER_CHANGE_INTERNAL_INSERT, so if you can find a way to
break the assumption it should cause an assertion failure, which is
something to look out for during testing.

>> While what I have here *is* very ugly, I see no better alternative. By
>> doing what you suggest, we'd be finding a special case way of safely
>> violating the general "WAL log-before-data" rule.
>
> No, it won't. We don't WAL log modifications that don't need to persist
> in a bunch of places. It'd be perfectly fine to start upsert with a
> 'acquiring a insertion lock' record that pretty much only contains the
> item pointer (and a FPI if necessary to prevent torn pages). And then
> only log the full new record once it's sure that the whole thing needs
> to survive.  Redo than can simply clean out the size touched by the
> insertion lock.

That seems like a lot of work in the general case to not do something
that will only very rarely need to occur anyway. The optimistic
approach of value locking scheme #2 has a small race that is detected
(which necessitates super deletion), but that will only very rarely be
required. Also, there is value in making super deletion (and
speculative insertion) as close as possible to regular deletion and
insertion.

--
Peter Geoghegan

Attachment

0004-Logical-decoding-support-for-ON-CONFLICT-UPDATE.patch