Thread: INSERT ... ON CONFLICT UPDATE and logical decoding
Andres pointed out that the INSERT ... ON CONFLICT UPDATE patch doesn't work well with logical decoding. Basically, as things stand, there is no means of determining that an INSERT had an ON CONFLICT UPDATE clause from logical decoding. It isn't obvious that this matters - after all, the relevant WAL records are generally consistent with there being a vanilla INSERT (or vanilla UPDATE). If there was no possibility of concurrent conflicts, that might actually be the end of it, but of course there is. I guess that the best way of fixing this is exposing to output plugins that a "super deletion" is a REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE. This is kind of an internal ReorderBufferChangeType constant, because it means that the output plugin should probably just omit the tuple just inserted and now deleted. I tend to think that it would be best if the fact that a speculative insertion was a speculative insertion is not exposed. We just formalize that a REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE is a variant of REORDER_BUFFER_CHANGE_DELETE...it's implicit that things like triggers are not fired here (which may or may not matter, depending on the use case, I suppose). Would that be sufficiently flexible for the various logical replication use cases? I guess we are somewhat forced to push that kind of thing into output plugins, because doing so lets them decide how to handle this. It's a little unfortunate that this implementation detail is exposed to output plugins, though, which is why I'd be willing to believe that it'd be better to have transaction reassembly normalize the records such that a super deleted tuple was never exposed to output plugins in the first place...they'd only see a REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that was the definitive outcome. No need for output plugins to consider REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all. Thoughts? Can't say that I've given conflict resolution for multi-master systems a great deal of thought before now, so I might be quite off the mark here. -- Peter Geoghegan
Hi, On 2015-02-18 16:35:14 -0800, Peter Geoghegan wrote: > Andres pointed out that the INSERT ... ON CONFLICT UPDATE patch > doesn't work well with logical decoding. Just to make that clear: I didn't actually test it, but it ddidn't look good. > I guess that the best way of fixing this is exposing to output plugins > that a "super deletion" is a > REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE. I'm pretty much dead set against exposing anything like this output plugins. The output plugin shouldn't care that a insertion was a upsert. For one they won't get it right since they will be really infrequent, for another it'll be hard to replay such an event on another node. > This is kind of an > internal ReorderBufferChangeType constant, because it means that the > output plugin should probably just omit the tuple just inserted and > now deleted. An output plugin can't just go back and call a tuple that was already sent to a client back. > I tend to think that it would be best if the fact that a > speculative insertion was a speculative insertion is not exposed. We > just formalize that a REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE is a > variant of REORDER_BUFFER_CHANGE_DELETE...it's implicit that things > like triggers are not fired here (which may or may not matter, > depending on the use case, I suppose). I can't see that working. How such a event - that obviously causes unique conflicts! - be replayed on another node? > Would that be sufficiently flexible for the various logical > replication use cases? I guess we are somewhat forced to push that > kind of thing into output plugins, because doing so lets them decide > how to handle this. I don't see what benefits that'd bring. > It's a little unfortunate that this implementation > detail is exposed to output plugins, though, which is why I'd be > willing to believe that it'd be better to have transaction reassembly > normalize the records such that a super deleted tuple was never > exposed to output plugins in the first place... Yes. > they'd only see a > REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of > an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that > was the definitive outcome. No need for output plugins to consider > REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all. Yes. It'd be easiest if the only the final insert/update were actually WAL logged as full actions. IIUC we can actually otherwise end with a, theoretically, arbitrarily large chain of insert/super deletions. > Thoughts? Can't say that I've given conflict resolution for > multi-master systems a great deal of thought before now, so I might be > quite off the mark here. I don't think conflict resolution actually plays a role here. This is about consistency inside a single system, not consistency across systems. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 19, 2015 at 2:11 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2015-02-18 16:35:14 -0800, Peter Geoghegan wrote: >> Andres pointed out that the INSERT ... ON CONFLICT UPDATE patch >> doesn't work well with logical decoding. > > Just to make that clear: I didn't actually test it, but it ddidn't look > good. You didn't have to test it - it wasn't considered properly, but should have been. Simple as that. >> Would that be sufficiently flexible for the various logical >> replication use cases? I guess we are somewhat forced to push that >> kind of thing into output plugins, because doing so lets them decide >> how to handle this. > > I don't see what benefits that'd bring. Hmm. Yeah, I did somewhat talk myself out of it as I wrote the e-mail. :-) >> It's a little unfortunate that this implementation >> detail is exposed to output plugins, though, which is why I'd be >> willing to believe that it'd be better to have transaction reassembly >> normalize the records such that a super deleted tuple was never >> exposed to output plugins in the first place... > > Yes. > >> they'd only see a >> REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of >> an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that >> was the definitive outcome. No need for output plugins to consider >> REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all. > > Yes. It'd be easiest if the only the final insert/update were actually > WAL logged as full actions. Well, that implies that we'd actually know that we'd succeed when WAL logging the speculative heap tuple's insertion. We literally have no way of knowing if that's the case at that point, though - that's just the nature of value locking scheme #2's optimistic approach. > IIUC we can actually otherwise end with a, > theoretically, arbitrarily large chain of insert/super deletions. We can end up with an infinite number of insertions and super deletions, in theory. However, one session is always guaranteed to make progress, so the lock starvation hazards [1] seem acceptable. You can say the same thing about the subxact looping pattern too. I'll look into making it so that a "normalization" process happens during transaction reassembly, thus relieving logical changeset plugins from the burden of having to think of ON CONFLICT UPDATE as a special case at all. I think that's what it will take. Thanks [1] https://wiki.postgresql.org/wiki/UPSERT#Theoretical_lock_starvation_hazards -- Peter Geoghegan
On 2015-02-20 15:44:12 -0800, Peter Geoghegan wrote: > On Thu, Feb 19, 2015 at 2:11 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Yes. It'd be easiest if the only the final insert/update were actually > > WAL logged as full actions. > > Well, that implies that we'd actually know that we'd succeed when WAL > logging the speculative heap tuple's insertion. I don't think it does. It'd certainly be possible to simply only emit the final WAL logging action once the insertion has actually non-speculatively succeeded. We might decide against that for eefficiency or complexity reasons, but it'd be far from impossible or even ugly. We could even not log the actual values for the speculative insertion - after all, those aren't needed if we crash halfway through... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 20, 2015 at 3:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2015-02-20 15:44:12 -0800, Peter Geoghegan wrote: >> On Thu, Feb 19, 2015 at 2:11 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > Yes. It'd be easiest if the only the final insert/update were actually >> > WAL logged as full actions. >> >> Well, that implies that we'd actually know that we'd succeed when WAL >> logging the speculative heap tuple's insertion. > > I don't think it does. It'd certainly be possible to simply only emit > the final WAL logging action once the insertion has actually > non-speculatively succeeded. We might decide against that for > eefficiency or complexity reasons, but it'd be far from impossible or > even ugly. We could even not log the actual values for the speculative > insertion - after all, those aren't needed if we crash halfway > through... I think that that would be prohibitively complex and inefficient, though. No? I will concede that it's probably possible in principle, but that seems like a pretty academic point. -- Peter Geoghegan
On Fri, Feb 20, 2015 at 3:44 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> they'd only see a >>> REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of >>> an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that >>> was the definitive outcome. No need for output plugins to consider >>> REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all. >> >> Yes. It'd be easiest if the only the final insert/update were actually >> WAL logged as full actions. > > Well, that implies that we'd actually know that we'd succeed when WAL > logging the speculative heap tuple's insertion. We literally have no > way of knowing if that's the case at that point, though - that's just > the nature of value locking scheme #2's optimistic approach. Attached patch does this. This is just a sketch, to provoke a quick discussion (which is hopefully all this needs). I did things this way because it seemed more efficient than rolling this into a whole new revision of the ON CONFLICT patch series. Ostensibly, this does the right thing: The test_decoding output plugin only ever reports either an INSERT or an UPDATE within a transaction (that contains an INSERT ... ON CONFLICT UPDATE). With jjanes_upsert running, I have not found any problematic cases where this fails to be true (although I have certainly been less than thorough so far - please take that into consideration). I have two basic concerns: * Do you think that what I've sketched here is roughly the right approach? Consolidating super deletions and insertions at transaction reassembly seemed like the right thing to do, but obviously I'm not best qualified to judge that. Clearly, it would be possible to optimize this so REORDER_BUFFER_CHANGE_INSERT "peek ahead" happens much more selectively, which I haven't bothered with yet. * What are the hazards implied by my abuse of the ReorderBufferIterTXNNext() interface? It looks like it should be okay to "peek ahead" like this, because it has a slightly odd convention around memory management that accidentally works. But honestly, I'm feeling too lazy to make myself grok the code within ReorderBufferIterTXNNext() at the moment, and so have not put concerns about the code being totally broken to rest yet. Could the "current" ReorderBufferChange be freed before it is sent to a logical decoding plugin by way of a call to the rb->apply_change() callback (when the xact is spooled to disk, for example)? Thanks -- Peter Geoghegan
Attachment
Hi, > > Well, that implies that we'd actually know that we'd succeed when WAL > > logging the speculative heap tuple's insertion. We literally have no > > way of knowing if that's the case at that point, though - that's just > > the nature of value locking scheme #2's optimistic approach. > > Attached patch does this. This is just a sketch, to provoke a quick > discussion (which is hopefully all this needs). I did things this way > because it seemed more efficient than rolling this into a whole new > revision of the ON CONFLICT patch series. > Ostensibly, this does the right thing: The test_decoding output plugin > only ever reports either an INSERT or an UPDATE within a transaction > (that contains an INSERT ... ON CONFLICT UPDATE). With jjanes_upsert > running, I have not found any problematic cases where this fails to be > true (although I have certainly been less than thorough so far - > please take that into consideration). I'm pretty sure this will entirely fail if you have a transaction that's large enough to spill to disk. Calling ReorderBufferIterTXNNext() will reuse the memory for the in memory changes. It's also borked because it skips a large number of checks for the next change. You're entirely disregarding what the routine does otherwise - it could be a toast record, it could be a change to a system table, it could be a new snapshot... Also, this seems awfully fragile in th presence of toast rows and such? Can we be entirely sure that the next WAL record logged by that session would be the internal super deletion? > * Do you think that what I've sketched here is roughly the right > approach? Consolidating super deletions and insertions at transaction > reassembly seemed like the right thing to do, but obviously I'm not > best qualified to judge that. Clearly, it would be possible to > optimize this so REORDER_BUFFER_CHANGE_INSERT "peek ahead" happens > much more selectively, which I haven't bothered with yet. It's far from pretty, but I guess it might end up being ok. I still think the alternative of only emitting a real WAL record once the insertion actually succeeded deserves attention - you hand waved it away, without looking that much. > * What are the hazards implied by my abuse of the > ReorderBufferIterTXNNext() interface? It looks like it should be okay > to "peek ahead" like this, because it has a slightly odd convention > around memory management that accidentally works. But honestly, I'm > feeling too lazy to make myself grok the code within > ReorderBufferIterTXNNext() at the moment, and so have not put concerns > about the code being totally broken to rest yet. Could the "current" > ReorderBufferChange be freed before it is sent to a logical decoding > plugin by way of a call to the rb->apply_change() callback (when the > xact is spooled to disk, for example)? As explained above, I can't see it working directly this way. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Feb 25, 2015 at 3:26 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I'm pretty sure this will entirely fail if you have a transaction that's > large enough to spill to disk. Calling ReorderBufferIterTXNNext() will > reuse the memory for the in memory changes. > > It's also borked because it skips a large number of checks for the next > change. You're entirely disregarding what the routine does otherwise - > it could be a toast record, it could be a change to a system table, it > could be a new snapshot... > > Also, this seems awfully fragile in th presence of toast rows and such? > Can we be entirely sure that the next WAL record logged by that session > would be the internal super deletion? toast_save_datum() is called with a heap_insert() call before heap insertion for the tuple proper. We're relying on the assumption that if there is no immediate super deletion record, things are fine. We cannot speculatively insert into catalogs, BTW. Why should we see a REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID case next, in the case that we need to care about (when the tuple was super deleted immediately afterwards)? Now, I fully admit that as written, the quick sketch patch is unacceptably fragile (I think it might accidentally work despite this, leaving aside concerns about resource lifespans). The "peek ahead" needs to be more robust, for sure, most obviously by not failing when we spill to disk, but also by being paranoid about super deletion records. Basically, by "more paranoid", I mean the next thing immediately following a speculatively inserted tuple might not be a super deletion record, but we still look until we conclude that there won't be one (based on something else - some "terminating condition" - happening, including the xact committing). Other examples of terminating conditions would be new, unrelated REORDER_BUFFER_CHANGE_INSERT/ REORDER_BUFFER_CHANGE_UPDATE/ REORDER_BUFFER_CHANGE_DELETE changes. We can make some assumptions about when it's safe to conclude that there was no super deletion that are not fragile. As for the idea of writing WAL separately, I don't think it's worth it. We'd need to go exclusive lock the buffer again, which seems like a non-starter. While what I have here *is* very ugly, I see no better alternative. By doing what you suggest, we'd be finding a special case way of safely violating the general "WAL log-before-data" rule. Performance aside, that seems like a worse, less isolated kludge...no? -- Peter Geoghegan
On 2/19/15 4:11 AM, Andres Freund wrote: >> Thoughts? Can't say that I've given conflict resolution for >> >multi-master systems a great deal of thought before now, so I might be >> >quite off the mark here. > I don't think conflict resolution actually plays a role here. This is > about consistency inside a single system, not consistency across > systems. Isn't it possible that a multi-master resolution algo would want to know that something was an UPSERT though? ISTM handling that differently than a plain INSERT/UPDATE is something you might well want to do. For example, if you're using last-wins conflict resolution and UPSERTs on two nodes overlap, wouldn't you want to know that the second UPSERT was an UPSERT and not a plain INSERT? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2015-02-25 14:04:55 -0800, Peter Geoghegan wrote: > On Wed, Feb 25, 2015 at 3:26 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I'm pretty sure this will entirely fail if you have a transaction that's > > large enough to spill to disk. Calling ReorderBufferIterTXNNext() will > > reuse the memory for the in memory changes. > > > > It's also borked because it skips a large number of checks for the next > > change. You're entirely disregarding what the routine does otherwise - > > it could be a toast record, it could be a change to a system table, it > > could be a new snapshot... > > > > Also, this seems awfully fragile in th presence of toast rows and such? > > Can we be entirely sure that the next WAL record logged by that session > > would be the internal super deletion? > > toast_save_datum() is called with a heap_insert() call before heap > insertion for the tuple proper. We're relying on the assumption that > if there is no immediate super deletion record, things are fine. We > cannot speculatively insert into catalogs, BTW. I fail to see what prohibits e.g. another catalog modifying transaction to commit inbetween and distribute a new snapshot. > Why should we see a REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID case > next, in the case that we need to care about (when the tuple was super > deleted immediately afterwards)? It's irrelevant whether you care about it or not. Your ReorderBufferIterTXNNext() consumes a change that needs to actually be processed. It could very well be something important like a new snapshot. > As for the idea of writing WAL separately, I don't think it's worth > it. We'd need to go exclusive lock the buffer again, which seems like > a non-starter. Meh^2. That won't be the most significant cost of an UPSERT. > While what I have here *is* very ugly, I see no better alternative. By > doing what you suggest, we'd be finding a special case way of safely > violating the general "WAL log-before-data" rule. No, it won't. We don't WAL log modifications that don't need to persist in a bunch of places. It'd be perfectly fine to start upsert with a 'acquiring a insertion lock' record that pretty much only contains the item pointer (and a FPI if necessary to prevent torn pages). And then only log the full new record once it's sure that the whole thing needs to survive. Redo than can simply clean out the size touched by the insertion lock. > Performance aside, that seems like a worse, less isolated kludge...no? No. I'm not sure it'll come up significantly better, but I don't see it coming up much worse. WAL logging where the result of the WAL logged action essentially can't be determined until many records later aren't pretty. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Mar 3, 2015 at 3:13 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> toast_save_datum() is called with a heap_insert() call before heap >> insertion for the tuple proper. We're relying on the assumption that >> if there is no immediate super deletion record, things are fine. We >> cannot speculatively insert into catalogs, BTW. > > I fail to see what prohibits e.g. another catalog modifying transaction > to commit inbetween and distribute a new snapshot. SnapBuildDistributeNewCatalogSnapshot()/REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT does look like a problematic case. It's problematic specifically because it involves some xact queuing a change to *some other* xact - we cannot assume that this won't happen between WAL-logging within heap_insert() and heap_delete(). Can you think of any other such cases? I think that REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID should be fine. REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID is not an issue either. In both cases I believe my assumption does not break because there won't be writes to system catalogs between the relevant heap_insert() and heap_delete() calls, which are tightly coupled, and also because speculative insertion into catalogs is unsupported. That just leaves the non-"*CHANGE_INTERAL_* "(regular DML) cases, which should also be fine. As for SnapBuildDistributeNewCatalogSnapshot(), I'm open to suggestions. Do you see any opportunity around making assumptions about heavyweight locking making the interleaving of some REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT change between a REORDER_BUFFER_CHANGE_INTERNAL_INSERT change and a REORDER_BUFFER_CHANGE_INTERNAL_DELETE change? AFAICT, that's all this approach really needs to worry about (or the interleaving of something else not under the control of speculative insertion - doesn't have to be a REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT, that's just the most obvious problematic case). >> Why should we see a REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID case >> next, in the case that we need to care about (when the tuple was super >> deleted immediately afterwards)? > > It's irrelevant whether you care about it or not. Your > ReorderBufferIterTXNNext() consumes a change that needs to actually be > processed. It could very well be something important like a new > snapshot. But it is actually processed, in the next iteration (we avoid calling ReorderBufferIterTXNNext() at the top of the loop if it has already been called for that iteration as part of peeking ahead). AFAICT all that is at issue is the safety of one particular assumption I've made: that it is safe to assume that there will never be a super deletion in the event of not seeing a super deletion change immediately following a speculative insertion change within some xact when decoding. It's still going to be processed if it's something else. The implementation will, however, fail to consolidate the speculative insertion change and super deletion change if my assumption is ever faulty. This whole approach doesn't seem to be all that different to how there is currently coordination within ReorderBufferCommit() between TOAST-related changes, and changes to the relation proper. In fact, I've now duplicated the way the IsToastRelation() case performs "dlist_delete(&change->node)" in order to avoid chunk reuse in the event of spilling to disk. Stress testing by decreasing "max_changes_in_memory" to 1 does not exhibit broken behavior, I believe (although that does break the test_decoding regression tests on the master branch, FWIW). Also, obviously I have not considered the SnapBuildDistributeNewCatalogSnapshot() case too closely. I attach an updated patch that I believe at least handles the serialization aspects correctly, FYI. Note that I assert that REORDER_BUFFER_CHANGE_INTERNAL_DELETE follows a REORDER_BUFFER_CHANGE_INTERNAL_INSERT, so if you can find a way to break the assumption it should cause an assertion failure, which is something to look out for during testing. >> While what I have here *is* very ugly, I see no better alternative. By >> doing what you suggest, we'd be finding a special case way of safely >> violating the general "WAL log-before-data" rule. > > No, it won't. We don't WAL log modifications that don't need to persist > in a bunch of places. It'd be perfectly fine to start upsert with a > 'acquiring a insertion lock' record that pretty much only contains the > item pointer (and a FPI if necessary to prevent torn pages). And then > only log the full new record once it's sure that the whole thing needs > to survive. Redo than can simply clean out the size touched by the > insertion lock. That seems like a lot of work in the general case to not do something that will only very rarely need to occur anyway. The optimistic approach of value locking scheme #2 has a small race that is detected (which necessitates super deletion), but that will only very rarely be required. Also, there is value in making super deletion (and speculative insertion) as close as possible to regular deletion and insertion. -- Peter Geoghegan