Thread: Logical insert/update/delete WAL records for custom table AMs
Attached is a WIP patch to add new WAL records to represent a logical insert, update, or delete. These records do not do anything at REDO time, they are only processed during logical decoding/replication. These are intended to be used by a custom table AM, like my columnar compression extension[0], which currently supports physical replication but can't support logical decoding/replication because decoding is not extensible. Using these new logical records would be redundant, making inserts/updates/deletes less efficient, but at least logical decoding would work (the lack of which is columnar's biggest weakness). Alternatively, we could support extensible WAL with extensible decoding. I also like this approach, but it takes more work for an AM like columnar to get that right -- it needs to keep additional state in the walsender to track and assemble the compressed columns stored across many blocks. It also requires a lot of care, because mistakes can get you into serious trouble. This proposal, for new logical records without WAL extensibility, provides a more shallow ramp to get a table AM working (including logical replication/decoding) without the need to invest in the WAL design. Later, of course I'd like the option for extensible WAL as well (to be more efficient), but right now I'd prefer it just worked (inefficiently). The patch is still very rough, but I tried in simple insert cases in my columnar[0] extension (which only supports insert, not update/delete). I'm looking for some review on the approach and structure before I polish and test it. Note that my main test case is columnar, which doesn't support update/delete. Also note that the patch is against v14 (for now). Regards, Jeff Davis [0] https://github.com/citusdata/citus/tree/master/src/backend/columnar
Attachment
On Sun, Oct 31, 2021 at 11:40 PM Jeff Davis <pgsql@j-davis.com> wrote: > > Attached is a WIP patch to add new WAL records to represent a logical > insert, update, or delete. These records do not do anything at REDO > time, they are only processed during logical decoding/replication. > > These are intended to be used by a custom table AM, like my columnar > compression extension[0], which currently supports physical replication > but can't support logical decoding/replication because decoding is not > extensible. Using these new logical records would be redundant, making > inserts/updates/deletes less efficient, but at least logical decoding > would work (the lack of which is columnar's biggest weakness). > > Alternatively, we could support extensible WAL with extensible > decoding. I also like this approach, but it takes more work for an AM > like columnar to get that right -- it needs to keep additional state in > the walsender to track and assemble the compressed columns stored > across many blocks. It also requires a lot of care, because mistakes > can get you into serious trouble. > > This proposal, for new logical records without WAL extensibility, > provides a more shallow ramp to get a table AM working (including > logical replication/decoding) without the need to invest in the WAL > design. Later, of course I'd like the option for extensible WAL as well > (to be more efficient), but right now I'd prefer it just worked > (inefficiently). > You have modeled DecodeLogicalInsert based on current DecodeInsert and it generates the same change message, so not sure how exactly these new messages will be different from current heap_insert/update/delete messages? Also, we have code to deal with different types of messages at various places during decoding, so if they will be different then we need to probably deal at those places as well. -- With Regards, Amit Kapila.
On Wed, 2021-11-03 at 11:25 +0530, Amit Kapila wrote: > You have modeled DecodeLogicalInsert based on current DecodeInsert > and > it generates the same change message, so not sure how exactly these > new messages will be different from current heap_insert/update/delete > messages? Is there a reason you think the messages should be different for heap versus another table AM? Isn't the table AM a physical implementation detail? Regards, Jeff Davis
On Thu, Nov 4, 2021 at 7:09 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2021-11-03 at 11:25 +0530, Amit Kapila wrote: > > You have modeled DecodeLogicalInsert based on current DecodeInsert > > and > > it generates the same change message, so not sure how exactly these > > new messages will be different from current heap_insert/update/delete > > messages? > > Is there a reason you think the messages should be different for heap > versus another table AM? Isn't the table AM a physical implementation > detail? > We have special handling for speculative insertions and toast insertions. Can't different tableAM's have different representations for toast or may be some different concept like speculative insertions? Similarly, I remember that for zheap we didn't had TransactionIds for subtransactions so we need to make some changes in logical decoding to compensate for that. I guess similar stuff could be required for another table AMs. Then a different table AM can have a different tuple format which won't work for current change records unless we convert it to heap tuple format before writing WAL for it. -- With Regards, Amit Kapila.
On Thu, 2021-11-04 at 14:33 +0530, Amit Kapila wrote: > Can't different tableAM's have different representations > for toast or may be some different concept like speculative > insertions? The decoding plugin should just use the common interfaces to toast, and if the table AM supports toast, everything should work fine. The only special thing it needs to do is check VARATT_IS_EXTERNAL_ONDISK(), because on-disk toast data can't be decoded (which is true for heap, too). I haven't looked as much into speculative insertions, but I don't think those are a problem either. Shouldn't they be handled before they make it into the change stream that the plugin sees? > Similarly, I remember that for zheap we didn't had > TransactionIds for subtransactions so we need to make some changes in > logical decoding to compensate for that. I guess similar stuff could > be required for another table AMs. Then a different table AM can have > a different tuple format which won't work for current change records > unless we convert it to heap tuple format before writing WAL for it. Columnar certainly has a different format. That makes me wonder whether ReorderBufferChange and/or ReorderBufferTupleBuf are too low-level, and we should have a higher-level representation of a change that is based on slots. Can you tell me more about the changes you made for zheap? I still don't understand why the decoding plugin would have to know what table AM the change came from. Regards, Jeff Davis
On Fri, Nov 5, 2021 at 4:53 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Thu, 2021-11-04 at 14:33 +0530, Amit Kapila wrote: > > Can't different tableAM's have different representations > > for toast or may be some different concept like speculative > > insertions? > > The decoding plugin should just use the common interfaces to toast, and > if the table AM supports toast, everything should work fine. > I think that is true only if table AM uses same format as heap for toast. It also seems to be relying heap tuple format. > The only > special thing it needs to do is check VARATT_IS_EXTERNAL_ONDISK(), > because on-disk toast data can't be decoded (which is true for heap, > too). > > I haven't looked as much into speculative insertions, but I don't think > those are a problem either. Shouldn't they be handled before they make > it into the change stream that the plugin sees? > They will be handled before the plugin sees them but I was talking about what if the table AM has some other WAL similar to WAL of speculative insertions that needs special handling. > > Similarly, I remember that for zheap we didn't had > > TransactionIds for subtransactions so we need to make some changes in > > logical decoding to compensate for that. I guess similar stuff could > > be required for another table AMs. Then a different table AM can have > > a different tuple format which won't work for current change records > > unless we convert it to heap tuple format before writing WAL for it. > > Columnar certainly has a different format. That makes me wonder whether > ReorderBufferChange and/or ReorderBufferTupleBuf are too low-level, and > we should have a higher-level representation of a change that is based > on slots. > > Can you tell me more about the changes you made for zheap? I still > don't understand why the decoding plugin would have to know what table > AM the change came from. > I am not talking about decoding plugin but rather decoding itself, basically, the work we do in reorderbuffer.c, decode.c, etc. The two things I remember were tuple format and transaction ids as mentioned in my previous email. I think we should try to find a solution for tuple format as the current decoding code relies on it if we want decoding to deal with another table AMs transparently. -- With Regards, Amit Kapila.
On Fri, 2021-11-05 at 10:00 +0530, Amit Kapila wrote: > I am not talking about decoding plugin but rather decoding itself, > basically, the work we do in reorderbuffer.c, decode.c, etc. The two > things I remember were tuple format and transaction ids as mentioned > in my previous email. If it's difficult to come up with something that will work for all table AMs, then it suggests that we might want to go towards fully extensible rmgr, and have a decoding method in the rmgr. I started a thread (with a patch) here: https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.camel@j-davis.com > I think we should try to find a solution for > tuple format as the current decoding code relies on it if we want > decoding to deal with another table AMs transparently. Agreed, but it seems like we need basic extensibility first. For now, we'll need to convert to a heap tuple, but later I'd like to support other formats for the decoding plugin to work with. Regards, Jeff Davis
On Tue, Nov 9, 2021 at 5:12 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Fri, 2021-11-05 at 10:00 +0530, Amit Kapila wrote: > > I am not talking about decoding plugin but rather decoding itself, > > basically, the work we do in reorderbuffer.c, decode.c, etc. The two > > things I remember were tuple format and transaction ids as mentioned > > in my previous email. > > If it's difficult to come up with something that will work for all > table AMs, then it suggests that we might want to go towards fully > extensible rmgr, and have a decoding method in the rmgr. > > I started a thread (with a patch) here: > > > https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.camel@j-davis.com > > > I think we should try to find a solution for > > tuple format as the current decoding code relies on it if we want > > decoding to deal with another table AMs transparently. > > Agreed, but it seems like we need basic extensibility first. For now, > we'll need to convert to a heap tuple, > Okay, but that might have a cost because we need to convert it before WAL logging it, and then we probably also need to convert back to the original table AM format during recovery/standby apply. -- With Regards, Amit Kapila.
On Wed, 10 Nov 2021 at 03:17, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 9, 2021 at 5:12 AM Jeff Davis <pgsql@j-davis.com> wrote: > > > > On Fri, 2021-11-05 at 10:00 +0530, Amit Kapila wrote: > > > I am not talking about decoding plugin but rather decoding itself, > > > basically, the work we do in reorderbuffer.c, decode.c, etc. The two > > > things I remember were tuple format and transaction ids as mentioned > > > in my previous email. > > > > If it's difficult to come up with something that will work for all > > table AMs, then it suggests that we might want to go towards fully > > extensible rmgr, and have a decoding method in the rmgr. > > > > I started a thread (with a patch) here: > > > > > > https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.camel@j-davis.com > > > > > I think we should try to find a solution for > > > tuple format as the current decoding code relies on it if we want > > > decoding to deal with another table AMs transparently. > > > > Agreed, but it seems like we need basic extensibility first. For now, > > we'll need to convert to a heap tuple, > > > > Okay, but that might have a cost because we need to convert it before > WAL logging it, and then we probably also need to convert back to the > original table AM format during recovery/standby apply. I spoke with Jeff in detail about this patch in NYC Dec 21, and I now think it is worth pursuing. It seems much more likely that this would be acceptable than fully extensible rmgr. Amit asks a good question: should we be writing a message that seems to presume the old heap tuple format? My answer is that we clearly need it to be written in *some* common format, and the current heap format currently works, so de facto it is the format we should use. Right now, TAMs have to reformat back into this same format, so it is the way the APIs work. Put it another way: I don't see any other format that makes sense to use, now, but that could change in the future. So I'm signing up as a reviewer and we'll see if this is good to go. -- Simon Riggs http://www.EnterpriseDB.com/
On Sun, 31 Oct 2021 at 18:10, Jeff Davis <pgsql@j-davis.com> wrote: > I'm looking for some review on the approach and structure before I > polish and test it. Repurposing the logical msg rmgr into a general purpose logical rmgr seems right. Structure looks obvious, which is good. Please pursue this and I will review with you as you go. -- Simon Riggs http://www.EnterpriseDB.com/
On Wed, 2022-01-05 at 20:19 +0000, Simon Riggs wrote: > I spoke with Jeff in detail about this patch in NYC Dec 21, and I now > think it is worth pursuing. It seems much more likely that this would > be acceptable than fully extensible rmgr. Thank you. I had some conversations with others who felt this approach is wasteful, which it is. But if this patch still has potential, I'm happy to pursue it along with the extensible rmgr approach. > So I'm signing up as a reviewer and we'll see if this is good to go. Great, I attached a rebased version. Regards, Jeff Davis
Attachment
On Mon, 17 Jan 2022 at 09:05, Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2022-01-05 at 20:19 +0000, Simon Riggs wrote: > > I spoke with Jeff in detail about this patch in NYC Dec 21, and I now > > think it is worth pursuing. It seems much more likely that this would > > be acceptable than fully extensible rmgr. > > Thank you. I had some conversations with others who felt this approach > is wasteful, which it is. But if this patch still has potential, I'm > happy to pursue it along with the extensible rmgr approach. It's self-contained and generally useful for a range of applications, so the code would be long-lived. Calling it wasteful presumes the way you'd use it. If you make logical log entries you don't need to make physical ones, so its actually the physical log entries that would be wasteful. Logical log entries don't need to be decoded, so it's actually more efficient, plus we could skip index entries altogether. I would argue that this is the way we should be doing WAL, with occasional divergence to physical records for performance, rather than the other way around. > > So I'm signing up as a reviewer and we'll see if this is good to go. > > Great, I attached a rebased version. The approach is perfectly fine, it just needs to be finished and rebased. -- Simon Riggs http://www.EnterpriseDB.com/
Hi, On 2022-01-17 01:05:14 -0800, Jeff Davis wrote: > Great, I attached a rebased version. Currently this doesn't apply: http://cfbot.cputube.org/patch_37_3394.log - Andres
On Thu, 2022-02-24 at 20:35 +0000, Simon Riggs wrote: > The approach is perfectly fine, it just needs to be finished and > rebased. Attached a new version. The scope expanded, so this is likely to slip v15 at this late time. For 15, I'll focus on my extensible rmgr work, which can serve similar purposes. The main purpose of this patch is to be able to emit logical events for a table (insert/update/delete/truncate) without actually modifying a table or relying on the heap at all. That allows table AMs to support logical decoding/replication, and perhaps allows a few other interesting use cases (maybe foreign tables?). There are really two advantages of this approach over relying on a custom rmgr: 1. Easier for extension authors 2. Doesn't require an extension module to be loaded to start the server Those are very nice advantages, but they come at the price of flexibility and performance. I think there's room for both, and we can discuss the merits individually. Changes: * Support logical messages for INSERT/UPDATE/DELETE/TRUNCATE (before it only supported INSERT) * SQL functions pg_logical_emit_insert/update/delete/truncate (callable by superuser) * Tests (using test_decoding) * Use replica identities properly * In general a lot of cleanup, but still not quite ready TODO: * Should SQL functions be callable by the table owner? I would lean toward superuser-only, because logical replication is used for administrative purposes like upgrades, and I don't think we want table owners to be able to create inconsistencies. * Support for multi-insert * Docs for SQL functions, and maybe docs in the section on Generic WAL * Try to get away from reliance on heap tuples specifically Regards, Jeff Davis
Attachment
On Wed, Mar 30, 2022 at 2:31 PM Jeff Davis <pgsql@j-davis.com> wrote: > Attached a new version. The scope expanded, so this is likely to slip > v15 at this late time. For 15, I'll focus on my extensible rmgr work, > which can serve similar purposes. > > The main purpose of this patch is to be able to emit logical events for > a table (insert/update/delete/truncate) without actually modifying a > table or relying on the heap at all. That allows table AMs to support > logical decoding/replication, and perhaps allows a few other > interesting use cases (maybe foreign tables?). There are really two > advantages of this approach over relying on a custom rmgr: > > 1. Easier for extension authors > 2. Doesn't require an extension module to be loaded to start the > server I'm not sure that I understand exactly how this is intended to be used. I can think of three possibilities: 1. An AM that doesn't care about having anything happening during recovery, but wants to be able to get logical decoding to do some work. Maybe the intention of the AM is that data is available only when the server is not in recovery and all data is lost on shutdown, or maybe the AM has its own separate durability mechanism. 2. An AM that wants things to happen during recovery, but handles that separately. For example, maybe it logs all the data changes via log_newpage() and then separately emits these log records. 3. An AM that wants things to happen during recovery, and expects these records to serve that purpose. That would require getting control during WAL replay as well as during decoding, and would probably only work for an AM whose data is not block-structured (e.g. an in-memory btree that is dumped to disk at every checkpoint). My guess is that this is intended to meet use cases 1 and 2 but not 3. Is that correct? -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 2022-03-31 at 09:05 -0400, Robert Haas wrote: > 1. An AM that doesn't care about having anything happening during > recovery, but wants to be able to get logical decoding to do some > work. Maybe the intention of the AM is that data is available only > when the server is not in recovery and all data is lost on shutdown, > or maybe the AM has its own separate durability mechanism. This is a speculative use case that is not what I would use it for, but perhaps someone wants to do that with a table AM or maybe an FDW. > 2. An AM that wants things to happen during recovery, but handles > that > separately. For example, maybe it logs all the data changes via > log_newpage() and then separately emits these log records. Yes, or Generic WAL. Generic WAL seems like a half-feature without this Logical WAL patch, because it's hopeless to support logical decoding/replication of whatever data you're logging with Generic WAL. That's probably the strongest argument for this patch. > 3. An AM that wants things to happen during recovery, and expects > these records to serve that purpose. That would require getting > control during WAL replay as well as during decoding, and would > probably only work for an AM whose data is not block-structured (e.g. > an in-memory btree that is dumped to disk at every checkpoint). This patch would not work in this case because the records are ignored during REDO. Regards, Jeff Davis
On Mon, 2022-03-21 at 17:43 -0700, Andres Freund wrote: > Currently this doesn't apply: > http://cfbot.cputube.org/patch_37_3394.log Withdrawn for now. With custom WAL resource managers this is less important to me. I still think it has value, and I'm willing to work on it if more use cases come forward. Regards, Jeff Davis