Thread: Logical insert/update/delete WAL records for custom table AMs

Logical insert/update/delete WAL records for custom table AMs

From
Jeff Davis
Date:
Attached is a WIP patch to add new WAL records to represent a logical
insert, update, or delete. These records do not do anything at REDO
time, they are only processed during logical decoding/replication.

These are intended to be used by a custom table AM, like my columnar
compression extension[0], which currently supports physical replication
but can't support logical decoding/replication because decoding is not
extensible. Using these new logical records would be redundant, making
inserts/updates/deletes less efficient, but at least logical decoding
would work (the lack of which is columnar's biggest weakness).

Alternatively, we could support extensible WAL with extensible
decoding. I also like this approach, but it takes more work for an AM
like columnar to get that right -- it needs to keep additional state in
the walsender to track and assemble the compressed columns stored
across many blocks. It also requires a lot of care, because mistakes
can get you into serious trouble.

This proposal, for new logical records without WAL extensibility,
provides a more shallow ramp to get a table AM working (including
logical replication/decoding) without the need to invest in the WAL
design. Later, of course I'd like the option for extensible WAL as well
(to be more efficient), but right now I'd prefer it just worked
(inefficiently).

The patch is still very rough, but I tried in simple insert cases in my
columnar[0] extension (which only supports insert, not update/delete).
I'm looking for some review on the approach and structure before I
polish and test it. Note that my main test case is columnar, which
doesn't support update/delete. Also note that the patch is against v14
(for now).

Regards,
    Jeff Davis

[0] https://github.com/citusdata/citus/tree/master/src/backend/columnar

Attachment

Re: Logical insert/update/delete WAL records for custom table AMs

From
Amit Kapila
Date:
On Sun, Oct 31, 2021 at 11:40 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> Attached is a WIP patch to add new WAL records to represent a logical
> insert, update, or delete. These records do not do anything at REDO
> time, they are only processed during logical decoding/replication.
>
> These are intended to be used by a custom table AM, like my columnar
> compression extension[0], which currently supports physical replication
> but can't support logical decoding/replication because decoding is not
> extensible. Using these new logical records would be redundant, making
> inserts/updates/deletes less efficient, but at least logical decoding
> would work (the lack of which is columnar's biggest weakness).
>
> Alternatively, we could support extensible WAL with extensible
> decoding. I also like this approach, but it takes more work for an AM
> like columnar to get that right -- it needs to keep additional state in
> the walsender to track and assemble the compressed columns stored
> across many blocks. It also requires a lot of care, because mistakes
> can get you into serious trouble.
>
> This proposal, for new logical records without WAL extensibility,
> provides a more shallow ramp to get a table AM working (including
> logical replication/decoding) without the need to invest in the WAL
> design. Later, of course I'd like the option for extensible WAL as well
> (to be more efficient), but right now I'd prefer it just worked
> (inefficiently).
>

You have modeled DecodeLogicalInsert based on current DecodeInsert and
it generates the same change message, so not sure how exactly these
new messages will be different from current heap_insert/update/delete
messages? Also, we have code to deal with different types of messages
at various places during decoding, so if they will be different then
we need to probably deal at those places as well.

-- 
With Regards,
Amit Kapila.



Re: Logical insert/update/delete WAL records for custom table AMs

From
Jeff Davis
Date:
On Wed, 2021-11-03 at 11:25 +0530, Amit Kapila wrote:
> You have modeled DecodeLogicalInsert based on current DecodeInsert
> and
> it generates the same change message, so not sure how exactly these
> new messages will be different from current heap_insert/update/delete
> messages? 

Is there a reason you think the messages should be different for heap
versus another table AM? Isn't the table AM a physical implementation
detail?

Regards,
    Jeff Davis





Re: Logical insert/update/delete WAL records for custom table AMs

From
Amit Kapila
Date:
On Thu, Nov 4, 2021 at 7:09 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2021-11-03 at 11:25 +0530, Amit Kapila wrote:
> > You have modeled DecodeLogicalInsert based on current DecodeInsert
> > and
> > it generates the same change message, so not sure how exactly these
> > new messages will be different from current heap_insert/update/delete
> > messages?
>
> Is there a reason you think the messages should be different for heap
> versus another table AM? Isn't the table AM a physical implementation
> detail?
>

We have special handling for speculative insertions and toast
insertions. Can't different tableAM's have different representations
for toast or may be some different concept like speculative
insertions? Similarly, I remember that for zheap we didn't had
TransactionIds for subtransactions so we need to make some changes in
logical decoding to compensate for that. I guess similar stuff could
be required for another table AMs. Then a different table AM can have
a different tuple format which won't work for current change records
unless we convert it to heap tuple format before writing WAL for it.

-- 
With Regards,
Amit Kapila.



Re: Logical insert/update/delete WAL records for custom table AMs

From
Jeff Davis
Date:
On Thu, 2021-11-04 at 14:33 +0530, Amit Kapila wrote:
> Can't different tableAM's have different representations
> for toast or may be some different concept like speculative
> insertions?

The decoding plugin should just use the common interfaces to toast, and
if the table AM supports toast, everything should work fine. The only
special thing it needs to do is check VARATT_IS_EXTERNAL_ONDISK(),
because on-disk toast data can't be decoded (which is true for heap,
too).

I haven't looked as much into speculative insertions, but I don't think
those are a problem either. Shouldn't they be handled before they make
it into the change stream that the plugin sees?

> Similarly, I remember that for zheap we didn't had
> TransactionIds for subtransactions so we need to make some changes in
> logical decoding to compensate for that. I guess similar stuff could
> be required for another table AMs. Then a different table AM can have
> a different tuple format which won't work for current change records
> unless we convert it to heap tuple format before writing WAL for it.

Columnar certainly has a different format. That makes me wonder whether
ReorderBufferChange and/or ReorderBufferTupleBuf are too low-level, and
we should have a higher-level representation of a change that is based
on slots.

Can you tell me more about the changes you made for zheap? I still
don't understand why the decoding plugin would have to know what table
AM the change came from.

Regards,
    Jeff Davis





Re: Logical insert/update/delete WAL records for custom table AMs

From
Amit Kapila
Date:
On Fri, Nov 5, 2021 at 4:53 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Thu, 2021-11-04 at 14:33 +0530, Amit Kapila wrote:
> > Can't different tableAM's have different representations
> > for toast or may be some different concept like speculative
> > insertions?
>
> The decoding plugin should just use the common interfaces to toast, and
> if the table AM supports toast, everything should work fine.
>

I think that is true only if table AM uses same format as heap for
toast. It also seems to be relying heap tuple format.

> The only
> special thing it needs to do is check VARATT_IS_EXTERNAL_ONDISK(),
> because on-disk toast data can't be decoded (which is true for heap,
> too).
>
> I haven't looked as much into speculative insertions, but I don't think
> those are a problem either. Shouldn't they be handled before they make
> it into the change stream that the plugin sees?
>

They will be handled before the plugin sees them but I was talking
about what if the table AM has some other WAL similar to WAL of
speculative insertions that needs special handling.

> > Similarly, I remember that for zheap we didn't had
> > TransactionIds for subtransactions so we need to make some changes in
> > logical decoding to compensate for that. I guess similar stuff could
> > be required for another table AMs. Then a different table AM can have
> > a different tuple format which won't work for current change records
> > unless we convert it to heap tuple format before writing WAL for it.
>
> Columnar certainly has a different format. That makes me wonder whether
> ReorderBufferChange and/or ReorderBufferTupleBuf are too low-level, and
> we should have a higher-level representation of a change that is based
> on slots.
>
> Can you tell me more about the changes you made for zheap? I still
> don't understand why the decoding plugin would have to know what table
> AM the change came from.
>

I am not talking about decoding plugin but rather decoding itself,
basically, the work we do in reorderbuffer.c, decode.c, etc. The two
things I remember were tuple format and transaction ids as mentioned
in my previous email. I think we should try to find a solution for
tuple format as the current decoding code relies on it if we want
decoding to deal with another table AMs transparently.

-- 
With Regards,
Amit Kapila.



Re: Logical insert/update/delete WAL records for custom table AMs

From
Jeff Davis
Date:
On Fri, 2021-11-05 at 10:00 +0530, Amit Kapila wrote:
> I am not talking about decoding plugin but rather decoding itself,
> basically, the work we do in reorderbuffer.c, decode.c, etc. The two
> things I remember were tuple format and transaction ids as mentioned
> in my previous email. 

If it's difficult to come up with something that will work for all
table AMs, then it suggests that we might want to go towards fully
extensible rmgr, and have a decoding method in the rmgr.

I started a thread (with a patch) here: 


https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.camel@j-davis.com

> I think we should try to find a solution for
> tuple format as the current decoding code relies on it if we want
> decoding to deal with another table AMs transparently.

Agreed, but it seems like we need basic extensibility first. For now,
we'll need to convert to a heap tuple, but later I'd like to support
other formats for the decoding plugin to work with.

Regards,
    Jeff Davis





Re: Logical insert/update/delete WAL records for custom table AMs

From
Amit Kapila
Date:
On Tue, Nov 9, 2021 at 5:12 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Fri, 2021-11-05 at 10:00 +0530, Amit Kapila wrote:
> > I am not talking about decoding plugin but rather decoding itself,
> > basically, the work we do in reorderbuffer.c, decode.c, etc. The two
> > things I remember were tuple format and transaction ids as mentioned
> > in my previous email.
>
> If it's difficult to come up with something that will work for all
> table AMs, then it suggests that we might want to go towards fully
> extensible rmgr, and have a decoding method in the rmgr.
>
> I started a thread (with a patch) here:
>
>
> https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.camel@j-davis.com
>
> > I think we should try to find a solution for
> > tuple format as the current decoding code relies on it if we want
> > decoding to deal with another table AMs transparently.
>
> Agreed, but it seems like we need basic extensibility first. For now,
> we'll need to convert to a heap tuple,
>

Okay, but that might have a cost because we need to convert it before
WAL logging it, and then we probably also need to convert back to the
original table AM format during recovery/standby apply.

-- 
With Regards,
Amit Kapila.



Re: Logical insert/update/delete WAL records for custom table AMs

From
Simon Riggs
Date:
On Wed, 10 Nov 2021 at 03:17, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Nov 9, 2021 at 5:12 AM Jeff Davis <pgsql@j-davis.com> wrote:
> >
> > On Fri, 2021-11-05 at 10:00 +0530, Amit Kapila wrote:
> > > I am not talking about decoding plugin but rather decoding itself,
> > > basically, the work we do in reorderbuffer.c, decode.c, etc. The two
> > > things I remember were tuple format and transaction ids as mentioned
> > > in my previous email.
> >
> > If it's difficult to come up with something that will work for all
> > table AMs, then it suggests that we might want to go towards fully
> > extensible rmgr, and have a decoding method in the rmgr.
> >
> > I started a thread (with a patch) here:
> >
> >
> > https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.camel@j-davis.com
> >
> > > I think we should try to find a solution for
> > > tuple format as the current decoding code relies on it if we want
> > > decoding to deal with another table AMs transparently.
> >
> > Agreed, but it seems like we need basic extensibility first. For now,
> > we'll need to convert to a heap tuple,
> >
>
> Okay, but that might have a cost because we need to convert it before
> WAL logging it, and then we probably also need to convert back to the
> original table AM format during recovery/standby apply.

I spoke with Jeff in detail about this patch in NYC Dec 21, and I now
think it is worth pursuing. It seems much more likely that this would
be acceptable than fully extensible rmgr.

Amit asks a good question: should we be writing a message that seems
to presume the old heap tuple format? My answer is that we clearly
need it to be written in *some* common format, and the current heap
format currently works, so de facto it is the format we should use.
Right now, TAMs have to reformat back into this same format, so it is
the way the APIs work. Put it another way: I don't see any other
format that makes sense to use, now, but that could change in the
future.

So I'm signing up as a reviewer and we'll see if this is good to go.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Logical insert/update/delete WAL records for custom table AMs

From
Simon Riggs
Date:
On Sun, 31 Oct 2021 at 18:10, Jeff Davis <pgsql@j-davis.com> wrote:

> I'm looking for some review on the approach and structure before I
> polish and test it.

Repurposing the logical msg rmgr into a general purpose logical rmgr
seems right.

Structure looks obvious, which is good.

Please pursue this and I will review with you as you go.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Logical insert/update/delete WAL records for custom table AMs

From
Jeff Davis
Date:
On Wed, 2022-01-05 at 20:19 +0000, Simon Riggs wrote:
> I spoke with Jeff in detail about this patch in NYC Dec 21, and I now
> think it is worth pursuing. It seems much more likely that this would
> be acceptable than fully extensible rmgr.

Thank you. I had some conversations with others who felt this approach
is wasteful, which it is. But if this patch still has potential, I'm
happy to pursue it along with the extensible rmgr approach.

> So I'm signing up as a reviewer and we'll see if this is good to go.

Great, I attached a rebased version.

Regards,
    Jeff Davis


Attachment

Re: Logical insert/update/delete WAL records for custom table AMs

From
Simon Riggs
Date:
On Mon, 17 Jan 2022 at 09:05, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2022-01-05 at 20:19 +0000, Simon Riggs wrote:
> > I spoke with Jeff in detail about this patch in NYC Dec 21, and I now
> > think it is worth pursuing. It seems much more likely that this would
> > be acceptable than fully extensible rmgr.
>
> Thank you. I had some conversations with others who felt this approach
> is wasteful, which it is. But if this patch still has potential, I'm
> happy to pursue it along with the extensible rmgr approach.

It's self-contained and generally useful for a range of applications,
so the code would be long-lived.

Calling it wasteful presumes the way you'd use it. If you make logical
log entries you don't need to make physical ones, so its actually the
physical log entries that would be wasteful.

Logical log entries don't need to be decoded, so it's actually more
efficient, plus we could skip index entries altogether.

I would argue that this is the way we should be doing WAL, with
occasional divergence to physical records for performance, rather than
the other way around.

> > So I'm signing up as a reviewer and we'll see if this is good to go.
>
> Great, I attached a rebased version.

The approach is perfectly fine, it just needs to be finished and rebased.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Logical insert/update/delete WAL records for custom table AMs

From
Andres Freund
Date:
Hi,

On 2022-01-17 01:05:14 -0800, Jeff Davis wrote:
> Great, I attached a rebased version.

Currently this doesn't apply: http://cfbot.cputube.org/patch_37_3394.log

- Andres



Re: Logical insert/update/delete WAL records for custom table AMs

From
Jeff Davis
Date:
On Thu, 2022-02-24 at 20:35 +0000, Simon Riggs wrote:
> The approach is perfectly fine, it just needs to be finished and
> rebased.

Attached a new version. The scope expanded, so this is likely to slip
v15 at this late time. For 15, I'll focus on my extensible rmgr work,
which can serve similar purposes.

The main purpose of this patch is to be able to emit logical events for
a table (insert/update/delete/truncate) without actually modifying a
table or relying on the heap at all. That allows table AMs to support
logical decoding/replication, and perhaps allows a few other
interesting use cases (maybe foreign tables?). There are really two
advantages of this approach over relying on a custom rmgr:

  1. Easier for extension authors
  2. Doesn't require an extension module to be loaded to start the
server

Those are very nice advantages, but they come at the price of
flexibility and performance. I think there's room for both, and we can
discuss the merits individually.

Changes:
  * Support logical messages for INSERT/UPDATE/DELETE/TRUNCATE
    (before it only supported INSERT)
  * SQL functions pg_logical_emit_insert/update/delete/truncate
    (callable by superuser)
  * Tests (using test_decoding)
  * Use replica identities properly
  * In general a lot of cleanup, but still not quite ready

TODO:
  * Should SQL functions be callable by the table owner? I would lean
toward superuser-only, because logical replication is used for
administrative purposes like upgrades, and I don't think we want table
owners to be able to create inconsistencies.
  * Support for multi-insert
  * Docs for SQL functions, and maybe docs in the section on Generic
WAL
  * Try to get away from reliance on heap tuples specifically

Regards,
    Jeff Davis


Attachment

Re: Logical insert/update/delete WAL records for custom table AMs

From
Robert Haas
Date:
On Wed, Mar 30, 2022 at 2:31 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Attached a new version. The scope expanded, so this is likely to slip
> v15 at this late time. For 15, I'll focus on my extensible rmgr work,
> which can serve similar purposes.
>
> The main purpose of this patch is to be able to emit logical events for
> a table (insert/update/delete/truncate) without actually modifying a
> table or relying on the heap at all. That allows table AMs to support
> logical decoding/replication, and perhaps allows a few other
> interesting use cases (maybe foreign tables?). There are really two
> advantages of this approach over relying on a custom rmgr:
>
>   1. Easier for extension authors
>   2. Doesn't require an extension module to be loaded to start the
> server

I'm not sure that I understand exactly how this is intended to be
used. I can think of three possibilities:

1. An AM that doesn't care about having anything happening during
recovery, but wants to be able to get logical decoding to do some
work. Maybe the intention of the AM is that data is available only
when the server is not in recovery and all data is lost on shutdown,
or maybe the AM has its own separate durability mechanism.

2. An AM that wants things to happen during recovery, but handles that
separately. For example, maybe it logs all the data changes via
log_newpage() and then separately emits these log records.

3. An AM that wants things to happen during recovery, and expects
these records to serve that purpose. That would require getting
control during WAL replay as well as during decoding, and would
probably only work for an AM whose data is not block-structured (e.g.
an in-memory btree that is dumped to disk at every checkpoint).

My guess is that this is intended to meet use cases 1 and 2 but not 3.
Is that correct?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Logical insert/update/delete WAL records for custom table AMs

From
Jeff Davis
Date:
On Thu, 2022-03-31 at 09:05 -0400, Robert Haas wrote:
> 1. An AM that doesn't care about having anything happening during
> recovery, but wants to be able to get logical decoding to do some
> work. Maybe the intention of the AM is that data is available only
> when the server is not in recovery and all data is lost on shutdown,
> or maybe the AM has its own separate durability mechanism.

This is a speculative use case that is not what I would use it for, but
perhaps someone wants to do that with a table AM or maybe an FDW.

> 2. An AM that wants things to happen during recovery, but handles
> that
> separately. For example, maybe it logs all the data changes via
> log_newpage() and then separately emits these log records.

Yes, or Generic WAL. Generic WAL seems like a half-feature without this
Logical WAL patch, because it's hopeless to support logical
decoding/replication of whatever data you're logging with Generic WAL.
That's probably the strongest argument for this patch.

> 3. An AM that wants things to happen during recovery, and expects
> these records to serve that purpose. That would require getting
> control during WAL replay as well as during decoding, and would
> probably only work for an AM whose data is not block-structured (e.g.
> an in-memory btree that is dumped to disk at every checkpoint).

This patch would not work in this case because the records are ignored
during REDO.

Regards,
    Jeff Davis






Re: Logical insert/update/delete WAL records for custom table AMs

From
Jeff Davis
Date:
On Mon, 2022-03-21 at 17:43 -0700, Andres Freund wrote:
> Currently this doesn't apply: 
> http://cfbot.cputube.org/patch_37_3394.log

Withdrawn for now. With custom WAL resource managers this is less
important to me.

I still think it has value, and I'm willing to work on it if more use
cases come forward.

Regards,
    Jeff Davis