Thread: Show various offset arrays for heap WAL records

Show various offset arrays for heap WAL records

From
Andres Freund
Date:
Hi,

A couple times when investigating data corruption issues, the last time just
yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM
records. As that's probably not just me, I think we should make that change
in-tree.

The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
XLOG_HEAP2_FREEZE_PAGE.

The biggest issue I have with the patch is that it's very hard to figure out
what punctuation to use where ;). The existing code is very inconsistent.

I chose to include infomask[2] for the different freeze plans mainly because
it looks odd to see different plans without a visible reason. But I'm not sure
that's the right choice.

Greetings,

Andres Freund

[1] https://postgr.es/m/CANtu0ojby3eBdMXfs4QmS%2BK1avBc7NcRq_Ot5bnzrbwM%2BuQ55w%40mail.gmail.com

Attachment

Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Mon, Jan 9, 2023 at 1:58 PM Andres Freund <andres@anarazel.de> wrote:
> A couple times when investigating data corruption issues, the last time just
> yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM
> records. As that's probably not just me, I think we should make that change
> in-tree.

I remember how useful this was when we were investigating that early
bug in 14, that turned out to be in parallel VACUUM. So I'm all in
favor of it.

> The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
> XLOG_HEAP2_FREEZE_PAGE.

I'm bound to end up doing the same in index access methods. Might make
sense for the utility routines to live somewhere more centralized, at
least when code reuse is likely. Practically every index AM has WAL
records that include a sorted page offset number array, just like
these ones. It's a very standard thing, obviously.

> I chose to include infomask[2] for the different freeze plans mainly because
> it looks odd to see different plans without a visible reason. But I'm not sure
> that's the right choice.

I don't think that it is particularly necessary to do so in order for
the output to make sense -- pg_waldump is inherently a tool for
experts. What it comes down to for me is whether or not this
information is sufficiently useful to display, and/or can be (or needs
to be) controlled via some kind of verbosity knob.

I think that it easily could be useful, and I also think that it
easily could be a bit annoying. How hard would it be to invent a
general mechanism to control the verbosity of what we'll show for each
WAL record?

-- 
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Andres Freund
Date:
Hi,

On 2023-01-09 19:59:42 -0800, Peter Geoghegan wrote:
> On Mon, Jan 9, 2023 at 1:58 PM Andres Freund <andres@anarazel.de> wrote:
> > A couple times when investigating data corruption issues, the last time just
> > yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM
> > records. As that's probably not just me, I think we should make that change
> > in-tree.
> 
> I remember how useful this was when we were investigating that early
> bug in 14, that turned out to be in parallel VACUUM. So I'm all in
> favor of it.

Cool.


> > The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
> > XLOG_HEAP2_FREEZE_PAGE.
> 
> I'm bound to end up doing the same in index access methods. Might make
> sense for the utility routines to live somewhere more centralized, at
> least when code reuse is likely. Practically every index AM has WAL
> records that include a sorted page offset number array, just like
> these ones. It's a very standard thing, obviously.

Hm, there doesn't seem to be a great location for them today. I guess we could
add something like src/include/access/rmgrdesc_utils.h? And put the
implementation in src/backend/access/rmgrdesc/rmgrdesc_utils.c? I first was
thinking of just rmgrdesc.[ch], but custom rmgrs added
src/bin/pg_waldump/rmgrdesc.[ch] ...


> > I chose to include infomask[2] for the different freeze plans mainly because
> > it looks odd to see different plans without a visible reason. But I'm not sure
> > that's the right choice.
> 
> I don't think that it is particularly necessary to do so in order for
> the output to make sense -- pg_waldump is inherently a tool for
> experts. What it comes down to for me is whether or not this
> information is sufficiently useful to display, and/or can be (or needs
> to be) controlled via some kind of verbosity knob.

It seemed useful enough to me, but I likely also stare more at this stuff than
most. Compared to the list of offsets it's not that much content.


> How hard would it be to invent a general mechanism to control the verbosity
> of what we'll show for each WAL record?

Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
    void        (*rm_desc) (StringInfo buf, XLogReaderState *record);

so we'd need to patch all of them. That might be worth doing at some point,
but I don't want to tackle it right now.

Greetings,

Andres Freund



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Tue, Jan 10, 2023 at 11:35 AM Andres Freund <andres@anarazel.de> wrote:
> Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
>         void            (*rm_desc) (StringInfo buf, XLogReaderState *record);
>
> so we'd need to patch all of them. That might be worth doing at some point,
> but I don't want to tackle it right now.

Okay. Let's just get the basics in soon, then.

I would like to have a similar capability for index access methods,
but mostly just for investigating performance. Whenever we've really
needed something like this for debugging it seems to have been a
heapam thing, just because there's a lot more that can go wrong with
pruning, which is spread across many different places.

-- 
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Andres Freund
Date:
Hi,

On 2023-01-11 14:53:54 -0800, Peter Geoghegan wrote:
> On Tue, Jan 10, 2023 at 11:35 AM Andres Freund <andres@anarazel.de> wrote:
> > Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
> >         void            (*rm_desc) (StringInfo buf, XLogReaderState *record);
> >
> > so we'd need to patch all of them. That might be worth doing at some point,
> > but I don't want to tackle it right now.
> 
> Okay. Let's just get the basics in soon, then.

> I would like to have a similar capability for index access methods,
> but mostly just for investigating performance. Whenever we've really
> needed something like this for debugging it seems to have been a
> heapam thing, just because there's a lot more that can go wrong with
> pruning, which is spread across many different places.

What are your thoughts about the place for the helper functions? You're ok
with rmgrdesc_utils.[ch]?

Greetings,

Andres Freund



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Wed, Jan 11, 2023 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:
> What are your thoughts about the place for the helper functions? You're ok
> with rmgrdesc_utils.[ch]?

Yeah, that seems okay.

We may well need to put more stuff in that file. We're overdue a big
overhaul of the rmgr output, so that everybody uses the same format
for everything. We made some progress on that for 16 already, by
standardizing on the name snapshotConflictHorizon, but a lot of
annoying inconsistencies still remain. Like the punctuation issue you
mentioned.

Ideally we'd be able to make the output more easy to manipulate via
the SQL interface from pg_walinspect, or perhaps via scripting. That
would require some rules that are imposed top-down, so that consumers
of the data can make certain general assumptions. But that's fairly
natural. It's not like there is just inherently a great deal of
diversity that we need to be considered. For example, the WAL records
used by each individual index access method are all very similar. In
fact the most important index AM WAL records used by each index AM
(e.g. insert, delete, vacuum) have virtually the same format as each
other already.

-- 
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Wed, Jan 11, 2023 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jan 11, 2023 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:
> > What are your thoughts about the place for the helper functions? You're ok
> > with rmgrdesc_utils.[ch]?
>
> Yeah, that seems okay.

BTW, while playing around with this patch today, I noticed that it
won't display the number of elements in each offset array directly.
Perhaps it's worth including that, too?

-- 
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
Hi,

I have taken a stab at doing some of the tasks listed in this email.

I have made the new files rmgr_utils.c/h.

I have come up with a standard format that I like for the output and
used it in all the heap record types.

Examples below:

snapshotConflictHorizon: 2184, nplans: 2, plans [ { xmax: 0, infomask:
2816, infomask2: 2, ntuples: 5, offsets: [ 10, 11, 12, 18, 71 ] }, {
xmax: 0, infomask: 11008, infomask2: 2, ntuples: 2, offsets: [ 72, 73
] } ]

snapshotConflictHorizon: 2199, nredirected: 4, ndead: 0, nunused: 4,
redirected: [ 1->38, 2->39, 3->40, 4->41 ], dead: [], unused: [ 24,
25, 26, 27, 37 ]

I started documenting it in the rmgr_utils.h header file in a comment,
however it may be worth a README?

I haven't polished this description of the format (or added examples,
etc) or used it in the btree-related functions because I assume the
format and helper function API will need more discussion.

This is still a rough draft, as I anticipate changes will be requested.
I would split it into multiple patches, etc. But I am looking for
feedback on the suggested format and the array formatting helper
function API.

Perhaps there should also be example output of the offset arrays in
pgwalinspect docs?

I've changed the array format helper functions that Andres added to be a
single function with an additional layer of indirection so that any
record with an array can use it regardless of type and format of the
individual elements. The signature is based somewhat off of qsort_r()
and allows the user to pass a function with the the desired format of
the elements.

On a semi-unrelated note, I think it might be nice to have a comment in
heapam_xlog.h about what the infobits fields actually are and why they
exist -- e.g. we only need a subset of infomask[2] bits in these
records.
I put a random comment in the code where I think it should go.
I will delete it later, of course.

On Mon, Jan 9, 2023 at 11:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Jan 9, 2023 at 1:58 PM Andres Freund <andres@anarazel.de> wrote:
> > The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
> > XLOG_HEAP2_FREEZE_PAGE.
>
> I'm bound to end up doing the same in index access methods. Might make
> sense for the utility routines to live somewhere more centralized, at
> least when code reuse is likely. Practically every index AM has WAL
> records that include a sorted page offset number array, just like
> these ones. It's a very standard thing, obviously.

I plan to add these if the format and API I suggested seems like the
right direction.

On Tue, Jan 10, 2023 at 2:35 PM Andres Freund <andres@anarazel.de> wrote:
>
> > > I chose to include infomask[2] for the different freeze plans mainly because
> > > it looks odd to see different plans without a visible reason. But I'm not sure
> > > that's the right choice.
> >
> > I don't think that it is particularly necessary to do so in order for
> > the output to make sense -- pg_waldump is inherently a tool for
> > experts. What it comes down to for me is whether or not this
> > information is sufficiently useful to display, and/or can be (or needs
> > to be) controlled via some kind of verbosity knob.
>
> It seemed useful enough to me, but I likely also stare more at this stuff than
> most. Compared to the list of offsets it's not that much content.
>

Personally, I like having the infomasks for the freeze plans. If we
someday have a more structured input to rmgr_desc, we could then easily
have them in their own column and use functions like
heap_tuple_infomask_flags() on them.

> > How hard would it be to invent a general mechanism to control the verbosity
> > of what we'll show for each WAL record?
>
> Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
>         void            (*rm_desc) (StringInfo buf, XLogReaderState *record);
>
> so we'd need to patch all of them. That might be worth doing at some point,
> but I don't want to tackle it right now.

In terms of a more structured format, it seems like it would make the
most sense to pass a JSON or composite datatype structure to rm_desc
instead of that StringInfo.

I would also like to see functions like XLogRecGetBlockRefInfo() pass
something more useful than a stringinfo buffer so that we could easily
extract out the relfilenode in pgwalinspect.

On Mon, Jan 16, 2023 at 10:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jan 11, 2023 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Wed, Jan 11, 2023 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:
> > > What are your thoughts about the place for the helper functions? You're ok
> > > with rmgrdesc_utils.[ch]?
> >
> > Yeah, that seems okay.
>
> BTW, while playing around with this patch today, I noticed that it
> won't display the number of elements in each offset array directly.
> Perhaps it's worth including that, too?

I believe I have addressed this in the attached patch.

- Melanie

Attachment

Re: Show various offset arrays for heap WAL records

From
Robert Haas
Date:
On Fri, Jan 27, 2023 at 12:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I believe I have addressed this in the attached patch.

I'm not sure what's best in terms of formatting details but I
definitely like the idea of making pg_waldump show more details. I'd
even like to have a way to extract the tuple data, when it's
operations on tuples and we have those tuples in the payload. That'd
be a lot more verbose than what you are doing here, though, and I'm
not saying you should go do it right now or anything like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Fri, Jan 27, 2023 at 9:24 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I have taken a stab at doing some of the tasks listed in this email.

Cool.

> I have made the new files rmgr_utils.c/h.
>
> I have come up with a standard format that I like for the output and
> used it in all the heap record types.
>
> Examples below:

That seems like a reasonable approach.

> I started documenting it in the rmgr_utils.h header file in a comment,
> however it may be worth a README?
>
> I haven't polished this description of the format (or added examples,
> etc) or used it in the btree-related functions because I assume the
> format and helper function API will need more discussion.

I think that standardization is good, but ISTM that we need clarity on
what the scope is -- what is *not* being standardized? It may or may
not be useful to call the end result an API. Or it may not make sense
to do so in the first committed version, even though we may ultimately
end up as something that deserves to be called an API. The obligation
to not break tools that are scraping the output in whatever way seems
kind of onerous right now -- just not having any gratuitous
inconsistencies (e.g., fixing totally inconsistent punctuation, making
the names for fields across WAL records consistent when they serve
exactly the same purpose) would be a big improvement.

As I mentioned in passing already, I actually don't think that the
B-Tree WAL records are all that special, as far as this stuff goes.
For example, the DELETE Btree record type is very similar to heapam's
PRUNE record type, and practically identical to Btree's VACUUM record
type. All of these record types use the same basic conventions, like a
snapshotConflictHorizon field for recovery conflicts (which is
generated in a very similar way during original execution, and
processed in precisely the same way during REDO), and arrays of page
offset numbers sorted in ascending order.

There are some remaining details where things from an index AM WAL
record aren't directly analogous (or pretty much identical) to some
other heapam WAL records, such as the way that the DELETE Btree record
type deals with deleting a subset of TIDs from a posting list index
tuple (generated by B-Tree deduplication). But even these exceptions
don't require all that much discussion. You could either choose to
only display the array of deleted index tuple page offset numbers, as
well as the similar array of "updated" index tuple page offset numbers
from xl_btree_delete, in which case you just display two arrays of
page offset numbers, in the same standard way. You may or may not want
to also show each individual xl_btree_update entry -- doing so would
be kinda like showing the details of individual freeze plans, except
that you'd probably display something very similar to the page offset
number display here too (even though these aren't page offset numbers,
they're 0-based offsets into the posting list's item pointer data
array).

BTW, there is also a tendency for non-btree index AM WAL records to be
fairly similar or even near-identical to the B-Tree WAL records. While
Hash indexes are very different to B-Tree indexes at a high level, it
is nevertheless the case that xl_hash_vacuum_one_page is directly
based on xl_btree_delete/xl_btree_vacuum, and that xl_hash_insert is
directly based on xl_btree_insert. There are some other WAL record
types that are completely different across hash and B-Tree, which is a
reflection of the fact that the index grows using a totally different
approach in each AM -- but that doesn't seem like something that
throws up any roadblocks for you (these can all be displayed as simple
structs anyway).

Speaking with my B-Tree hat on, I'd just be happy to be able to see
both of the page offset number arrays (the deleted + updated offset
number arrays from xl_btree_delete/xl_btree_vacuum), without also
being able to\ see output for each individual xl_btree_update
item-pointer-array-offset arrays -- just seeing that much is already a
huge improvement. That's why I'm a bit hesitant to use the term API
just yet, because an obligation to be consistent in whatever way seems
like it might block incremental progress.

> Perhaps there should also be example output of the offset arrays in
> pgwalinspect docs?

That would definitely make sense.

> I've changed the array format helper functions that Andres added to be a
> single function with an additional layer of indirection so that any
> record with an array can use it regardless of type and format of the
> individual elements. The signature is based somewhat off of qsort_r()
> and allows the user to pass a function with the the desired format of
> the elements.

That's handy.

> Personally, I like having the infomasks for the freeze plans. If we
> someday have a more structured input to rmgr_desc, we could then easily
> have them in their own column and use functions like
> heap_tuple_infomask_flags() on them.

I agree, in general, though long term the best approach is one that
has a configurable level of verbosity, with some kind of roughly
uniform definition of verbosity (kinda like DEBUG1 - DEBUG5, though
probably with only 2 or 3 distinct levels).

Obviously what you're doing here will lead to a significant increase
in the verbosity of the output for affected WAL records. I don't feel
too bad about that, though. It's really an existing problem, and one
that should be fixed either way. You kind of have to deal with this
already, by having a good psql pager, since record types such as
COMMIT_PREPARED, INVALIDATIONS, and RUNNING_XACTS are already very
verbose in roughly the same way. You only need to have one of these
record types output by a function like pg_get_wal_records_info() to
get absurdly wide output -- it hardly matters that most individual WAL
record types have terse output at that point.

> > > How hard would it be to invent a general mechanism to control the verbosity
> > > of what we'll show for each WAL record?
> >
> > Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
> >         void            (*rm_desc) (StringInfo buf, XLogReaderState *record);
> >
> > so we'd need to patch all of them. That might be worth doing at some point,
> > but I don't want to tackle it right now.
>
> In terms of a more structured format, it seems like it would make the
> most sense to pass a JSON or composite datatype structure to rm_desc
> instead of that StringInfo.
>
> I would also like to see functions like XLogRecGetBlockRefInfo() pass
> something more useful than a stringinfo buffer so that we could easily
> extract out the relfilenode in pgwalinspect.

That does seem particularly important. It's a pain to do this from
SQL. In general I'm okay with focussing on pg_walinspect over
pg_waldump, since it'll become more important over time. Obviously
pg_waldump needs to still work, but I think it's okay to care less
about pg_waldump usability.

> > BTW, while playing around with this patch today, I noticed that it
> > won't display the number of elements in each offset array directly.
> > Perhaps it's worth including that, too?
>
> I believe I have addressed this in the attached patch.

Thanks for taking care of that.

--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Tue, Jan 31, 2023 at 1:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I would also like to see functions like XLogRecGetBlockRefInfo() pass
> > something more useful than a stringinfo buffer so that we could easily
> > extract out the relfilenode in pgwalinspect.
>
> That does seem particularly important. It's a pain to do this from
> SQL. In general I'm okay with focussing on pg_walinspect over
> pg_waldump, since it'll become more important over time. Obviously
> pg_waldump needs to still work, but I think it's okay to care less
> about pg_waldump usability.

I just realized why you mentioned XLogRecGetBlockRefInfo() -- it
probably shouldn't even be used by pg_walinspect at all (just by
pg_waldump). Using something like XLogRecGetBlockRefInfo() within
pg_walinspect misses out on the opportunity to output information in a
more descriptive tuple format, with real data types. It's not just the
relfilenode, either -- it's the block numbers themselves. And the fork
number.

In other words, I suspect that this is out of scope for this patch,
strictly speaking. We simply shouldn't be using
XLogRecGetBlockRefInfo() in pg_walinspect in the first place. Rather,
pg_walinspect should be calling some other function that ultimately
allows the user to work with (say) an array of int8 from SQL for the
block numbers. There is no great reason not to, AFAICT, since this
information is completely generic -- it's not like the rmgr-specific
output from GetRmgr(), where fine grained type information is just a
nice-to-have, with usability issues of its own (on account of the
details being record type specific).

I've been managing this problem within my own custom pg_walinspect
queries by using my own custom ICU collation. I use ICU's natural sort
order to order based on block_ref, or based on a substring()
expression that extracts something interesting from block_ref, such as
relfilenode. You can create a custom collation for this like so, per
the docs:

CREATE COLLATION IF NOT EXISTS numeric (provider = icu, locale =
'en-u-kn-true');

Obviously this hack of mine works, but hardly anybody else would be
willing to take the time to figure something like this out. Plus it's
error prone when it doesn't really have to be. And it suggests that
the block_ref field isn't record type generic -- that's sort of
misleading IMV.

-- 
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Tue, Jan 31, 2023 at 1:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Obviously what you're doing here will lead to a significant increase
> in the verbosity of the output for affected WAL records. I don't feel
> too bad about that, though. It's really an existing problem, and one
> that should be fixed either way. You kind of have to deal with this
> already, by having a good psql pager, since record types such as
> COMMIT_PREPARED, INVALIDATIONS, and RUNNING_XACTS are already very
> verbose in roughly the same way. You only need to have one of these
> record types output by a function like pg_get_wal_records_info() to
> get absurdly wide output -- it hardly matters that most individual WAL
> record types have terse output at that point.

Actually the really wide output comes from COMMIT records. After I run
the regression tests, and execute some of my own custom pg_walinspect
queries, I see that some individual COMMIT records have a
length(description) of over 10,000 bytes/characters. There is even one
particular COMMIT record whose length(description) is about 46,000
bytes/characters. So *ludicrously* verbose GetRmgr() strings are not
uncommon today. The worst case (or even particularly bad cases) won't
be made any worse by this patch, because there are obviously limits on
the width of the arrays that it outputs details descriptions of, that
don't apply to these COMMIT records.

-- 
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Robert Haas
Date:
On Tue, Jan 31, 2023 at 6:20 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Actually the really wide output comes from COMMIT records. After I run
> the regression tests, and execute some of my own custom pg_walinspect
> queries, I see that some individual COMMIT records have a
> length(description) of over 10,000 bytes/characters. There is even one
> particular COMMIT record whose length(description) is about 46,000
> bytes/characters. So *ludicrously* verbose GetRmgr() strings are not
> uncommon today. The worst case (or even particularly bad cases) won't
> be made any worse by this patch, because there are obviously limits on
> the width of the arrays that it outputs details descriptions of, that
> don't apply to these COMMIT records.

If we're dumping a lot of details out of each WAL record, we might
want to switch to a multi-line format of some kind. No one enjoys a
460-character wide line, let alone 46000.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Wed, Feb 1, 2023 at 5:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> If we're dumping a lot of details out of each WAL record, we might
> want to switch to a multi-line format of some kind. No one enjoys a
> 460-character wide line, let alone 46000.

I generally prefer it when I can use psql without using expanded table
format mode, and without having to use a pager. Of course that isn't
always possible, but it often is. I just don't think that that's going
to become feasible with pg_walinspect queries any time soon, since it
really requires a comprehensive strategy to deal with the issue of
verbosity.

It seems practically mandatory to use a pager when running
pg_walinspect queries in psql right now -- pspg is good for this. I
really can't use expanded table mode here, since it obscures the
relationship between adjoining records. I'm usually looking through
rows/records in LSN order, and want to be able to easily compare the
LSNs (or other details) of groups of adjoining records.

--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Robert Haas
Date:
On Wed, Feb 1, 2023 at 12:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Feb 1, 2023 at 5:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > If we're dumping a lot of details out of each WAL record, we might
> > want to switch to a multi-line format of some kind. No one enjoys a
> > 460-character wide line, let alone 46000.
>
> I generally prefer it when I can use psql without using expanded table
> format mode, and without having to use a pager. Of course that isn't
> always possible, but it often is. I just don't think that that's going
> to become feasible with pg_walinspect queries any time soon, since it
> really requires a comprehensive strategy to deal with the issue of
> verbosity.

Well, if we're thinking of making the output a lot more verbose, it
seems like we should at least do a bit of brainstorming about what
that strategy could be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
On Tue, Jan 31, 2023 at 5:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Jan 31, 2023 at 1:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > I would also like to see functions like XLogRecGetBlockRefInfo() pass
> > > something more useful than a stringinfo buffer so that we could easily
> > > extract out the relfilenode in pgwalinspect.
> >
> > That does seem particularly important. It's a pain to do this from
> > SQL. In general I'm okay with focussing on pg_walinspect over
> > pg_waldump, since it'll become more important over time. Obviously
> > pg_waldump needs to still work, but I think it's okay to care less
> > about pg_waldump usability.
>
> I just realized why you mentioned XLogRecGetBlockRefInfo() -- it
> probably shouldn't even be used by pg_walinspect at all (just by
> pg_waldump). Using something like XLogRecGetBlockRefInfo() within
> pg_walinspect misses out on the opportunity to output information in a
> more descriptive tuple format, with real data types. It's not just the
> relfilenode, either -- it's the block numbers themselves. And the fork
> number.
>
> In other words, I suspect that this is out of scope for this patch,
> strictly speaking. We simply shouldn't be using
> XLogRecGetBlockRefInfo() in pg_walinspect in the first place. Rather,
> pg_walinspect should be calling some other function that ultimately
> allows the user to work with (say) an array of int8 from SQL for the
> block numbers. There is no great reason not to, AFAICT, since this
> information is completely generic -- it's not like the rmgr-specific
> output from GetRmgr(), where fine grained type information is just a
> nice-to-have, with usability issues of its own (on account of the
> details being record type specific).

Something like the attached?

start_lsn        | 0/19823390
end_lsn          | 0/19824360
prev_lsn         | 0/19821358
xid              | 1355
resource_manager | Heap
record_type      | UPDATE
record_length    | 4021
main_data_length | 14
fpi_length       | 3948
description      | off 11 xmax 1355 flags 0x00 ; new off 109 xmax 0
block_ref        |
[0:1][0:8]={{0,1663,5,17033,0,442,460,4244,0},{1,1663,5,17033,0,0,0,0,0}}

It is a bit annoying not to have information about what each block_ref
item in the array represents (previously in the string), so maybe the
format in the attached shouldn't be a replacement for what is already
displayed by pg_get_wal_records_info() and friends.

It could instead be a new function which returns information in this
format -- perhaps tuples with separate columns for each labeled block
ref field denormalized to repeat the wal record info for every block?

The one piece of information I didn't include in the new block_ref
columns is the compression type (since it is a string). Since I used the
forknum value instead of the forknum name, maybe it is defensible to
also provide a documented int value for the compression type and make
that an int too?

- Melanie

Attachment

Re: Show various offset arrays for heap WAL records

From
Peter Eisentraut
Date:
On 01.03.23 17:11, Melanie Plageman wrote:
> diff --git a/contrib/pg_walinspect/pg_walinspect--1.0.sql b/contrib/pg_walinspect/pg_walinspect--1.0.sql
> index 08b3dd5556..eb8ff82dd8 100644
> --- a/contrib/pg_walinspect/pg_walinspect--1.0.sql
> +++ b/contrib/pg_walinspect/pg_walinspect--1.0.sql
> @@ -17,7 +17,7 @@ CREATE FUNCTION pg_get_wal_record_info(IN in_lsn pg_lsn,
>       OUT main_data_length int4,
>       OUT fpi_length int4,
>       OUT description text,
> -    OUT block_ref text
> +    OUT block_ref int4[][]
>   )
>   AS 'MODULE_PATHNAME', 'pg_get_wal_record_info'
>   LANGUAGE C STRICT PARALLEL SAFE;

A change like this would require a new extension version and an upgrade 
script.

I suppose it's ok to postpone that work while the actual meat of the 
patch is still being worked out, but I figured I'd mention it in case it 
wasn't considered yet.




Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
Thanks for the various perspectives and feedback.

Attached v2 has additional info for xl_btree_vacuum and xl_btree_delete.

I've quoted various emails by various senders below and replied.

On Fri, Jan 27, 2023 at 3:02 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jan 27, 2023 at 12:24 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I believe I have addressed this in the attached patch.
>
> I'm not sure what's best in terms of formatting details but I
> definitely like the idea of making pg_waldump show more details. I'd
> even like to have a way to extract the tuple data, when it's
> operations on tuples and we have those tuples in the payload. That'd
> be a lot more verbose than what you are doing here, though, and I'm
> not saying you should go do it right now or anything like that.

If I'm not mistaken, this would be quite difficult without changing
rm_desc to return some kind of self-describing data type.

On Tue, Jan 31, 2023 at 4:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jan 27, 2023 at 9:24 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I started documenting it in the rmgr_utils.h header file in a comment,
> > however it may be worth a README?
> >
> > I haven't polished this description of the format (or added examples,
> > etc) or used it in the btree-related functions because I assume the
> > format and helper function API will need more discussion.
>
> I think that standardization is good, but ISTM that we need clarity on
> what the scope is -- what is *not* being standardized? It may or may
> not be useful to call the end result an API. Or it may not make sense
> to do so in the first committed version, even though we may ultimately
> end up as something that deserves to be called an API. The obligation
> to not break tools that are scraping the output in whatever way seems
> kind of onerous right now -- just not having any gratuitous
> inconsistencies (e.g., fixing totally inconsistent punctuation, making
> the names for fields across WAL records consistent when they serve
> exactly the same purpose) would be a big improvement.

So, we can scrap any README or big comment, but are there other changes
to the code or structure you think would avoid it being seen as an
API?

> As I mentioned in passing already, I actually don't think that the
> B-Tree WAL records are all that special, as far as this stuff goes.
> For example, the DELETE Btree record type is very similar to heapam's
> PRUNE record type, and practically identical to Btree's VACUUM record
> type. All of these record types use the same basic conventions, like a
> snapshotConflictHorizon field for recovery conflicts (which is
> generated in a very similar way during original execution, and
> processed in precisely the same way during REDO), and arrays of page
> offset numbers sorted in ascending order.
>
> There are some remaining details where things from an index AM WAL
> record aren't directly analogous (or pretty much identical) to some
> other heapam WAL records, such as the way that the DELETE Btree record
> type deals with deleting a subset of TIDs from a posting list index
> tuple (generated by B-Tree deduplication). But even these exceptions
> don't require all that much discussion. You could either choose to
> only display the array of deleted index tuple page offset numbers, as
> well as the similar array of "updated" index tuple page offset numbers
> from xl_btree_delete, in which case you just display two arrays of
> page offset numbers, in the same standard way. You may or may not want
> to also show each individual xl_btree_update entry -- doing so would
> be kinda like showing the details of individual freeze plans, except
> that you'd probably display something very similar to the page offset
> number display here too (even though these aren't page offset numbers,
> they're 0-based offsets into the posting list's item pointer data
> array).

I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
the updated/deleted target offset numbers and the updated tuples
metadata.

I wondered if there was any reason to do xl_btree_dedup deduplication
intervals.

> BTW, there is also a tendency for non-btree index AM WAL records to be
> fairly similar or even near-identical to the B-Tree WAL records. While
> Hash indexes are very different to B-Tree indexes at a high level, it
> is nevertheless the case that xl_hash_vacuum_one_page is directly
> based on xl_btree_delete/xl_btree_vacuum, and that xl_hash_insert is
> directly based on xl_btree_insert. There are some other WAL record
> types that are completely different across hash and B-Tree, which is a
> reflection of the fact that the index grows using a totally different
> approach in each AM -- but that doesn't seem like something that
> throws up any roadblocks for you (these can all be displayed as simple
> structs anyway).

I chose not to take on any other index types until I saw if this was viable.

> > Perhaps there should also be example output of the offset arrays in
> > pgwalinspect docs?
>
> That would definitely make sense.

I wanted to include at least a minimal example for those following along
with this thread that would cause creation of one of the record types
which I have enhanced, but I had a little trouble making a reliable
example.

Below is my strategy for getting a Heap PRUNE record with redirects, but
it occasionally doesn't end up working and I wasn't sure why (I can do
more investigation if we think that having some kind of test for this is
useful).

CREATE EXTENSION pg_walinspect;
DROP TABLE IF EXISTS lsns;
CREATE TABLE lsns(name TEXT, lsn pg_lsn);

DROP TABLE IF EXISTS baz;
create table baz(a int, b int) with (autovacuum_enabled=false);
insert into baz select i, i % 3 from generate_series(1,100)i;

update baz set b = 0 where b = 1;
update baz set b = 7 where b = 0;
INSERT INTO lsns VALUES('start_lsn', (SELECT pg_current_wal_lsn()));
vacuum baz;
select count(*) from baz;
INSERT INTO lsns VALUES('end_lsn', (SELECT pg_current_wal_lsn()));
SELECT * FROM pg_get_wal_records_info((select lsn from lsns where name
= 'start_lsn'),
        (select lsn from lsns where name = 'end_lsn'))
    WHERE record_type LIKE 'PRUNE%' AND resource_manager = 'Heap2' LIMIT 1;

> > Personally, I like having the infomasks for the freeze plans. If we
> > someday have a more structured input to rmgr_desc, we could then easily
> > have them in their own column and use functions like
> > heap_tuple_infomask_flags() on them.
>
> I agree, in general, though long term the best approach is one that
> has a configurable level of verbosity, with some kind of roughly
> uniform definition of verbosity (kinda like DEBUG1 - DEBUG5, though
> probably with only 2 or 3 distinct levels).

Given this comment and Robert's concern quoted below, I am wondering if
the consensus is that a lack of verbosity control is a dealbreaker for
adding offsets or not.

On Wed, Feb 1, 2023 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 1, 2023 at 12:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Wed, Feb 1, 2023 at 5:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > If we're dumping a lot of details out of each WAL record, we might
> > > want to switch to a multi-line format of some kind. No one enjoys a
> > > 460-character wide line, let alone 46000.
> >
> > I generally prefer it when I can use psql without using expanded table
> > format mode, and without having to use a pager. Of course that isn't
> > always possible, but it often is. I just don't think that that's going
> > to become feasible with pg_walinspect queries any time soon, since it
> > really requires a comprehensive strategy to deal with the issue of
> > verbosity.
>
> Well, if we're thinking of making the output a lot more verbose, it
> seems like we should at least do a bit of brainstorming about what
> that strategy could be.

In terms of strategies for controlling output verbosity, it seems
difficult to do without changing the rmgrdesc function signature. Unless
you are thinking of trying to reparse the rmgrdesc string output on the
pg_walinspect/pg_waldump side?

I think if there was a more structured output of rmgrdesc, then this
would also solve the verbosity level problem. Consumers could decide on
their verbosity level -- in various pg_walinspect function outputs, that
would probably just be column selection. For pg_waldump, I imagine that
some kind of parameter or flag would work.

Unless you are suggesting that we add a verbosity parameter to the
rmgrdesc function API now?

On Thu, Mar 2, 2023 at 3:17 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> On 01.03.23 17:11, Melanie Plageman wrote:
> > diff --git a/contrib/pg_walinspect/pg_walinspect--1.0.sql b/contrib/pg_walinspect/pg_walinspect--1.0.sql
> > index 08b3dd5556..eb8ff82dd8 100644
> > --- a/contrib/pg_walinspect/pg_walinspect--1.0.sql
> > +++ b/contrib/pg_walinspect/pg_walinspect--1.0.sql
> > @@ -17,7 +17,7 @@ CREATE FUNCTION pg_get_wal_record_info(IN in_lsn pg_lsn,
> >       OUT main_data_length int4,
> >       OUT fpi_length int4,
> >       OUT description text,
> > -    OUT block_ref text
> > +    OUT block_ref int4[][]
> >   )
> >   AS 'MODULE_PATHNAME', 'pg_get_wal_record_info'
> >   LANGUAGE C STRICT PARALLEL SAFE;
>
> A change like this would require a new extension version and an upgrade
> script.
>
> I suppose it's ok to postpone that work while the actual meat of the
> patch is still being worked out, but I figured I'd mention it in case it
> wasn't considered yet.

Thanks for letting me know. This pg_walinspect patch ended up being
discussed over in [1].

- Melanie

[1] https://www.postgresql.org/message-id/flat/CAAKRu_bORebdZmcV8V4cZBzU8M_C6tDDdbiPhCZ6i-iuSXW9TA%40mail.gmail.com

Attachment

Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Mon, Mar 13, 2023 at 4:01 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Fri, Jan 27, 2023 at 3:02 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I'm not sure what's best in terms of formatting details but I
> > definitely like the idea of making pg_waldump show more details.

> If I'm not mistaken, this would be quite difficult without changing
> rm_desc to return some kind of self-describing data type.

I'd say that it would depend on how far you went with it. Basic
information about the tuple wouldn't require any of that. I suggest
leaving this part out for now, though.

> So, we can scrap any README or big comment, but are there other changes
> to the code or structure you think would avoid it being seen as an
> API?

I think that it would be good to try to build something that looks
like an API, while making zero promises about its stability -- at
least until further notice. Kind of like how there are no guarantees
about the stability of internal interfaces within the Linux kernel.

There is no reason to not take a firm position on some things now.
Things like punctuation, and symbol names for generic cross-record
symbols like snapshotConflictHorizon. Many of the differences that
exist now are wholly gratuitous -- just accidents. It would make sense
to standardize-away these clearly unnecessary variations. And to
document the new standard. I'd be surprised if anybody disagreed with
me on this point.

> I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
> the updated/deleted target offset numbers and the updated tuples
> metadata.
>
> I wondered if there was any reason to do xl_btree_dedup deduplication
> intervals.

No reason. It wouldn't be hard to cover xl_btree_dedup deduplication
intervals -- each element is a page offset number, and a corresponding
count of index tuples to merge together in the REDO routine. That's
slightly different to anything else, but not in a way that seems like
it requires very much additional effort.

> I wanted to include at least a minimal example for those following along
> with this thread that would cause creation of one of the record types
> which I have enhanced, but I had a little trouble making a reliable
> example.
>
> Below is my strategy for getting a Heap PRUNE record with redirects, but
> it occasionally doesn't end up working and I wasn't sure why (I can do
> more investigation if we think that having some kind of test for this is
> useful).

I'm not sure, but offhand I think that there could be a number of
annoying little implementation details that make it hard to come up
with a perfectly reliable test case. Perhaps try it while using VACUUM
VERBOSE, with the proviso that we should only expect the revised
example workflow to show a redirect record as intended when the
VERBOSE output confirms that VACUUM actually ran as expected, in
whatever way. For example, VACUUM can't have failed to acquire a
cleanup lock on a heap page due to the current phase of the moon.
VACUUM shouldn't have its "removable cutoff" held back by
who-knows-what when the test case is run, either.

Some of the tests for VACUUM use a temp table, since they conveniently
cannot have their "removable cutoff" held back -- not since commit
a7212be8. Of course, that strategy won't help you here. Getting VACUUM
to behave very predictably for testing purposes has proven tricky at
times.

> > I agree, in general, though long term the best approach is one that
> > has a configurable level of verbosity, with some kind of roughly
> > uniform definition of verbosity (kinda like DEBUG1 - DEBUG5, though
> > probably with only 2 or 3 distinct levels).
>
> Given this comment and Robert's concern quoted below, I am wondering if
> the consensus is that a lack of verbosity control is a dealbreaker for
> adding offsets or not.

There are several different things that seem important to me
personally. These are in tension with each other, to a degree. These
are:

1. Like Andres, I'd really like to have some way of inspecting things
like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant
detail. These record types happen to be very important in general, and
the ability to see detailed information about the WAL record would
definitely help with some debugging scenarios. I've really missed
stuff like this while debugging serious issues under time pressure.

2. To a lesser extent I would like to see similar detailed information
for nbtree's DELETE, VACUUM, and possibly DEDUPLICATE record types.
They might also come in handy for debugging, in about the same way.

3. More manageable verbosity.

I think that it would be okay to put off coming up with a solution to
3, for now. 1 and 2 seem more important than 3.

> I think if there was a more structured output of rmgrdesc, then this
> would also solve the verbosity level problem. Consumers could decide on
> their verbosity level -- in various pg_walinspect function outputs, that
> would probably just be column selection. For pg_waldump, I imagine that
> some kind of parameter or flag would work.
>
> Unless you are suggesting that we add a verbosity parameter to the
> rmgrdesc function API now?

The verbosity problem will get somewhat worse if we do just my items 1
and 2, so it would be nice if we at least had a strategy in mind that
delivers on item 3/verbosity -- though the implementation can appear
in a later release. Maybe something simple would work, like promising
to output (say) 30 characters or less in terse mode, and making no
such promise otherwise. Terse mode wouldn't just truncate the output
of verbose mode -- it would never display information that could in
principle exceed the 30 character allowance, even with records that
happen to fall under the limit.

I can't feel too bad about putting this part off. A pager like pspg is
already table stakes when using pg_walinspect in any sort of serious
way. As I said upthread, absurdly wide output is already reasonably
common in most cases.

--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Mon, Mar 13, 2023 at 6:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> There are several different things that seem important to me
> personally. These are in tension with each other, to a degree. These
> are:
>
> 1. Like Andres, I'd really like to have some way of inspecting things
> like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant
> detail. These record types happen to be very important in general, and
> the ability to see detailed information about the WAL record would
> definitely help with some debugging scenarios. I've really missed
> stuff like this while debugging serious issues under time pressure.

One problem that I often run into when performing analysis of VACUUM
using pg_walinspect is the issue of *who* pruned which heap page, for
any given PRUNE record. Was it VACUUM/autovacuum, or was it
opportunistic pruning? There is no way of knowing for sure right now.
You *cannot* rely on an xid of 0 as an indicator of a given PRUNE
record coming from VACUUM; it could just have been an opportunistic
prune operation that happened to take place when a SELECT query ran,
before any XID was ever allocated.

I think that we should do something like the attached, to completely
avoid this ambiguity. This patch adds a new XLOG_HEAP2 bit that's
similar to XLOG_HEAP_INIT_PAGE -- XLOG_HEAP2_BYVACUUM. This allows all
XLOG_HEAP2 record types to indicate that they took place during
VACUUM, by XOR'ing the flag with the record type/info when
XLogInsert() is called. For now this is only used by PRUNE records.
Tools like pg_walinspect will report a separate "Heap2/PRUNE+BYVACUUM"
record_type, as well as the unadorned Heap2/PRUNE record_type, which
we'll now know must have been opportunistic pruning.

The approach of using a bit in the style of the heapam init bit makes
sense to me, because the bit is available, and works in a way that is
minimally invasive. Also, one can imagine needing to resolve a similar
ambiguity in the future, when (say) opportunistic freezing is added.

I think that it makes sense to treat this within the scope of
Melanie's ongoing work to improve the instrumentation of these records
-- meaning that it's in scope for Postgres 16. Admittedly this is a
slightly creative interpretation, so if others disagree then I won't
argue. This is quite a small patch, though, which makes debugging
significantly easier. I think that there could be a great deal of
utility in being able to easily "pair up" corresponding
"Heap2/PRUNE+BYVACUUM" and "Heap2/VACUUM" records in debugging
scenarios. I can imagine linking these to "Heap2/FREEZE_PAGE" and
"Heap2/VISIBLE" records, too, since they're all closely related record
types.

--
Peter Geoghegan

Attachment

Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Tue, Mar 21, 2023 at 3:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> One problem that I often run into when performing analysis of VACUUM
> using pg_walinspect is the issue of *who* pruned which heap page, for
> any given PRUNE record. Was it VACUUM/autovacuum, or was it
> opportunistic pruning? There is no way of knowing for sure right now.
> You *cannot* rely on an xid of 0 as an indicator of a given PRUNE
> record coming from VACUUM; it could just have been an opportunistic
> prune operation that happened to take place when a SELECT query ran,
> before any XID was ever allocated.

In case it's unclear how much of a problem this can be, here's an example:

The misc.sql regression test does a bulk update of the table "onek". A
little later, one of the queries that appears under the section "copy"
from the same file SELECTs from "onek". This produces a succession of
opportunistic prune records that look exactly like what you'd expect from
a VACUUM when viewed through pg_walinspect (without this patch). Each
PRUNE record has XID 0. The records appear in ascending heap block
number order, since there is a sequential scan involved (we go through
heapgetpage() to get to heap_page_prune_opt(), where the query prunes
opportunistically).

Another slightly surprising fact revealed by the patch is the ratio of
opportunistic prunes ("Heap2/PRUNE") to prunes run during VACUUM
("Heap2/PRUNE+BYVACUUM") with the regression tests:

│ resource_manager/record_type │ Heap2/PRUNE                 │
│ count                        │ 4,521                       │
│ count_perc                   │ 0.220                       │
│ rec_size                     │ 412,442                     │
│ avg_rec_size                 │ 91                          │
│ rec_size_perc                │ 0.194                       │
│ fpi_size                     │ 632,828                     │
│ fpi_size_perc                │ 1.379                       │
│ combined_size                │ 1,045,270                   │
│ combined_size_perc           │ 0.404                       │
├─[ RECORD 61 ]────────────────┼─────────────────────────────┤
│ resource_manager/record_type │ Heap2/PRUNE+BYVACUUM        │
│ count                        │ 2,784                       │
│ count_perc                   │ 0.135                       │
│ rec_size                     │ 467,057                     │
│ avg_rec_size                 │ 167                         │
│ rec_size_perc                │ 0.219                       │
│ fpi_size                     │ 546,344                     │
│ fpi_size_perc                │ 1.190                       │
│ combined_size                │ 1,013,401                   │
│ combined_size_perc           │ 0.391                       │
├─[ RECORD 62 ]────────────────┼─────────────────────────────┤
│ resource_manager/record_type │ Heap2/VACUUM                │
│ count                        │ 3,463                       │
│ count_perc                   │ 0.168                       │
│ rec_size                     │ 610,038                     │
│ avg_rec_size                 │ 176                         │
│ rec_size_perc                │ 0.286                       │
│ fpi_size                     │ 893,964                     │
│ fpi_size_perc                │ 1.948                       │
│ combined_size                │ 1,504,002                   │
│ combined_size_perc           │ 0.581                       │
├─[ RECORD 63 ]────────────────┼─────────────────────────────┤
│ resource_manager/record_type │ Heap2/VISIBLE               │
│ count                        │ 7,293                       │
│ count_perc                   │ 0.354                       │
│ rec_size                     │ 431,382                     │
│ avg_rec_size                 │ 59                          │
│ rec_size_perc                │ 0.202                       │
│ fpi_size                     │ 1,794,048                   │
│ fpi_size_perc                │ 3.909                       │
│ combined_size                │ 2,225,430                   │
│ combined_size_perc           │ 0.859                       │


--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
On Mon, Mar 13, 2023 at 9:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Mar 13, 2023 at 4:01 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> > I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
> > the updated/deleted target offset numbers and the updated tuples
> > metadata.
> >
> > I wondered if there was any reason to do xl_btree_dedup deduplication
> > intervals.
>
> No reason. It wouldn't be hard to cover xl_btree_dedup deduplication
> intervals -- each element is a page offset number, and a corresponding
> count of index tuples to merge together in the REDO routine. That's
> slightly different to anything else, but not in a way that seems like
> it requires very much additional effort.

I went to add dedup records and noticed that since the actual
BTDedupInterval struct is what is put in the xlog, I would need access
to that type from nbtdesc.c, however, including nbtree.h doesn't seem to
work because it includes files that cannot be included in frontend code.

I, of course, could make some local struct in nbtdesc.c which has an
OffsetNumber and a uint16, since the BTDedupInterval is pretty
straightforward, but that seems a bit annoying.
I'm probably missing something obvious, but is there a better way to do
this?

On another note, I've thought about how to include some example output
in docs, and, for example we could modify the example output in the
pgwalinspect docs which includes a PRUNE record already for
pg_get_wal_record_info() docs. We'd probably just want to keep it short.

- Melanie



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Mon, Mar 27, 2023 at 2:29 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I went to add dedup records and noticed that since the actual
> BTDedupInterval struct is what is put in the xlog, I would need access
> to that type from nbtdesc.c, however, including nbtree.h doesn't seem to
> work because it includes files that cannot be included in frontend code.

I suppose that the BTDedupInterval struct could have just as easily
gone in nbtxlog.h, next to xl_btree_dedup. There might have been a
moment where I thought about doing it that way, but I guess I found it
slightly preferable to use that symbol name (BTDedupInterval) rather
than (say) xl_btree_dedup_interval in places like the nearby
BTDedupStateData struct.

Actually, I suppose that it's hard to make that alternative work, at
least without
including nbtxlog.h in nbtree.h. Which sounds wrong.

> I, of course, could make some local struct in nbtdesc.c which has an
> OffsetNumber and a uint16, since the BTDedupInterval is pretty
> straightforward, but that seems a bit annoying.
> I'm probably missing something obvious, but is there a better way to do
> this?

It was probably just one of those cases where I settled on the
arrangement that looked least odd overall. Not a particularly
principled approach. But the approach that I'm going to take once more
here.  ;-)

All of the available alternatives are annoying in roughly the same
way, though perhaps to varying degrees. All except one: I'm okay with
just not adding coverage for deduplication records, for the time being
-- just seeing the number of intervals alone is relatively informative
with deduplication records, unlike (say) nbtree delete records. I'm
also okay with having coverage for dedup records if you feel it's
worth having. Your call.

If we're going to have coverage for deduplication records then it
seems to me that we have to have a struct in nbtxlog.h for your code
to work off of. It also seems likely that we'll want to use that same
struct within nbtxlog.c. What's less clear is what that means for the
BTDedupInterval struct. I don't think that we should include nbtxlog.h
in nbtree.h, nor should we do the converse.

I guess maybe two identical structs would be okay. BTDedupInterval,
and xl_btree_dedup_interval, with the former still used in nbtdedup.c,
and the latter used through a pointer at the point that nbtxlog.c
reads a dedup record. Then maybe at a sizeof() static assert beside
the existing btree_xlog_dedup() assertions that check that the dedup
state interval array matches the array taken from the WAL record.
That's still a bit weird, but I find it preferable to any alternative
that I can think of.

> On another note, I've thought about how to include some example output
> in docs, and, for example we could modify the example output in the
> pgwalinspect docs which includes a PRUNE record already for
> pg_get_wal_record_info() docs. We'd probably just want to keep it short.

Yeah. Perhaps a PRUNE record for one of the system catalogs whose
relfilenode is relatively recognizable. Say pg_class. It probably
doesn't matter that much, but there is perhaps some small value in
picking an example that is relatively easy to recreate later on (or to
approximately recreate). I'm certainly not insisting on that, though.

--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
Attached v3 is cleaned up and includes a pg_walinspect docs update as
well as some edited comments in rmgr_utils.c

On Mon, Mar 27, 2023 at 6:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Mar 27, 2023 at 2:29 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I went to add dedup records and noticed that since the actual
> > BTDedupInterval struct is what is put in the xlog, I would need access
> > to that type from nbtdesc.c, however, including nbtree.h doesn't seem to
> > work because it includes files that cannot be included in frontend code.
>
> I suppose that the BTDedupInterval struct could have just as easily
> gone in nbtxlog.h, next to xl_btree_dedup. There might have been a
> moment where I thought about doing it that way, but I guess I found it
> slightly preferable to use that symbol name (BTDedupInterval) rather
> than (say) xl_btree_dedup_interval in places like the nearby
> BTDedupStateData struct.
>
> Actually, I suppose that it's hard to make that alternative work, at
> least without
> including nbtxlog.h in nbtree.h. Which sounds wrong.
>
> > I, of course, could make some local struct in nbtdesc.c which has an
> > OffsetNumber and a uint16, since the BTDedupInterval is pretty
> > straightforward, but that seems a bit annoying.
> > I'm probably missing something obvious, but is there a better way to do
> > this?
>
> It was probably just one of those cases where I settled on the
> arrangement that looked least odd overall. Not a particularly
> principled approach. But the approach that I'm going to take once more
> here.  ;-)
>
> All of the available alternatives are annoying in roughly the same
> way, though perhaps to varying degrees. All except one: I'm okay with
> just not adding coverage for deduplication records, for the time being
> -- just seeing the number of intervals alone is relatively informative
> with deduplication records, unlike (say) nbtree delete records. I'm
> also okay with having coverage for dedup records if you feel it's
> worth having. Your call.
>
> If we're going to have coverage for deduplication records then it
> seems to me that we have to have a struct in nbtxlog.h for your code
> to work off of. It also seems likely that we'll want to use that same
> struct within nbtxlog.c. What's less clear is what that means for the
> BTDedupInterval struct. I don't think that we should include nbtxlog.h
> in nbtree.h, nor should we do the converse.
>
> I guess maybe two identical structs would be okay. BTDedupInterval,
> and xl_btree_dedup_interval, with the former still used in nbtdedup.c,
> and the latter used through a pointer at the point that nbtxlog.c
> reads a dedup record. Then maybe at a sizeof() static assert beside
> the existing btree_xlog_dedup() assertions that check that the dedup
> state interval array matches the array taken from the WAL record.
> That's still a bit weird, but I find it preferable to any alternative
> that I can think of.

I've omitted enhancements for the dedup record type for now.

> > On another note, I've thought about how to include some example output
> > in docs, and, for example we could modify the example output in the
> > pgwalinspect docs which includes a PRUNE record already for
> > pg_get_wal_record_info() docs. We'd probably just want to keep it short.
>
> Yeah. Perhaps a PRUNE record for one of the system catalogs whose
> relfilenode is relatively recognizable. Say pg_class. It probably
> doesn't matter that much, but there is perhaps some small value in
> picking an example that is relatively easy to recreate later on (or to
> approximately recreate). I'm certainly not insisting on that, though.

I've added such an example to pg_walinspect docs.

On Tue, Mar 21, 2023 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Mar 13, 2023 at 6:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > There are several different things that seem important to me
> > personally. These are in tension with each other, to a degree. These
> > are:
> >
> > 1. Like Andres, I'd really like to have some way of inspecting things
> > like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant
> > detail. These record types happen to be very important in general, and
> > the ability to see detailed information about the WAL record would
> > definitely help with some debugging scenarios. I've really missed
> > stuff like this while debugging serious issues under time pressure.
>
> One problem that I often run into when performing analysis of VACUUM
> using pg_walinspect is the issue of *who* pruned which heap page, for
> any given PRUNE record. Was it VACUUM/autovacuum, or was it
> opportunistic pruning? There is no way of knowing for sure right now.
> You *cannot* rely on an xid of 0 as an indicator of a given PRUNE
> record coming from VACUUM; it could just have been an opportunistic
> prune operation that happened to take place when a SELECT query ran,
> before any XID was ever allocated.
>
> I think that we should do something like the attached, to completely
> avoid this ambiguity. This patch adds a new XLOG_HEAP2 bit that's
> similar to XLOG_HEAP_INIT_PAGE -- XLOG_HEAP2_BYVACUUM. This allows all
> XLOG_HEAP2 record types to indicate that they took place during
> VACUUM, by XOR'ing the flag with the record type/info when
> XLogInsert() is called. For now this is only used by PRUNE records.
> Tools like pg_walinspect will report a separate "Heap2/PRUNE+BYVACUUM"
> record_type, as well as the unadorned Heap2/PRUNE record_type, which
> we'll now know must have been opportunistic pruning.
>
> The approach of using a bit in the style of the heapam init bit makes
> sense to me, because the bit is available, and works in a way that is
> minimally invasive. Also, one can imagine needing to resolve a similar
> ambiguity in the future, when (say) opportunistic freezing is added.
>
> I think that it makes sense to treat this within the scope of
> Melanie's ongoing work to improve the instrumentation of these records
> -- meaning that it's in scope for Postgres 16. Admittedly this is a
> slightly creative interpretation, so if others disagree then I won't
> argue. This is quite a small patch, though, which makes debugging
> significantly easier. I think that there could be a great deal of
> utility in being able to easily "pair up" corresponding
> "Heap2/PRUNE+BYVACUUM" and "Heap2/VACUUM" records in debugging
> scenarios. I can imagine linking these to "Heap2/FREEZE_PAGE" and
> "Heap2/VISIBLE" records, too, since they're all closely related record
> types.

I really like this idea and would find it useful. I reviewed the patch
and tried it out and it worked for me and code looked fine as well.

I didn't include it in the attached patchset because I don't feel
confident enough in my own understanding of any potential implications
of splitting up these record types to definitively endorse it. But, if
someone else felt comfortable with it, I would like to see it in the
tree.

- Melanie

Attachment

Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Fri, Apr 7, 2023 at 1:33 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Attached v3 is cleaned up and includes a pg_walinspect docs update as
> well as some edited comments in rmgr_utils.c

Attached v4 has some small tweaks on your v3. Mostly just whitespace
tweaks. Two slightly notable tweaks:

* I changed the approach to globbing in the Makefile, rather than use
your original overwide formulation for the new rmgrdesc_utils.c file.

What do you think of this approach?

* Removed use of the restrict keyword.

While "restrict" is C99, I'm not completely sure that it's totally
supported by Postgres. I'm a bit surprised that you opted to use it in
this particular patch.

I meant to ask you about this earlier...why use restrict in this patch?

> I've added such an example to pg_walinspect docs.

There already was a PRUNE example, though -- for the
pg_get_wal_record_info function (singular, not to be confused with
pg_get_wal_records_info).

v4 makes the example a VACUUM record, which replaces the previous
pg_get_wal_record_info PRUNE example -- that needed to be updated
anyway. This approach has the advantage of not being too verbose,
which still showing some of this kind of detail.

This has the advantage of allowing pg_get_wal_records_info's example
to continue to be an example that lacks a block reference (and so has
a NULL block_ref). This is a useful contrast against the new
pg_get_wal_block_info function.

> I really like this idea and would find it useful. I reviewed the patch
> and tried it out and it worked for me and code looked fine as well.
>
> I didn't include it in the attached patchset because I don't feel
> confident enough in my own understanding of any potential implications
> of splitting up these record types to definitively endorse it. But, if
> someone else felt comfortable with it, I would like to see it in the
> tree.

I'm not going to move on it now for 16, given the lack of feedback about it.

--
Peter Geoghegan

Attachment

Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
On Fri, Apr 7, 2023 at 5:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Apr 7, 2023 at 1:33 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Attached v3 is cleaned up and includes a pg_walinspect docs update as
> > well as some edited comments in rmgr_utils.c
>
> Attached v4 has some small tweaks on your v3. Mostly just whitespace
> tweaks. Two slightly notable tweaks:
>
> * I changed the approach to globbing in the Makefile, rather than use
> your original overwide formulation for the new rmgrdesc_utils.c file.
>
> What do you think of this approach?

Seems fine.

> * Removed use of the restrict keyword.
>
> While "restrict" is C99, I'm not completely sure that it's totally
> supported by Postgres. I'm a bit surprised that you opted to use it in
> this particular patch.
>
> I meant to ask you about this earlier...why use restrict in this patch?


So, I think the signature I meant to have was:

void
array_desc(StringInfo buf, void *array, size_t elem_size, int count,
          void (*elem_desc) (StringInfo buf, const void *elem, void *data),
          void *data)

Basically I wanted to indicate that elem was not and should not be
modified and data can be modified but that they should not be the same
element or overlap at all.

> > I've added such an example to pg_walinspect docs.
>
> There already was a PRUNE example, though -- for the
> pg_get_wal_record_info function (singular, not to be confused with
> pg_get_wal_records_info).
>
> v4 makes the example a VACUUM record, which replaces the previous
> pg_get_wal_record_info PRUNE example -- that needed to be updated
> anyway. This approach has the advantage of not being too verbose,
> which still showing some of this kind of detail.
>
> This has the advantage of allowing pg_get_wal_records_info's example
> to continue to be an example that lacks a block reference (and so has
> a NULL block_ref). This is a useful contrast against the new
> pg_get_wal_block_info function.

LGTM

- Melanie



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Fri, Apr 7, 2023 at 4:01 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> LGTM

Pushed, thanks.

--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
On Fri, Apr 7, 2023 at 7:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Apr 7, 2023 at 4:01 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > LGTM
>
> Pushed, thanks.

It's come to my attention that I forgot to include the btree patch earlier.

PFA

Attachment

Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Fri, Apr 7, 2023 at 4:21 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> It's come to my attention that I forgot to include the btree patch earlier.

Pushed that one too.

Also removed the use of the "restrict" keyword here.

Thanks
--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Fri, Apr 7, 2023 at 4:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Pushed that one too.

I noticed that the nbtree VACUUM and DELETE record types have their
update/xl_btree_update arrays output incorrectly. We cannot use the
generic array_desc() approach with xl_btree_update elements, because
they're variable-width elements. The problem is that array_desc() only deals
with fixed-width elements.

I also changed some of the details around whitespace in arrays in the
fixup patch (though I didn't do the same with objects). It doesn't
seem useful to use so much whitespace for long arrays of integers
(really page offset numbers). And I brought a few nbtree desc routines
that still used ";" characters as punctuation in line with the new
convention.

Finally, the patch revises the guidelines written for rmgr desc
routine authors. I don't think that we need to describe how to handle
outputting whitespace in detail. It'll be quite natural for other
rmgrs to use existing facilities such as array_desc() themselves,
which makes whitespace type inconsistencies unlikely. I've tried to
make the limits of the guidelines clear. The main goal is to avoid
gratuitous inconsistencies, and to provide a standard way of doing
things that many different rmgrs are likely to want to do, again and
again. But individual rmgrs still have a certain amount of discretion,
which seems like a good thing to me (the alternative requires that we
fix at least a couple of things in nbtdesc.c and in heapdesc.c, which
doesn't seem useful to me).

--
Peter Geoghegan

Attachment

Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Sun, Apr 9, 2023 at 5:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I noticed that the nbtree VACUUM and DELETE record types have their
> update/xl_btree_update arrays output incorrectly. We cannot use the
> generic array_desc() approach with xl_btree_update elements, because
> they're variable-width elements. The problem is that array_desc() only deals
> with fixed-width elements.

I pushed this fix just now, though without the updates to the
guidelines (or only minimal updates).

A remaining problem with arrays appears in "infobits" fields for
record types such as LOCK. Here's an example of the problem:

off: 34, xid: 3, flags: 0x00, infobits: [, LOCK_ONLY, EXCL_LOCK ]

Clearly the punctuation from the array is malformed.

A second issue (related to the first) is the name of the key itself,
"infobits". While "infobits" actually seems fine in this particular
example, I don't think that we want to do the same for record types
such as HEAP_UPDATE, since such records require that the description
show information about flags whose underlying field in the WAL record
struct is actually called "old_infobits_set". I think that we should
be outputting "old_infobits: [ ... ] " in the description of
HEAP_UPDATE records, which isn't the case right now.

A third issue is present in the nearby handling of xl_heap_truncate
status flags. It's the same basic array punctuation issue again, so
arguably this is the same issue as the first one.

Attached patch fixes all of these issues, and overhauls the guidelines
in the way originally proposed by the nbtree fix patch (since I didn't
keep that part of the nbtree patch when I pushed it today).

Note that the patch makes many individual (say) HOT_UPDATE records
have descriptions that look like this:

... old_infobits: [], ...

This differs from HEAD, where the output is totally suppressed because
there are no flag bits to show. I think that this behavior is more
logical and consistent overall.

--
Peter Geoghegan

Attachment

Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
On Sun, Apr 9, 2023 at 8:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Apr 7, 2023 at 4:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Pushed that one too.
>
> I noticed that the nbtree VACUUM and DELETE record types have their
> update/xl_btree_update arrays output incorrectly. We cannot use the
> generic array_desc() approach with xl_btree_update elements, because
> they're variable-width elements. The problem is that array_desc() only deals
> with fixed-width elements.

You are right. I'm sorry for the rather egregious oversight.

I took a look at the first patch even though you've pushed the bugfix
part. Any reason you didn't use array_desc() for the inner array (of
"ptids")? I find that following the pattern of using array_desc (when it
is correct, of course!) helps me to quickly identify: "okay, this is an
array of x" without having to stare at the loop too much.

I will say that the prefix of p in "ptid" makes it sound like pointer to
a tid, which I don't believe is what you meant.

> I also changed some of the details around whitespace in arrays in the
> fixup patch (though I didn't do the same with objects). It doesn't
> seem useful to use so much whitespace for long arrays of integers
> (really page offset numbers). And I brought a few nbtree desc routines
> that still used ";" characters as punctuation in line with the new
> convention.

Cool.

> Finally, the patch revises the guidelines written for rmgr desc
> routine authors. I don't think that we need to describe how to handle
> outputting whitespace in detail. It'll be quite natural for other
> rmgrs to use existing facilities such as array_desc() themselves,
> which makes whitespace type inconsistencies unlikely. I've tried to
> make the limits of the guidelines clear. The main goal is to avoid
> gratuitous inconsistencies, and to provide a standard way of doing
> things that many different rmgrs are likely to want to do, again and
> again. But individual rmgrs still have a certain amount of discretion,
> which seems like a good thing to me (the alternative requires that we
> fix at least a couple of things in nbtdesc.c and in heapdesc.c, which
> doesn't seem useful to me).

I like the new guidelines you proposed (in the patch).
They are well-written and clear.


On Mon, Apr 10, 2023 at 3:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sun, Apr 9, 2023 at 5:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I noticed that the nbtree VACUUM and DELETE record types have their
> > update/xl_btree_update arrays output incorrectly. We cannot use the
> > generic array_desc() approach with xl_btree_update elements, because
> > they're variable-width elements. The problem is that array_desc() only deals
> > with fixed-width elements.
>
> I pushed this fix just now, though without the updates to the
> guidelines (or only minimal updates).
>
> A remaining problem with arrays appears in "infobits" fields for
> record types such as LOCK. Here's an example of the problem:
>
> off: 34, xid: 3, flags: 0x00, infobits: [, LOCK_ONLY, EXCL_LOCK ]
>
> Clearly the punctuation from the array is malformed.

So, I did do this on purpose -- because I didn't want to have to do the
gymnastics to determine which flag was hit first (though it looks like I
mistakenly omitted the comma prepending IS_MULTI -- that was not
intentional).
I recognized that the output doesn't look nice, but I hadn't exactly
thought of it as malformed. Perhaps you are right.

I will say and I am still not a fan of the "if (first) else" logic in
your attached patch.

I've put my suggestion for how to do it instead inline with the code
diff below for clarity.

diff --git a/src/backend/access/rmgrdesc/heapdesc.c
b/src/backend/access/rmgrdesc/heapdesc.c
index 3bd083875..a64d14c2c 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -18,29 +18,75 @@
 #include "access/rmgrdesc_utils.h"

 static void
-out_infobits(StringInfo buf, uint8 infobits)
+infobits_desc(StringInfo buf, uint8 infobits, const char *keyname)
...
     if (infobits & XLHL_KEYS_UPDATED)
-        appendStringInfoString(buf, ", KEYS_UPDATED");
+    {
+        if (first)
+            appendStringInfoString(buf, "KEYS_UPDATED");
+        else
+            appendStringInfoString(buf, ", KEYS_UPDATED");
+        first = false;
+    }

How about we have the flags use a trailing comma and space and then
overwrite the last one with something this:

    if (infobits & XLHL_KEYS_UPDATED)
        appendStringInfoString(buf, "KEYS_UPDATED, ");
    buf->data[buf->len -= strlen(", ")] = '\0';


@@ -230,7 +271,9 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
             OffsetNumber *offsets;

I don't prefer this to what I had, which is also correct, right?

             plans = (xl_heap_freeze_plan *)
XLogRecGetBlockData(record, 0, NULL);
-            offsets = (OffsetNumber *) &plans[xlrec->nplans];
+            offsets = (OffsetNumber *) ((char *) plans +
+                                        (xlrec->nplans *
+                                        sizeof(xl_heap_freeze_plan)));
             appendStringInfoString(buf, ", plans:");
             array_desc(buf, plans, sizeof(xl_heap_freeze_plan), xlrec->nplans,
                       &plan_elem_desc, &offsets);

> A second issue (related to the first) is the name of the key itself,
> "infobits". While "infobits" actually seems fine in this particular
> example, I don't think that we want to do the same for record types
> such as HEAP_UPDATE, since such records require that the description
> show information about flags whose underlying field in the WAL record
> struct is actually called "old_infobits_set". I think that we should
> be outputting "old_infobits: [ ... ] " in the description of
> HEAP_UPDATE records, which isn't the case right now.

--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -18,29 +18,75 @@
 #include "access/rmgrdesc_utils.h"

 static void
-out_infobits(StringInfo buf, uint8 infobits)
+infobits_desc(StringInfo buf, uint8 infobits, const char *keyname)

I like the keyname parameter.

> Note that the patch makes many individual (say) HOT_UPDATE records
> have descriptions that look like this:
>
> ... old_infobits: [], ...
>
> This differs from HEAD, where the output is totally suppressed because
> there are no flag bits to show. I think that this behavior is more
> logical and consistent overall.

Yea, I think it is better to include things and show that they are empty
then omit them. I find it more clear.

- Melanie



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Mon, Apr 10, 2023 at 3:04 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I took a look at the first patch even though you've pushed the bugfix
> part. Any reason you didn't use array_desc() for the inner array (of
> "ptids")? I find that following the pattern of using array_desc (when it
> is correct, of course!) helps me to quickly identify: "okay, this is an
> array of x" without having to stare at the loop too much.

It was fairly arbitrary. I was thinking "we can't use array_desc for
this", which wasn't 100% true, but seemed close enough. It helped that
this allowed me to remove uint16_elem_desc(), which likely wouldn't
have been reused later on.

> I will say that the prefix of p in "ptid" makes it sound like pointer to
> a tid, which I don't believe is what you meant.

I was thinking of the symbol name "ptid" from
_bt_delitems_delete_check() (it even appears in code comments). I
intended "posting list TID". But "pointer to a TID" actually kinda
works too, since these are offsets into a posting list (a simple
ItemPointerData array) for those TIDs that we're in the process of
removing/deleted from the tuple.

> I like the new guidelines you proposed (in the patch).
> They are well-written and clear.

Thanks. The guidelines might well become stricter in the future. Right
now I'd be happy if everybody could at least be in rough agreement
about the most basic things.

> I recognized that the output doesn't look nice, but I hadn't exactly
> thought of it as malformed. Perhaps you are right.

It does seem like an annoying thing to have to handle if you actually
want to parse the array. It requires a different approach to every
other array, which seems bad.

> I will say and I am still not a fan of the "if (first) else" logic in
> your attached patch.

I agree that my approach there was pretty ugly.

> How about we have the flags use a trailing comma and space and then
> overwrite the last one with something this:
>
>     if (infobits & XLHL_KEYS_UPDATED)
>         appendStringInfoString(buf, "KEYS_UPDATED, ");
>     buf->data[buf->len -= strlen(", ")] = '\0';

I'll try something like that instead.

> -            offsets = (OffsetNumber *) &plans[xlrec->nplans];
> +            offsets = (OffsetNumber *) ((char *) plans +
> +                                        (xlrec->nplans *
> +                                        sizeof(xl_heap_freeze_plan)));
>              appendStringInfoString(buf, ", plans:");
>              array_desc(buf, plans, sizeof(xl_heap_freeze_plan), xlrec->nplans,
>                        &plan_elem_desc, &offsets);

I thought that it made sense to match the FREEZE_PAGE REDO routine.

Another fairly arbitrary change, to be honest.

> > Note that the patch makes many individual (say) HOT_UPDATE records
> > have descriptions that look like this:
> >
> > ... old_infobits: [], ...
> >
> > This differs from HEAD, where the output is totally suppressed because
> > there are no flag bits to show. I think that this behavior is more
> > logical and consistent overall.
>
> Yea, I think it is better to include things and show that they are empty
> then omit them. I find it more clear.

Right. It makes sense for something like this, because generally
speaking the structures aren't nested in any real sense. They're also
very static -- WAL records have a fixed structure. So it's unlikely
that anybody is going to try to parse the description before knowing
which particular WAL record type (or perhaps types, plural) are
involved.

--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
On Mon, Apr 10, 2023 at 04:31:44PM -0700, Peter Geoghegan wrote:
> On Mon, Apr 10, 2023 at 3:04 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
> > 
> > I will say that the prefix of p in "ptid" makes it sound like pointer to
> > a tid, which I don't believe is what you meant.
> 
> I was thinking of the symbol name "ptid" from
> _bt_delitems_delete_check() (it even appears in code comments). I
> intended "posting list TID". But "pointer to a TID" actually kinda
> works too, since these are offsets into a posting list (a simple
> ItemPointerData array) for those TIDs that we're in the process of
> removing/deleted from the tuple.

If you keep the name, I'd explain it briefly in a comment above the code
then -- for those of us who spend less time with btrees. It is a tool
that will be often used by developers, so it is not unreasonable to
assume they may read the code if they are confused.

- Melanie



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Mon, Apr 10, 2023 at 5:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> If you keep the name, I'd explain it briefly in a comment above the code
> then -- for those of us who spend less time with btrees. It is a tool
> that will be often used by developers, so it is not unreasonable to
> assume they may read the code if they are confused.

Okay, I'll do something about that shortly.

Attached v2 deals with the "trailing comma and space in flags array"
heap desc issue using an approach that's along the same lines as your
suggested approach. What do you think?

--
Peter Geoghegan

Attachment

Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
Hi,

static void
infobits_desc(StringInfo buf, uint8 infobits, const char *keyname)
{
    appendStringInfo(buf, "%s: [", keyname);

Why can we assume that there will be no space at the end here?

I know we need to be able to avoid doing the comma overwriting if no
flags were set. In general, we expect record description elements to
prepend themselves with commas and spaces, but these infobits, for
example, use a trailing comma and space. If someone adds a description
element with a trailing space, they will trip this assert. We should at
least add a comment explaining this assertion so someone knows what to
do if they trip it.

Otherwise, we can return early if no flags are set. That will probably
make for slightly messier code since we would still have to construct
the empty list.

    Assert(buf->data[buf->len - 1] != ' ');

    if (infobits & XLHL_XMAX_IS_MULTI)
        appendStringInfoString(buf, "IS_MULTI, ");
    if (infobits & XLHL_XMAX_LOCK_ONLY)
        appendStringInfoString(buf, "LOCK_ONLY, ");
    if (infobits & XLHL_XMAX_EXCL_LOCK)
        appendStringInfoString(buf, "EXCL_LOCK, ");
    if (infobits & XLHL_XMAX_KEYSHR_LOCK)
        appendStringInfoString(buf, "KEYSHR_LOCK, ");
    if (infobits & XLHL_KEYS_UPDATED)
        appendStringInfoString(buf, "KEYS_UPDATED, ");

    if (buf->data[buf->len - 1] == ' ')
    {
        /* Truncate-away final unneeded ", "  */
        Assert(buf->data[buf->len - 2] == ',');
        buf->len -= 2;
        buf->data[buf->len] = '\0';
    }

    appendStringInfoString(buf, "]");
}

Also you didn't add the same assert to truncate_flags_desc().

static void
truncate_flags_desc(StringInfo buf, uint8 flags)
{
    appendStringInfoString(buf, "flags: [");

    if (flags & XLH_TRUNCATE_CASCADE)
        appendStringInfoString(buf, "CASCADE, ");
    if (flags & XLH_TRUNCATE_RESTART_SEQS)
        appendStringInfoString(buf, "RESTART_SEQS, ");

    if (buf->data[buf->len - 1] == ' ')
    {
        /* Truncate-away final unneeded ", "  */
        Assert(buf->data[buf->len - 2] == ',');
        buf->len -= 2;
        buf->data[buf->len] = '\0';
    }

    appendStringInfoString(buf, "]");
}

Not the fault of this patch, but I also noticed that heap UPDATE and
HOT_UPDATE records have xmax twice and don't differentiate between new
and old. I think that was probably a mistake.

description      | off: 119, xmax: 1105, flags: 0x00, old_infobits:
[], new off: 100, xmax 0

    else if (info == XLOG_HEAP_UPDATE)
    {
        xl_heap_update *xlrec = (xl_heap_update *) rec;

        appendStringInfo(buf, "off: %u, xmax: %u, flags: 0x%02X, ",
                        xlrec->old_offnum,
                        xlrec->old_xmax,
                        xlrec->flags);
        infobits_desc(buf, xlrec->old_infobits_set, "old_infobits");
        appendStringInfo(buf, ", new off: %u, xmax %u",
                        xlrec->new_offnum,
                        xlrec->new_xmax);
    }
    else if (info == XLOG_HEAP_HOT_UPDATE)
    {
        xl_heap_update *xlrec = (xl_heap_update *) rec;

        appendStringInfo(buf, "off: %u, xmax: %u, flags: 0x%02X, ",
                        xlrec->old_offnum,
                        xlrec->old_xmax,
                        xlrec->flags);
        infobits_desc(buf, xlrec->old_infobits_set, "old_infobits");
        appendStringInfo(buf, ", new off: %u, xmax: %u",
                        xlrec->new_offnum,
                        xlrec->new_xmax);
    }

Also not the fault of this patch, but looking at the output while using
this, I realized truncate record type has a stringified version of its
flags while other record types, like update, don't. Do you think this
makes sense? Perhaps not something we can change now, though...

description      | off: 1, xmax: 1183, flags: 0x00, old_infobits: [],
new off: 119, xmax 0

Also not the fault of this patch, but I noticed that leaftopparent is
often InvalidBlockNumber--which shows up as 4294967295. I wonder if
anyone would be confused by this. Maybe devs know that this value is
InvalidBlockNumber. In the future, perhaps we should interpolate the
string "InvalidBlockNumber"?

description | left: 436, right: 389, level: 0, safexid: 0:1091,
leafleft: 436, leafright: 389, leaftopparent: 4294967295

- Melanie



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Tue, Apr 11, 2023 at 7:40 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> static void
> infobits_desc(StringInfo buf, uint8 infobits, const char *keyname)
> {
>     appendStringInfo(buf, "%s: [", keyname);
>
> Why can we assume that there will be no space at the end here?

I don't think that anybody is going to try that, but if they do then
the assertion will fail reliably.

> I know we need to be able to avoid doing the comma overwriting if no
> flags were set. In general, we expect record description elements to
> prepend themselves with commas and spaces, but these infobits, for
> example, use a trailing comma and space. If someone adds a description
> element with a trailing space, they will trip this assert. We should at
> least add a comment explaining this assertion so someone knows what to
> do if they trip it.

The invariant here is roughly: caller's keyname argument cannot have
trailing spaces or punctuation characters. It looks like it would be
inconvenient to write a precise assertion for that, but it doesn't
feel particularly necessary, given that this is just a static helper
function.

> Otherwise, we can return early if no flags are set. That will probably
> make for slightly messier code since we would still have to construct
> the empty list.

I prefer to keep this as simple as possible for now.

> Also you didn't add the same assert to truncate_flags_desc().

That's because truncate_flags_desc doesn't have a "keyname" argument.
Though it does have an assertion at the end that is almost equivalent:
the "Assert(buf->data[buf->len - 2] == ',') assertion (a matching
assertion appears at the end of infobits_desc).

> Not the fault of this patch, but I also noticed that heap UPDATE and
> HOT_UPDATE records have xmax twice and don't differentiate between new
> and old. I think that was probably a mistake.
>
> description      | off: 119, xmax: 1105, flags: 0x00, old_infobits:
> [], new off: 100, xmax 0

That doesn't seem great to me either. I don't like this ambiguity,
because it seems like it makes the description hard to parse in a way
that flies in the face of what we're trying to do here, in general.
So it seems like it might be worth fixing now, in the scope of this
patch.

> Also not the fault of this patch, but looking at the output while using
> this, I realized truncate record type has a stringified version of its
> flags while other record types, like update, don't. Do you think this
> makes sense? Perhaps not something we can change now, though...

You don't have to look at the truncate record type (which is a
relatively obscure and unimportant record type) to see these kinds of
inconsistencies. You can see the same thing with HEAP_UPDATE and
HEAP_HOT_UPDATE, which have stringified constants for infomask bits,
but not for the xl_heap_update.flags status bits.

I don't see any principled reason why such an inconsistency should
exist -- and we're talking about a pretty glaring inconsistency here.
On the other hand I don't think that we're obligated to do anything
about it for 16.

> Also not the fault of this patch, but I noticed that leaftopparent is
> often InvalidBlockNumber--which shows up as 4294967295. I wonder if
> anyone would be confused by this. Maybe devs know that this value is
> InvalidBlockNumber. In the future, perhaps we should interpolate the
> string "InvalidBlockNumber"?
>
> description | left: 436, right: 389, level: 0, safexid: 0:1091,
> leafleft: 436, leafright: 389, leaftopparent: 4294967295

In my personal opinion (this is a totally subjective question), the
current approach here is okay because (on its own) "leaftopparent:
4294967295" isn't any more or less meaningful than "leaftopparent:
InvalidBlockNumber". It's not as if the REDO routine actually relies
on the value ever being InvalidBlockNumber at all (except in an
assertion).

Plus it's easier to parse as-is. That's what swings it for me, in fact
(as with the "two xmax fields in update records" question).

This is the kind of question that tends to lead to bikeshedding. The
guidelines should avoid taking a firm position on these more
subjective questions, where we must make a subjective trade-off.
Especially a trade-off around how faithfully we represent the physical
WAL record versus readability (whatever "readability" means). I
pondered a similar trade-off in comments added to delvacuum_desc. That
contributed to my feeling that this is best left up to individual rmgr
desc routines.

--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Tue, Apr 11, 2023 at 10:34 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > description      | off: 119, xmax: 1105, flags: 0x00, old_infobits:
> > [], new off: 100, xmax 0
>
> That doesn't seem great to me either. I don't like this ambiguity,
> because it seems like it makes the description hard to parse in a way
> that flies in the face of what we're trying to do here, in general.
> So it seems like it might be worth fixing now, in the scope of this
> patch.

Attached revision deals with this by spelling out the names in full
(e.g., "old_xmax" and "new_xmax"). It also reorders the output fields
to match the order from the physical UPDATE, HOT_UPDATE, and LOCK WAL
record types, on the theory that those should match the physical
record (unless there is a good reason not to, which doesn't apply
here). I also removed some inconsistencies between
xl_heap_lock_updated and xl_heap_lock, since they're very similar
record types.

The revision also adds an extra sentence to the guidelines, since this
seems like something that we're entitled to take a relatively firm
position on. Finally, it also adds a comment about the rules for
infobits_desc callers in header comments for the function, per your
concern about that.

--
Peter Geoghegan

Attachment

Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Tue, Apr 11, 2023 at 11:48 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached revision deals with this by spelling out the names in full
> (e.g., "old_xmax" and "new_xmax"). It also reorders the output fields
> to match the order from the physical UPDATE, HOT_UPDATE, and LOCK WAL
> record types, on the theory that those should match the physical
> record (unless there is a good reason not to, which doesn't apply
> here).

I just noticed that we don't even show xmax in the case of DELETE
records. Perhaps the original assumption is that it must match the
record's own XID, but that's not true after the MultiXact enhancements
for foreign key locking added to 9.3 (and in any case there is no
reason at all to show xmax in UPDATE but not in DELETE).

Attached revision v4 fixes this, making DELETE, UPDATE, HOT_UPDATE,
LOCK, and LOCK_UPDATED record types consistent with each other in
terms of the key names output by the heap desc routine. The field
order also needed a couple of tweaks for struct consistency (and
cross-record consistency) for v4.

--
Peter Geoghegan

Attachment

Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
On Tue, Apr 11, 2023 at 1:35 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Apr 11, 2023 at 7:40 AM Melanie Plageman <melanieplageman@gmail.com> wrote:
> > Not the fault of this patch, but I also noticed that heap UPDATE and
> > HOT_UPDATE records have xmax twice and don't differentiate between new
> > and old. I think that was probably a mistake.
> >
> > description      | off: 119, xmax: 1105, flags: 0x00, old_infobits:
> > [], new off: 100, xmax 0
>
> That doesn't seem great to me either. I don't like this ambiguity,
> because it seems like it makes the description hard to parse in a way
> that flies in the face of what we're trying to do here, in general.
> So it seems like it might be worth fixing now, in the scope of this
> patch.

Agreed.

On Tue, Apr 11, 2023 at 3:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Apr 11, 2023 at 11:48 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached revision deals with this by spelling out the names in full
> > (e.g., "old_xmax" and "new_xmax"). It also reorders the output fields
> > to match the order from the physical UPDATE, HOT_UPDATE, and LOCK WAL
> > record types, on the theory that those should match the physical
> > record (unless there is a good reason not to, which doesn't apply
> > here).
>
> I just noticed that we don't even show xmax in the case of DELETE
> records. Perhaps the original assumption is that it must match the
> record's own XID, but that's not true after the MultiXact enhancements
> for foreign key locking added to 9.3 (and in any case there is no
> reason at all to show xmax in UPDATE but not in DELETE).
>
> Attached revision v4 fixes this, making DELETE, UPDATE, HOT_UPDATE,
> LOCK, and LOCK_UPDATED record types consistent with each other in
> terms of the key names output by the heap desc routine. The field
> order also needed a couple of tweaks for struct consistency (and
> cross-record consistency) for v4.

Code in v4 all seems fine to me.
I like the update guidelines comment.

I agree it would be nice for xl_heap_lock->locking_xid to be renamed
xmax for clarity. I would suggest that if you don't intend to put it
in a separate commit, you mention it explicitly in the final commit
message. Its motivation isn't immediately obvious to the reader.

- Melanie



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Tue, Apr 11, 2023 at 2:29 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> > That doesn't seem great to me either. I don't like this ambiguity,
> > because it seems like it makes the description hard to parse in a way
> > that flies in the face of what we're trying to do here, in general.
> > So it seems like it might be worth fixing now, in the scope of this
> > patch.
>
> Agreed.

Great -- pushed a fix for this just now, which included that change.

> I agree it would be nice for xl_heap_lock->locking_xid to be renamed
> xmax for clarity. I would suggest that if you don't intend to put it
> in a separate commit, you mention it explicitly in the final commit
> message. Its motivation isn't immediately obvious to the reader.

What I ended up doing is making that part of a bug fix for a minor
buglet I noticed in passing -- it became part of the "Fix xl_heap_lock
WAL record field's data type" commit from a bit earlier on.

Thanks for your help with the follow-up work. Seems like we're done
with this now.

--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Heikki Linnakangas
Date:
On 12/04/2023 01:29, Peter Geoghegan wrote:
> Thanks for your help with the follow-up work. Seems like we're done
> with this now.

This is still listed in the July commitfest; is there some work remaining?

I'm late to the party, but regarding commit c03c2eae0a, which added the 
guidelines for writing formatting desc functions:

You moved the comment from rmgrdesc_utils.c into rmgrdesc_utils.h, but I 
don't think that was a good idea. Our usual convention is to have the 
function comment in the .c file, not at the declaration in the header 
file. When I want to know what a function does, I jump to the .c file, 
and might miss the comment in the header entirely.

Let's add a src/backend/access/rmgrdesc/README file. We don't currently 
have any explanation anywhere why the rmgr desc functions are in a 
separate directory. The README would be a good place to explain that, 
and to have the formatting guidelines. See attached.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Mon, Jul 10, 2023 at 12:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> This is still listed in the July commitfest; is there some work remaining?

I don't think so; not in the scope of the original patch series from
Melanie, at least.

> You moved the comment from rmgrdesc_utils.c into rmgrdesc_utils.h, but I
> don't think that was a good idea. Our usual convention is to have the
> function comment in the .c file, not at the declaration in the header
> file. When I want to know what a function does, I jump to the .c file,
> and might miss the comment in the header entirely.

I think that this was a gray area. It wasn't particularly obvious
where this would go. At least not to me.

> Let's add a src/backend/access/rmgrdesc/README file. We don't currently
> have any explanation anywhere why the rmgr desc functions are in a
> separate directory. The README would be a good place to explain that,
> and to have the formatting guidelines. See attached.

I agree that it's better this way, though.

--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Peter Geoghegan
Date:
On Mon, Jul 10, 2023 at 10:29 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Let's add a src/backend/access/rmgrdesc/README file. We don't currently
> > have any explanation anywhere why the rmgr desc functions are in a
> > separate directory. The README would be a good place to explain that,
> > and to have the formatting guidelines. See attached.
>
> I agree that it's better this way, though.

Did you forget to follow up here?

--
Peter Geoghegan



Re: Show various offset arrays for heap WAL records

From
Melanie Plageman
Date:
On Mon, Jul 10, 2023 at 3:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm late to the party, but regarding commit c03c2eae0a, which added the
> guidelines for writing formatting desc functions:
>
> You moved the comment from rmgrdesc_utils.c into rmgrdesc_utils.h, but I
> don't think that was a good idea. Our usual convention is to have the
> function comment in the .c file, not at the declaration in the header
> file. When I want to know what a function does, I jump to the .c file,
> and might miss the comment in the header entirely.
>
> Let's add a src/backend/access/rmgrdesc/README file. We don't currently
> have any explanation anywhere why the rmgr desc functions are in a
> separate directory. The README would be a good place to explain that,
> and to have the formatting guidelines. See attached.

diff --git a/src/backend/access/rmgrdesc/README
b/src/backend/access/rmgrdesc/README
new file mode 100644
index 0000000000..abe84b9f11
--- /dev/null
+++ b/src/backend/access/rmgrdesc/README
@@ -0,0 +1,60 @@
+src/backend/access/rmgrdesc/README
+
+WAL resource manager description functions
+==========================================
+
+For debugging purposes, there is a "description function", or rmgrdesc
+function, for each WAL resource manager. The rmgrdesc function parses the WAL
+record and prints the contents of the WAL record in a somewhat human-readable
+format.
+
+The rmgrdesc functions for all resource managers are gathered in this
+directory, because they are also used in the stand-alone pg_waldump program.

"standalone" seems the more common spelling of this adjective in the
codebase today.

+They could potentially be used by out-of-tree debugging tools too, although
+the the functions or the output format should not be considered a stable API.

You have an extra "the".

I might phrase the last bit as "neither the description functions nor
the output format should be considered part of a stable API"

+Guidelines for rmgrdesc output format
+=====================================

I noticed you used === for both headings and wondered if it was
intentional. Other READMEs I looked at in src/backend/access tend to
have a single heading underlined with ==== and then subsequent
headings are underlined with -----. I could see an argument either way
here, but I just thought I would bring it up in case it was not a
conscious choice.

Otherwise, LGTM.

- Melanie



Re: Show various offset arrays for heap WAL records

From
Heikki Linnakangas
Date:
On 04/09/2023 23:02, Melanie Plageman wrote:
> I might phrase the last bit as "neither the description functions nor
> the output format should be considered part of a stable API"
> 
> +Guidelines for rmgrdesc output format
> +=====================================
> 
> I noticed you used === for both headings and wondered if it was
> intentional. Other READMEs I looked at in src/backend/access tend to
> have a single heading underlined with ==== and then subsequent
> headings are underlined with -----. I could see an argument either way
> here, but I just thought I would bring it up in case it was not a
> conscious choice.
> 
> Otherwise, LGTM.

Made these changes and committed. Thank you!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




On Tue, Mar 21, 2023 at 3:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that we should do something like the attached, to completely
> avoid this ambiguity. This patch adds a new XLOG_HEAP2 bit that's
> similar to XLOG_HEAP_INIT_PAGE -- XLOG_HEAP2_BYVACUUM. This allows all
> XLOG_HEAP2 record types to indicate that they took place during
> VACUUM, by XOR'ing the flag with the record type/info when
> XLogInsert() is called. For now this is only used by PRUNE records.
> Tools like pg_walinspect will report a separate "Heap2/PRUNE+BYVACUUM"
> record_type, as well as the unadorned Heap2/PRUNE record_type, which
> we'll now know must have been opportunistic pruning.
>
> The approach of using a bit in the style of the heapam init bit makes
> sense to me, because the bit is available, and works in a way that is
> minimally invasive. Also, one can imagine needing to resolve a similar
> ambiguity in the future, when (say) opportunistic freezing is added.

Starting a new, dedicated thread to keep track of this in the CF app.

This patch bitrot. Attached is v2, rebased on top of HEAD.

--
Peter Geoghegan

Attachment
On 09/12/2023 23:48, Peter Geoghegan wrote:
> On Tue, Mar 21, 2023 at 3:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> I think that we should do something like the attached, to completely
>> avoid this ambiguity. This patch adds a new XLOG_HEAP2 bit that's
>> similar to XLOG_HEAP_INIT_PAGE -- XLOG_HEAP2_BYVACUUM. This allows all
>> XLOG_HEAP2 record types to indicate that they took place during
>> VACUUM, by XOR'ing the flag with the record type/info when
>> XLogInsert() is called. For now this is only used by PRUNE records.
>> Tools like pg_walinspect will report a separate "Heap2/PRUNE+BYVACUUM"
>> record_type, as well as the unadorned Heap2/PRUNE record_type, which
>> we'll now know must have been opportunistic pruning.
>>
>> The approach of using a bit in the style of the heapam init bit makes
>> sense to me, because the bit is available, and works in a way that is
>> minimally invasive. Also, one can imagine needing to resolve a similar
>> ambiguity in the future, when (say) opportunistic freezing is added.
> 
> Starting a new, dedicated thread to keep track of this in the CF app.
> 
> This patch bitrot. Attached is v2, rebased on top of HEAD.

I included changes like this in commit f83d709760 ("Merge prune, freeze 
and vacuum WAL record formats"). Marking this as Committed in the 
commitfest.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




On Mon, Mar 25, 2024 at 9:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I included changes like this in commit f83d709760 ("Merge prune, freeze
> and vacuum WAL record formats"). Marking this as Committed in the
> commitfest.

Thanks for making sure that that happened. I suspect that the amount
of pruning performed opportunistically is sometimes much higher than
generally assumed, so having a way of measuring that seems like it
might lead to valuable insights.

--
Peter Geoghegan