Re: Minimal logical decoding on standbys - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Minimal logical decoding on standbys
Date
Msg-id CA+TgmoaVcu_mbxbH=EccvKG6u8+MdQf9zx98uAL9zsStFwrYsQ@mail.gmail.com
Whole thread Raw
In response to Re: Minimal logical decoding on standbys  ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>)
Responses Re: Minimal logical decoding on standbys  (Andres Freund <andres@anarazel.de>)
Re: Minimal logical decoding on standbys  ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>)
List pgsql-hackers
On Wed, Dec 14, 2022 at 8:06 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> Please find attached v31 with the changes mentioned above (except that I put your wording into the commit message
insteadof a README: I think it helps to make
 
> clear what the "design" for the patch series is).

Thanks, I think that's good clarification.

I read through 0001 again and I noticed this:

 typedef struct xl_heap_prune
 {
     TransactionId snapshotConflictHorizon;
     uint16      nredirected;
     uint16      ndead;
+    bool        onCatalogAccessibleInLogicalDecoding;
     /* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;

I think this is unsafe on alignment-picky machines. I think it will
cause the offset numbers to be aligned at an odd address.
heap_xlog_prune() doesn't copy the data into aligned memory, so I
think this will result in a misaligned pointer being passed down to
heap_page_prune_execute.

I wonder what the best fix is here. We could (1) have
heap_page_prune_execute copy the data into a newly-palloc'd chunk,
which seems kind of sad on performance grounds, or we could (2) just
make the field here two bytes, or add a second byte as padding, but
that bloats the WAL slightly, or we could (3) try to steal a bit from
ndirected or ndead, if we think that we don't need all the bits. It
seems like the maximum block size is 32kB right now, which means
MaxOffsetNumber can't, I think, be more than 16kB. So maybe we could
think of replacing nredirected and ndead with uint32 flags and then
have accessor macros.

But it looks like we also have a bunch of similar issues elsewhere.
gistxlogDelete looks like it has the same problem. gistxlogPageReuse
is OK because there's no data afterwards. xl_hash_vacuum_one_page
looks like it has the same problem. So does xl_heap_prune.
xl_heap_freeze_page also has the issue: heap_xlog_freeze_page does
memcpy the plans, but not the offsets, and even for the plans, I think
for correctness we would need to treat the "plans" pointer as a void *
or char * because the pointer might be unaligned and the compiler, not
knowing that, could do bad things.

xl_btree_reuse_page is OK because no data follows the main record.
xl_btree_delete appears to have this problem if you just look at the
comments, because it says that offset numbers follow, and thus are
probably presumed aligned. However, they don't really follow, because
commit d2e5e20e57111cca9e14f6e5a99a186d4c66a5b7 moved the data from
the main data to the registered buffer data. However, AIUI, you're not
really supposed to assume that the registered buffer data is aligned.
I think this just happens to work because the length of the main
record is a multiple of the relevant small integer, and the size of
the block header is 4, so the buffer data ends up being accidentally
aligned. That might be worth fixing somehow independently of this
issue.

spgxlogVacuumRedirect is OK because the offsets array is part of the
struct, using FLEXIBLE_ARRAY_MEMBER, which will cause the offsets
field to be aligned properly. It means inserting a padding byte, but
it's not broken. If we don't mind adding padding bytes in some of the
other cases, we could potentially make use of this technique
elsewhere, I think.

Other comments:

+    if RelationIsAccessibleInLogicalDecoding(rel)
+        xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;

This is a few parentheses short of where it should be. Hilariously it
still compiles because there are parentheses in the macro definition.

+            xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

These lines are quite long. I think we should consider (1) picking a
shorter name for the xlrec field and, if it's such lines are going to
still routinely exceed 80 characters, (2) splitting them into two
lines, with the second one indented to match pgindent's preferences in
such cases, which I think is something like this:

xlrec.onCatalogAccessibleInLogicalDecoding =
    RelationIsAccessibleInLogicalDecoding(relation);

As far as renaming, I think we could at least remove onCatalog part
from the identifier, as that doesn't seem to be adding much. And maybe
we could even think of changing it to something like
logicalDecodingConflict or even decodingConflict, which would shave
off a bunch more characters.

+    if (heapRelation->rd_options)
+        isusercatalog = ((StdRdOptions *)
(heapRelation)->rd_options)->user_catalog_table;

Couldn't you get rid of the if statement here and also the
initialization at the top of the function and just write isusercatalog
= RelationIsUsedAsCatalogTable(heapRelation)? Or even just get rid of
the variable entirely and pass
RelationIsUsedAsCatalogTable(heapRelation) as the argument to
UpdateIndexRelation directly?

I think this could use some test cases demonstrating that
indisusercatalog gets set correctly in all the relevant cases: table
is created with user_catalog_table = true/false, reloption is changed,
reloptions are reset, new index is added later, etc.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: static assert cleanup
Next
From: Tom Lane
Date:
Subject: Re: Error-safe user functions