Thread: datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE

datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE

From
Noah Misch
Date:
https://postgr.es/m/20240512232923.aa.nmisch@google.com wrote:
> Separable, nontrivial things not fixed in the attached patch stack:

> - Trouble is possible, I bet, if the system crashes between the inplace-update
>   memcpy() and XLogInsert().  See the new XXX comment below the memcpy().

That comment:

    /*----------
     * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
     *
     * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
     * ["R" is a VACUUM tbl]
     * D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
     * D: systable_getnext() returns pg_class tuple of tbl
     * R: memcpy() into pg_class tuple of tbl
     * D: raise pg_database.datfrozenxid, XLogInsert(), finish
     * [crash]
     * [recovery restores datfrozenxid w/o relfrozenxid]
     */

>   Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and
>   finally issuing memcpy() into the buffer.

That fix worked.  Along with that, I'm attaching a not-for-commit patch with a
test case and one with the fix rebased on that test case.  Apply on top of the
v2 patch stack from https://postgr.es/m/20240617235854.f8.nmisch@google.com.
This gets key testing from 027_stream_regress.pl; when I commented out some
memcpy lines of the heapam.c change, that test caught it.

This resolves the last inplace update defect known to me.

Thanks,
nm

Attachment

Re: datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE

From
"Andrey M. Borodin"
Date:

> On 20 Jun 2024, at 06:29, Noah Misch <noah@leadboat.com> wrote:
>
> This resolves the last inplace update defect known to me.

That’s a huge amount of work, thank you!

Do I get it right, that inplace updates are catalog-specific and some other OOM corruptions [0] and Standby corruptions
[1]are not related to this fix. Bot cases we observed on regular tables. 
Or that might be effect of vacuum deepening corruption after observing wrong datfrozenxid?


Best regards, Andrey Borodin.

[0]
https://www.postgresql.org/message-id/flat/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE%40yandex-team.ru#1266dd8b898ba02686c2911e0a50ab47
[1] https://www.postgresql.org/message-id/flat/CAFj8pRBEFMxxFSCVOSi-4n0jHzSaxh6Ze_cZid5eG%3Dtsnn49-A%40mail.gmail.com


On Thu, Jun 20, 2024 at 12:17:44PM +0500, Andrey M. Borodin wrote:
> On 20 Jun 2024, at 06:29, Noah Misch <noah@leadboat.com> wrote:
> > This resolves the last inplace update defect known to me.
> 
> That’s a huge amount of work, thank you!
> 
> Do I get it right, that inplace updates are catalog-specific and some other OOM corruptions [0] and Standby
corruptions[1] are not related to this fix. Bot cases we observed on regular tables.
 

In core code, inplace updates are specific to pg_class and pg_database.
Adding PGXN modules, only the citus extension uses them on some other table.
[0] definitely looks unrelated.

> Or that might be effect of vacuum deepening corruption after observing wrong datfrozenxid?

Wrong datfrozenxid can cause premature clog truncation, which can cause "could
not access status of transaction".  While $SUBJECT could cause that, I think
it would happen on both primary and standby.  [1] seems to be about a standby
lacking clog present on the primary, which is unrelated.

> [0]
https://www.postgresql.org/message-id/flat/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE%40yandex-team.ru#1266dd8b898ba02686c2911e0a50ab47
> [1]
https://www.postgresql.org/message-id/flat/CAFj8pRBEFMxxFSCVOSi-4n0jHzSaxh6Ze_cZid5eG%3Dtsnn49-A%40mail.gmail.com