Re: "PANIC: could not open critical system index 2662" - twice - Mailing list pgsql-general

From Thomas Munro
Subject Re: "PANIC: could not open critical system index 2662" - twice
Date
Msg-id CA+hUKGJ5zz8a6-D5ZQcoftZbAeBDs_h_3NL87TndKhW_2_aw1Q@mail.gmail.com
Whole thread Raw
In response to Re: "PANIC: could not open critical system index 2662" - twice  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: "PANIC: could not open critical system index 2662" - twice
Re: "PANIC: could not open critical system index 2662" - twice
List pgsql-general
On Sun, May 7, 2023 at 1:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Did you previously run this same workload on versions < 15 and never
> > see any problem?  15 gained a new feature CREATE DATABASE ...
> > STRATEGY=WAL_LOG, which is also the default.  I wonder if there is a
> > bug somewhere near that, though I have no specific idea.
>
> Per the release notes I was just writing ...
>
>     <listitem>
> <!--
> Author: Michael Paquier <michael@paquier.xyz>
> Branch: master [8a8661828] 2023-02-22 10:14:52 +0900
> Branch: REL_15_STABLE [fa5dd460c] 2023-02-22 10:14:56 +0900
> -->
>      <para>
>       Fix potential corruption of the template (source) database after
>       <command>CREATE DATABASE</command> with the <literal>STRATEGY
>       WAL_LOG</literal> option (Nathan Bossart, Ryo Matsumura)
>      </para>

Hmm.  That bug seems to have caused corruption (backwards time travel)
of blocks in the *source* DB's pg_class, by failing to write back
changes.  We seem to have zeroed pages in the *target* database, for
all catalogs (apparently everything copied by
RelationCopyStorageUsingBuffer()), even though the template is still
fine.  It is as if RelationCopyStorageUsingBuffer() created the
zero-filed file with smgrextend(), but then the buffer data was never
written out even though we memcpy'd it into the a buffer and set the
buffer dirty.

Bug-in-PostgreSQL explanations could include that we forgot it was
dirty, or some backend wrote it out to the wrong file; but if we were
forgetting something like permanent or dirty, would there be a more
systematic failure?  Oh, it could require special rare timing if it is
similar to 8a8661828's confusion about permanence level or otherwise
somehow not setting BM_PERMANENT, but in the target blocks, so I think
that'd require a checkpoint AND a crash.  It doesn't reproduce for me,
but perhaps more unlucky ingredients are needed.

Bug-in-OS/FS explanations could include that a whole lot of writes
were mysteriously lost in some time window, so all those files still
contain the zeroes we write first in smgrextend().  I guess this
previously rare (previously limited to hash indexes?) use of sparse
file hole-punching could be a factor in an it's-all-ZFS's-fault
explanation:

openat(AT_FDCWD,"base/16390/2662",O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC,0600)
= 36 (0x24)
openat(AT_FDCWD,"base/1/2662",O_RDWR|O_CLOEXEC,00) = 37 (0x25)
lseek(37,0x0,SEEK_END)                           = 32768 (0x8000)
lseek(37,0x0,SEEK_END)                           = 32768 (0x8000)
pwrite(36,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,8192,0x6000) = 8192
(0x2000) <-- smgrextend(final block)
lseek(36,0x0,SEEK_END)                           = 32768 (0x8000)

I  was trying to think about how I might go about trying to repro the
exact system setup.  Evgeny, do you mind sharing your "zfs get all
/path/to/pgdata" (curious to see block size, compression settings,
anything else etc) and your postgresql.conf? And your exact Ubuntu
kernel version and ZFS package versions?



pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: "PANIC: could not open critical system index 2662" - twice
Next
From: Marc Millas
Date:
Subject: Re: Death postgres