Re: "PANIC: could not open critical system index 2662" - twice - Mailing list pgsql-general

From Thomas Munro
Subject Re: "PANIC: could not open critical system index 2662" - twice
Date
Msg-id CA+hUKGJDfBEfqatSsV16XosCkr=3k2vb6X1mcbL_kdod_wHQwQ@mail.gmail.com
Whole thread Raw
In response to Re: "PANIC: could not open critical system index 2662" - twice  (Evgeny Morozov <postgresql3@realityexists.net>)
Responses Re: "PANIC: could not open critical system index 2662" - twice  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: "PANIC: could not open critical system index 2662" - twice  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-general
On Sun, May 7, 2023 at 12:29 AM Evgeny Morozov
<postgresql3@realityexists.net> wrote:
> On 6/05/2023 12:34 pm, Thomas Munro wrote:
> > So it does indeed look like something unknown has replaced 32KB of
> > data with 32KB of zeroes underneath us.  Are there more non-empty
> > files that are all-zeroes?  Something like this might find them:
> >
> > for F in base/1414389/*
> > do
> >   if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null
> >   then
> >     echo $F
> >   fi
> > done
>
> Yes, a total of 309 files are all-zeroes (and 52 files are not).
>
> I also checked the other DB that reports the same "unexpected zero page
> at block 0" error, "test_behavior_638186280406544656" (OID 1414967) -
> similar story there. I uploaded the lists of zeroed and non-zeroed files
> and the ls -la output for both as
> https://objective.realityexists.net/temp/pgstuff3.zip
>
> I then searched recursively such all-zeroes files in $PGDATA/base and
> did not find any outside of those two directories (base/1414389 and
> base/1414967). None in $PGDATA/global, either.

So "diff -u zeroed-files-1414967.txt zeroed-files-1414389.txt" shows
that they have the same broken stuff in the range cloned from the
template database by CREATE DATABASE STRATEGY=WAL_LOG, and it looks
like it's *all* the cloned catalogs, and then they have some
non-matching relfilenodes > 1400000, presumably stuff you created
directly in the new database (I'm not sure if I can say for sure that
those files are broken, without knowing what they are).

Did you previously run this same workload on versions < 15 and never
see any problem?  15 gained a new feature CREATE DATABASE ...
STRATEGY=WAL_LOG, which is also the default.  I wonder if there is a
bug somewhere near that, though I have no specific idea.  If you
explicitly added STRATEGY=FILE_COPY to your CREATE DATABASE commands,
you'll get the traditional behaviour.  It seems like you have some
kind of high frequency testing workload that creates and tests
databases all day long, and just occasionally detects this corruption.
Would you like to try requesting FILE_COPY for a while and see if it
eventually happens like that too?

My spidey sense is leaning away from filesystem bugs.  We've found
plenty of filesystem bugs on these mailing lists over the years and of
course it's not impossible, but I dunno... it seems quite suspicious
that all the system catalogs have apparently been wiped during or
moments after the creation of a new database that's running new
PostgreSQL 15 code...



pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: Death postgres
Next
From: Andrew Gierth
Date:
Subject: Re: Check that numeric is zero