Thread: "PANIC: could not open critical system index 2662" - twice

"PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:

Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed with this error:

2023-04-05 09:24:03.448 UTC [15227] ERROR:  index "pg_class_oid_index" contains unexpected zero page at block 0
2023-04-05 09:24:03.448 UTC [15227] HINT:  Please REINDEX it.
...
2023-04-05 13:05:25.018 UTC [15437] root@test_behavior_638162834106895162 FATAL:  index "pg_class_oid_index" contains unexpected zero page at block 0
2023-04-05 13:05:25.018 UTC [15437] root@test_behavior_638162834106895162 HINT:  Please REINDEX it.
... (same error for a few more DBs)
2023-04-05 13:05:25.144 UTC [16965] root@test_behavior_638162855458823077 FATAL:  index "pg_class_oid_index" contains unexpected zero page at block 0
2023-04-05 13:05:25.144 UTC [16965] root@test_behavior_638162855458823077 HINT:  Please REINDEX it.
...
2023-04-05 13:05:25.404 UTC [17309] root@test_behavior_638162881641031612 PANIC:  could not open critical system index 2662
2023-04-05 13:05:25.405 UTC [9372] LOG:  server process (PID 17309) was terminated by signal 6: Aborted
2023-04-05 13:05:25.405 UTC [9372] LOG:  terminating any other active server processes

We had the same thing happened about a month ago on a different database on the same cluster. For a while PG actually ran OK as long as you didn't access that specific DB, but when trying to back up that DB with pg_dump it would crash every time. At that time one of the disks hosting the ZFS dataset with the PG data directory on it was reporting errors, so we thought it was likely due to that.

Unfortunately, before we could replace the disks, PG crashed completely and would not start again at all, so I had to rebuild the cluster from scratch and restore from pg_dump backups (still onto the old, bad disks). Once the disks were replaced (all of them) I just copied the data to them using zfs send | zfs receive and didn't bother restoring pg_dump backups again - which was perhaps foolish in hindsight.

Well, yesterday it happened again. The server still restarted OK, so I took fresh pg_dump backups of the databases we care about (which ran fine), rebuilt the cluster and restored the pg_dump backups again - now onto the new disks, which are not reporting any problems.

So while everything is up and running now this error has me rather concerned. Could the error we're seeing now have been caused by some corruption in the PG data that's been there for a month (so it could still be attributed to the bad disk), which should now be fixed by having restored from backups onto good disks? Could this be a PG bug? What can I do to figure out why this is happening and prevent it from happening again? Advice appreciated!

Re: "PANIC: could not open critical system index 2662" - twice

From
Laurenz Albe
Date:
On Thu, 2023-04-06 at 16:41 +0000, Evgeny Morozov wrote:
>  Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed with this error:
>
> 2023-04-05 09:24:03.448 UTC [15227] ERROR:  index "pg_class_oid_index" contains unexpected zero page at block 0
> [...]
>
> We had the same thing happened about a month ago on a different database on the same cluster.
> For a while PG actually ran OK as long as you didn't access that specific DB, but when trying
> to back up that DB with pg_dump it would crash every time. At that time one of the disks
> hosting the ZFS dataset with the PG data directory on it was reporting errors, so we thought
> it was likely due to that.
>
> Unfortunately, before we could replace the disks, PG crashed completely and would not start
> again at all, so I had to rebuild the cluster from scratch and restore from pg_dump backups
> (still onto the old, bad disks). Once the disks were replaced (all of them) I just copied
> the data to them using zfs send | zfs receive and didn't bother restoring pg_dump backups
> again - which was perhaps foolish in hindsight.
>
> Well, yesterday it happened again. The server still restarted OK, so I took fresh pg_dump
> backups of the databases we care about (which ran fine), rebuilt the cluster and restored
> the pg_dump backups again - now onto the new disks, which are not reporting any problems.
>
> So while everything is up and running now this error has me rather concerned. Could the
> error we're seeing now have been caused by some corruption in the PG data that's been there
> for a month (so it could still be attributed to the bad disk), which should now be fixed by
> having restored from backups onto good disks?

Yes, that is entirely possible.

> Could this be a PG bug?

It could be, but data corruption caused by bad hardware is much more likely.

> What can I do to figure out why this is happening and prevent it from happening again?

No idea about the former, but bad hardware is a good enough explanation.

As to keeping it from happening: use good hardware.

Yours,
Laurenz Albe



Re: "PANIC: could not open critical system index 2662" - twice

From
Michael Paquier
Date:
On Fri, Apr 07, 2023 at 01:04:34PM +0200, Laurenz Albe wrote:
> On Thu, 2023-04-06 at 16:41 +0000, Evgeny Morozov wrote:
>> Could this be a PG bug?
>
> It could be, but data corruption caused by bad hardware is much more likely.

There is no way to be completely sure here, except if we would be able
to put our hands on a reproducible test case that would break the
cluster so much that we'd get into this state.  I don't recall seeing
this error pattern in recent history, though.
--
Michael

Attachment

Re: "PANIC: could not open critical system index 2662" - twice

From
"Peter J. Holzer"
Date:
On 2023-04-07 13:04:34 +0200, Laurenz Albe wrote:
> On Thu, 2023-04-06 at 16:41 +0000, Evgeny Morozov wrote:
> > What can I do to figure out why this is happening and prevent it from happening again?
>
> No idea about the former, but bad hardware is a good enough explanation.
>
> As to keeping it from happening: use good hardware.

Also: Use checksums. PostgreSQL offers data checksums[1]. Some filesystems
also offer checksums.

This doesn't prevent corruption but at least it will be detected early
and can't spread.

        hp

[1] For some reason I thought the Debian/Ubuntu packages enabled this by
    default. But that doesn't seem to be the case.

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
> No idea about the former, but bad hardware is a good enough explanation.
> As to keeping it from happening: use good hardware.

Alright, thanks, I'll just keep my fingers crossed that it doesn't
happen again then!

> Also: Use checksums. PostgreSQL offers data checksums[1]. Some
filesystems also offer checksums.

We have data_checksums=on. (It must be on by default, since I cannot
find that in our config files anywhere.) However, the docs say "Only
data pages are protected by checksums; internal data structures and
temporary files are not.", so I guess pg_class_oid_index might be an
"internal data structure"?

We also have checksum=on for the ZFS dataset on which the data is stored
(also the default - we didn't change it). ZFS did detect problems (zpool
status reported read, write and checksum errors for one of the old
disks), but it also said "errors: No known data errors". I understood
that to meant that it recovered from the errors, i.e. wrote the data
different disk blocks or read it from another disk in the pool.





Re: "PANIC: could not open critical system index 2662" - twice

From
Michael Paquier
Date:
On Tue, Apr 11, 2023 at 04:44:54PM +0000, Evgeny Morozov wrote:
> We have data_checksums=on. (It must be on by default, since I cannot
> find that in our config files anywhere.)

initdb does not enable checksums by default, requiring a
-k/--data-checksums, so likely this addition comes from from your
environment.

> However, the docs say "Only
> data pages are protected by checksums; internal data structures and
> temporary files are not.", so I guess pg_class_oid_index might be an
> "internal data structure"?

pg_class_oid_index is a btree index that relies on 8k on-disk pages
(default size), so it is subject to the same rules as normal relations
regarding checksums for the pages flushed to disk, even if it is on a
catalog.
--
Michael

Attachment

Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 12/04/2023 2:35 am, Michael Paquier wrote:
> initdb does not enable checksums by default, requiring a
> -k/--data-checksums, so likely this addition comes from from your
> environment.

Indeed, turns out we had it in init_db_options.


> However, the docs say "Only
>> data pages are protected by checksums; internal data structures and
>> temporary files are not.", so I guess pg_class_oid_index might be an
>> "internal data structure"?
> pg_class_oid_index is a btree index that relies on 8k on-disk pages
> (default size), so it is subject to the same rules as normal relations
> regarding checksums for the pages flushed to disk, even if it is on a
> catalog.

OK, so then what does that mean for the error in the subject? At what
point should that problem have been detected by the data checksums?




Re: "PANIC: could not open critical system index 2662" - twice

From
Laurenz Albe
Date:
On Thu, 2023-04-13 at 06:56 +0000, Evgeny Morozov wrote:
> On 12/04/2023 2:35 am, Michael Paquier wrote:
> > initdb does not enable checksums by default, requiring a
> > -k/--data-checksums, so likely this addition comes from from your
> > environment.
>
> OK, so then what does that mean for the error in the subject? At what
> point should that problem have been detected by the data checksums?

It means that if the error is caused by a faulty disk changing your data,
you'll notice as soon as you touch the page.

That would perhaps not have made a lot of difference in your case,
except that the error message would have been different and proof
that the disk was the problem.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 13/04/2023 5:02 pm, Laurenz Albe wrote:
> It means that if the error is caused by a faulty disk changing your data,
> you'll notice as soon as you touch the page.
>
> That would perhaps not have made a lot of difference in your case,
> except that the error message would have been different and proof
> that the disk was the problem.

OK, but we had data checksums on the whole time. So that means that the
disk was NOT the problem in our case?




Re: "PANIC: could not open critical system index 2662" - twice

From
Laurenz Albe
Date:
On Thu, 2023-04-13 at 19:07 +0000, Evgeny Morozov wrote:
> On 13/04/2023 5:02 pm, Laurenz Albe wrote:
> > It means that if the error is caused by a faulty disk changing your data,
> > you'll notice as soon as you touch the page.
> >
> > That would perhaps not have made a lot of difference in your case,
> > except that the error message would have been different and proof
> > that the disk was the problem.
>
> OK, but we had data checksums on the whole time. So that means that the
> disk was NOT the problem in our case?

Hmm, I am not certain.  The block was filled with zeros from your error
message, and I think such blocks don't trigger a checksum warning.
So if your disk replaces a valid block with zeros (filesystem check
after crash?), that could explain what you see.

Yours,
Laurenz Albe



Re: "PANIC: could not open critical system index 2662" - twice

From
Thorsten Glaser
Date:
On Fri, 14 Apr 2023, Laurenz Albe wrote:

>So if your disk replaces a valid block with zeros (filesystem check
>after crash?), that could explain what you see.

Oh, I had that happen on a RAID 1 once. On of the two discs had an
intermittent error (write I guess) but didn’t fail out of the RAID,
and some of the reads from there got zero-filled blocks in some
positions. It was a CVS repository so I was able to identify all such
blocks in question and restore them from the rsync slave (whose initial
population predated the HDD issue).

Hm, now that I think about it, it could even have been a read error
with subsequent block reassignment. Oh well.

Filesystem issues (ext3, and ext4 without/predating auto_da_alloc,
in particular) could be it just as well of course.

bye,
//mirabilos
--
15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
> Hmm, I am not certain. The block was filled with zeros from your error
> message, and I think such blocks don't trigger a checksum warning.

OK, so data_checksums=on might not have made any difference in this case?


> So if your disk replaces a valid block with zeros (filesystem check
> after crash?), that could explain what you see.

If by "crash" here you mean the OS crashing - we didn't have that
happen. The OS is on separate disks, which have not reported any errors.

When we first ran into this problem the PG data was on a ZFS RAIDZ (i.e.
RAID5) volume of 3 disks, and for one of them `zpool status -v` reported
read, write and checksum error count > 0, but it also said  "errors: No
known data errors" and the disk status remained "online" (it did not
become "faulted" or "offline"). (Now we have the PG data on a ZFS mirror
volume of 2 new disks, which have not reported any errors.)

I don't know whether ZFS zero-fills blocks on disk errors. As I
understood, ZFS should have been able to recover from disk errors (that
were "unrecoverable" at the hardware level) using the data on the other
two disks (which did not report any errors). Thus, PG should not have
seen any corrupted data (if ZFS was working correctly).
https://unix.stackexchange.com/questions/341614/understanding-the-error-reporting-of-zfs-on-linux
seems to confirm this. Am I misunderstanding something?




Re: "PANIC: could not open critical system index 2662" - twice

From
Alban Hertroys
Date:
> On 14 Apr 2023, at 9:38, Evgeny Morozov <postgresql3@realityexists.net> wrote:

(…)

> I don't know whether ZFS zero-fills blocks on disk errors. As I
> understood, ZFS should have been able to recover from disk errors (that
> were "unrecoverable" at the hardware level) using the data on the other
> two disks (which did not report any errors). Thus, PG should not have
> seen any corrupted data (if ZFS was working correctly).
> https://unix.stackexchange.com/questions/341614/understanding-the-error-reporting-of-zfs-on-linux
> seems to confirm this. Am I misunderstanding something?

Your problem coincides with a thread at freebsd-current with very similar data corruption after a recent OpenZFS
import:blocks of all zeroes, but also missing files. So, perhaps these problems are related? 

Apparently, there was a recent fix for a data corruption issue with the block_cloning feature enabled, but people are
stillseeing corruption even when they never enabled that feature. 

I couldn’t really find the start of the thread in the archives, so this one kind of jumps into the middle of the thread
ata relevant-looking point: 

https://lists.freebsd.org/archives/freebsd-current/2023-April/003446.html

Regards,

Alban Hertroys
--
There is always an exception to always.







Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 14/04/2023 10:42 am, Alban Hertroys wrote:
> Your problem coincides with a thread at freebsd-current with very
> similar data corruption after a recent OpenZFS import: blocks of all
> zeroes, but also missing files. So, perhaps these problems are related?
> Apparently, there was a recent fix for a data corruption issue with the block_cloning feature enabled, but people are
stillseeing corruption even when they never enabled that feature.
 
>
> I couldn’t really find the start of the thread in the archives, so this one kind of jumps into the middle of the
threadat a relevant-looking point:
 
>
> https://lists.freebsd.org/archives/freebsd-current/2023-April/003446.html

That thread was a bit over my head, I'm afraid, so I can't say if it's
related. I haven't detected any missing files, anyway.


Well, the problem happened again! Kind of... This time PG has not
crashed with the PANIC error in the subject, but pg_dumping certain DBs
again fails with


pg_dump: error: connection to server on socket
"/var/run/postgresql/.s.PGSQL.5434" failed: FATAL:  index
"pg_class_oid_index" contains unexpected zero page at block 0

PG server log contains:

2023-05-03 04:31:49.903 UTC [14724]
postgres@test_behavior_638186279733138190 FATAL:  index
"pg_class_oid_index" contains unexpected zero page at block 0
2023-05-03 04:31:49.903 UTC [14724]
postgres@test_behavior_638186279733138190 HINT:  Please REINDEX it.

The server PID does not change on such a pg_dump attempt, so it appears
that only the backend process for the pg_dump connection crashes this
time. I don't see any disk errors and there haven't been any OS crashes.

This currently happens for two DBs. Both of them are very small DBs
created by automated .NET tests using Npgsql as client. The code creates
such a test DB from a template DB and tries to drop it at the end of the
test. This times out sometimes and on timeout our test code tries to
drop the DB again (while the first drop command is likely still pending
on the server). This second attempt to drop the DB also timed out:

[12:40:39] Npgsql.NpgsqlException : Exception while reading from stream
 ----> System.TimeoutException : Timeout during reading attempt
   at
Npgsql.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|194_0(NpgsqlConnector
connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean
readingNotifications, Boolean isReadingPrependedMessage)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean
isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlDataReader.NextResult()
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior,
Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior,
Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async,
CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteNonQuery()

...
[12:41:41] (same error again for the same DB)

From looking at old logs it seems like the same thing happened last time
(in April) as well. That's quite an unlikely coincidence if a bad disk
or bad filesystem was to blame, isn't it?

I've tried to reproduce this by re-running those tests over and over,
but without success so far. So what can I do about this? Do I just try
to drop those databases again manually? But what about the next time it
happens? How do I figure out the cause and prevent this problem?



Re: "PANIC: could not open critical system index 2662" - twice

From
Laurenz Albe
Date:
On Thu, 2023-05-04 at 15:49 +0000, Evgeny Morozov wrote:
> Well, the problem happened again! Kind of... This time PG has not
> crashed with the PANIC error in the subject, but pg_dumping certain DBs
> again fails with
>
>
> pg_dump: error: connection to server on socket
> "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL:  index
> "pg_class_oid_index" contains unexpected zero page at block 0

If you dumped and restored the database after the last time the error
happened, there must be a systemic problem.

Perhaps you have bad hardware, or a problem with a storage driver,
file system or some other low-level software component.
It might of course be a PostgreSQL bug too, but it is hard to say
without a way to reproduce...

Yours,
Laurenz Albe



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 4/05/2023 6:42 pm, Laurenz Albe wrote:
> On Thu, 2023-05-04 at 15:49 +0000, Evgeny Morozov wrote:
>> Well, the problem happened again! Kind of... This time PG has not
>> crashed with the PANIC error in the subject, but pg_dumping certain DBs
>> again fails with
>>
>>
>> pg_dump: error: connection to server on socket
>> "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL:  index
>> "pg_class_oid_index" contains unexpected zero page at block 0
> If you dumped and restored the database after the last time the error
> happened, there must be a systemic problem.

I dumped and restored the "real" databases I cared about. The tests
databases on which error now happens are new (created 2 days ago).


> Perhaps you have bad hardware, or a problem with a storage driver,
> file system or some other low-level software component.
> It might of course be a PostgreSQL bug too, but it is hard to say
> without a way to reproduce...

I'm now thinking of setting up a dedicated AWS EC2 instance just for
these little DBs that get created by our automated tests. If the problem
happens there as well then that would strongly point towards a bug in
PostgreSQL, wouldn't it? (And if nothing else, at least it won't affect
the more important DBs!)

Meanwhile, what do I do with the existing server, though? Just try to
drop the problematic DBs again manually?





On 5/4/23 13:10, Evgeny Morozov wrote:
[snip]
I'm now thinking of setting up a dedicated AWS EC2 instance just for
these little DBs that get created by our automated tests. If the problem
happens there as well then that would strongly point towards a bug in
PostgreSQL, wouldn't it?

Many other people besides you would have noticed regular corruption of system catalogs.

Meanwhile, what do I do with the existing server, though? Just try to
drop the problematic DBs again manually?

"Fix the hardware" is what you do.

--
Born in Arizona, moved to Babylonia.

Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Fri, May 5, 2023 at 6:11 AM Evgeny Morozov
<postgresql3@realityexists.net> wrote:
> Meanwhile, what do I do with the existing server, though? Just try to
> drop the problematic DBs again manually?

That earlier link to a FreeBSD thread is surely about bleeding edge
new ZFS stuff that was briefly broken then fixed, being discovered by
people running code imported from OpenZFS master branch into FreeBSD
main branch (ie it's not exactly released, not following the details
but I think it might soon be 2.2?), but you're talking about an LTS
Ubuntu release from 2018, which shipped "ZFS on Linux" version 0.7.5,
unless you installed a newer version somehow?  So it doesn't sound
like it could be related.

That doesn't mean it couldn't be a different ZFS bug though.  While
looking into file system corruption issues that had similar symptoms
on some other file system (which turned out to be a bug in btrfs) I
did bump into a claim that ZFS could product unexpected zeroes in some
mmap coherency scenario, OpenZFS issue #14548.  I don't immediately
see how PostgreSQL could get tangled up with that problem though, as
we aren't doing that...

It seems quite interesting that it's always pg_class_oid_index block 0
(the btree meta-page), which feels more like a PostgreSQL bug, unless
the access pattern of that particular file/block is somehow highly
unusual compared to every other block and tickling bugs elsewhere in
the stack.  How does that file look, in terms of size, and how many
pages in it are zero?  I think it should be called base/5/2662.

Oooh, but this is a relation that goes through
RelationMapOidToFilenumber.  What does select
pg_relation_filepath('pg_class_oid_index') show in the corrupted
database, base/5/2662 or something else?  Now *that* is a piece of
logic that changed in PostgreSQL 15.  It changed from sector-based
atomicity assumptions to a directory entry swizzling trick, in commit
d8cd0c6c95c0120168df93aae095df4e0682a08a.  Hmm.



Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Fri, May 5, 2023 at 11:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Now *that* is a piece of
> logic that changed in PostgreSQL 15.  It changed from sector-based
> atomicity assumptions to a directory entry swizzling trick, in commit
> d8cd0c6c95c0120168df93aae095df4e0682a08a.  Hmm.

I spoke too soon, that only changed in 16.  But still, it means there
are two files that could be corrupted here, pg_filenode.map which
might somehow be pointing to the wrong file, and the relation (index)
main fork file.



Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Fri, May 5, 2023 at 11:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> What does select
> pg_relation_filepath('pg_class_oid_index') show in the corrupted
> database, base/5/2662 or something else?

Oh, you can't get that far, but perhaps you could share the
pg_filenode.map file?  Or alternatively strace -f PostgreSQL while
it's starting up to see which file it's reading, just to be sure.  One
way to find clues about whether PostgreSQL did something wrong, once
we definitely have the right relfilenode for the index, aside from
examining its contents, would be to search the WAL for references to
that block with pg_waldump.  Maybe you still have enough WAL if it
happened recently?



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 5/05/2023 2:02 am, Thomas Munro wrote:
On Fri, May 5, 2023 at 11:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
What does select
pg_relation_filepath('pg_class_oid_index') show in the corrupted
database, base/5/2662 or something else?
Oh, you can't get that far, but perhaps you could share the
pg_filenode.map file?

Hi Thomas, thanks very much for looking into this!

Indeed, I cannot get that far due to the same error. I read about ignore_system_indexes, but...

# sudo -u postgres psql -w -p 5434 -c "set ignore_system_indexes=on";
ERROR:  parameter "ignore_system_indexes" cannot be set after connection start

I'm not sure how to set it BEFORE connection start, but without restarting the server (which I'd rather avoid if I can).

The OID of the bad DB ('test_behavior_638186279733138190') is 1414389 and I've uploaded base/1414389/pg_filenode.map and also base/5/2662 (in case that's helpful) as https://objective.realityexists.net/temp/pgstuff1.zip

> Maybe you still have enough WAL if it happened recently?

Maybe! What should I do with pg_waldump? I've never used it before.


Re: "PANIC: could not open critical system index 2662" - twice

From
Andrew Gierth
Date:
>>>>> "Evgeny" == Evgeny Morozov <postgresql3@realityexists.net> writes:

 Evgeny> Indeed, I cannot get that far due to the same error. I read
 Evgeny> about ignore_system_indexes, but...

 Evgeny> # sudo -u postgres psql -w -p 5434 -c "set ignore_system_indexes=on";
 Evgeny> ERROR:  parameter "ignore_system_indexes" cannot be set after connection
 Evgeny> start

sudo -u postgres psql -w -p 5434 -d "options='-P'"

(make that -d "dbname=whatever options='-P'"  if you need to specify
some database name; or use PGOPTIONS="-P" in the environment.)

--
Andrew (irc:RhodiumToad)



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 5/05/2023 10:38 am, Andrew Gierth wrote:
sudo -u postgres psql -w -p 5434 -d "options='-P'"
(make that -d "dbname=whatever options='-P'"  if you need to specify
some database name; or use PGOPTIONS="-P" in the environment.)

Thanks, good to know! Unfortunately that also fails:

# sudo -u postgres psql -w -p 5434 -d "dbname=test_behavior_638186279733138190 options='-P'"
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL:  could not open relation with OID 2964

(PG server log contains the same error message and no further details.)

select 2964::regclass returns "pg_db_role_setting" FWIW.

Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Fri, May 5, 2023 at 7:50 PM Evgeny Morozov
<postgresql3@realityexists.net> wrote:
> The OID of the bad DB ('test_behavior_638186279733138190') is 1414389 and I've uploaded base/1414389/pg_filenode.map
andalso base/5/2662 (in case that's helpful) as https://objective.realityexists.net/temp/pgstuff1.zip 

Thanks.  That pg_filenode.map looks healthy to me.

tmunro@build1:~/junk $ od -t x1 pg_filenode.map
0000000    17  27  59  00  11  00  00  00  eb  04  00  00  eb  04  00  00
0000020    e1  04  00  00  e1  04  00  00  e7  04  00  00  e7  04  00  00
0000040    df  04  00  00  df  04  00  00  14  0b  00  00  14  0b  00  00
0000060    15  0b  00  00  15  0b  00  00  4b  10  00  00  4b  10  00  00
0000100    4c  10  00  00  4c  10  00  00  82  0a  00  00  82  0a  00  00
0000120    83  0a  00  00  83  0a  00  00  8f  0a  00  00  8f  0a  00  00
0000140    90  0a  00  00  90  0a  00  00  62  0a  00  00  62  0a  00  00
0000160    63  0a  00  00  63  0a  00  00  66  0a  00  00  66  0a  00  00
...

hex(2662) is 0xa66, and we see 63 0a 00 00 followed by 63 0a 00 00 in
that last line as expected, so that rules out the idea that it's
somehow trashed that map file and points to the wrong relation file.

Next can you share the file base/1414389/2662?  ("5" was from the
wrong database.)

> > Maybe you still have enough WAL if it happened recently?
>
> Maybe! What should I do with pg_waldump? I've never used it before.

Try something like:

pg_waldump -R 1663/1414389/2662 -F main 000000010000000000000001
000000010000000000000007

... but change that to the range of files you have in your pg_wal.



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 6/05/2023 1:06 am, Thomas Munro wrote:
> Next can you share the file base/1414389/2662? ("5" was from the wrong
> database.) 

Right - I should have realised that! base/1414389/2662 is indeed all
nulls, 32KB of them. I included the file anyway in
https://objective.realityexists.net/temp/pgstuff2.zip


> Try something like:
> pg_waldump -R 1663/1414389/2662 -F main 000000010000000000000001
> 000000010000000000000007
>
> ... but change that to the range of files you have in your pg_wal.

The PG server had only 2 files left from today, and pg_waldumping them
gave this error:

pg_waldump -R 1663/1414389/2662 -F main 00000001000000650000005E
00000001000000650000005F
pg_waldump: error: error in WAL record at 65/5F629838: invalid record
length at 65/5F62A1E0: wanted 24, got 0

Not sure if that's something to worry about or not!

Then I realised we're actually archiving our WAL files with pgBackRest,
retrieved the WAL files for the time the DB was created and used (~12:39
UTC on 2023-05-02) and re-ran pg_waldump on those.

pg_waldump -R 1663/1414389/2662 -F main 000000010000005B00000000
000000010000005B0000000F

rmgr: XLOG        len (rec/tot):     51/   108, tx:     242382, lsn:
5B/0222BC68, prev 5B/0222BC38, desc: FPI , blkref #0: rel
1663/1414389/2662 blk 0 FPW
rmgr: XLOG        len (rec/tot):     51/  3224, tx:     242382, lsn:
5B/0222BCD8, prev 5B/0222BC68, desc: FPI , blkref #0: rel
1663/1414389/2662 blk 1 FPW
rmgr: XLOG        len (rec/tot):     51/  1544, tx:     242382, lsn:
5B/0222C988, prev 5B/0222BCD8, desc: FPI , blkref #0: rel
1663/1414389/2662 blk 2 FPW
rmgr: XLOG        len (rec/tot):     49/   121, tx:     242382, lsn:
5B/0222CF90, prev 5B/0222C988, desc: FPI , blkref #0: rel
1663/1414389/2662 blk 3 FPW
rmgr: Btree       len (rec/tot):     64/    64, tx:     242384, lsn:
5B/02321870, prev 5B/023217A0, desc: INSERT_LEAF off 132, blkref #0: rel
1663/1414389/2662 blk 2
rmgr: Btree       len (rec/tot):     64/    64, tx:     242384, lsn:
5B/02322640, prev 5B/02322570, desc: INSERT_LEAF off 133, blkref #0: rel
1663/1414389/2662 blk 2
... (many more entries like that)
pg_waldump: error: error in WAL record at 5B/FFFFF38: missing contrecord
at 5B/FFFFF70

The full output is also in
https://objective.realityexists.net/temp/pgstuff2.zip





Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Sat, May 6, 2023 at 9:58 PM Evgeny Morozov
<postgresql3@realityexists.net> wrote:
> Right - I should have realised that! base/1414389/2662 is indeed all
> nulls, 32KB of them. I included the file anyway in
> https://objective.realityexists.net/temp/pgstuff2.zip

OK so it's not just page 0, you have 32KB or 4 pages of all zeroes.
That's the expected length of that relation when copied from the
initial template, and consistent with the pg_waldump output (it uses
FPIs to copy blocks 0-3).  We can't see the block contents but we know
that block 2 definitely is not all zeroes at that point because there
are various modifications to it, which not only write non-zeroes but
must surely have required a sane page 0.

So it does indeed look like something unknown has replaced 32KB of
data with 32KB of zeroes underneath us.  Are there more non-empty
files that are all-zeroes?  Something like this might find them:

for F in base/1414389/*
do
  if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null
  then
    echo $F
  fi
done



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 6/05/2023 12:34 pm, Thomas Munro wrote:
> So it does indeed look like something unknown has replaced 32KB of
> data with 32KB of zeroes underneath us.  Are there more non-empty
> files that are all-zeroes?  Something like this might find them:
>
> for F in base/1414389/*
> do
>   if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null
>   then
>     echo $F
>   fi
> done

Yes, a total of 309 files are all-zeroes (and 52 files are not).

I also checked the other DB that reports the same "unexpected zero page
at block 0" error, "test_behavior_638186280406544656" (OID 1414967) -
similar story there. I uploaded the lists of zeroed and non-zeroed files
and the ls -la output for both as
https://objective.realityexists.net/temp/pgstuff3.zip

I then searched recursively such all-zeroes files in $PGDATA/base and
did not find any outside of those two directories (base/1414389 and
base/1414967). None in $PGDATA/global, either.





Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Sun, May 7, 2023 at 12:29 AM Evgeny Morozov
<postgresql3@realityexists.net> wrote:
> On 6/05/2023 12:34 pm, Thomas Munro wrote:
> > So it does indeed look like something unknown has replaced 32KB of
> > data with 32KB of zeroes underneath us.  Are there more non-empty
> > files that are all-zeroes?  Something like this might find them:
> >
> > for F in base/1414389/*
> > do
> >   if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null
> >   then
> >     echo $F
> >   fi
> > done
>
> Yes, a total of 309 files are all-zeroes (and 52 files are not).
>
> I also checked the other DB that reports the same "unexpected zero page
> at block 0" error, "test_behavior_638186280406544656" (OID 1414967) -
> similar story there. I uploaded the lists of zeroed and non-zeroed files
> and the ls -la output for both as
> https://objective.realityexists.net/temp/pgstuff3.zip
>
> I then searched recursively such all-zeroes files in $PGDATA/base and
> did not find any outside of those two directories (base/1414389 and
> base/1414967). None in $PGDATA/global, either.

So "diff -u zeroed-files-1414967.txt zeroed-files-1414389.txt" shows
that they have the same broken stuff in the range cloned from the
template database by CREATE DATABASE STRATEGY=WAL_LOG, and it looks
like it's *all* the cloned catalogs, and then they have some
non-matching relfilenodes > 1400000, presumably stuff you created
directly in the new database (I'm not sure if I can say for sure that
those files are broken, without knowing what they are).

Did you previously run this same workload on versions < 15 and never
see any problem?  15 gained a new feature CREATE DATABASE ...
STRATEGY=WAL_LOG, which is also the default.  I wonder if there is a
bug somewhere near that, though I have no specific idea.  If you
explicitly added STRATEGY=FILE_COPY to your CREATE DATABASE commands,
you'll get the traditional behaviour.  It seems like you have some
kind of high frequency testing workload that creates and tests
databases all day long, and just occasionally detects this corruption.
Would you like to try requesting FILE_COPY for a while and see if it
eventually happens like that too?

My spidey sense is leaning away from filesystem bugs.  We've found
plenty of filesystem bugs on these mailing lists over the years and of
course it's not impossible, but I dunno... it seems quite suspicious
that all the system catalogs have apparently been wiped during or
moments after the creation of a new database that's running new
PostgreSQL 15 code...



Re: "PANIC: could not open critical system index 2662" - twice

From
Jeffrey Walton
Date:
On Sat, May 6, 2023 at 6:35 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sat, May 6, 2023 at 9:58 PM Evgeny Morozov
> <postgresql3@realityexists.net> wrote:
> > Right - I should have realised that! base/1414389/2662 is indeed all
> > nulls, 32KB of them. I included the file anyway in
> > https://objective.realityexists.net/temp/pgstuff2.zip
>
> OK so it's not just page 0, you have 32KB or 4 pages of all zeroes.
> That's the expected length of that relation when copied from the
> initial template, and consistent with the pg_waldump output (it uses
> FPIs to copy blocks 0-3).  We can't see the block contents but we know
> that block 2 definitely is not all zeroes at that point because there
> are various modifications to it, which not only write non-zeroes but
> must surely have required a sane page 0.
>
> So it does indeed look like something unknown has replaced 32KB of
> data with 32KB of zeroes underneath us.

This may be related... I seem to recall the GNUlib folks talking about
a cp bug on sparse files. It looks like it may be fixed in coreutils
release 9.2 (2023-03-20):
https://github.com/coreutils/coreutils/blob/master/NEWS#L233

If I recall correctly, it had something to do with the way
copy_file_range worked. (Or maybe, it did not work as expected).

According to the GNUlib docs
(https://www.gnu.org/software/gnulib/manual/html_node/copy_005ffile_005frange.html):

    This function has many problems on Linux
    kernel versions before 5.3

> Are there more non-empty
> files that are all-zeroes?  Something like this might find them:
>
> for F in base/1414389/*
> do
>   if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null
>   then
>     echo $F
>   fi
> done



Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Sun, May 7, 2023 at 10:23 AM Jeffrey Walton <noloader@gmail.com> wrote:
> This may be related... I seem to recall the GNUlib folks talking about
> a cp bug on sparse files. It looks like it may be fixed in coreutils
> release 9.2 (2023-03-20):
> https://github.com/coreutils/coreutils/blob/master/NEWS#L233
>
> If I recall correctly, it had something to do with the way
> copy_file_range worked. (Or maybe, it did not work as expected).
>
> According to the GNUlib docs
> (https://www.gnu.org/software/gnulib/manual/html_node/copy_005ffile_005frange.html):
>
>     This function has many problems on Linux
>     kernel versions before 5.3

That's quite interesting, thanks (we've been talking about making
direct use of copy_file_range() in a few threads, I'll definitely be
looking into that history), but we don't currently use
copy_file_range() or any coreutils stuff in the relevant code paths
here -- this data is copied by plain old pread() and pwrite().



Re: "PANIC: could not open critical system index 2662" - twice

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Did you previously run this same workload on versions < 15 and never
> see any problem?  15 gained a new feature CREATE DATABASE ...
> STRATEGY=WAL_LOG, which is also the default.  I wonder if there is a
> bug somewhere near that, though I have no specific idea.

Per the release notes I was just writing ...

    <listitem>
<!--
Author: Michael Paquier <michael@paquier.xyz>
Branch: master [8a8661828] 2023-02-22 10:14:52 +0900
Branch: REL_15_STABLE [fa5dd460c] 2023-02-22 10:14:56 +0900
-->
     <para>
      Fix potential corruption of the template (source) database after
      <command>CREATE DATABASE</command> with the <literal>STRATEGY
      WAL_LOG</literal> option (Nathan Bossart, Ryo Matsumura)
     </para>

     <para>
      Improper buffer handling created a risk that any later modification
      of the template's <structname>pg_class</structname> catalog would be
      lost.
     </para>
    </listitem>

The comment about only pg_class being affected is my interpretation
of what the commit message said, but I might have misunderstood.

            regards, tom lane



Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Sun, May 7, 2023 at 1:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Did you previously run this same workload on versions < 15 and never
> > see any problem?  15 gained a new feature CREATE DATABASE ...
> > STRATEGY=WAL_LOG, which is also the default.  I wonder if there is a
> > bug somewhere near that, though I have no specific idea.
>
> Per the release notes I was just writing ...
>
>     <listitem>
> <!--
> Author: Michael Paquier <michael@paquier.xyz>
> Branch: master [8a8661828] 2023-02-22 10:14:52 +0900
> Branch: REL_15_STABLE [fa5dd460c] 2023-02-22 10:14:56 +0900
> -->
>      <para>
>       Fix potential corruption of the template (source) database after
>       <command>CREATE DATABASE</command> with the <literal>STRATEGY
>       WAL_LOG</literal> option (Nathan Bossart, Ryo Matsumura)
>      </para>

Hmm.  That bug seems to have caused corruption (backwards time travel)
of blocks in the *source* DB's pg_class, by failing to write back
changes.  We seem to have zeroed pages in the *target* database, for
all catalogs (apparently everything copied by
RelationCopyStorageUsingBuffer()), even though the template is still
fine.  It is as if RelationCopyStorageUsingBuffer() created the
zero-filed file with smgrextend(), but then the buffer data was never
written out even though we memcpy'd it into the a buffer and set the
buffer dirty.

Bug-in-PostgreSQL explanations could include that we forgot it was
dirty, or some backend wrote it out to the wrong file; but if we were
forgetting something like permanent or dirty, would there be a more
systematic failure?  Oh, it could require special rare timing if it is
similar to 8a8661828's confusion about permanence level or otherwise
somehow not setting BM_PERMANENT, but in the target blocks, so I think
that'd require a checkpoint AND a crash.  It doesn't reproduce for me,
but perhaps more unlucky ingredients are needed.

Bug-in-OS/FS explanations could include that a whole lot of writes
were mysteriously lost in some time window, so all those files still
contain the zeroes we write first in smgrextend().  I guess this
previously rare (previously limited to hash indexes?) use of sparse
file hole-punching could be a factor in an it's-all-ZFS's-fault
explanation:

openat(AT_FDCWD,"base/16390/2662",O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC,0600)
= 36 (0x24)
openat(AT_FDCWD,"base/1/2662",O_RDWR|O_CLOEXEC,00) = 37 (0x25)
lseek(37,0x0,SEEK_END)                           = 32768 (0x8000)
lseek(37,0x0,SEEK_END)                           = 32768 (0x8000)
pwrite(36,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,8192,0x6000) = 8192
(0x2000) <-- smgrextend(final block)
lseek(36,0x0,SEEK_END)                           = 32768 (0x8000)

I  was trying to think about how I might go about trying to repro the
exact system setup.  Evgeny, do you mind sharing your "zfs get all
/path/to/pgdata" (curious to see block size, compression settings,
anything else etc) and your postgresql.conf? And your exact Ubuntu
kernel version and ZFS package versions?



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 6/05/2023 11:13 pm, Thomas Munro wrote:
> Did you previously run this same workload on versions < 15 and never
> see any problem?
Yes, kind of. We have a test suite that creates one test DB and runs a
bunch of tests on it. Two of these tests, however, create another DB
each (also by cloning the same template DB) in order to test copying
data between DBs. It's only these "extra" DBs that were corrupted, at
least on this occasion. (Hard to say about the last time, because that
time it all went south and the whole server crashed, and we may have had
some residual corruption from bad disks then - who knows.) I'm not sure
whether the tests that created the extra DBs existed before we upgraded
to PG 15, but we definitely have not seen such problems on PG 13 or 14.

> It seems like you have some kind of high frequency testing workload that creates and tests databases all day long,
andjust occasionally detects this corruption.
 
Maybe 10-30 times per day normally, depending on the day. However, I
have tried to repro this by running those two specific tests thousands
of times in one day, without success.
> Would you like to try requesting FILE_COPY for a while and see if it eventually happens like that too?
Sure, we can try that.

On 7/05/2023 12:30 pm, Thomas Munro wrote:
> your "zfs get all /path/to/pgdata"

PROPERTY              VALUE                  SOURCE
type                  filesystem             -
creation              Mon Mar  6 17:07 2023  -
used                  166G                   -
available             2.34T                  -
referenced            166G                   -
compressratio         2.40x                  -
mounted               yes                    -
quota                 none                   default
reservation           none                   default
recordsize            16K                    local
mountpoint            /default
sharenfs              off                    default
checksum              on                     default
compression           lz4                    received
atime                 off                    inherited from pgdata
devices               on                     default
exec                  off                    inherited from pgdata
setuid                off                    inherited from pgdata
readonly              off                    default
zoned                 off                    default
snapdir               hidden                 default
aclinherit            restricted             default
createtxg             90                     -
canmount              on                     received
xattr                 on                     default
copies                1                      default
version               5                      -
utf8only              off                    -
normalization         none                   -
casesensitivity       sensitive              -
vscan                 off                    default
nbmand                off                    default
sharesmb              off                    default
refquota              none                   default
refreservation        none                   default
primarycache          all                    default
secondarycache        all                    default
usedbysnapshots       199M                   -
usedbydataset         166G                   -
usedbychildren        0B                     -
usedbyrefreservation  0B                     -
logbias               latency                default
dedup                 off                    default
mlslabel              none                   default
sync                  standard               default
dnodesize             legacy                 default
refcompressratio      2.40x                  -
written               64.9M                  -
logicalused           397G                   -
logicalreferenced     397G                   -
volmode               default                default
filesystem_limit      none                   default
snapshot_limit        none                   default
filesystem_count      none                   default
snapshot_count        none                   default
snapdev               hidden                 default
acltype               off                    default
context               none                   default
fscontext             none                   default
defcontext            none                   default
rootcontext           none                   default
relatime              off                    default
redundant_metadata    all                    default
overlay               off                    default

> your postgresql.conf?

We have a bunch of config files, so I tried to get the resulting config
using "select name, setting from pg_settings where source =
'configuration file'" - hopefully that gives what you wanted.

            name            |                       
setting                       
----------------------------+-------------------------------------------------------
 archive_command            | pgbackrest --stanza="behavior-pg15"
archive-push "%p"
 archive_mode               | on
 archive_timeout            | 900
 cluster_name               | 15/behavior
 DateStyle                  | ISO, MDY
 default_text_search_config | pg_catalog.english
 dynamic_shared_memory_type | posix
 external_pid_file          | /var/run/postgresql/15-behavior.pid
 full_page_writes           | off
 lc_messages                | C
 lc_monetary                | C
 lc_numeric                 | C
 lc_time                    | C
 listen_addresses           | *
 log_checkpoints            | on
 log_connections            | on
 log_disconnections         | on
 log_file_mode              | 0640
 log_line_prefix            | %m [%p] %q%u@%d
 log_lock_waits             | on
 log_min_duration_statement | 1000
 log_temp_files             | 0
 log_timezone               | Etc/UTC
 maintenance_work_mem       | 1048576
 max_connections            | 100
 max_slot_wal_keep_size     | 30000
 max_wal_size               | 1024
 min_wal_size               | 80
 port                       | 5434
 shared_buffers             | 4194304
 ssl                        | on
 ssl_cert_file              | (redacted)
 ssl_ciphers                | TLSv1.2:TLSv1.3:!aNULL
 ssl_dh_params_file         | (redacted)
 ssl_key_file               | (redacted)
 ssl_min_protocol_version   | TLSv1.2
 temp_buffers               | 10240
 TimeZone                   | Etc/UTC
 unix_socket_directories    | /var/run/postgresql
 wal_compression            | pglz
 wal_init_zero              | off
 wal_level                  | replica
 wal_recycle                | off
 work_mem                   | 262144

> And your exact Ubuntu kernel version and ZFS package versions? 

Ubuntu 18.04.6
Kernel 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023
x86_64 x86_64 x86_64 GNU/Linux
zfsutils-linux package version 0.7.5-1ubuntu16.12 amd64





Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Mon, May 8, 2023 at 4:10 AM Evgeny Morozov
<postgresql3@realityexists.net> wrote:
> On 6/05/2023 11:13 pm, Thomas Munro wrote:
> > Would you like to try requesting FILE_COPY for a while and see if it eventually happens like that too?
> Sure, we can try that.

Maybe you could do some one way and some the other, so that we try to
learn more?

> Ubuntu 18.04.6
> Kernel 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023
> x86_64 x86_64 x86_64 GNU/Linux
> zfsutils-linux package version 0.7.5-1ubuntu16.12 amd64

I tried for a few hours to reproduce this by trying to make as many
things as similar to yours as I could based on info in this thread
(Linux: up-to-date Ubuntu 18.04 in vagrant which has nearly the same
kernel 4.15.0-208-generic and a close but slightly different version
of ancient ZFS 0.7.5-1ubuntu15, not sure why, ZFS: mirror (I used a
pair of loopback files), recordsize=16kB, compression=lz4, PG:
compiled from tag REL_15_2, data_checksums=on, full_page_writes=off,
wal_recycle=off, wal_init_zero=off), with what I thought might be
roughly what you're doing (creating three DBs, two clones of the
first, with various modification at various points, with various
overlapping activities, and then checking for catalog corruption).  No
cigar.  Hrmph.



Re: "PANIC: could not open critical system index 2662" - twice

From
Michael Paquier
Date:
On Sun, May 07, 2023 at 10:30:52PM +1200, Thomas Munro wrote:
> Bug-in-PostgreSQL explanations could include that we forgot it was
> dirty, or some backend wrote it out to the wrong file; but if we were
> forgetting something like permanent or dirty, would there be a more
> systematic failure?  Oh, it could require special rare timing if it is
> similar to 8a8661828's confusion about permanence level or otherwise
> somehow not setting BM_PERMANENT, but in the target blocks, so I think
> that'd require a checkpoint AND a crash.  It doesn't reproduce for me,
> but perhaps more unlucky ingredients are needed.
>
> Bug-in-OS/FS explanations could include that a whole lot of writes
> were mysteriously lost in some time window, so all those files still
> contain the zeroes we write first in smgrextend().  I guess this
> previously rare (previously limited to hash indexes?) use of sparse
> file hole-punching could be a factor in an it's-all-ZFS's-fault
> explanation:

Yes, you would need a bit of all that.

I can reproduce the same backtrace here.  That's just my usual laptop
with ext4, so this would be a Postgres bug.  First, here are the four
things running in parallel so as I can get a failure in loading a
critical index when connecting:
1) Create and drop a database with WAL_LOG as strategy and the
regression database as template:
while true; do
  createdb --template=regression --strategy=wal_log testdb;
  dropdb testdb;
done
2) Feeding more data to pg_class in the middle, while testing the
connection to the database created:
while true;
  do psql -c 'create table popo as select 1 as a;' regression > /dev/null 2>&1 ;
  psql testdb -c "select 1" > /dev/null 2>&1 ;
  psql -c 'drop table popo' regression > /dev/null 2>&1 ;
  psql testdb -c "select 1" > /dev/null 2>&1 ;
done;
3) Force some checkpoints:
while true; do psql -c 'checkpoint' > /dev/null 2>&1; sleep 4; done
4) Force a few crashes and recoveries:
while true ; do pg_ctl stop -m immediate ; pg_ctl start ; sleep 4 ; done

And here is one backtrace:
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at
./nptl/pthread_kill.c:44
#1  0x00007f8a395cad2f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007f8a3957bef2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f8a39566472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x000055f0b36b95ac in errfinish (filename=0x55f0b38d4f68 "relcache.c", lineno=4335, funcname=0x55f0b38d6500
<__func__.12>"load_critical_index") at elog.c:604 
#5  0x000055f0b36a9c64 in load_critical_index (indexoid=2662, heapoid=1259) at relcache.c:4335
#6  0x000055f0b36a9712 in RelationCacheInitializePhase3 () at relcache.c:4110
(gdb) up 5
#5  0x000055f0b36a9c64 in load_critical_index (indexoid=2662, heapoid=1259) at relcache.c:4335
4335            elog(PANIC, "could not open critical system index %u", indexoid);

You can also get failures with btree deduplication because of the CTAS
running above, by the way, but that's just the top of the iceberg once
the state is corrupted:
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at
./nptl/pthread_kill.c:44
#1  0x00007fcae38abd2f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007fcae385cef2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fcae3847472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00005556a809f584 in ExceptionalCondition (conditionName=0x5556a812b639 "false", fileName=0x5556a812abc8
"heapam.c",lineNumber=7882) at assert.c:66 
#5  0x00005556a79e13db in index_delete_sort_cmp (deltid1=0x5556a9682158, deltid2=0x7ffdb62c7088) at heapam.c:7882
#6  0x00005556a79e14f6 in index_delete_sort (delstate=0x7ffdb62c8350) at heapam.c:7923
#7  0x00005556a79e0cd0 in heap_index_delete_tuples (rel=0x7fcae0f1a9a8, delstate=0x7ffdb62c8350) at heapam.c:7580
#8  0x00005556a7a07be4 in table_index_delete_tuples (rel=0x7fcae0f1a9a8, delstate=0x7ffdb62c8350) at
../../../../src/include/access/tableam.h:1337
#9  0x00005556a7a0a7a3 in _bt_delitems_delete_check (rel=0x7fcae0f231c8, buf=179, heapRel=0x7fcae0f1a9a8,
delstate=0x7ffdb62c8350)at nbtpage.c:1541 
#10 0x00005556a7a07545 in _bt_simpledel_pass (rel=0x7fcae0f231c8, buffer=179, heapRel=0x7fcae0f1a9a8,
deletable=0x7ffdb62c8430,ndeletable=160, newitem=0x5556a9689260,  

Anyway, you would be able to see that b3e184a (pretty much the same as
15.2), without 8a86618 included.  Once 8a86618 is included, all these
steps are stable in the backend, at least here.  Or do we have some
low-hanging fruit with the WAL_LOG strategy?  That could always be
possible, of course, but that looks like the same issue to me, just
with a different symptom showing up.
--
Michael

Attachment

Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Mon, May 8, 2023 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote:
> I can reproduce the same backtrace here.  That's just my usual laptop
> with ext4, so this would be a Postgres bug.  First, here are the four
> things running in parallel so as I can get a failure in loading a
> critical index when connecting:

That sounds like good news, but I'm still confused: do you see all 0s
in the target database (popo)'s catalogs, as reported (and if so can
you explain how they got there?), or is it regression that is
corrupted in more subtle ways also involving 1s?



Re: "PANIC: could not open critical system index 2662" - twice

From
Michael Paquier
Date:
On Mon, May 08, 2023 at 02:46:37PM +1200, Thomas Munro wrote:
> That sounds like good news, but I'm still confused: do you see all 0s
> in the target database (popo)'s catalogs, as reported (and if so can
> you explain how they got there?), or is it regression that is
> corrupted in more subtle ways also involving 1s?

Nope, I have not been able to confirm that yet without 8a86618.  The
test runs at a high frequency, so that's kind of hard to catch.  I
have not been able to get things in a state where I could look at a
FPW for pg_class or its index, either, in a way similar to Evgeny.
--
Michael

Attachment

Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 8/05/2023 4:24 am, Michael Paquier wrote:
> here are the four things running in parallel so as I can get a failure
> in loading a critical index when connecting

Wow, that is some amazing detective work! We do indeed create tables
during our tests, specifically partitions of tables copied from the
template DB. Checkpoints seem to be happening every few minutes (we
don't force them, but there is a big DB with more writes on the same
instance - it's probably due to that). PG is not crashing in our case,
though - not this time.

Do you have any idea why the "drop database" command would have timed
out (not completed after 30 seconds) for the corrupted DBs?

On 8/05/2023 4:17 am, Thomas Munro wrote:
> Maybe you could do some one way and some the other, so that we try to
> learn more?
Do you still want me to try this given what Michael has found? Or
anything else to help narrow this down?



Re: "PANIC: could not open critical system index 2662" - twice

From
Alvaro Herrera
Date:
On 2023-May-07, Thomas Munro wrote:

> Did you previously run this same workload on versions < 15 and never
> see any problem?  15 gained a new feature CREATE DATABASE ...
> STRATEGY=WAL_LOG, which is also the default.  I wonder if there is a
> bug somewhere near that, though I have no specific idea.  If you
> explicitly added STRATEGY=FILE_COPY to your CREATE DATABASE commands,
> you'll get the traditional behaviour.

Maybe it would be sensible to make STRATEGY_FILE=FILE_COPY the default
again, for branch 15, before today's release.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Doing what he did amounts to sticking his fingers under the hood of the
implementation; if he gets his fingers burnt, it's his problem."  (Tom Lane)



Re: "PANIC: could not open critical system index 2662" - twice

From
Dilip Kumar
Date:
On Mon, May 8, 2023 at 7:55 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, May 07, 2023 at 10:30:52PM +1200, Thomas Munro wrote:
> > Bug-in-PostgreSQL explanations could include that we forgot it was
> > dirty, or some backend wrote it out to the wrong file; but if we were
> > forgetting something like permanent or dirty, would there be a more
> > systematic failure?  Oh, it could require special rare timing if it is
> > similar to 8a8661828's confusion about permanence level or otherwise
> > somehow not setting BM_PERMANENT, but in the target blocks, so I think
> > that'd require a checkpoint AND a crash.  It doesn't reproduce for me,
> > but perhaps more unlucky ingredients are needed.
> >
> > Bug-in-OS/FS explanations could include that a whole lot of writes
> > were mysteriously lost in some time window, so all those files still
> > contain the zeroes we write first in smgrextend().  I guess this
> > previously rare (previously limited to hash indexes?) use of sparse
> > file hole-punching could be a factor in an it's-all-ZFS's-fault
> > explanation:
>
> Yes, you would need a bit of all that.
>
> I can reproduce the same backtrace here.  That's just my usual laptop
> with ext4, so this would be a Postgres bug.  First, here are the four
> things running in parallel so as I can get a failure in loading a
> critical index when connecting:
> 1) Create and drop a database with WAL_LOG as strategy and the
> regression database as template:
> while true; do
>   createdb --template=regression --strategy=wal_log testdb;
>   dropdb testdb;
> done
> 2) Feeding more data to pg_class in the middle, while testing the
> connection to the database created:
> while true;
>   do psql -c 'create table popo as select 1 as a;' regression > /dev/null 2>&1 ;
>   psql testdb -c "select 1" > /dev/null 2>&1 ;
>   psql -c 'drop table popo' regression > /dev/null 2>&1 ;
>   psql testdb -c "select 1" > /dev/null 2>&1 ;
> done;
> 3) Force some checkpoints:
> while true; do psql -c 'checkpoint' > /dev/null 2>&1; sleep 4; done
> 4) Force a few crashes and recoveries:
> while true ; do pg_ctl stop -m immediate ; pg_ctl start ; sleep 4 ; done
>

I am able to reproduce this using the steps given above, I am also
trying to analyze this further.  I will send the update once I get
some clue.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: "PANIC: could not open critical system index 2662" - twice

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Maybe it would be sensible to make STRATEGY_FILE=FILE_COPY the default
> again, for branch 15, before today's release.

If we had more than one such report, I'd be in favor of that.
But I think it's a bit premature to conclude that the copy
strategy is to blame.

            regards, tom lane



Re: "PANIC: could not open critical system index 2662" - twice

From
Andres Freund
Date:
Hi,

On 2023-05-07 16:10:28 +0000, Evgeny Morozov wrote:
> Yes, kind of. We have a test suite that creates one test DB and runs a
> bunch of tests on it. Two of these tests, however, create another DB
> each (also by cloning the same template DB) in order to test copying
> data between DBs. It's only these "extra" DBs that were corrupted, at
> least on this occasion.

Did you have any occasions where CREATE or DROP DATABASE was interrupted?
Either due the connection being terminated or a crash?

As described in
https://postgr.es/m/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de
we don't handle that correctly for DROP DATABASE.

I think that might actually fit the symptoms - the DropDatabaseBuffers() will
throw away the dirty buffer contents from the WAL strategy CREATE DATABASE,
but if you then get cancelled ata point before all the files are removed, the
on-disk fails with all-zeroes would remain.

Greetings,

Andres Freund



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 8/05/2023 9:47 pm, Andres Freund wrote:
> Did you have any occasions where CREATE or DROP DATABASE was interrupted?
> Either due the connection being terminated or a crash?

I've uploaded an edited version of the PG log for the time as
https://objective.realityexists.net/temp/log-extract-2023-05-02.txt
(test_behavior_638186279733138190 and test_behavior_638186280406544656
are the two DBs that got corrupted).

I cannot see any crash in the test logs or the PG logs, but whether it
was interrupted is less clear. I don't know whether the the tests ran
successfully up to the point where they tried to drop the DBs (I've
since added logging to show that next time), but DROP DATABASE did not
return after 30 seconds and the client library (Npgsql) then tried to
cancel the requests. We then tried to drop the DB again, with the same
results in both cases. After the second attempts timed out we closed the
connections anyway - so maybe that was the interruption?


> As described in
> https://postgr.es/m/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de
> we don't handle that correctly for DROP DATABASE.
>
> I think that might actually fit the symptoms - the DropDatabaseBuffers() will
> throw away the dirty buffer contents from the WAL strategy CREATE DATABASE,
> but if you then get cancelled ata point before all the files are removed, the
> on-disk fails with all-zeroes would remain.
Oooh, that does seem to fit! Thank you for digging that up.



Re: "PANIC: could not open critical system index 2662" - twice

From
Andres Freund
Date:
Hi,

On 2023-05-08 20:27:14 +0000, Evgeny Morozov wrote:
> On 8/05/2023 9:47 pm, Andres Freund wrote:
> > Did you have any occasions where CREATE or DROP DATABASE was interrupted?
> > Either due the connection being terminated or a crash?
>
> I've uploaded an edited version of the PG log for the time as
> https://objective.realityexists.net/temp/log-extract-2023-05-02.txt
> (test_behavior_638186279733138190 and test_behavior_638186280406544656
> are the two DBs that got corrupted).
>
> I cannot see any crash in the test logs or the PG logs, but whether it
> was interrupted is less clear. I don't know whether the the tests ran
> successfully up to the point where they tried to drop the DBs (I've
> since added logging to show that next time), but DROP DATABASE did not
> return after 30 seconds and the client library (Npgsql) then tried to
> cancel the requests. We then tried to drop the DB again, with the same
> results in both cases. After the second attempts timed out we closed the
> connections anyway - so maybe that was the interruption?

Yep, that is the interruption.  I suspect that something was not processing
interrupts, which then lead the WaitForProcSignalBarrier() in dropdb() to
block.

Are you using any extensions? Do you have any chance to figure out what
statements were running concurrently with the DROP DATABASE?


Seems we figured out what the problem is... Just need to figure out how to fix
it.

I think the minimal fix will entail moving the commit of the transaction into
dropdb(). We need to commit before executing DropDatabaseBuffers(), as the
database is inconsistent after that point.  Luckily DROP DATABASE already
can't be executed in a transaction, so that's easily possible.

That means we'd loose track of the files for the database in case of a crash,
but that's surely better than a corrupt database. And it's already the case
for CREATE DATABASE.


Additionally we probably need to prevent processing cancellations between
DropDatabaseBuffers() and remove_dbtablespaces(). But we can't just stop
accepting interrupts, because that'd break WaitForProcSignalBarrier(). Ah,
what fun.

Unfortunately QueryCancelHoldoffCount is not sufficient for this purpose - we
don't just need to prevent cancellations, we also can't accept termintions
either...

For a bit I thought we might be able to just reorder operations, moving the
WaitForProcSignalBarrier() earlier. Unfortunately I don't think that works,
because until DropDatabaseBuffers() has executed, backends might write such
buffers back, we need to do WaitForProcSignalBarrier() after that.


I've toyed with the idea of splitting DropDatabaseBuffers() into two. In a
first pass, mark all buffers as IO_IN_PROGRESS (plus perhaps some other
flag). That'd prevent new IO from being started for those buffers. Then
WaitForProcSignalBarrier(), to make sure nobody has an open FD for those
files. Then actually drop all those buffers and unlink the files. That now can
happen with interrupts held, without any chance of being blocked, afaict.  In
case of being cancelled during WaitForProcSignalBarrier(), AbortBufferIO()
would remove IO_IN_PROGRESS, and everything would be fine.

I don't like the idea of WaitForProcSignalBarrier() while having buffers
marked as in-progress. I don't *immediately* see a deadlock danger, but I'd be
unsurprised if there were some.

But perhaps a similar approach could be the solution? My gut says that the
rought direction might allow us to keep dropdb() a single transaction.

Greetings,

Andres Freund



Re: "PANIC: could not open critical system index 2662" - twice

From
Michael Paquier
Date:
On Mon, May 08, 2023 at 07:15:20PM +0530, Dilip Kumar wrote:
> I am able to reproduce this using the steps given above, I am also
> trying to analyze this further.  I will send the update once I get
> some clue.

Have you been able to reproduce this on HEAD or at the top of
REL_15_STABLE, or is that 15.2-ish without fa5dd46?

One thing I was wondering about to improve the odds of the hits is to
be more aggressive with the number of relations created at once, so as
we are much more aggressive with the number of pages extended in
pg_class from the origin database.
--
Michael

Attachment

Re: "PANIC: could not open critical system index 2662" - twice

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> One thing I was wondering about to improve the odds of the hits is to
> be more aggressive with the number of relations created at once, so as
> we are much more aggressive with the number of pages extended in
> pg_class from the origin database.

Andres seems to think it's a problem with aborting a DROP DATABASE.
Adding more data might serve to make the window wider, perhaps.

            regards, tom lane



Re: "PANIC: could not open critical system index 2662" - twice

From
Thomas Munro
Date:
On Tue, May 9, 2023 at 10:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > One thing I was wondering about to improve the odds of the hits is to
> > be more aggressive with the number of relations created at once, so as
> > we are much more aggressive with the number of pages extended in
> > pg_class from the origin database.
>
> Andres seems to think it's a problem with aborting a DROP DATABASE.
> Adding more data might serve to make the window wider, perhaps.

Here's an easy way:

@@ -1689,6 +1689,14 @@ dropdb(const char *dbname, bool missing_ok, bool force)
        /* Close all smgr fds in all backends. */
        WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

+/* XXX pretend one of the above steps got interrupted by a statement
timeout or ^C */
+if (random() % 2 == 0)
+{
+QueryCancelPending = true;
+InterruptPending = true;
+CHECK_FOR_INTERRUPTS();
+}

postgres=# create database db2;
CREATE DATABASE
postgres=# drop database db2;
ERROR:  canceling statement due to user request

$ psql db2
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432"
failed: PANIC:  could not open critical system index 2662

$ od -t x1 base/111117/2662
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0100000
$ od -t x1 base/111117/2837
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0040000
$ od -t x1 base/111117/2840
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0100000



Re: "PANIC: could not open critical system index 2662" - twice

From
Michael Paquier
Date:
On Mon, May 08, 2023 at 06:04:23PM -0400, Tom Lane wrote:
> Andres seems to think it's a problem with aborting a DROP DATABASE.
> Adding more data might serve to make the window wider, perhaps.

And the odds get indeed much better once I use these two toys:
CREATE OR REPLACE FUNCTION create_tables(num_tables int)
  RETURNS VOID AS
  $func$
  BEGIN
  FOR i IN 1..num_tables LOOP
    EXECUTE format('
      CREATE TABLE IF NOT EXISTS %I (id int)', 't_' || i);
  END LOOP;
END
$func$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION drop_tables(num_tables int)
  RETURNS VOID AS
  $func$
  BEGIN
  FOR i IN 1..num_tables LOOP
    EXECUTE format('
      DROP TABLE IF EXISTS %I', 't_' || i);
  END LOOP;
END
$func$ LANGUAGE plpgsql;

And then use this loop with the others I have mentioned upthread (just
create origindb and the functions in them):
while true;
  do psql -c 'select create_tables(1000)' origindb > /dev/null 2>&1 ;
  psql testdb -c "select 1" > /dev/null 2>&1 ;
  psql -c 'select drop_tables(1000)' origindb > /dev/null 2>&1 ;
  psql testdb -c "select 1" > /dev/null 2>&1 ;
done;

On top of that, I have also been able to reproduce the issue on HEAD,
and luckily some pg_class file remain around, full of zeros:
$ hexdump ./base/199634/1259
0000000 0000 0000 0000 0000 0000 0000 0000 0000

The contents of 2662, though, looked OK.

Echoing Alvaro..  Could we, err, revisit the choice of making WAL_LOG
the default strategy even for this set of minor releases?  FWIW, I've
mentioned that this choice was too aggressive in the thread of
8a86618..
--
Michael

Attachment

Re: "PANIC: could not open critical system index 2662" - twice

From
Andres Freund
Date:
Hi,

On 2023-05-08 14:04:00 -0700, Andres Freund wrote:
> But perhaps a similar approach could be the solution? My gut says that the
> rought direction might allow us to keep dropdb() a single transaction.

I started to hack on the basic approach of committing after the catalog
changes. But then I started wondering if we're not tackling this all wrong.


We don't actually want to drop the catalog contents early as that prevents the
database from being dropped again, while potentially leaving behind contents
in the case of a crash / interrupt.

Similary, the fact that we only commit the transaction at the end of
createdb() leads to interrupts / crashes orphaning the contents of that
[partial] database.  We also hve similar issues with movedb(), I think.


This is a non-transactional operation. I think we should copy the approach of
the CONCURRENTLY operations. Namely add a new column to pg_database,
indicating whether the database contents are valid. An invalid database can be
dropped, but not connected to.

Then we could have createdb() commit before starting to create the target
database directory (with invalid = true, of course). After all the filesystem
level stuff is complete, set invalid = false.

For dropping a database we'd use heap_inplace_update() to set invalid = true
just before the DropDatabaseBuffers(), preventing any connections after that
point.

Making movedb() safe is probably a bit harder - I think it'd temporarily
require two pg_database entries?


Of course we can't add a new column in the back branches. IIRC we had a
similar issue with CIC some point, and just ended up misusing some other
column for the backbranches?  We could e.g. use datconnlimit == -2 for that
purpose (but would need to make sure that ALTER DATABASE can't unset it).


My current gut feeling is that we should use datconnlimit == -2 to prevent
connections after reaching DropDatabaseBuffers() in dropdb(), and use a new
column in 16, for both createdb() and dropdb().  In some ways handling
createdb() properly is a new feature, but it's also arguably a bug that we
leak the space - and I think the code will be better if we work on both
together.

Greetings,

Andres Freund



Re: "PANIC: could not open critical system index 2662" - twice

From
Andres Freund
Date:
Hi,

On 2023-05-08 17:46:37 -0700, Andres Freund wrote:
> My current gut feeling is that we should use datconnlimit == -2 to prevent
> connections after reaching DropDatabaseBuffers() in dropdb(), and use a new
> column in 16, for both createdb() and dropdb().

Attached is a rough prototype of that idea (only using datconnlimit == -2 for
now).

This would need a fair bit more polish. The tests are crappy and output stuff
to stderr and don't validate enough. The error messages are bad. No docs
changes. More comments about why datconnlimit == -2 is used. Etc.

But I think it should be sufficient to discuss whether this is a viable path.


I guess we need to move this to -hackers. Perhaps I'll post subsequent
versions below
https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de ?

Greetings,

Andres Freund

Attachment

Re: "PANIC: could not open critical system index 2662" - twice

From
Dilip Kumar
Date:
On Tue, May 9, 2023 at 3:15 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, May 08, 2023 at 07:15:20PM +0530, Dilip Kumar wrote:
> > I am able to reproduce this using the steps given above, I am also
> > trying to analyze this further.  I will send the update once I get
> > some clue.
>
> Have you been able to reproduce this on HEAD or at the top of
> REL_15_STABLE, or is that 15.2-ish without fa5dd46?
>

I am able to reproduce on REL_15_STABLE as well with your test case.
The only difference is without fa5dd46 the issue gets reproduced just
in a couple of seconds and very consistent OTOH on REL_15_STABLE it
takes time to reproduce 2-3 mins and also reproduction is not very
consistent.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 8/05/2023 11:04 pm, Andres Freund wrote:
> Are you using any extensions?

Only plpgsql.

> Do you have any chance to figure out what statements were running
> concurrently with the DROP DATABASE?
No. Is there some way to log that, other than just logging all
statements (which seems impractical)? Most of the statements running on
that server should be inserts into another database, some reads.
Something could be creating or dropping partitions in another DB in
parallel, too, but not frequently.



Re: "PANIC: could not open critical system index 2662" - twice

From
Kirk Wolak
Date:
On Sun, May 7, 2023 at 10:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, May 8, 2023 at 4:10 AM Evgeny Morozov
<postgresql3@realityexists.net> wrote:
> On 6/05/2023 11:13 pm, Thomas Munro wrote:
> > Would you like to try requesting FILE_COPY for a while and see if it eventually happens like that too?
> Sure, we can try that.

Maybe you could do some one way and some the other, so that we try to
learn more?

> Ubuntu 18.04.6
> Kernel 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023
> x86_64 x86_64 x86_64 GNU/Linux
> zfsutils-linux package version 0.7.5-1ubuntu16.12 amd64

I tried for a few hours to reproduce this by trying to make as many
things as similar to yours as I could based on info in this thread
(Linux: up-to-date Ubuntu 18.04 in vagrant which has nearly the same
kernel 4.15.0-208-generic and a close but slightly different version
of ancient ZFS 0.7.5-1ubuntu15, not sure why, ZFS: mirror (I used a
pair of loopback files), recordsize=16kB, compression=lz4, PG:
compiled from tag REL_15_2, data_checksums=on, full_page_writes=off,
wal_recycle=off, wal_init_zero=off), with what I thought might be
roughly what you're doing (creating three DBs, two clones of the
first, with various modification at various points, with various
overlapping activities, and then checking for catalog corruption).  No
cigar.  Hrmph.

My first thought on this was along the lines of previous comments.
We are writing the initial pieces of the files for a new DATABASE.
Some flag is set, and then context is lost, and it ends up that a SAVE happens
to an existing DATABASE.

So, my thought was a parallel set of UPDATE "type" commands that would
cause those types of pages for a stable/existing DB to be updated/written.

It could be as simple as creating temp tables in the other database (since I believe pg_class was hit).

From a "rare" standpoint.  That "feels" about right to me.  It would pass serial tests (like you are running).
Just a thought.  Maybe a simple background script creating temp tables in another DB.
Also, not sure if the OP has a set of things done after he creates the DB that may help?

Kirk

 

Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 10/05/2023 6:39 am, Kirk Wolak wrote:
It could be as simple as creating temp tables in the other database (since I believe pg_class was hit).
We do indeed create temp tables, both in other databases and in the ones being tested. (We also create non-temp tables there.)

Also, not sure if the OP has a set of things done after he creates the DB that may help?

Basically we read rows from the source database, create some partitions of tables in the target database, insert into a temp table there using BULK COPY, then using a regular INSERT copy from the temp tables to the new partitions.


Now that the probem has been reproduced and understood by the PG developers, could anyone explain why PG crashed entirely with the "PANIC" error back in April when only specific databases were corrupted, not any global objects necesary for PG to run? And why did it not crash with the "PANIC" on this occasion?

Re: "PANIC: could not open critical system index 2662" - twice

From
Kirk Wolak
Date:
On Wed, May 10, 2023 at 9:32 AM Evgeny Morozov <postgresql3@realityexists.net> wrote:
On 10/05/2023 6:39 am, Kirk Wolak wrote:
It could be as simple as creating temp tables in the other database (since I believe pg_class was hit).
We do indeed create temp tables, both in other databases and in the ones being tested. (We also create non-temp tables there.)

Also, not sure if the OP has a set of things done after he creates the DB that may help?

Basically we read rows from the source database, create some partitions of tables in the target database, insert into a temp table there using BULK COPY, then using a regular INSERT copy from the temp tables to the new partitions.


Now that the probem has been reproduced and understood by the PG developers, could anyone explain why PG crashed entirely with the "PANIC" error back in April when only specific databases were corrupted, not any global objects necesary for PG to run? And why did it not crash with the "PANIC" on this occasion?

I understand the question as:
Why would it PANIC on non-system data corruption, but not on system data corruption?

To which my guess is:
Because System  Data Corruption, on startup is probably a use case, and we want to report, and come up as much as possible.
Whereas the OTHER code did a PANIC simply because it was BOTH unexpected, and NOT Where it was in a place it could move forward.
Meaning it had no idea if it read in bad data, or if it CREATED the bad data.

As a programmer, you will find much more robust code on startup checking than in the middle of doing something else.

But just a guess.  Someone deeper into the code might explain it better.
And you COULD go dig through the source to compare the origination of the error messages?

Kirk...

Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 9/05/2023 3:32 am, Andres Freund wrote:
> Attached is a rough prototype of that idea (only using datconnlimit ==
> -2 for now).
> I guess we need to move this to -hackers. Perhaps I'll post subsequent
> versions below
> https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de ?
So now that a patch is in the works, can I drop the two corrupted
databases? Is there a workaround I can use to reduce the risk of running
into this issue again until a patch is released? (Which I guess would be
in August?)



Re: "PANIC: could not open critical system index 2662" - twice

From
Kirk Wolak
Date:
On Tue, May 16, 2023 at 10:20 AM Evgeny Morozov <postgresql3@realityexists.net> wrote:
On 9/05/2023 3:32 am, Andres Freund wrote:
> Attached is a rough prototype of that idea (only using datconnlimit ==
> -2 for now).
> I guess we need to move this to -hackers. Perhaps I'll post subsequent
> versions below
> https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de ?
So now that a patch is in the works, can I drop the two corrupted
databases? Is there a workaround I can use to reduce the risk of running
into this issue again until a patch is released? (Which I guess would be
in August?)

The only work around to avoid losing data that I know of are backups and WAL backups.
Plus "hard core testing/validation" that they work.  We settled on pg_backrest and are happy with it.

Technically, based on what I understand of this bug.  It did not corrupt the WAL.  If that's true, then if
you had a basebackup and all the wall files, you could have played back and recovered the data.
At least to some degree.  Assuming I am right.

HTH 

Re: "PANIC: could not open critical system index 2662" - twice

From
Andres Freund
Date:
Hi,

On 2023-05-16 14:20:46 +0000, Evgeny Morozov wrote:
> On 9/05/2023 3:32 am, Andres Freund wrote:
> > Attached is a rough prototype of that idea (only using datconnlimit ==
> > -2 for now).
> > I guess we need to move this to -hackers. Perhaps I'll post subsequent
> > versions below
> > https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de ?
> So now that a patch is in the works, can I drop the two corrupted
> databases?

Yes.


> Is there a workaround I can use to reduce the risk of running into this
> issue again until a patch is released? (Which I guess would be in August?)

Try to prevent the DROP DATABASE from getting cancelled :/. If you want to go
a bit further, you could rename the database to *_dropped before dropping it,
and then try to do the DROP DATABASE. That way you'd at least know that it's
corrupt because of a failed DROP database.

Greetings,

Andres Freund



Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
On 17/05/2023 1:39 am, Andres Freund wrote:
> Try to prevent the DROP DATABASE from getting cancelled :/.

I still don't know why that's happening. I mean, I know why it gets
cancelled (the client timeout we set in Npgsql), but I don't know why
the drop does not succeed within 30 seconds. We could, of course,
increase that timeout, but my gut feeling is that if it hasn't happened
in 30 seconds it won't happen in 5 minutes, either, and the connection
will have to be closed eventually. Can you think of any way to
troubleshoot that?


> If you want to go a bit further, you could rename the database to
> *_dropped before dropping it, and then try to do the DROP DATABASE.
> That way you'd at least know that it's corrupt because of a failed
> DROP database.

That's a good idea and I will do that, but my bigger concern is PG
crashing entirely with the "PANIC" error and refusing to start again. I
still don't know why that happened back in April (but not this time, in
May), so I don't know how to prevent that.




Re: "PANIC: could not open critical system index 2662" - twice

From
Evgeny Morozov
Date:
There haven't been any updates posted to
https://www.postgresql.org/message-id/20230509040203.z6mvijumv7wxcuib%40awork3.anarazel.de
so I just wanted to check if there is any update on the status of the
patch? Can we expect it in PostgreSQL 15.4? Thanks.



Re: "PANIC: could not open critical system index 2662" - twice

From
Andres Freund
Date:
Hi,

On 2023-06-19 10:04:35 +0000, Evgeny Morozov wrote:
> There haven't been any updates posted to
> https://www.postgresql.org/message-id/20230509040203.z6mvijumv7wxcuib%40awork3.anarazel.de
> so I just wanted to check if there is any update on the status of the
> patch? Can we expect it in PostgreSQL 15.4? Thanks.

I pushed the fixes to all branches just now. Thanks for the report!

Greetings,

Andres Freund