Re: "PANIC: could not open critical system index 2662" - twice - Mailing list pgsql-general
From | Evgeny Morozov |
---|---|
Subject | Re: "PANIC: could not open critical system index 2662" - twice |
Date | |
Msg-id | 01020187e773e031-eecea370-714f-4858-82e5-ed1c9091eb6f-000000@eu-west-1.amazonses.com Whole thread Raw |
In response to | Re: "PANIC: could not open critical system index 2662" - twice (Alban Hertroys <haramrae@gmail.com>) |
Responses |
Re: "PANIC: could not open critical system index 2662" - twice
|
List | pgsql-general |
On 14/04/2023 10:42 am, Alban Hertroys wrote: > Your problem coincides with a thread at freebsd-current with very > similar data corruption after a recent OpenZFS import: blocks of all > zeroes, but also missing files. So, perhaps these problems are related? > Apparently, there was a recent fix for a data corruption issue with the block_cloning feature enabled, but people are stillseeing corruption even when they never enabled that feature. > > I couldn’t really find the start of the thread in the archives, so this one kind of jumps into the middle of the threadat a relevant-looking point: > > https://lists.freebsd.org/archives/freebsd-current/2023-April/003446.html That thread was a bit over my head, I'm afraid, so I can't say if it's related. I haven't detected any missing files, anyway. Well, the problem happened again! Kind of... This time PG has not crashed with the PANIC error in the subject, but pg_dumping certain DBs again fails with pg_dump: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: index "pg_class_oid_index" contains unexpected zero page at block 0 PG server log contains: 2023-05-03 04:31:49.903 UTC [14724] postgres@test_behavior_638186279733138190 FATAL: index "pg_class_oid_index" contains unexpected zero page at block 0 2023-05-03 04:31:49.903 UTC [14724] postgres@test_behavior_638186279733138190 HINT: Please REINDEX it. The server PID does not change on such a pg_dump attempt, so it appears that only the backend process for the pg_dump connection crashes this time. I don't see any disk errors and there haven't been any OS crashes. This currently happens for two DBs. Both of them are very small DBs created by automated .NET tests using Npgsql as client. The code creates such a test DB from a template DB and tries to drop it at the end of the test. This times out sometimes and on timeout our test code tries to drop the DB again (while the first drop command is likely still pending on the server). This second attempt to drop the DB also timed out: [12:40:39] Npgsql.NpgsqlException : Exception while reading from stream ----> System.TimeoutException : Timeout during reading attempt at Npgsql.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|194_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage) at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken) at Npgsql.NpgsqlDataReader.NextResult() at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteNonQuery() ... [12:41:41] (same error again for the same DB) From looking at old logs it seems like the same thing happened last time (in April) as well. That's quite an unlikely coincidence if a bad disk or bad filesystem was to blame, isn't it? I've tried to reproduce this by re-running those tests over and over, but without success so far. So what can I do about this? Do I just try to drop those databases again manually? But what about the next time it happens? How do I figure out the cause and prevent this problem?
pgsql-general by date: