Re: checkpointer: PANIC: could not fsync file: No such file or directory - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: checkpointer: PANIC: could not fsync file: No such file or directory
Date
Msg-id CAMsr+YG+WNf17xRwTZhSKgFP9p-PAxb9s1DqGZGqQ_NiVZTSPA@mail.gmail.com
Whole thread Raw
In response to Re: checkpointer: PANIC: could not fsync file: No such file ordirectory  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
On Thu, 21 Nov 2019 at 09:07, Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Nov 19, 2019 at 07:22:26PM -0600, Justin Pryzby wrote:
> I was trying to reproduce what was happening:
> set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int unique); INSERT INTO t SELECT generate_series(1,999999)"; echo "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At >/tmp/snapshot& sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; PGOPTIONS='-cclient_min_messages=debug' psql postgres -txc "ALTER TABLE t ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44;
>
> Under v12, with or without the CHECKPOINT command, it fails:
> |pg_dump: error: query failed: ERROR:  cache lookup failed for index 0
> But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails like:
> |pg_dump: [archiver (db)] query failed: ERROR:  cache lookup failed for index 16391
> With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be
> needed, or why it would behave differently (or if it's related to this crash).

Actually, I think that's at least related to documented behavior:

https://www.postgresql.org/docs/12/mvcc-caveats.html
|Some DDL commands, currently only TRUNCATE and the table-rewriting forms of ALTER TABLE, are not MVCC-safe. This means that after the truncation or rewrite commits, the table will appear empty to concurrent transactions, if they are using a snapshot taken before the DDL command committed.

I don't know why CHECKPOINT allows it to work under 9.5, or if it's even
related to the PANIC ..

The PANIC is a defense against potential corruptions that can be caused by some kinds of disk errors. It's likely that we used to just ERROR and retry, then the retry would succeed without getting upset.

fsync_fname() is supposed to ignore errors for files that cannot be opened. But that same message may be emitted by a number of other parts of the code, and it looks like you didn't have log_error_verbosity = verbose so we don't have file/line info.

The only other place I see that emits that error where a relation path could be a valid argument is in rewriteheap.c  in logical_end_heap_rewrite(). That calls the vfd layer's FileSync() and assumes that any failure is a fsync() syscall failure. But FileSync() can return failure if we fail to reopen the underlying file managed by the vfd too, per FileAccess().

Would there be a legitimate case where a logical rewrite file mapping could vanish without that being a problem? If so, we should probably be more tolerante there.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Why overhead of SPI is so large?
Next
From: Craig Ringer
Date:
Subject: Re: ssl passphrase callback