Re: checkpointer: PANIC: could not fsync file: No such file ordirectory - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: checkpointer: PANIC: could not fsync file: No such file ordirectory
Date
Msg-id 20191120012226.GW30362@telsasoft.com
Whole thread Raw
In response to Re: checkpointer: PANIC: could not fsync file: No such file ordirectory  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: checkpointer: PANIC: could not fsync file: No such file ordirectory  (Justin Pryzby <pryzby@telsasoft.com>)
Re: checkpointer: PANIC: could not fsync file: No such file ordirectory  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
On Tue, Nov 19, 2019 at 04:49:10PM -0600, Justin Pryzby wrote:
> On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote:
> > Perhaps we should not panic if we failed to open (not fsync) the file,
> > but it's not the root problem here which is that somehow we thought we
> > should be fsyncing a file that had apparently been removed already
> > (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).

Note, the ALTER was (I think) building index in a parallel process:

 2019-11-15 22:16:08.752-05 |  5976 | terminating connection because of crash of another server process
 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was terminated by signal 6: Aborted
 2019-11-15 22:16:08.751-05 | 27384 | terminating any other active server processes
 2019-11-15 22:16:07.098-05 | 27388 | could not fsync file "base/16491/1731839470.2": No such file or directory
 2019-11-15 22:15:59.592-05 | 19860 | duration: 220283.831 ms  statement: ALTER TABLE child.eric_enodeb_cell_201811
ALTERpmradiothpvolulscell TYPE integer USING pmradiothpvolulscell::integer
 
 2019-11-15 22:15:59.459-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/1.0", size
5144576
 2019-11-15 22:15:59.458-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/2.0", size
6463488
 2019-11-15 22:15:59.456-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/0.0", size
4612096

FYI, that table is *currently* (5 days later):
ts=# \dti+ child.eric_enodeb_cell_201811*
 child  | eric_enodeb_cell_201811          | table | telsasoft |                         | 2595 MB |
 child  | eric_enodeb_cell_201811_idx      | index | telsasoft | eric_enodeb_cell_201811 | 120 kB  |
 child  | eric_enodeb_cell_201811_site_idx | index | telsasoft | eric_enodeb_cell_201811 | 16 MB   |

I don't know if that table is likely to be the one with relfilenode 1731839470
(but it certainly wasn't its index), or if that was maybe a table (or index)
from an earlier ALTER.  I tentatively think we wouldn't have had any other
tables being dropped, partitions pruned or maintenance commands running.

Checkpoint logs for good measure:
 2019-11-15 22:18:26.168-05 | 10388 | checkpoint complete: wrote 2915 buffers (3.0%); 0 WAL file(s) added, 0 removed,
18recycled; write=30.022 s, sync=0.472 s, total=32.140 s; sync files=107, longest=0.364 s, average=0.004 s; 
 
distance=297471 kB, estimate=297471 kB
 2019-11-15 22:17:54.028-05 | 10388 | checkpoint starting: time
 2019-11-15 22:16:53.753-05 | 10104 | checkpoint complete: wrote 98275 buffers (100.0%); 0 WAL file(s) added, 0
removed,43 recycled; write=11.040 s, sync=0.675 s, total=11.833 s; sync files=84, longest=0.335 s, average=0.008 s
 
; distance=698932 kB, estimate=698932 kB
 2019-11-15 22:16:41.921-05 | 10104 | checkpoint starting: end-of-recovery immediate
 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was terminated by signal 6: Aborted
 2019-11-15 22:15:33.03-05  | 27388 | checkpoint starting: time
 2019-11-15 22:15:03.62-05  | 27388 | checkpoint complete: wrote 5436 buffers (5.5%); 0 WAL file(s) added, 0 removed,
45recycled; write=28.938 s, sync=0.355 s, total=29.711 s; sync files=22, longest=0.174 s, average=0.016 s; d
 
istance=740237 kB, estimate=740237 kB
 2019-11-15 22:14:33.908-05 | 27388 | checkpoint starting: time

I was trying to reproduce what was happening:
set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int unique); INSERT INTO t SELECT
generate_series(1,999999)";echo "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At
>/tmp/snapshot&sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; PGOPTIONS='-cclient_min_messages=debug' psql
postgres-txc "ALTER TABLE t ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44;
 

Under v12, with or without the CHECKPOINT command, it fails:
|pg_dump: error: query failed: ERROR:  cache lookup failed for index 0
But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails like:
|pg_dump: [archiver (db)] query failed: ERROR:  cache lookup failed for index 16391
With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be
needed, or why it would behave differently (or if it's related to this crash).



pgsql-hackers by date:

Previous
From: "imai.yoshikazu@fujitsu.com"
Date:
Subject: RE: Planning counters in pg_stat_statements (using pgss_store)
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: initdb SegFault