Re: checkpointer: PANIC: could not fsync file: No such file ordirectory - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: checkpointer: PANIC: could not fsync file: No such file ordirectory
Date
Msg-id 20191127065313.GU16647@telsasoft.com
Whole thread Raw
In response to checkpointer: PANIC: could not fsync file: No such file or directory  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: checkpointer: PANIC: could not fsync file: No such file or directory
List pgsql-hackers
This same crash occured on a 2nd server.
Also qemu/KVM, but this time on a 2ndary ZFS tablespaces which (fails to) include the missing relfilenode.
Linux database7 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

This is postgresql12-12.1-1PGDG.rhel7.x86_64 (same as first crash), running since:
|30515 Tue Nov 19 10:04:33 2019 S ?        00:09:54 /usr/pgsql-12/bin/postmaster -D /var/lib/pgsql/12/data/

Before that, this server ran v12.0 since Oct 30 (without crashing).

In this case, the pg_dump --snap finished and released its snapshot at 21:50,
and there were no ALTERed tables.  I see a temp file written since the previous
checkpoint, but not by a parallel worker, as in the previous server's crash.

The crash happened while reindexing, though.  The "DROP INDEX CONCURRENTLY" is
from pg_repack -i, and completed successfully, but is followed immediately by
the abort log.  The folllowing "REINDEX toast..." failed.   In this case, I
*guess* that the missing filenode is due to a dropped index (570627937 or
otherwise).  I don't see any other CLUSTER, VACUUM FULL, DROP, TRUNCATE or
ALTER within that checkpoint interval (note, we have 1 minute checkpoints).

Note, I double checked on the first server which crashed, it definitely wasn't
running pg_repack or the reindex script, since I removed pg_repack12 from our
servers until 12.1 was installed, to avoid the "concurrently" progress
reporting crash fixed at 1cd5bc3c.  So I think ALTER TABLE TYPE and REINDEX can
both trigger this crash, at least on v12.1.

Note I actually have *full logs*, which I've now saved.  But here's an excerpt:

postgres=# SELECT log_time, message FROM ckpt_crash WHERE log_time BETWEEN '2019-11-26 23:40:20' AND '2019-11-26
23:48:58'AND user_name IS NULL ORDER BY 1;
 
 2019-11-26 23:40:20.139-05 | checkpoint starting: time
 2019-11-26 23:40:50.069-05 | checkpoint complete: wrote 11093 buffers (5.6%); 0 WAL file(s) added, 0 removed, 12
recycled;write=29.885 s, sync=0.008 s, total=29.930 s; sync files=71, longest=0.001 s, average=0.000 s; distance
 
=193388 kB, estimate=550813 kB
 2019-11-26 23:41:16.234-05 | automatic analyze of table "postgres.public.postgres_log_2019_11_26_2300" system usage:
CPU:user: 3.00 s, system: 0.19 s, elapsed: 10.92 s
 
 2019-11-26 23:41:20.101-05 | checkpoint starting: time
 2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file
ordirectory
 
 2019-11-26 23:42:04.397-05 | checkpointer process (PID 30560) was terminated by signal 6: Aborted
 2019-11-26 23:42:04.397-05 | terminating any other active server processes
 2019-11-26 23:42:04.397-05 | terminating connection because of crash of another server process
 2019-11-26 23:42:04.42-05  | terminating connection because of crash of another server process
 2019-11-26 23:42:04.493-05 | all server processes terminated; reinitializing
 2019-11-26 23:42:05.651-05 | database system was interrupted; last known up at 2019-11-27 00:40:50 -04
 2019-11-26 23:47:30.404-05 | database system was not properly shut down; automatic recovery in progress
 2019-11-26 23:47:30.435-05 | redo starts at 3450/1B202938
 2019-11-26 23:47:54.501-05 | redo done at 3450/205CE960
 2019-11-26 23:47:54.501-05 | invalid record length at 3450/205CEA18: wanted 24, got 0
 2019-11-26 23:47:54.567-05 | checkpoint starting: end-of-recovery immediate
 2019-11-26 23:47:57.365-05 | checkpoint complete: wrote 3287 buffers (1.7%); 0 WAL file(s) added, 0 removed, 5
recycled;write=2.606 s, sync=0.183 s, total=2.798 s; sync files=145, longest=0.150 s, average=0.001 s; distance=85
 
808 kB, estimate=85808 kB
 2019-11-26 23:47:57.769-05 | database system is ready to accept connections
 2019-11-26 23:48:57.774-05 | checkpoint starting: time

< 2019-11-27 00:42:04.342 -04 postgres >LOG:  duration: 13.028 ms  statement: DROP INDEX CONCURRENTLY
"child"."index_570627937"
< 2019-11-27 00:42:04.397 -04  >LOG:  checkpointer process (PID 30560) was terminated by signal 6: Aborted
< 2019-11-27 00:42:04.397 -04  >LOG:  terminating any other active server processes
< 2019-11-27 00:42:04.397 -04  >WARNING:  terminating connection because of crash of another server process
< 2019-11-27 00:42:04.397 -04  >DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory.
 
< 2019-11-27 00:42:04.397 -04  >HINT:  In a moment you should be able to reconnect to the database and repeat your
command.
...
< 2019-11-27 00:42:04.421 -04 postgres >STATEMENT:  begin; LOCK TABLE child.ericsson_sgsn_ss7_remote_sp_201911 IN SHARE
MODE;REINDEXINDEX pg_toast.pg_toast_570627929_index;commit
 

Here's all the nondefault settings which seem plausibly relevant or interesting:
 autovacuum_analyze_scale_factor | 0.005                                  |      | configuration file
 autovacuum_analyze_threshold    | 2                                      |      | configuration file
 checkpoint_timeout              | 60                                     | s    | configuration file
 max_files_per_process           | 1000                                   |      | configuration file
 max_stack_depth                 | 2048                                   | kB   | environment variable
 max_wal_size                    | 4096                                   | MB   | configuration file
 min_wal_size                    | 4096                                   | MB   | configuration file
 shared_buffers                  | 196608                                 | 8kB  | configuration file
 shared_preload_libraries        | pg_stat_statements                     |      | configuration file
 wal_buffers                     | 2048                                   | 8kB  | override
 wal_compression                 | on                                     |      | configuration file
 wal_segment_size                | 16777216                               | B    | override

(gdb) bt
#0  0x00007f07c0070207 in raise () from /lib64/libc.so.6
#1  0x00007f07c00718f8 in abort () from /lib64/libc.so.6
#2  0x000000000087752a in errfinish (dummy=<optimized out>) at elog.c:552
#3  0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
#4  0x0000000000734dd9 in CheckPointBuffers (flags=flags@entry=256) at bufmgr.c:2588
#5  0x00000000005095e1 in CheckPointGuts (checkPointRedo=57518713529016, flags=flags@entry=256) at xlog.c:9006
#6  0x000000000050ff86 in CreateCheckPoint (flags=flags@entry=256) at xlog.c:8795
#7  0x00000000006e4092 in CheckpointerMain () at checkpointer.c:481
#8  0x000000000051fcd5 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffd82122400) at bootstrap.c:461
#9  0x00000000006ee680 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5392
#10 0x00000000006ef9ca in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2973
#11 <signal handler called>
#12 0x00007f07c012ef53 in __select_nocancel () from /lib64/libc.so.6
#13 0x00000000004833d4 in ServerLoop () at postmaster.c:1668
#14 0x00000000006f106f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x19d4280) at postmaster.c:1377
#15 0x0000000000484cd3 in main (argc=3, argv=0x19d4280) at main.c:228

#3  0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
        path = "pg_tblspc/16401/PG_12_201909212/16460/973123799.10", '\000' <repeats 14 times>, ...
        failures = 1
        sync_in_progress = true
        hstat = {hashp = 0x19fd2f0, curBucket = 1443, curEntry = 0x0}
        entry = 0x1a61260
        absorb_counter = <optimized out>
        processed = 23
        sync_start = {tv_sec = 21582125, tv_nsec = 303557162}
        sync_end = {tv_sec = 21582125, tv_nsec = 303536006}
        sync_diff = <optimized out>
        elapsed = <optimized out>
        longest = 1674
        total_elapsed = 7074
        __func__ = "ProcessSyncRequests"



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: psql - improve test coverage from 41% to 88%
Next
From: Etsuro Fujita
Date:
Subject: Re: Problem while updating a foreign table pointing to a partitionedtable on foreign server