Re: checkpointer: PANIC: could not fsync file: No such file ordirectory - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | Re: checkpointer: PANIC: could not fsync file: No such file ordirectory |
Date | |
Msg-id | 20191127065313.GU16647@telsasoft.com Whole thread Raw |
In response to | checkpointer: PANIC: could not fsync file: No such file or directory (Justin Pryzby <pryzby@telsasoft.com>) |
Responses |
Re: checkpointer: PANIC: could not fsync file: No such file or directory
|
List | pgsql-hackers |
This same crash occured on a 2nd server. Also qemu/KVM, but this time on a 2ndary ZFS tablespaces which (fails to) include the missing relfilenode. Linux database7 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux This is postgresql12-12.1-1PGDG.rhel7.x86_64 (same as first crash), running since: |30515 Tue Nov 19 10:04:33 2019 S ? 00:09:54 /usr/pgsql-12/bin/postmaster -D /var/lib/pgsql/12/data/ Before that, this server ran v12.0 since Oct 30 (without crashing). In this case, the pg_dump --snap finished and released its snapshot at 21:50, and there were no ALTERed tables. I see a temp file written since the previous checkpoint, but not by a parallel worker, as in the previous server's crash. The crash happened while reindexing, though. The "DROP INDEX CONCURRENTLY" is from pg_repack -i, and completed successfully, but is followed immediately by the abort log. The folllowing "REINDEX toast..." failed. In this case, I *guess* that the missing filenode is due to a dropped index (570627937 or otherwise). I don't see any other CLUSTER, VACUUM FULL, DROP, TRUNCATE or ALTER within that checkpoint interval (note, we have 1 minute checkpoints). Note, I double checked on the first server which crashed, it definitely wasn't running pg_repack or the reindex script, since I removed pg_repack12 from our servers until 12.1 was installed, to avoid the "concurrently" progress reporting crash fixed at 1cd5bc3c. So I think ALTER TABLE TYPE and REINDEX can both trigger this crash, at least on v12.1. Note I actually have *full logs*, which I've now saved. But here's an excerpt: postgres=# SELECT log_time, message FROM ckpt_crash WHERE log_time BETWEEN '2019-11-26 23:40:20' AND '2019-11-26 23:48:58'AND user_name IS NULL ORDER BY 1; 2019-11-26 23:40:20.139-05 | checkpoint starting: time 2019-11-26 23:40:50.069-05 | checkpoint complete: wrote 11093 buffers (5.6%); 0 WAL file(s) added, 0 removed, 12 recycled;write=29.885 s, sync=0.008 s, total=29.930 s; sync files=71, longest=0.001 s, average=0.000 s; distance =193388 kB, estimate=550813 kB 2019-11-26 23:41:16.234-05 | automatic analyze of table "postgres.public.postgres_log_2019_11_26_2300" system usage: CPU:user: 3.00 s, system: 0.19 s, elapsed: 10.92 s 2019-11-26 23:41:20.101-05 | checkpoint starting: time 2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file ordirectory 2019-11-26 23:42:04.397-05 | checkpointer process (PID 30560) was terminated by signal 6: Aborted 2019-11-26 23:42:04.397-05 | terminating any other active server processes 2019-11-26 23:42:04.397-05 | terminating connection because of crash of another server process 2019-11-26 23:42:04.42-05 | terminating connection because of crash of another server process 2019-11-26 23:42:04.493-05 | all server processes terminated; reinitializing 2019-11-26 23:42:05.651-05 | database system was interrupted; last known up at 2019-11-27 00:40:50 -04 2019-11-26 23:47:30.404-05 | database system was not properly shut down; automatic recovery in progress 2019-11-26 23:47:30.435-05 | redo starts at 3450/1B202938 2019-11-26 23:47:54.501-05 | redo done at 3450/205CE960 2019-11-26 23:47:54.501-05 | invalid record length at 3450/205CEA18: wanted 24, got 0 2019-11-26 23:47:54.567-05 | checkpoint starting: end-of-recovery immediate 2019-11-26 23:47:57.365-05 | checkpoint complete: wrote 3287 buffers (1.7%); 0 WAL file(s) added, 0 removed, 5 recycled;write=2.606 s, sync=0.183 s, total=2.798 s; sync files=145, longest=0.150 s, average=0.001 s; distance=85 808 kB, estimate=85808 kB 2019-11-26 23:47:57.769-05 | database system is ready to accept connections 2019-11-26 23:48:57.774-05 | checkpoint starting: time < 2019-11-27 00:42:04.342 -04 postgres >LOG: duration: 13.028 ms statement: DROP INDEX CONCURRENTLY "child"."index_570627937" < 2019-11-27 00:42:04.397 -04 >LOG: checkpointer process (PID 30560) was terminated by signal 6: Aborted < 2019-11-27 00:42:04.397 -04 >LOG: terminating any other active server processes < 2019-11-27 00:42:04.397 -04 >WARNING: terminating connection because of crash of another server process < 2019-11-27 00:42:04.397 -04 >DETAIL: The postmaster has commanded this server process to roll back the current transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. < 2019-11-27 00:42:04.397 -04 >HINT: In a moment you should be able to reconnect to the database and repeat your command. ... < 2019-11-27 00:42:04.421 -04 postgres >STATEMENT: begin; LOCK TABLE child.ericsson_sgsn_ss7_remote_sp_201911 IN SHARE MODE;REINDEXINDEX pg_toast.pg_toast_570627929_index;commit Here's all the nondefault settings which seem plausibly relevant or interesting: autovacuum_analyze_scale_factor | 0.005 | | configuration file autovacuum_analyze_threshold | 2 | | configuration file checkpoint_timeout | 60 | s | configuration file max_files_per_process | 1000 | | configuration file max_stack_depth | 2048 | kB | environment variable max_wal_size | 4096 | MB | configuration file min_wal_size | 4096 | MB | configuration file shared_buffers | 196608 | 8kB | configuration file shared_preload_libraries | pg_stat_statements | | configuration file wal_buffers | 2048 | 8kB | override wal_compression | on | | configuration file wal_segment_size | 16777216 | B | override (gdb) bt #0 0x00007f07c0070207 in raise () from /lib64/libc.so.6 #1 0x00007f07c00718f8 in abort () from /lib64/libc.so.6 #2 0x000000000087752a in errfinish (dummy=<optimized out>) at elog.c:552 #3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398 #4 0x0000000000734dd9 in CheckPointBuffers (flags=flags@entry=256) at bufmgr.c:2588 #5 0x00000000005095e1 in CheckPointGuts (checkPointRedo=57518713529016, flags=flags@entry=256) at xlog.c:9006 #6 0x000000000050ff86 in CreateCheckPoint (flags=flags@entry=256) at xlog.c:8795 #7 0x00000000006e4092 in CheckpointerMain () at checkpointer.c:481 #8 0x000000000051fcd5 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffd82122400) at bootstrap.c:461 #9 0x00000000006ee680 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5392 #10 0x00000000006ef9ca in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2973 #11 <signal handler called> #12 0x00007f07c012ef53 in __select_nocancel () from /lib64/libc.so.6 #13 0x00000000004833d4 in ServerLoop () at postmaster.c:1668 #14 0x00000000006f106f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x19d4280) at postmaster.c:1377 #15 0x0000000000484cd3 in main (argc=3, argv=0x19d4280) at main.c:228 #3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398 path = "pg_tblspc/16401/PG_12_201909212/16460/973123799.10", '\000' <repeats 14 times>, ... failures = 1 sync_in_progress = true hstat = {hashp = 0x19fd2f0, curBucket = 1443, curEntry = 0x0} entry = 0x1a61260 absorb_counter = <optimized out> processed = 23 sync_start = {tv_sec = 21582125, tv_nsec = 303557162} sync_end = {tv_sec = 21582125, tv_nsec = 303536006} sync_diff = <optimized out> elapsed = <optimized out> longest = 1674 total_elapsed = 7074 __func__ = "ProcessSyncRequests"
pgsql-hackers by date: