checkpointer: PANIC: could not fsync file: No such file or directory - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | checkpointer: PANIC: could not fsync file: No such file or directory |
Date | |
Msg-id | 20191119115759.GI30362@telsasoft.com Whole thread Raw |
Responses |
Re: checkpointer: PANIC: could not fsync file: No such file or directory
|
List | pgsql-hackers |
I (finally) noticed this morning on a server running PG12.1: < 2019-11-15 22:16:07.098 EST >PANIC: could not fsync file "base/16491/1731839470.2": No such file or directory < 2019-11-15 22:16:08.751 EST >LOG: checkpointer process (PID 27388) was terminated by signal 6: Aborted /dev/vdb on /var/lib/pgsql type ext4 (rw,relatime,seclabel,data=ordered) Centos 7.7 qemu/KVM Linux database 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux There's no added tablespaces. Copying Thomas since I wonder if this is related: 3eb77eba Refactor the fsync queue for wider use. I can't find any relation with filenode nor OID matching 1731839470, nor a file named like that (which is maybe no surprise, since it's exactly the issue checkpointer had last week). A backup job would've started at 22:00 and probably would've run until 22:27, except that its backend was interrupted "because of crash of another server process". That uses pg_dump --snapshot. This shows a gap of OIDs between 1721850297 and 1746136569; the tablenames indicate that would've been between 2019-11-15 01:30:03,192 and 04:31:19,348. |SELECT oid, relname FROM pg_class ORDER BY 1 DESC; Ah, I found a maybe relevant log: |2019-11-15 22:15:59.592-05 | duration: 220283.831 ms statement: ALTER TABLE child.eric_enodeb_cell_201811 ALTER pmradiothpvolul So we altered that table (and 100+ others) with a type-promoting alter, starting at 2019-11-15 21:20:51,942. That involves DETACHing all but the most recent partitions, altering the parent, and then iterating over historic children to ALTER and reATTACHing them. (We do this to avoid locking the table for long periods, and to avoid worst-case disk usage). FYI, that server ran PG12.0 since Oct 7 with no issue. I installed pg12.1 at: $ ps -O lstart 27384 PID STARTED S TTY TIME COMMAND 27384 Fri Nov 15 08:13:08 2019 S ? 00:05:54 /usr/pgsql-12/bin/postmaster -D /var/lib/pgsql/12/data/ Core was generated by `postgres: checkpointer '. Program terminated with signal 6, Aborted. (gdb) bt #0 0x00007efc9d8b3337 in raise () from /lib64/libc.so.6 #1 0x00007efc9d8b4a28 in abort () from /lib64/libc.so.6 #2 0x000000000087752a in errfinish (dummy=<optimized out>) at elog.c:552 #3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398 #4 0x0000000000734dd9 in CheckPointBuffers (flags=flags@entry=256) at bufmgr.c:2588 #5 0x00000000005095e1 in CheckPointGuts (checkPointRedo=26082542473320, flags=flags@entry=256) at xlog.c:9006 #6 0x000000000050ff86 in CreateCheckPoint (flags=flags@entry=256) at xlog.c:8795 #7 0x00000000006e4092 in CheckpointerMain () at checkpointer.c:481 #8 0x000000000051fcd5 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffda8a24340) at bootstrap.c:461 #9 0x00000000006ee680 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5392 #10 0x00000000006ef9ca in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2973 #11 <signal handler called> #12 0x00007efc9d972933 in __select_nocancel () from /lib64/libc.so.6 #13 0x00000000004833d4 in ServerLoop () at postmaster.c:1668 #14 0x00000000006f106f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1601280) at postmaster.c:1377 #15 0x0000000000484cd3 in main (argc=3, argv=0x1601280) at main.c:228 bt f: #3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398 path = "base/16491/1731839470.2\000m\000e\001\000\000\000\000hЭи\027\000\000\032\364q\000\000\000\000\000\251\202\214\000\000\000\000\000\004\000\000\000\000\000\000\000\251\202\214\000\000\000\000\000@S`\001\000\000\000\000\000>\242\250\375\177\000\000\002\000\000\000\000\000\000\000\340\211b\001\000\000\000\000\002\346\241\000\000\000\000\000\200>\242\250\375\177\000\000\225ኝ\374~\000\000C\000_US.UTF-8\000\374~\000\000\000\000\000\000\000\000\000\000\306ފ\235\374~\000\000LC_MESSAGES/postgres-12.mo\000\000\000\000\000\000\001\000\000\000\000\000\000\000.ފ\235\374~"... failures = 1 sync_in_progress = true hstat = {hashp = 0x1629e00, curBucket = 122, curEntry = 0x0} entry = 0x1658590 absorb_counter = <optimized out> processed = 43
pgsql-hackers by date: