checkpointer: PANIC: could not fsync file: No such file or directory - Mailing list pgsql-hackers

From Justin Pryzby
Subject checkpointer: PANIC: could not fsync file: No such file or directory
Date
Msg-id 20191119115759.GI30362@telsasoft.com
Whole thread Raw
Responses Re: checkpointer: PANIC: could not fsync file: No such file or directory  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
I (finally) noticed this morning on a server running PG12.1:

< 2019-11-15 22:16:07.098 EST  >PANIC:  could not fsync file "base/16491/1731839470.2": No such file or directory
< 2019-11-15 22:16:08.751 EST  >LOG:  checkpointer process (PID 27388) was terminated by signal 6: Aborted

/dev/vdb on /var/lib/pgsql type ext4 (rw,relatime,seclabel,data=ordered)
Centos 7.7 qemu/KVM
Linux database 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
There's no added tablespaces.

Copying Thomas since I wonder if this is related:
3eb77eba Refactor the fsync queue for wider use.

I can't find any relation with filenode nor OID matching 1731839470, nor a file
named like that (which is maybe no surprise, since it's exactly the issue
checkpointer had last week).

A backup job would've started at 22:00 and probably would've run until 22:27,
except that its backend was interrupted "because of crash of another server
process".  That uses pg_dump --snapshot.

This shows a gap of OIDs between 1721850297 and 1746136569; the tablenames
indicate that would've been between 2019-11-15 01:30:03,192 and 04:31:19,348.
|SELECT oid, relname FROM pg_class ORDER BY 1 DESC;

Ah, I found a maybe relevant log:
|2019-11-15 22:15:59.592-05 | duration: 220283.831 ms  statement: ALTER TABLE child.eric_enodeb_cell_201811 ALTER
pmradiothpvolul

So we altered that table (and 100+ others) with a type-promoting alter,
starting at 2019-11-15 21:20:51,942.  That involves DETACHing all but the most
recent partitions, altering the parent, and then iterating over historic
children to ALTER and reATTACHing them.  (We do this to avoid locking the table
for long periods, and to avoid worst-case disk usage).

FYI, that server ran PG12.0 since Oct 7 with no issue.
I installed pg12.1 at:
$ ps -O lstart 27384
  PID                  STARTED S TTY          TIME COMMAND
27384 Fri Nov 15 08:13:08 2019 S ?        00:05:54 /usr/pgsql-12/bin/postmaster -D /var/lib/pgsql/12/data/

Core was generated by `postgres: checkpointer                                 '.
Program terminated with signal 6, Aborted.

(gdb) bt
#0  0x00007efc9d8b3337 in raise () from /lib64/libc.so.6
#1  0x00007efc9d8b4a28 in abort () from /lib64/libc.so.6
#2  0x000000000087752a in errfinish (dummy=<optimized out>) at elog.c:552
#3  0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
#4  0x0000000000734dd9 in CheckPointBuffers (flags=flags@entry=256) at bufmgr.c:2588
#5  0x00000000005095e1 in CheckPointGuts (checkPointRedo=26082542473320, flags=flags@entry=256) at xlog.c:9006
#6  0x000000000050ff86 in CreateCheckPoint (flags=flags@entry=256) at xlog.c:8795
#7  0x00000000006e4092 in CheckpointerMain () at checkpointer.c:481
#8  0x000000000051fcd5 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffda8a24340) at bootstrap.c:461
#9  0x00000000006ee680 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5392
#10 0x00000000006ef9ca in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2973
#11 <signal handler called>
#12 0x00007efc9d972933 in __select_nocancel () from /lib64/libc.so.6
#13 0x00000000004833d4 in ServerLoop () at postmaster.c:1668
#14 0x00000000006f106f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1601280) at postmaster.c:1377
#15 0x0000000000484cd3 in main (argc=3, argv=0x1601280) at main.c:228

bt f:
#3  0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
        path =
"base/16491/1731839470.2\000m\000e\001\000\000\000\000hЭи\027\000\000\032\364q\000\000\000\000\000\251\202\214\000\000\000\000\000\004\000\000\000\000\000\000\000\251\202\214\000\000\000\000\000@S`\001\000\000\000\000\000>\242\250\375\177\000\000\002\000\000\000\000\000\000\000\340\211b\001\000\000\000\000\002\346\241\000\000\000\000\000\200>\242\250\375\177\000\000\225ኝ\374~\000\000C\000_US.UTF-8\000\374~\000\000\000\000\000\000\000\000\000\000\306ފ\235\374~\000\000LC_MESSAGES/postgres-12.mo\000\000\000\000\000\000\001\000\000\000\000\000\000\000.ފ\235\374~"...
        failures = 1
        sync_in_progress = true
        hstat = {hashp = 0x1629e00, curBucket = 122, curEntry = 0x0}
        entry = 0x1658590
        absorb_counter = <optimized out>
        processed = 43




pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
Next
From: Juan José Santamaría Flecha
Date:
Subject: Re: logical decoding : exceeded maxAllocatedDescs for .spill files