Thread: checkpointer: PANIC: could not fsync file: No such file or directory
I (finally) noticed this morning on a server running PG12.1: < 2019-11-15 22:16:07.098 EST >PANIC: could not fsync file "base/16491/1731839470.2": No such file or directory < 2019-11-15 22:16:08.751 EST >LOG: checkpointer process (PID 27388) was terminated by signal 6: Aborted /dev/vdb on /var/lib/pgsql type ext4 (rw,relatime,seclabel,data=ordered) Centos 7.7 qemu/KVM Linux database 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux There's no added tablespaces. Copying Thomas since I wonder if this is related: 3eb77eba Refactor the fsync queue for wider use. I can't find any relation with filenode nor OID matching 1731839470, nor a file named like that (which is maybe no surprise, since it's exactly the issue checkpointer had last week). A backup job would've started at 22:00 and probably would've run until 22:27, except that its backend was interrupted "because of crash of another server process". That uses pg_dump --snapshot. This shows a gap of OIDs between 1721850297 and 1746136569; the tablenames indicate that would've been between 2019-11-15 01:30:03,192 and 04:31:19,348. |SELECT oid, relname FROM pg_class ORDER BY 1 DESC; Ah, I found a maybe relevant log: |2019-11-15 22:15:59.592-05 | duration: 220283.831 ms statement: ALTER TABLE child.eric_enodeb_cell_201811 ALTER pmradiothpvolul So we altered that table (and 100+ others) with a type-promoting alter, starting at 2019-11-15 21:20:51,942. That involves DETACHing all but the most recent partitions, altering the parent, and then iterating over historic children to ALTER and reATTACHing them. (We do this to avoid locking the table for long periods, and to avoid worst-case disk usage). FYI, that server ran PG12.0 since Oct 7 with no issue. I installed pg12.1 at: $ ps -O lstart 27384 PID STARTED S TTY TIME COMMAND 27384 Fri Nov 15 08:13:08 2019 S ? 00:05:54 /usr/pgsql-12/bin/postmaster -D /var/lib/pgsql/12/data/ Core was generated by `postgres: checkpointer '. Program terminated with signal 6, Aborted. (gdb) bt #0 0x00007efc9d8b3337 in raise () from /lib64/libc.so.6 #1 0x00007efc9d8b4a28 in abort () from /lib64/libc.so.6 #2 0x000000000087752a in errfinish (dummy=<optimized out>) at elog.c:552 #3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398 #4 0x0000000000734dd9 in CheckPointBuffers (flags=flags@entry=256) at bufmgr.c:2588 #5 0x00000000005095e1 in CheckPointGuts (checkPointRedo=26082542473320, flags=flags@entry=256) at xlog.c:9006 #6 0x000000000050ff86 in CreateCheckPoint (flags=flags@entry=256) at xlog.c:8795 #7 0x00000000006e4092 in CheckpointerMain () at checkpointer.c:481 #8 0x000000000051fcd5 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffda8a24340) at bootstrap.c:461 #9 0x00000000006ee680 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5392 #10 0x00000000006ef9ca in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2973 #11 <signal handler called> #12 0x00007efc9d972933 in __select_nocancel () from /lib64/libc.so.6 #13 0x00000000004833d4 in ServerLoop () at postmaster.c:1668 #14 0x00000000006f106f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1601280) at postmaster.c:1377 #15 0x0000000000484cd3 in main (argc=3, argv=0x1601280) at main.c:228 bt f: #3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398 path = "base/16491/1731839470.2\000m\000e\001\000\000\000\000hЭи\027\000\000\032\364q\000\000\000\000\000\251\202\214\000\000\000\000\000\004\000\000\000\000\000\000\000\251\202\214\000\000\000\000\000@S`\001\000\000\000\000\000>\242\250\375\177\000\000\002\000\000\000\000\000\000\000\340\211b\001\000\000\000\000\002\346\241\000\000\000\000\000\200>\242\250\375\177\000\000\225ኝ\374~\000\000C\000_US.UTF-8\000\374~\000\000\000\000\000\000\000\000\000\000\306ފ\235\374~\000\000LC_MESSAGES/postgres-12.mo\000\000\000\000\000\000\001\000\000\000\000\000\000\000.ފ\235\374~"... failures = 1 sync_in_progress = true hstat = {hashp = 0x1629e00, curBucket = 122, curEntry = 0x0} entry = 0x1658590 absorb_counter = <optimized out> processed = 43
On Wed, Nov 20, 2019 at 12:58 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > < 2019-11-15 22:16:07.098 EST >PANIC: could not fsync file "base/16491/1731839470.2": No such file or directory > < 2019-11-15 22:16:08.751 EST >LOG: checkpointer process (PID 27388) was terminated by signal 6: Aborted > > /dev/vdb on /var/lib/pgsql type ext4 (rw,relatime,seclabel,data=ordered) > Centos 7.7 qemu/KVM > Linux database 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux > There's no added tablespaces. > > Copying Thomas since I wonder if this is related: > 3eb77eba Refactor the fsync queue for wider use. It could be, since it changed some details about the way that queue worked, and another relevant change is: 9ccdd7f6 PANIC on fsync() failure. Perhaps we should not panic if we failed to open (not fsync) the file, but it's not the root problem here which is that somehow we thought we should be fsyncing a file that had apparently been removed already (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc). Usually, if a file is in the fsync queue and then is later removed, we handle that by searching the queue for cancellation messages (since one should always be sent before the file is unlinked), and I think your core file with "failures = 1" tells us that it didn't find a cancellation message. So this seems to indicate a problem, somewhere, in that protocol. That could either be a defect in 3eb77eba or it could have been a pre-existing problem that became a bigger problem due to 9ccdd7f6. Looking into it. > I can't find any relation with filenode nor OID matching 1731839470, nor a file > named like that (which is maybe no surprise, since it's exactly the issue > checkpointer had last week). > > A backup job would've started at 22:00 and probably would've run until 22:27, > except that its backend was interrupted "because of crash of another server > process". That uses pg_dump --snapshot. > > This shows a gap of OIDs between 1721850297 and 1746136569; the tablenames > indicate that would've been between 2019-11-15 01:30:03,192 and 04:31:19,348. > |SELECT oid, relname FROM pg_class ORDER BY 1 DESC; By the way, it's relfilenode, not oid, that is used in these names (though they start out the same). In a rewrite, the relfilenode changes but the oid stays the same. > Ah, I found a maybe relevant log: > |2019-11-15 22:15:59.592-05 | duration: 220283.831 ms statement: ALTER TABLE child.eric_enodeb_cell_201811 ALTER pmradiothpvolul > > So we altered that table (and 100+ others) with a type-promoting alter, > starting at 2019-11-15 21:20:51,942. That involves DETACHing all but the most > recent partitions, altering the parent, and then iterating over historic > children to ALTER and reATTACHing them. (We do this to avoid locking the table > for long periods, and to avoid worst-case disk usage). Hmm.
On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote: > Perhaps we should not panic if we failed to open (not fsync) the file, > but it's not the root problem here which is that somehow we thought we > should be fsyncing a file that had apparently been removed already > (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc). FYI, I *do* have scripts which CLUSTER and(or) VAC FULL (and REINDEX), but they were disabled due to crash in 12.0 (which was resolved in 12.1) and it wouldn't have run anyway, since they run after the backup script (which failed) in a shell with set -e. I think TRUNCATE does that too..but I don't think that any processes which do TRUNCATE are running on that server. Thanks, Justin
On Tue, Nov 19, 2019 at 04:49:10PM -0600, Justin Pryzby wrote: > On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote: > > Perhaps we should not panic if we failed to open (not fsync) the file, > > but it's not the root problem here which is that somehow we thought we > > should be fsyncing a file that had apparently been removed already > > (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc). Note, the ALTER was (I think) building index in a parallel process: 2019-11-15 22:16:08.752-05 | 5976 | terminating connection because of crash of another server process 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was terminated by signal 6: Aborted 2019-11-15 22:16:08.751-05 | 27384 | terminating any other active server processes 2019-11-15 22:16:07.098-05 | 27388 | could not fsync file "base/16491/1731839470.2": No such file or directory 2019-11-15 22:15:59.592-05 | 19860 | duration: 220283.831 ms statement: ALTER TABLE child.eric_enodeb_cell_201811 ALTERpmradiothpvolulscell TYPE integer USING pmradiothpvolulscell::integer 2019-11-15 22:15:59.459-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/1.0", size 5144576 2019-11-15 22:15:59.458-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/2.0", size 6463488 2019-11-15 22:15:59.456-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/0.0", size 4612096 FYI, that table is *currently* (5 days later): ts=# \dti+ child.eric_enodeb_cell_201811* child | eric_enodeb_cell_201811 | table | telsasoft | | 2595 MB | child | eric_enodeb_cell_201811_idx | index | telsasoft | eric_enodeb_cell_201811 | 120 kB | child | eric_enodeb_cell_201811_site_idx | index | telsasoft | eric_enodeb_cell_201811 | 16 MB | I don't know if that table is likely to be the one with relfilenode 1731839470 (but it certainly wasn't its index), or if that was maybe a table (or index) from an earlier ALTER. I tentatively think we wouldn't have had any other tables being dropped, partitions pruned or maintenance commands running. Checkpoint logs for good measure: 2019-11-15 22:18:26.168-05 | 10388 | checkpoint complete: wrote 2915 buffers (3.0%); 0 WAL file(s) added, 0 removed, 18recycled; write=30.022 s, sync=0.472 s, total=32.140 s; sync files=107, longest=0.364 s, average=0.004 s; distance=297471 kB, estimate=297471 kB 2019-11-15 22:17:54.028-05 | 10388 | checkpoint starting: time 2019-11-15 22:16:53.753-05 | 10104 | checkpoint complete: wrote 98275 buffers (100.0%); 0 WAL file(s) added, 0 removed,43 recycled; write=11.040 s, sync=0.675 s, total=11.833 s; sync files=84, longest=0.335 s, average=0.008 s ; distance=698932 kB, estimate=698932 kB 2019-11-15 22:16:41.921-05 | 10104 | checkpoint starting: end-of-recovery immediate 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was terminated by signal 6: Aborted 2019-11-15 22:15:33.03-05 | 27388 | checkpoint starting: time 2019-11-15 22:15:03.62-05 | 27388 | checkpoint complete: wrote 5436 buffers (5.5%); 0 WAL file(s) added, 0 removed, 45recycled; write=28.938 s, sync=0.355 s, total=29.711 s; sync files=22, longest=0.174 s, average=0.016 s; d istance=740237 kB, estimate=740237 kB 2019-11-15 22:14:33.908-05 | 27388 | checkpoint starting: time I was trying to reproduce what was happening: set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int unique); INSERT INTO t SELECT generate_series(1,999999)";echo "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At >/tmp/snapshot&sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; PGOPTIONS='-cclient_min_messages=debug' psql postgres-txc "ALTER TABLE t ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44; Under v12, with or without the CHECKPOINT command, it fails: |pg_dump: error: query failed: ERROR: cache lookup failed for index 0 But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails like: |pg_dump: [archiver (db)] query failed: ERROR: cache lookup failed for index 16391 With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be needed, or why it would behave differently (or if it's related to this crash).
On Tue, Nov 19, 2019 at 07:22:26PM -0600, Justin Pryzby wrote: > I was trying to reproduce what was happening: > set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int unique); INSERT INTO t SELECT generate_series(1,999999)";echo "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At >/tmp/snapshot&sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; PGOPTIONS='-cclient_min_messages=debug' psql postgres-txc "ALTER TABLE t ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44; > > Under v12, with or without the CHECKPOINT command, it fails: > |pg_dump: error: query failed: ERROR: cache lookup failed for index 0 > But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails like: > |pg_dump: [archiver (db)] query failed: ERROR: cache lookup failed for index 16391 > With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be > needed, or why it would behave differently (or if it's related to this crash). Actually, I think that's at least related to documented behavior: https://www.postgresql.org/docs/12/mvcc-caveats.html |Some DDL commands, currently only TRUNCATE and the table-rewriting forms of ALTER TABLE, are not MVCC-safe. This means thatafter the truncation or rewrite commits, the table will appear empty to concurrent transactions, if they are using asnapshot taken before the DDL command committed. I don't know why CHECKPOINT allows it to work under 9.5, or if it's even related to the PANIC .. Justin
On Thu, 21 Nov 2019 at 09:07, Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Nov 19, 2019 at 07:22:26PM -0600, Justin Pryzby wrote:
> I was trying to reproduce what was happening:
> set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int unique); INSERT INTO t SELECT generate_series(1,999999)"; echo "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At >/tmp/snapshot& sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; PGOPTIONS='-cclient_min_messages=debug' psql postgres -txc "ALTER TABLE t ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44;
>
> Under v12, with or without the CHECKPOINT command, it fails:
> |pg_dump: error: query failed: ERROR: cache lookup failed for index 0
> But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails like:
> |pg_dump: [archiver (db)] query failed: ERROR: cache lookup failed for index 16391
> With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be
> needed, or why it would behave differently (or if it's related to this crash).
Actually, I think that's at least related to documented behavior:
https://www.postgresql.org/docs/12/mvcc-caveats.html
|Some DDL commands, currently only TRUNCATE and the table-rewriting forms of ALTER TABLE, are not MVCC-safe. This means that after the truncation or rewrite commits, the table will appear empty to concurrent transactions, if they are using a snapshot taken before the DDL command committed.
I don't know why CHECKPOINT allows it to work under 9.5, or if it's even
related to the PANIC ..
The PANIC is a defense against potential corruptions that can be caused by some kinds of disk errors. It's likely that we used to just ERROR and retry, then the retry would succeed without getting upset.
fsync_fname() is supposed to ignore errors for files that cannot be opened. But that same message may be emitted by a number of other parts of the code, and it looks like you didn't have log_error_verbosity = verbose so we don't have file/line info.
The only other place I see that emits that error where a relation path could be a valid argument is in rewriteheap.c in logical_end_heap_rewrite(). That calls the vfd layer's FileSync() and assumes that any failure is a fsync() syscall failure. But FileSync() can return failure if we fail to reopen the underlying file managed by the vfd too, per FileAccess().
Would there be a legitimate case where a logical rewrite file mapping could vanish without that being a problem? If so, we should probably be more tolerante there.
I looked and found a new "hint". On Tue, Nov 19, 2019 at 05:57:59AM -0600, Justin Pryzby wrote: > < 2019-11-15 22:16:07.098 EST >PANIC: could not fsync file "base/16491/1731839470.2": No such file or directory > < 2019-11-15 22:16:08.751 EST >LOG: checkpointer process (PID 27388) was terminated by signal 6: Aborted An earlier segment of that relation had been opened successfully and was *still* opened: $ sudo grep 1731839470 /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds 63:/var/lib/pgsql/12/data/base/16491/1731839470 For context: $ sudo grep / /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds |tail -3 61:/var/lib/pgsql/12/data/base/16491/1757077748 62:/var/lib/pgsql/12/data/base/16491/1756223121.2 63:/var/lib/pgsql/12/data/base/16491/1731839470 So this may be an issue only with relations>segment (but, that interpretation could also be very naive).
On Tue, Nov 26, 2019 at 5:21 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > I looked and found a new "hint". > > On Tue, Nov 19, 2019 at 05:57:59AM -0600, Justin Pryzby wrote: > > < 2019-11-15 22:16:07.098 EST >PANIC: could not fsync file "base/16491/1731839470.2": No such file or directory > > < 2019-11-15 22:16:08.751 EST >LOG: checkpointer process (PID 27388) was terminated by signal 6: Aborted > > An earlier segment of that relation had been opened successfully and was > *still* opened: > > $ sudo grep 1731839470 /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds > 63:/var/lib/pgsql/12/data/base/16491/1731839470 > > For context: > > $ sudo grep / /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds |tail -3 > 61:/var/lib/pgsql/12/data/base/16491/1757077748 > 62:/var/lib/pgsql/12/data/base/16491/1756223121.2 > 63:/var/lib/pgsql/12/data/base/16491/1731839470 > > So this may be an issue only with relations>segment (but, that interpretation > could also be very naive). FTR I have been trying to reproduce this but failing so far. I'm planning to dig some more in the next couple of days. Yeah, it's a .2 file, which means that it's one that would normally be unlinked after you commit your transaction (unlike a no-suffix file, which would normally be dropped at the next checkpoint after the commit, as our strategy to prevent the relfilenode from being reused before the next checkpoint cycle), but should normally have had a SYNC_FORGET_REQUEST enqueued for it first. So the question is, how did it come to pass that a .2 file was ENOENT but there was no forget request? Diificult, given the definition of mdunlinkfork(). I wondered if something was going wrong in queue compaction or something like that, but I don't see it. I need to dig into the exactly flow with the ALTER case to see if there is something I'm missing there, and perhaps try reproducing it with a tiny segment size to exercise some more multisegment-related code paths.
This same crash occured on a 2nd server. Also qemu/KVM, but this time on a 2ndary ZFS tablespaces which (fails to) include the missing relfilenode. Linux database7 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux This is postgresql12-12.1-1PGDG.rhel7.x86_64 (same as first crash), running since: |30515 Tue Nov 19 10:04:33 2019 S ? 00:09:54 /usr/pgsql-12/bin/postmaster -D /var/lib/pgsql/12/data/ Before that, this server ran v12.0 since Oct 30 (without crashing). In this case, the pg_dump --snap finished and released its snapshot at 21:50, and there were no ALTERed tables. I see a temp file written since the previous checkpoint, but not by a parallel worker, as in the previous server's crash. The crash happened while reindexing, though. The "DROP INDEX CONCURRENTLY" is from pg_repack -i, and completed successfully, but is followed immediately by the abort log. The folllowing "REINDEX toast..." failed. In this case, I *guess* that the missing filenode is due to a dropped index (570627937 or otherwise). I don't see any other CLUSTER, VACUUM FULL, DROP, TRUNCATE or ALTER within that checkpoint interval (note, we have 1 minute checkpoints). Note, I double checked on the first server which crashed, it definitely wasn't running pg_repack or the reindex script, since I removed pg_repack12 from our servers until 12.1 was installed, to avoid the "concurrently" progress reporting crash fixed at 1cd5bc3c. So I think ALTER TABLE TYPE and REINDEX can both trigger this crash, at least on v12.1. Note I actually have *full logs*, which I've now saved. But here's an excerpt: postgres=# SELECT log_time, message FROM ckpt_crash WHERE log_time BETWEEN '2019-11-26 23:40:20' AND '2019-11-26 23:48:58'AND user_name IS NULL ORDER BY 1; 2019-11-26 23:40:20.139-05 | checkpoint starting: time 2019-11-26 23:40:50.069-05 | checkpoint complete: wrote 11093 buffers (5.6%); 0 WAL file(s) added, 0 removed, 12 recycled;write=29.885 s, sync=0.008 s, total=29.930 s; sync files=71, longest=0.001 s, average=0.000 s; distance =193388 kB, estimate=550813 kB 2019-11-26 23:41:16.234-05 | automatic analyze of table "postgres.public.postgres_log_2019_11_26_2300" system usage: CPU:user: 3.00 s, system: 0.19 s, elapsed: 10.92 s 2019-11-26 23:41:20.101-05 | checkpoint starting: time 2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file ordirectory 2019-11-26 23:42:04.397-05 | checkpointer process (PID 30560) was terminated by signal 6: Aborted 2019-11-26 23:42:04.397-05 | terminating any other active server processes 2019-11-26 23:42:04.397-05 | terminating connection because of crash of another server process 2019-11-26 23:42:04.42-05 | terminating connection because of crash of another server process 2019-11-26 23:42:04.493-05 | all server processes terminated; reinitializing 2019-11-26 23:42:05.651-05 | database system was interrupted; last known up at 2019-11-27 00:40:50 -04 2019-11-26 23:47:30.404-05 | database system was not properly shut down; automatic recovery in progress 2019-11-26 23:47:30.435-05 | redo starts at 3450/1B202938 2019-11-26 23:47:54.501-05 | redo done at 3450/205CE960 2019-11-26 23:47:54.501-05 | invalid record length at 3450/205CEA18: wanted 24, got 0 2019-11-26 23:47:54.567-05 | checkpoint starting: end-of-recovery immediate 2019-11-26 23:47:57.365-05 | checkpoint complete: wrote 3287 buffers (1.7%); 0 WAL file(s) added, 0 removed, 5 recycled;write=2.606 s, sync=0.183 s, total=2.798 s; sync files=145, longest=0.150 s, average=0.001 s; distance=85 808 kB, estimate=85808 kB 2019-11-26 23:47:57.769-05 | database system is ready to accept connections 2019-11-26 23:48:57.774-05 | checkpoint starting: time < 2019-11-27 00:42:04.342 -04 postgres >LOG: duration: 13.028 ms statement: DROP INDEX CONCURRENTLY "child"."index_570627937" < 2019-11-27 00:42:04.397 -04 >LOG: checkpointer process (PID 30560) was terminated by signal 6: Aborted < 2019-11-27 00:42:04.397 -04 >LOG: terminating any other active server processes < 2019-11-27 00:42:04.397 -04 >WARNING: terminating connection because of crash of another server process < 2019-11-27 00:42:04.397 -04 >DETAIL: The postmaster has commanded this server process to roll back the current transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. < 2019-11-27 00:42:04.397 -04 >HINT: In a moment you should be able to reconnect to the database and repeat your command. ... < 2019-11-27 00:42:04.421 -04 postgres >STATEMENT: begin; LOCK TABLE child.ericsson_sgsn_ss7_remote_sp_201911 IN SHARE MODE;REINDEXINDEX pg_toast.pg_toast_570627929_index;commit Here's all the nondefault settings which seem plausibly relevant or interesting: autovacuum_analyze_scale_factor | 0.005 | | configuration file autovacuum_analyze_threshold | 2 | | configuration file checkpoint_timeout | 60 | s | configuration file max_files_per_process | 1000 | | configuration file max_stack_depth | 2048 | kB | environment variable max_wal_size | 4096 | MB | configuration file min_wal_size | 4096 | MB | configuration file shared_buffers | 196608 | 8kB | configuration file shared_preload_libraries | pg_stat_statements | | configuration file wal_buffers | 2048 | 8kB | override wal_compression | on | | configuration file wal_segment_size | 16777216 | B | override (gdb) bt #0 0x00007f07c0070207 in raise () from /lib64/libc.so.6 #1 0x00007f07c00718f8 in abort () from /lib64/libc.so.6 #2 0x000000000087752a in errfinish (dummy=<optimized out>) at elog.c:552 #3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398 #4 0x0000000000734dd9 in CheckPointBuffers (flags=flags@entry=256) at bufmgr.c:2588 #5 0x00000000005095e1 in CheckPointGuts (checkPointRedo=57518713529016, flags=flags@entry=256) at xlog.c:9006 #6 0x000000000050ff86 in CreateCheckPoint (flags=flags@entry=256) at xlog.c:8795 #7 0x00000000006e4092 in CheckpointerMain () at checkpointer.c:481 #8 0x000000000051fcd5 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffd82122400) at bootstrap.c:461 #9 0x00000000006ee680 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5392 #10 0x00000000006ef9ca in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2973 #11 <signal handler called> #12 0x00007f07c012ef53 in __select_nocancel () from /lib64/libc.so.6 #13 0x00000000004833d4 in ServerLoop () at postmaster.c:1668 #14 0x00000000006f106f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x19d4280) at postmaster.c:1377 #15 0x0000000000484cd3 in main (argc=3, argv=0x19d4280) at main.c:228 #3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398 path = "pg_tblspc/16401/PG_12_201909212/16460/973123799.10", '\000' <repeats 14 times>, ... failures = 1 sync_in_progress = true hstat = {hashp = 0x19fd2f0, curBucket = 1443, curEntry = 0x0} entry = 0x1a61260 absorb_counter = <optimized out> processed = 23 sync_start = {tv_sec = 21582125, tv_nsec = 303557162} sync_end = {tv_sec = 21582125, tv_nsec = 303536006} sync_diff = <optimized out> elapsed = <optimized out> longest = 1674 total_elapsed = 7074 __func__ = "ProcessSyncRequests"
On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > 2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file ordirectory I managed to reproduce this (see below). I think I know what the problem is: mdsyncfiletag() uses _mdfd_getseg() to open the segment to be fsync'd, but that function opens all segments up to the one you requested, so if a lower-numbered segment has already been unlinked, it can fail. Usually that's unlikely because it's hard to get the request queue to fill up and therefore hard to split up the cancel requests for all the segments for a relation, but your workload and the repro below do it. In fact, the path it shows in the error message is not even the problem file, that's the one it really wanted, but first it was trying to open lower-numbered ones. I can see a couple of solutions to the problem (unlink in reverse order, send all the forget messages first before unlinking anything, or go back to using a single atomic "forget everything for this rel" message instead of per-segment messages), but I'll have to think more about that tomorrow. === repro === Recompile with RELSEG_SIZE 2 in pg_config.h. Run with checkpoint_timeout=30s and shared_buffers=128kB. Then: create table t (i int primary key); cluster t using t_pkey; insert into t select generate_series(1, 10000); Session 1: cluster t; \watch 1 Session 2: update t set i = i; \watch 1.1
On Fri, Nov 29, 2019 at 3:13 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > 2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such fileor directory > > I managed to reproduce this (see below). I think I know what the > problem is: mdsyncfiletag() uses _mdfd_getseg() to open the segment to > be fsync'd, but that function opens all segments up to the one you > requested, so if a lower-numbered segment has already been unlinked, > it can fail. Usually that's unlikely because it's hard to get the > request queue to fill up and therefore hard to split up the cancel > requests for all the segments for a relation, but your workload and > the repro below do it. In fact, the path it shows in the error > message is not even the problem file, that's the one it really wanted, > but first it was trying to open lower-numbered ones. I can see a > couple of solutions to the problem (unlink in reverse order, send all > the forget messages first before unlinking anything, or go back to > using a single atomic "forget everything for this rel" message instead > of per-segment messages), but I'll have to think more about that > tomorrow. Here is a patch that fixes the problem by sending all the SYNC_FORGET_REQUEST messages up front.
Attachment
On Fri, Nov 29, 2019 at 10:50:36AM +1300, Thomas Munro wrote: > On Fri, Nov 29, 2019 at 3:13 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > 2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such fileor directory > > > > I managed to reproduce this (see below). I think I know what the > > Here is a patch that fixes the problem by sending all the > SYNC_FORGET_REQUEST messages up front. I managed to reproduce it too, but my recipe is crummy enough that I'm not even going to send it.. I confirmed that patch also seems to work for my worse recipe. Thanks, Justin
On Fri, Nov 29, 2019 at 11:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Fri, Nov 29, 2019 at 10:50:36AM +1300, Thomas Munro wrote: > > On Fri, Nov 29, 2019 at 3:13 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > > On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > 2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No suchfile or directory > > > > > > I managed to reproduce this (see below). I think I know what the > > > > Here is a patch that fixes the problem by sending all the > > SYNC_FORGET_REQUEST messages up front. > > I managed to reproduce it too, but my recipe is crummy enough that I'm not even > going to send it.. > > I confirmed that patch also seems to work for my worse recipe. Thanks. One thing I'm wondering about is what happens if you encounter EPERM, EIO etc while probing the existing files, so that we give up early and don't deal with some higher numbered files. (1) We'll still have unlinked the lower numbered files, and we'll leave the higher numbered files where they are, which might confuse matters if the relfilenode number if later recycled so the zombie segments appear to belong to the new relfilenode. That is longstanding PostgreSQL behaviour not changed by commit 3eb77eba or this patch IIUC (despite moving a few things around), and I guess it's unlikely to bite you considering all the coincidences required, but if there's a transient EPERM (think Windows virus checker opening files without the shared read flag), it's not inconceivable. One solution to that would be not to queue the unlink request for segment 0 if anything goes wrong while unlinking the other segments (in other words: if you can't unlink all the segments, deliberately leak segment 0 and thus the *whole relfilenode*, not just the problem file(s)). (2) Even with the fix I just proposed, if you give up early due to EPERM, EIO etc, there might still be sync requests queued for high numbered segments, so you could reach the PANIC case. It's likely that, once you've hit such an error, the checkpointer is going to be panicking anyway when it starts seeing similar errors, but still, it'd be possible to do better here (especially if the error was transient and short lived). If we don't accept that behaviour, we could switch (back) to a single cancel message that can whack every request relating to the relation (essentially as it was in 11, though it requires a bit more work for the new design), or stop using _mdfd_getseg() for this so that you can remove segments independently without worrying about sync requests for other segments (it was actually like that in an earlier version of the patch for commit 3eb77eba, but someone complained that it didn't benifit from fd caching).
On Fri, Nov 29, 2019 at 12:34 PM Thomas Munro <thomas.munro@gmail.com> wrote: > ... or stop using > _mdfd_getseg() for this so that you can remove segments independently > without worrying about sync requests for other segments (it was > actually like that in an earlier version of the patch for commit > 3eb77eba, but someone complained that it didn't benifit from fd > caching). Not sure which approach I prefer yet, but here's a patch showing that alternative.
Attachment
On Sat, Nov 30, 2019 at 10:57 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Nov 29, 2019 at 12:34 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > ... or stop using > > _mdfd_getseg() for this so that you can remove segments independently > > without worrying about sync requests for other segments (it was > > actually like that in an earlier version of the patch for commit > > 3eb77eba, but someone complained that it didn't benifit from fd > > caching). > > Not sure which approach I prefer yet, but here's a patch showing that > alternative. Here's a better version: it uses the existing fd if we have it already in md_seg_fds, but opens and closes a transient one if not.
Attachment
On Fri, Dec 13, 2019 at 5:41 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's a better version: it uses the existing fd if we have it already > in md_seg_fds, but opens and closes a transient one if not. Pushed.
On Sat, Dec 14, 2019 at 4:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Dec 13, 2019 at 5:41 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Here's a better version: it uses the existing fd if we have it already > > in md_seg_fds, but opens and closes a transient one if not. > > Pushed. Build farm not happy... checking...
On Sat, Dec 14, 2019 at 5:05 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Pushed. > > Build farm not happy... checking... Hrmph. FileGetRawDesc() does not contain a call to FileAccess(), so this is failing on low-fd-limit systems. Looking into a way to fix that...
On Sat, Dec 14, 2019 at 5:32 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Dec 14, 2019 at 5:05 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > Pushed. > > > > Build farm not happy... checking... > > Hrmph. FileGetRawDesc() does not contain a call to FileAccess(), so > this is failing on low-fd-limit systems. Looking into a way to fix > that... Seemed best not to use FileGetRawDesc(). Rewritten to use only File, and tested with the torture-test mentioned upthread under ulimit -n 50.