Thread: checkpointer: PANIC: could not fsync file: No such file or directory

checkpointer: PANIC: could not fsync file: No such file or directory

From
Justin Pryzby
Date:
I (finally) noticed this morning on a server running PG12.1:

< 2019-11-15 22:16:07.098 EST  >PANIC:  could not fsync file "base/16491/1731839470.2": No such file or directory
< 2019-11-15 22:16:08.751 EST  >LOG:  checkpointer process (PID 27388) was terminated by signal 6: Aborted

/dev/vdb on /var/lib/pgsql type ext4 (rw,relatime,seclabel,data=ordered)
Centos 7.7 qemu/KVM
Linux database 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
There's no added tablespaces.

Copying Thomas since I wonder if this is related:
3eb77eba Refactor the fsync queue for wider use.

I can't find any relation with filenode nor OID matching 1731839470, nor a file
named like that (which is maybe no surprise, since it's exactly the issue
checkpointer had last week).

A backup job would've started at 22:00 and probably would've run until 22:27,
except that its backend was interrupted "because of crash of another server
process".  That uses pg_dump --snapshot.

This shows a gap of OIDs between 1721850297 and 1746136569; the tablenames
indicate that would've been between 2019-11-15 01:30:03,192 and 04:31:19,348.
|SELECT oid, relname FROM pg_class ORDER BY 1 DESC;

Ah, I found a maybe relevant log:
|2019-11-15 22:15:59.592-05 | duration: 220283.831 ms  statement: ALTER TABLE child.eric_enodeb_cell_201811 ALTER
pmradiothpvolul

So we altered that table (and 100+ others) with a type-promoting alter,
starting at 2019-11-15 21:20:51,942.  That involves DETACHing all but the most
recent partitions, altering the parent, and then iterating over historic
children to ALTER and reATTACHing them.  (We do this to avoid locking the table
for long periods, and to avoid worst-case disk usage).

FYI, that server ran PG12.0 since Oct 7 with no issue.
I installed pg12.1 at:
$ ps -O lstart 27384
  PID                  STARTED S TTY          TIME COMMAND
27384 Fri Nov 15 08:13:08 2019 S ?        00:05:54 /usr/pgsql-12/bin/postmaster -D /var/lib/pgsql/12/data/

Core was generated by `postgres: checkpointer                                 '.
Program terminated with signal 6, Aborted.

(gdb) bt
#0  0x00007efc9d8b3337 in raise () from /lib64/libc.so.6
#1  0x00007efc9d8b4a28 in abort () from /lib64/libc.so.6
#2  0x000000000087752a in errfinish (dummy=<optimized out>) at elog.c:552
#3  0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
#4  0x0000000000734dd9 in CheckPointBuffers (flags=flags@entry=256) at bufmgr.c:2588
#5  0x00000000005095e1 in CheckPointGuts (checkPointRedo=26082542473320, flags=flags@entry=256) at xlog.c:9006
#6  0x000000000050ff86 in CreateCheckPoint (flags=flags@entry=256) at xlog.c:8795
#7  0x00000000006e4092 in CheckpointerMain () at checkpointer.c:481
#8  0x000000000051fcd5 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffda8a24340) at bootstrap.c:461
#9  0x00000000006ee680 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5392
#10 0x00000000006ef9ca in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2973
#11 <signal handler called>
#12 0x00007efc9d972933 in __select_nocancel () from /lib64/libc.so.6
#13 0x00000000004833d4 in ServerLoop () at postmaster.c:1668
#14 0x00000000006f106f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1601280) at postmaster.c:1377
#15 0x0000000000484cd3 in main (argc=3, argv=0x1601280) at main.c:228

bt f:
#3  0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
        path =
"base/16491/1731839470.2\000m\000e\001\000\000\000\000hЭи\027\000\000\032\364q\000\000\000\000\000\251\202\214\000\000\000\000\000\004\000\000\000\000\000\000\000\251\202\214\000\000\000\000\000@S`\001\000\000\000\000\000>\242\250\375\177\000\000\002\000\000\000\000\000\000\000\340\211b\001\000\000\000\000\002\346\241\000\000\000\000\000\200>\242\250\375\177\000\000\225ኝ\374~\000\000C\000_US.UTF-8\000\374~\000\000\000\000\000\000\000\000\000\000\306ފ\235\374~\000\000LC_MESSAGES/postgres-12.mo\000\000\000\000\000\000\001\000\000\000\000\000\000\000.ފ\235\374~"...
        failures = 1
        sync_in_progress = true
        hstat = {hashp = 0x1629e00, curBucket = 122, curEntry = 0x0}
        entry = 0x1658590
        absorb_counter = <optimized out>
        processed = 43




Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Wed, Nov 20, 2019 at 12:58 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> < 2019-11-15 22:16:07.098 EST  >PANIC:  could not fsync file "base/16491/1731839470.2": No such file or directory
> < 2019-11-15 22:16:08.751 EST  >LOG:  checkpointer process (PID 27388) was terminated by signal 6: Aborted
>
> /dev/vdb on /var/lib/pgsql type ext4 (rw,relatime,seclabel,data=ordered)
> Centos 7.7 qemu/KVM
> Linux database 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
> There's no added tablespaces.
>
> Copying Thomas since I wonder if this is related:
> 3eb77eba Refactor the fsync queue for wider use.

It could be, since it changed some details about the way that queue
worked, and another relevant change is:

9ccdd7f6 PANIC on fsync() failure.

Perhaps we should not panic if we failed to open (not fsync) the file,
but it's not the root problem here which is that somehow we thought we
should be fsyncing a file that had apparently been removed already
(due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).  Usually, if
a file is in the fsync queue and then is later removed, we handle that
by searching the queue for cancellation messages (since one should
always be sent before the file is unlinked), and I think your core
file with "failures = 1" tells us that it didn't find a cancellation
message.  So this seems to indicate a problem, somewhere, in that
protocol.  That could either be a defect in 3eb77eba or it could have
been a pre-existing problem that became a bigger problem due to
9ccdd7f6.

Looking into it.

> I can't find any relation with filenode nor OID matching 1731839470, nor a file
> named like that (which is maybe no surprise, since it's exactly the issue
> checkpointer had last week).
>
> A backup job would've started at 22:00 and probably would've run until 22:27,
> except that its backend was interrupted "because of crash of another server
> process".  That uses pg_dump --snapshot.
>
> This shows a gap of OIDs between 1721850297 and 1746136569; the tablenames
> indicate that would've been between 2019-11-15 01:30:03,192 and 04:31:19,348.
> |SELECT oid, relname FROM pg_class ORDER BY 1 DESC;

By the way, it's relfilenode, not oid, that is used in these names
(though they start out the same).  In a rewrite, the relfilenode
changes but the oid stays the same.

> Ah, I found a maybe relevant log:
> |2019-11-15 22:15:59.592-05 | duration: 220283.831 ms  statement: ALTER TABLE child.eric_enodeb_cell_201811 ALTER
pmradiothpvolul
>
> So we altered that table (and 100+ others) with a type-promoting alter,
> starting at 2019-11-15 21:20:51,942.  That involves DETACHing all but the most
> recent partitions, altering the parent, and then iterating over historic
> children to ALTER and reATTACHing them.  (We do this to avoid locking the table
> for long periods, and to avoid worst-case disk usage).

Hmm.



Re: checkpointer: PANIC: could not fsync file: No such file ordirectory

From
Justin Pryzby
Date:
On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote:
> Perhaps we should not panic if we failed to open (not fsync) the file,
> but it's not the root problem here which is that somehow we thought we
> should be fsyncing a file that had apparently been removed already
> (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).

FYI, I *do* have scripts which CLUSTER and(or) VAC FULL (and REINDEX), but they
were disabled due to crash in 12.0 (which was resolved in 12.1) and it wouldn't
have run anyway, since they run after the backup script (which failed) in a
shell with set -e.

I think TRUNCATE does that too..but I don't think that any processes which do
TRUNCATE are running on that server.

Thanks,
Justin



Re: checkpointer: PANIC: could not fsync file: No such file ordirectory

From
Justin Pryzby
Date:
On Tue, Nov 19, 2019 at 04:49:10PM -0600, Justin Pryzby wrote:
> On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote:
> > Perhaps we should not panic if we failed to open (not fsync) the file,
> > but it's not the root problem here which is that somehow we thought we
> > should be fsyncing a file that had apparently been removed already
> > (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).

Note, the ALTER was (I think) building index in a parallel process:

 2019-11-15 22:16:08.752-05 |  5976 | terminating connection because of crash of another server process
 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was terminated by signal 6: Aborted
 2019-11-15 22:16:08.751-05 | 27384 | terminating any other active server processes
 2019-11-15 22:16:07.098-05 | 27388 | could not fsync file "base/16491/1731839470.2": No such file or directory
 2019-11-15 22:15:59.592-05 | 19860 | duration: 220283.831 ms  statement: ALTER TABLE child.eric_enodeb_cell_201811
ALTERpmradiothpvolulscell TYPE integer USING pmradiothpvolulscell::integer
 
 2019-11-15 22:15:59.459-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/1.0", size
5144576
 2019-11-15 22:15:59.458-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/2.0", size
6463488
 2019-11-15 22:15:59.456-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/0.0", size
4612096

FYI, that table is *currently* (5 days later):
ts=# \dti+ child.eric_enodeb_cell_201811*
 child  | eric_enodeb_cell_201811          | table | telsasoft |                         | 2595 MB |
 child  | eric_enodeb_cell_201811_idx      | index | telsasoft | eric_enodeb_cell_201811 | 120 kB  |
 child  | eric_enodeb_cell_201811_site_idx | index | telsasoft | eric_enodeb_cell_201811 | 16 MB   |

I don't know if that table is likely to be the one with relfilenode 1731839470
(but it certainly wasn't its index), or if that was maybe a table (or index)
from an earlier ALTER.  I tentatively think we wouldn't have had any other
tables being dropped, partitions pruned or maintenance commands running.

Checkpoint logs for good measure:
 2019-11-15 22:18:26.168-05 | 10388 | checkpoint complete: wrote 2915 buffers (3.0%); 0 WAL file(s) added, 0 removed,
18recycled; write=30.022 s, sync=0.472 s, total=32.140 s; sync files=107, longest=0.364 s, average=0.004 s; 
 
distance=297471 kB, estimate=297471 kB
 2019-11-15 22:17:54.028-05 | 10388 | checkpoint starting: time
 2019-11-15 22:16:53.753-05 | 10104 | checkpoint complete: wrote 98275 buffers (100.0%); 0 WAL file(s) added, 0
removed,43 recycled; write=11.040 s, sync=0.675 s, total=11.833 s; sync files=84, longest=0.335 s, average=0.008 s
 
; distance=698932 kB, estimate=698932 kB
 2019-11-15 22:16:41.921-05 | 10104 | checkpoint starting: end-of-recovery immediate
 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was terminated by signal 6: Aborted
 2019-11-15 22:15:33.03-05  | 27388 | checkpoint starting: time
 2019-11-15 22:15:03.62-05  | 27388 | checkpoint complete: wrote 5436 buffers (5.5%); 0 WAL file(s) added, 0 removed,
45recycled; write=28.938 s, sync=0.355 s, total=29.711 s; sync files=22, longest=0.174 s, average=0.016 s; d
 
istance=740237 kB, estimate=740237 kB
 2019-11-15 22:14:33.908-05 | 27388 | checkpoint starting: time

I was trying to reproduce what was happening:
set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int unique); INSERT INTO t SELECT
generate_series(1,999999)";echo "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At
>/tmp/snapshot&sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; PGOPTIONS='-cclient_min_messages=debug' psql
postgres-txc "ALTER TABLE t ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44;
 

Under v12, with or without the CHECKPOINT command, it fails:
|pg_dump: error: query failed: ERROR:  cache lookup failed for index 0
But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails like:
|pg_dump: [archiver (db)] query failed: ERROR:  cache lookup failed for index 16391
With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be
needed, or why it would behave differently (or if it's related to this crash).



Re: checkpointer: PANIC: could not fsync file: No such file ordirectory

From
Justin Pryzby
Date:
On Tue, Nov 19, 2019 at 07:22:26PM -0600, Justin Pryzby wrote:
> I was trying to reproduce what was happening:
> set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int unique); INSERT INTO t SELECT
generate_series(1,999999)";echo "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At
>/tmp/snapshot&sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; PGOPTIONS='-cclient_min_messages=debug' psql
postgres-txc "ALTER TABLE t ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44;
 
> 
> Under v12, with or without the CHECKPOINT command, it fails:
> |pg_dump: error: query failed: ERROR:  cache lookup failed for index 0
> But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails like:
> |pg_dump: [archiver (db)] query failed: ERROR:  cache lookup failed for index 16391
> With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be
> needed, or why it would behave differently (or if it's related to this crash).

Actually, I think that's at least related to documented behavior:

https://www.postgresql.org/docs/12/mvcc-caveats.html
|Some DDL commands, currently only TRUNCATE and the table-rewriting forms of ALTER TABLE, are not MVCC-safe. This means
thatafter the truncation or rewrite commits, the table will appear empty to concurrent transactions, if they are using
asnapshot taken before the DDL command committed.
 

I don't know why CHECKPOINT allows it to work under 9.5, or if it's even
related to the PANIC ..

Justin



Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Craig Ringer
Date:
On Thu, 21 Nov 2019 at 09:07, Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Nov 19, 2019 at 07:22:26PM -0600, Justin Pryzby wrote:
> I was trying to reproduce what was happening:
> set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int unique); INSERT INTO t SELECT generate_series(1,999999)"; echo "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At >/tmp/snapshot& sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; PGOPTIONS='-cclient_min_messages=debug' psql postgres -txc "ALTER TABLE t ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44;
>
> Under v12, with or without the CHECKPOINT command, it fails:
> |pg_dump: error: query failed: ERROR:  cache lookup failed for index 0
> But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails like:
> |pg_dump: [archiver (db)] query failed: ERROR:  cache lookup failed for index 16391
> With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be
> needed, or why it would behave differently (or if it's related to this crash).

Actually, I think that's at least related to documented behavior:

https://www.postgresql.org/docs/12/mvcc-caveats.html
|Some DDL commands, currently only TRUNCATE and the table-rewriting forms of ALTER TABLE, are not MVCC-safe. This means that after the truncation or rewrite commits, the table will appear empty to concurrent transactions, if they are using a snapshot taken before the DDL command committed.

I don't know why CHECKPOINT allows it to work under 9.5, or if it's even
related to the PANIC ..

The PANIC is a defense against potential corruptions that can be caused by some kinds of disk errors. It's likely that we used to just ERROR and retry, then the retry would succeed without getting upset.

fsync_fname() is supposed to ignore errors for files that cannot be opened. But that same message may be emitted by a number of other parts of the code, and it looks like you didn't have log_error_verbosity = verbose so we don't have file/line info.

The only other place I see that emits that error where a relation path could be a valid argument is in rewriteheap.c  in logical_end_heap_rewrite(). That calls the vfd layer's FileSync() and assumes that any failure is a fsync() syscall failure. But FileSync() can return failure if we fail to reopen the underlying file managed by the vfd too, per FileAccess().

Would there be a legitimate case where a logical rewrite file mapping could vanish without that being a problem? If so, we should probably be more tolerante there.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: checkpointer: PANIC: could not fsync file: No such file ordirectory

From
Justin Pryzby
Date:
I looked and found a new "hint".

On Tue, Nov 19, 2019 at 05:57:59AM -0600, Justin Pryzby wrote:
> < 2019-11-15 22:16:07.098 EST  >PANIC:  could not fsync file "base/16491/1731839470.2": No such file or directory
> < 2019-11-15 22:16:08.751 EST  >LOG:  checkpointer process (PID 27388) was terminated by signal 6: Aborted

An earlier segment of that relation had been opened successfully and was 
*still* opened:

$ sudo grep 1731839470 /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds 
63:/var/lib/pgsql/12/data/base/16491/1731839470

For context:

$ sudo grep / /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds |tail -3
61:/var/lib/pgsql/12/data/base/16491/1757077748
62:/var/lib/pgsql/12/data/base/16491/1756223121.2
63:/var/lib/pgsql/12/data/base/16491/1731839470

So this may be an issue only with relations>segment (but, that interpretation 
could also be very naive).




Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Tue, Nov 26, 2019 at 5:21 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> I looked and found a new "hint".
>
> On Tue, Nov 19, 2019 at 05:57:59AM -0600, Justin Pryzby wrote:
> > < 2019-11-15 22:16:07.098 EST  >PANIC:  could not fsync file "base/16491/1731839470.2": No such file or directory
> > < 2019-11-15 22:16:08.751 EST  >LOG:  checkpointer process (PID 27388) was terminated by signal 6: Aborted
>
> An earlier segment of that relation had been opened successfully and was
> *still* opened:
>
> $ sudo grep 1731839470 /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds
> 63:/var/lib/pgsql/12/data/base/16491/1731839470
>
> For context:
>
> $ sudo grep / /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds |tail -3
> 61:/var/lib/pgsql/12/data/base/16491/1757077748
> 62:/var/lib/pgsql/12/data/base/16491/1756223121.2
> 63:/var/lib/pgsql/12/data/base/16491/1731839470
>
> So this may be an issue only with relations>segment (but, that interpretation
> could also be very naive).

FTR I have been trying to reproduce this but failing so far.  I'm
planning to dig some more in the next couple of days.  Yeah, it's a .2
file, which means that it's one that would normally be unlinked after
you commit your transaction (unlike a no-suffix file, which would
normally be dropped at the next checkpoint after the commit, as our
strategy to prevent the relfilenode from being reused before the next
checkpoint cycle), but should normally have had a SYNC_FORGET_REQUEST
enqueued for it first.  So the question is, how did it come to pass
that a .2 file was ENOENT but there was no forget request?  Diificult,
given the definition of mdunlinkfork().  I wondered if something was
going wrong in queue compaction or something like that, but I don't
see it.  I need to dig into the exactly flow with the ALTER case to
see if there is something I'm missing there, and perhaps try
reproducing it with a tiny segment size to exercise some more
multisegment-related code paths.



Re: checkpointer: PANIC: could not fsync file: No such file ordirectory

From
Justin Pryzby
Date:
This same crash occured on a 2nd server.
Also qemu/KVM, but this time on a 2ndary ZFS tablespaces which (fails to) include the missing relfilenode.
Linux database7 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

This is postgresql12-12.1-1PGDG.rhel7.x86_64 (same as first crash), running since:
|30515 Tue Nov 19 10:04:33 2019 S ?        00:09:54 /usr/pgsql-12/bin/postmaster -D /var/lib/pgsql/12/data/

Before that, this server ran v12.0 since Oct 30 (without crashing).

In this case, the pg_dump --snap finished and released its snapshot at 21:50,
and there were no ALTERed tables.  I see a temp file written since the previous
checkpoint, but not by a parallel worker, as in the previous server's crash.

The crash happened while reindexing, though.  The "DROP INDEX CONCURRENTLY" is
from pg_repack -i, and completed successfully, but is followed immediately by
the abort log.  The folllowing "REINDEX toast..." failed.   In this case, I
*guess* that the missing filenode is due to a dropped index (570627937 or
otherwise).  I don't see any other CLUSTER, VACUUM FULL, DROP, TRUNCATE or
ALTER within that checkpoint interval (note, we have 1 minute checkpoints).

Note, I double checked on the first server which crashed, it definitely wasn't
running pg_repack or the reindex script, since I removed pg_repack12 from our
servers until 12.1 was installed, to avoid the "concurrently" progress
reporting crash fixed at 1cd5bc3c.  So I think ALTER TABLE TYPE and REINDEX can
both trigger this crash, at least on v12.1.

Note I actually have *full logs*, which I've now saved.  But here's an excerpt:

postgres=# SELECT log_time, message FROM ckpt_crash WHERE log_time BETWEEN '2019-11-26 23:40:20' AND '2019-11-26
23:48:58'AND user_name IS NULL ORDER BY 1;
 
 2019-11-26 23:40:20.139-05 | checkpoint starting: time
 2019-11-26 23:40:50.069-05 | checkpoint complete: wrote 11093 buffers (5.6%); 0 WAL file(s) added, 0 removed, 12
recycled;write=29.885 s, sync=0.008 s, total=29.930 s; sync files=71, longest=0.001 s, average=0.000 s; distance
 
=193388 kB, estimate=550813 kB
 2019-11-26 23:41:16.234-05 | automatic analyze of table "postgres.public.postgres_log_2019_11_26_2300" system usage:
CPU:user: 3.00 s, system: 0.19 s, elapsed: 10.92 s
 
 2019-11-26 23:41:20.101-05 | checkpoint starting: time
 2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file
ordirectory
 
 2019-11-26 23:42:04.397-05 | checkpointer process (PID 30560) was terminated by signal 6: Aborted
 2019-11-26 23:42:04.397-05 | terminating any other active server processes
 2019-11-26 23:42:04.397-05 | terminating connection because of crash of another server process
 2019-11-26 23:42:04.42-05  | terminating connection because of crash of another server process
 2019-11-26 23:42:04.493-05 | all server processes terminated; reinitializing
 2019-11-26 23:42:05.651-05 | database system was interrupted; last known up at 2019-11-27 00:40:50 -04
 2019-11-26 23:47:30.404-05 | database system was not properly shut down; automatic recovery in progress
 2019-11-26 23:47:30.435-05 | redo starts at 3450/1B202938
 2019-11-26 23:47:54.501-05 | redo done at 3450/205CE960
 2019-11-26 23:47:54.501-05 | invalid record length at 3450/205CEA18: wanted 24, got 0
 2019-11-26 23:47:54.567-05 | checkpoint starting: end-of-recovery immediate
 2019-11-26 23:47:57.365-05 | checkpoint complete: wrote 3287 buffers (1.7%); 0 WAL file(s) added, 0 removed, 5
recycled;write=2.606 s, sync=0.183 s, total=2.798 s; sync files=145, longest=0.150 s, average=0.001 s; distance=85
 
808 kB, estimate=85808 kB
 2019-11-26 23:47:57.769-05 | database system is ready to accept connections
 2019-11-26 23:48:57.774-05 | checkpoint starting: time

< 2019-11-27 00:42:04.342 -04 postgres >LOG:  duration: 13.028 ms  statement: DROP INDEX CONCURRENTLY
"child"."index_570627937"
< 2019-11-27 00:42:04.397 -04  >LOG:  checkpointer process (PID 30560) was terminated by signal 6: Aborted
< 2019-11-27 00:42:04.397 -04  >LOG:  terminating any other active server processes
< 2019-11-27 00:42:04.397 -04  >WARNING:  terminating connection because of crash of another server process
< 2019-11-27 00:42:04.397 -04  >DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory.
 
< 2019-11-27 00:42:04.397 -04  >HINT:  In a moment you should be able to reconnect to the database and repeat your
command.
...
< 2019-11-27 00:42:04.421 -04 postgres >STATEMENT:  begin; LOCK TABLE child.ericsson_sgsn_ss7_remote_sp_201911 IN SHARE
MODE;REINDEXINDEX pg_toast.pg_toast_570627929_index;commit
 

Here's all the nondefault settings which seem plausibly relevant or interesting:
 autovacuum_analyze_scale_factor | 0.005                                  |      | configuration file
 autovacuum_analyze_threshold    | 2                                      |      | configuration file
 checkpoint_timeout              | 60                                     | s    | configuration file
 max_files_per_process           | 1000                                   |      | configuration file
 max_stack_depth                 | 2048                                   | kB   | environment variable
 max_wal_size                    | 4096                                   | MB   | configuration file
 min_wal_size                    | 4096                                   | MB   | configuration file
 shared_buffers                  | 196608                                 | 8kB  | configuration file
 shared_preload_libraries        | pg_stat_statements                     |      | configuration file
 wal_buffers                     | 2048                                   | 8kB  | override
 wal_compression                 | on                                     |      | configuration file
 wal_segment_size                | 16777216                               | B    | override

(gdb) bt
#0  0x00007f07c0070207 in raise () from /lib64/libc.so.6
#1  0x00007f07c00718f8 in abort () from /lib64/libc.so.6
#2  0x000000000087752a in errfinish (dummy=<optimized out>) at elog.c:552
#3  0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
#4  0x0000000000734dd9 in CheckPointBuffers (flags=flags@entry=256) at bufmgr.c:2588
#5  0x00000000005095e1 in CheckPointGuts (checkPointRedo=57518713529016, flags=flags@entry=256) at xlog.c:9006
#6  0x000000000050ff86 in CreateCheckPoint (flags=flags@entry=256) at xlog.c:8795
#7  0x00000000006e4092 in CheckpointerMain () at checkpointer.c:481
#8  0x000000000051fcd5 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffd82122400) at bootstrap.c:461
#9  0x00000000006ee680 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5392
#10 0x00000000006ef9ca in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2973
#11 <signal handler called>
#12 0x00007f07c012ef53 in __select_nocancel () from /lib64/libc.so.6
#13 0x00000000004833d4 in ServerLoop () at postmaster.c:1668
#14 0x00000000006f106f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x19d4280) at postmaster.c:1377
#15 0x0000000000484cd3 in main (argc=3, argv=0x19d4280) at main.c:228

#3  0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
        path = "pg_tblspc/16401/PG_12_201909212/16460/973123799.10", '\000' <repeats 14 times>, ...
        failures = 1
        sync_in_progress = true
        hstat = {hashp = 0x19fd2f0, curBucket = 1443, curEntry = 0x0}
        entry = 0x1a61260
        absorb_counter = <optimized out>
        processed = 23
        sync_start = {tv_sec = 21582125, tv_nsec = 303557162}
        sync_end = {tv_sec = 21582125, tv_nsec = 303536006}
        sync_diff = <optimized out>
        elapsed = <optimized out>
        longest = 1674
        total_elapsed = 7074
        __func__ = "ProcessSyncRequests"



Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>  2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file
ordirectory
 

I managed to reproduce this (see below).  I think I know what the
problem is: mdsyncfiletag() uses _mdfd_getseg() to open the segment to
be fsync'd, but that function opens all segments up to the one you
requested, so if a lower-numbered segment has already been unlinked,
it can fail.  Usually that's unlikely because it's hard to get the
request queue to fill up and therefore hard to split up the cancel
requests for all the segments for a relation, but your workload and
the repro below do it.  In fact, the path it shows in the error
message is not even the problem file, that's the one it really wanted,
but first it was trying to open lower-numbered ones.  I can see a
couple of solutions to the problem (unlink in reverse order, send all
the forget messages first before unlinking anything, or go back to
using a single atomic "forget everything for this rel" message instead
of per-segment messages), but I'll have to think more about that
tomorrow.

=== repro ===

Recompile with RELSEG_SIZE 2 in pg_config.h.  Run with
checkpoint_timeout=30s and shared_buffers=128kB.  Then:

create table t (i int primary key);
cluster t using t_pkey;
insert into t select generate_series(1, 10000);

Session 1:
cluster t;
\watch 1

Session 2:
update t set i = i;
\watch 1.1



Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Fri, Nov 29, 2019 at 3:13 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> >  2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such
fileor directory
 
>
> I managed to reproduce this (see below).  I think I know what the
> problem is: mdsyncfiletag() uses _mdfd_getseg() to open the segment to
> be fsync'd, but that function opens all segments up to the one you
> requested, so if a lower-numbered segment has already been unlinked,
> it can fail.  Usually that's unlikely because it's hard to get the
> request queue to fill up and therefore hard to split up the cancel
> requests for all the segments for a relation, but your workload and
> the repro below do it.  In fact, the path it shows in the error
> message is not even the problem file, that's the one it really wanted,
> but first it was trying to open lower-numbered ones.  I can see a
> couple of solutions to the problem (unlink in reverse order, send all
> the forget messages first before unlinking anything, or go back to
> using a single atomic "forget everything for this rel" message instead
> of per-segment messages), but I'll have to think more about that
> tomorrow.

Here is a patch that fixes the problem by sending all the
SYNC_FORGET_REQUEST messages up front.

Attachment

Re: checkpointer: PANIC: could not fsync file: No such file ordirectory

From
Justin Pryzby
Date:
On Fri, Nov 29, 2019 at 10:50:36AM +1300, Thomas Munro wrote:
> On Fri, Nov 29, 2019 at 3:13 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > >  2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such
fileor directory
 
> >
> > I managed to reproduce this (see below).  I think I know what the
> 
> Here is a patch that fixes the problem by sending all the
> SYNC_FORGET_REQUEST messages up front.

I managed to reproduce it too, but my recipe is crummy enough that I'm not even
going to send it..

I confirmed that patch also seems to work for my worse recipe.

Thanks,
Justin



Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Fri, Nov 29, 2019 at 11:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Fri, Nov 29, 2019 at 10:50:36AM +1300, Thomas Munro wrote:
> > On Fri, Nov 29, 2019 at 3:13 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > >  2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No
suchfile or directory
 
> > >
> > > I managed to reproduce this (see below).  I think I know what the
> >
> > Here is a patch that fixes the problem by sending all the
> > SYNC_FORGET_REQUEST messages up front.
>
> I managed to reproduce it too, but my recipe is crummy enough that I'm not even
> going to send it..
>
> I confirmed that patch also seems to work for my worse recipe.

Thanks.

One thing I'm wondering about is what happens if you encounter EPERM,
EIO etc while probing the existing files, so that we give up early and
don't deal with some higher numbered files.

(1)  We'll still have unlinked the lower numbered files, and we'll
leave the higher numbered files where they are, which might confuse
matters if the relfilenode number if later recycled so the zombie
segments appear to belong to the new relfilenode.  That is
longstanding PostgreSQL behaviour not changed by commit 3eb77eba or
this patch IIUC (despite moving a few things around), and I guess it's
unlikely to bite you considering all the coincidences required, but if
there's a transient EPERM (think Windows virus checker opening files
without the shared read flag), it's not inconceivable.  One solution
to that would be not to queue the unlink request for segment 0 if
anything goes wrong while unlinking the other segments (in other
words: if you can't unlink all the segments, deliberately leak segment
0 and thus the *whole relfilenode*, not just the problem file(s)).

(2)  Even with the fix I just proposed, if you give up early due to
EPERM, EIO etc, there might still be sync requests queued for high
numbered segments, so you could reach the PANIC case.  It's likely
that, once you've hit such an error, the checkpointer is going to be
panicking anyway when it starts seeing similar errors, but still, it'd
be possible to do better here (especially if the error was transient
and short lived).  If we don't accept that behaviour, we could switch
(back) to a single cancel message that can whack every request
relating to the relation (essentially as it was in 11, though it
requires a bit more work for the new design), or stop using
_mdfd_getseg() for this so that you can remove segments independently
without worrying about sync requests for other segments (it was
actually like that in an earlier version of the patch for commit
3eb77eba, but someone complained that it didn't benifit from fd
caching).



Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Fri, Nov 29, 2019 at 12:34 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... or stop using
> _mdfd_getseg() for this so that you can remove segments independently
> without worrying about sync requests for other segments (it was
> actually like that in an earlier version of the patch for commit
> 3eb77eba, but someone complained that it didn't benifit from fd
> caching).

Not sure which approach I prefer yet, but here's a patch showing that
alternative.

Attachment

Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Sat, Nov 30, 2019 at 10:57 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Nov 29, 2019 at 12:34 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > ... or stop using
> > _mdfd_getseg() for this so that you can remove segments independently
> > without worrying about sync requests for other segments (it was
> > actually like that in an earlier version of the patch for commit
> > 3eb77eba, but someone complained that it didn't benifit from fd
> > caching).
>
> Not sure which approach I prefer yet, but here's a patch showing that
> alternative.

Here's a better version: it uses the existing fd if we have it already
in md_seg_fds, but opens and closes a transient one if not.

Attachment

Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Fri, Dec 13, 2019 at 5:41 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's a better version: it uses the existing fd if we have it already
> in md_seg_fds, but opens and closes a transient one if not.

Pushed.



Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Sat, Dec 14, 2019 at 4:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Dec 13, 2019 at 5:41 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Here's a better version: it uses the existing fd if we have it already
> > in md_seg_fds, but opens and closes a transient one if not.
>
> Pushed.

Build farm not happy... checking...



Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Sat, Dec 14, 2019 at 5:05 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Pushed.
>
> Build farm not happy... checking...

Hrmph.  FileGetRawDesc() does not contain a call to FileAccess(), so
this is failing on low-fd-limit systems.  Looking into a way to fix
that...



Re: checkpointer: PANIC: could not fsync file: No such file or directory

From
Thomas Munro
Date:
On Sat, Dec 14, 2019 at 5:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Dec 14, 2019 at 5:05 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > Pushed.
> >
> > Build farm not happy... checking...
>
> Hrmph.  FileGetRawDesc() does not contain a call to FileAccess(), so
> this is failing on low-fd-limit systems.  Looking into a way to fix
> that...

Seemed best not to use FileGetRawDesc().  Rewritten to use only File,
and tested with the torture-test mentioned upthread under ulimit -n
50.