Re: checkpointer: PANIC: could not fsync file: No such file or directory - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: checkpointer: PANIC: could not fsync file: No such file or directory
Date
Msg-id CA+hUKGLbC=+5DS8VOXqZ6peX_H6Zd_mxbV8vHAqS=ajM-x9wSQ@mail.gmail.com
Whole thread Raw
In response to Re: checkpointer: PANIC: could not fsync file: No such file ordirectory  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
On Tue, Nov 26, 2019 at 5:21 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> I looked and found a new "hint".
>
> On Tue, Nov 19, 2019 at 05:57:59AM -0600, Justin Pryzby wrote:
> > < 2019-11-15 22:16:07.098 EST  >PANIC:  could not fsync file "base/16491/1731839470.2": No such file or directory
> > < 2019-11-15 22:16:08.751 EST  >LOG:  checkpointer process (PID 27388) was terminated by signal 6: Aborted
>
> An earlier segment of that relation had been opened successfully and was
> *still* opened:
>
> $ sudo grep 1731839470 /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds
> 63:/var/lib/pgsql/12/data/base/16491/1731839470
>
> For context:
>
> $ sudo grep / /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds |tail -3
> 61:/var/lib/pgsql/12/data/base/16491/1757077748
> 62:/var/lib/pgsql/12/data/base/16491/1756223121.2
> 63:/var/lib/pgsql/12/data/base/16491/1731839470
>
> So this may be an issue only with relations>segment (but, that interpretation
> could also be very naive).

FTR I have been trying to reproduce this but failing so far.  I'm
planning to dig some more in the next couple of days.  Yeah, it's a .2
file, which means that it's one that would normally be unlinked after
you commit your transaction (unlike a no-suffix file, which would
normally be dropped at the next checkpoint after the commit, as our
strategy to prevent the relfilenode from being reused before the next
checkpoint cycle), but should normally have had a SYNC_FORGET_REQUEST
enqueued for it first.  So the question is, how did it come to pass
that a .2 file was ENOENT but there was no forget request?  Diificult,
given the definition of mdunlinkfork().  I wondered if something was
going wrong in queue compaction or something like that, but I don't
see it.  I need to dig into the exactly flow with the ALTER case to
see if there is something I'm missing there, and perhaps try
reproducing it with a tiny segment size to exercise some more
multisegment-related code paths.



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Safeguards against incorrect fd flags for fsync()
Next
From: Michael Paquier
Date:
Subject: Re: accounting for memory used for BufFile during hash joins