Michael Paquier <michael@paquier.xyz> writes:
> Recently, one of the test beds we use has blown up once when doing
> streaming replication like that:
> FATAL: could not seek to end of file "base/16386/19817_fsm": No such
> file or directory
> CONTEXT: WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197
> LOG: startup process (PID 44886) exited with exit code 1
> All the WAL records have been wiped out since, so I don't know exactly
> what happened, but I could track down that this FSM file got removed
> a couple of hours before as I got my hands on some FS-level logs which
> showed a deletion.
Hm. AFAICS the immediate issuer of the error must have been
_mdnblocks(); there are other matches to that error string but
they are in places where we can tell which file the seek must
have been applied to, and it wasn't a FSM file.
> Before blaming a lower level of
> the application stack, I am wondering if we have some issues with
> mdfd_vfd meaning that the file has been removed but that it is still
> tracked as opened.
lseek() per se presumably would never return ENOENT. A more likely
theory is that the file wasn't actually open but only had a leftover
VFD entry, and when FileSize() -> FileAccess() tried to open it,
the open failed with ENOENT --- but _mdnblocks() would still call it
a seek failure.
So I'd opine that this is a pretty high-level failure --- what are
we doing trying to replay WAL against a table that's been dropped?
Or if it wasn't dropped, why was the FSM removed?
regards, tom lane