"Yurgis Baykshtis" <ybaykshtis@micropat.com> writes:
> I just noticed that the rename panic errors like this one:
> PANIC: rename from /data/pg_xlog/000000030000001F to
> /data/pg_xlog/000000030000002C (initialization of log file 3, segment 44)
> failed: No such file or directory
> come shortly AFTER the following messages
> LOG: recycled transaction log file 000000030000001B
> LOG: recycled transaction log file 000000030000001C
> LOG: recycled transaction log file 000000030000001D
> LOG: recycled transaction log file 000000030000001E
> LOG: removing transaction log file 000000030000001F
> LOG: removing transaction log file 0000000300000020
> LOG: removing transaction log file 0000000300000021
> LOG: removing transaction log file 0000000300000022
> So, you can see that 000000030000001F file was previously deleted by the
> logic in MoveOfflineLogs() function.
Interesting ...
> Now what I can see is that MoveOfflineLogs() does not seem to be
> synchronized between backends.
It's certainly supposed to be, because the only place it is called from
holds the CheckPointLock while it's doing it. If more than one backend
is able to run MoveOfflineLogs at a time, then the LWLock code is simply
broken. That seems unlikely, as just about nothing would work reliably
if LWLock failed to lock out concurrent operations.
What I suspect at this point is a cygwin bug: somehow, its
implementation of readdir() is able to retrieve a stale view of a
directory. I'd suggest pinging the cygwin developers to see if that
idea strikes a chord or not.
[ thinks for a bit... ] It might be that it isn't even a stale-data
issue, but that readdir() misbehaves if there are concurrent insert,
rename or delete operations carried out in the same directory. (The
renames or deletes would be coming from MoveOfflineLogs itself, the
inserts, if any, from concurrent backends finding that they need more
WAL space.) Again I would call that a cygwin bug, as we've not seen
reports of comparable behavior anywhere else.
> Also, we have a suspicion that the problem happens even with only one client
> connected to postgres.
Unless the clients are issuing explicit CHECKPOINT operations, that
wouldn't matter, because MoveOfflineLogs is called only from
checkpointing, and the postmaster never creates more than one background
checkpoint process at a time. (So there are actually two levels of
protection in place against concurrent execution of this code.)
regards, tom lane