Thread: pgsql: On Windows, when a file is deleted and another process still has
pgsql: On Windows, when a file is deleted and another process still has
From
heikki@postgresql.org (Heikki Linnakangas)
Date:
Log Message: ----------- On Windows, when a file is deleted and another process still has an open file handle on it, the file goes into "pending deletion" state where it still shows up in directory listing, but isn't accessible otherwise. That confuses RemoveOldXLogFiles(), making it think that the file hasn't been archived yet, while it actually was, and it was deleted along with the .done file. Fix that by renaming the file with ".deleted" extension before deleting it. Also check the return value of rename() and unlink(), so that if the removal fails for any reason (e.g another process is holding the file locked), we don't delete the .done file until the WAL file is really gone. Backpatch to 8.2, which is the oldest version supported on Windows. Modified Files: -------------- pgsql/src/backend/access/transam: xlog.c (r1.351 -> r1.352) (http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/transam/xlog.c?r1=1.351&r2=1.352)
heikki@postgresql.org (Heikki Linnakangas) writes: > Fix that by renaming the file with ".deleted" extension before deleting it. > Also check the return value of rename() and unlink(), so that if the removal > fails for any reason (e.g another process is holding the file locked), we > don't delete the .done file until the WAL file is really gone. This patch seems to me to be a seriously bad idea. It means that some random process holding a file open will be able to cause checkpoints to fail indefinitely. Is it really necessary or advisable to have the error cases be elog(ERROR)? Couldn't we just elog(LOG) and leave the file alone? regards, tom lane
Re: pgsql: On Windows, when a file is deleted and another process still has
From
Heikki Linnakangas
Date:
Tom Lane wrote: > heikki@postgresql.org (Heikki Linnakangas) writes: >> Fix that by renaming the file with ".deleted" extension before deleting it. >> Also check the return value of rename() and unlink(), so that if the removal >> fails for any reason (e.g another process is holding the file locked), we >> don't delete the .done file until the WAL file is really gone. > > This patch seems to me to be a seriously bad idea. It means that some > random process holding a file open will be able to cause checkpoints to > fail indefinitely. Is it really necessary or advisable to have the > error cases be elog(ERROR)? Couldn't we just elog(LOG) and leave the > file alone? I considered that, but didn't bother since InstallXLogFileSegment will throw an error anyway if the file is locked, and that's a more common codepath. On closer look, it looks like the code in InstallXLogFileSegment is trying to check the return code from rename, and not throw an error if it fails because the file is locked, but it doesn't seem to be working properly. When I open a WAL file with a separate program, and then cycle through enough segments to have the file recycled, I get this: ERROR: could not rename file "pg_xlog/0000000100000001000000A0" to "pg_xlog/0000000100000001000000A9" (initialization of log file 1, segment 169): No such file or directory I noted that behavior in my previous mail on the thread on pgsql-bugs, but it didn't strike me as serious then. FWIW, the checkpoint is mostly complete at that point, all that's left is non-critical cleanup. I admit it's still annoying. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > FWIW, the checkpoint is mostly complete at that point, all that's left > is non-critical cleanup. I admit it's still annoying. "Mostly complete" isn't good enough, and it's much more than just an annoyance --- failure to complete checkpoints will lead to disk full, and other nasty things IIRC. We really need to deal with this. I wouldn't be too surprised if that rat's nest in InstallXLogFileSegment isn't right :-(. But we have to treat "file can't be renamed" as a nonfatal condition on Windows. regards, tom lane
Re: pgsql: On Windows, when a file is deleted and another process still has
From
Heikki Linnakangas
Date:
Tom Lane wrote: > I wouldn't be too surprised if that rat's nest in InstallXLogFileSegment > isn't right :-(. But we have to treat "file can't be renamed" as a > nonfatal condition on Windows. I added some debugging code, and I'm getting an ERROR_SHARING_VIOLATION error when another program keeps the file open, while InstallXLogFileSegment is checking for ERROR_ACCESS_DENIED. It seems that check is flat-out wrong, as is the one in pgrename(). A bit of googling suggests that Windows 9x might've returned ERROR_ACCESS_DENIED in that case, so that's probably where that originated. pgwin32_open() correctly checks for ERROR_SHARING_VIOLATION, but also for ERROR_LOCK_VIOLATION. I also note that since pgrename() doesn't set errno, the error message printed in InstallXLogFileSegment if it fails is bogus. pgrename() should set errno, using _dosmaperr(). A completely different approach would be to treat any failure on all platforms as non-fatal. We shouldn't really cut the checkpoint short if recycling a WAL file fails, whatever the reason. That seems like a more robust approach than trying to guess which error codes are OK to ignore. When called from XLogFileInit(), it's not expected that the target file doesn't exist after InstallXLogFileSegment(), but it would just fail to open it and throw an error anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > A completely different approach would be to treat any failure on all > platforms as non-fatal. We shouldn't really cut the checkpoint short if > recycling a WAL file fails, whatever the reason. That seems like a more > robust approach than trying to guess which error codes are OK to ignore. I could live with that, as long as it gets logged. regards, tom lane