Thread: pgsql: On Windows, when a file is deleted and another process still has

pgsql: On Windows, when a file is deleted and another process still has

From
heikki@postgresql.org (Heikki Linnakangas)
Date:
Log Message:
-----------
On Windows, when a file is deleted and another process still has an open
file handle on it, the file goes into "pending deletion" state where it
still shows up in directory listing, but isn't accessible otherwise. That
confuses RemoveOldXLogFiles(), making it think that the file hasn't been
archived yet, while it actually was, and it was deleted along with the .done
file.

Fix that by renaming the file with ".deleted" extension before deleting it.
Also check the return value of rename() and unlink(), so that if the removal
fails for any reason (e.g another process is holding the file locked), we
don't delete the .done file until the WAL file is really gone.

Backpatch to 8.2, which is the oldest version supported on Windows.

Modified Files:
--------------
    pgsql/src/backend/access/transam:
        xlog.c (r1.351 -> r1.352)
        (http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/transam/xlog.c?r1=1.351&r2=1.352)

Re: pgsql: On Windows, when a file is deleted and another process still has

From
Tom Lane
Date:
heikki@postgresql.org (Heikki Linnakangas) writes:
> Fix that by renaming the file with ".deleted" extension before deleting it.
> Also check the return value of rename() and unlink(), so that if the removal
> fails for any reason (e.g another process is holding the file locked), we
> don't delete the .done file until the WAL file is really gone.

This patch seems to me to be a seriously bad idea.  It means that some
random process holding a file open will be able to cause checkpoints to
fail indefinitely.  Is it really necessary or advisable to have the
error cases be elog(ERROR)?  Couldn't we just elog(LOG) and leave the
file alone?

            regards, tom lane

Re: pgsql: On Windows, when a file is deleted and another process still has

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> heikki@postgresql.org (Heikki Linnakangas) writes:
>> Fix that by renaming the file with ".deleted" extension before deleting it.
>> Also check the return value of rename() and unlink(), so that if the removal
>> fails for any reason (e.g another process is holding the file locked), we
>> don't delete the .done file until the WAL file is really gone.
>
> This patch seems to me to be a seriously bad idea.  It means that some
> random process holding a file open will be able to cause checkpoints to
> fail indefinitely.  Is it really necessary or advisable to have the
> error cases be elog(ERROR)?  Couldn't we just elog(LOG) and leave the
> file alone?

I considered that, but didn't bother since InstallXLogFileSegment will
throw an error anyway if the file is locked, and that's a more common
codepath.

On closer look, it looks like the code in InstallXLogFileSegment is
trying to check the return code from rename, and not throw an error if
it fails because the file is locked, but it doesn't seem to be working
properly. When I open a WAL file with a separate program, and then cycle
through enough segments to have the file recycled, I get this:

ERROR:  could not rename file "pg_xlog/0000000100000001000000A0" to
"pg_xlog/0000000100000001000000A9" (initialization of log file 1,
segment 169): No such file or directory

I noted that behavior in my previous mail on the thread on pgsql-bugs,
but it didn't strike me as serious then.

FWIW, the checkpoint is mostly complete at that point, all that's left
is non-critical cleanup. I admit it's still annoying.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: pgsql: On Windows, when a file is deleted and another process still has

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> FWIW, the checkpoint is mostly complete at that point, all that's left
> is non-critical cleanup. I admit it's still annoying.

"Mostly complete" isn't good enough, and it's much more than just an
annoyance --- failure to complete checkpoints will lead to disk full,
and other nasty things IIRC.  We really need to deal with this.

I wouldn't be too surprised if that rat's nest in InstallXLogFileSegment
isn't right :-(.  But we have to treat "file can't be renamed" as a
nonfatal condition on Windows.

            regards, tom lane

Re: pgsql: On Windows, when a file is deleted and another process still has

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> I wouldn't be too surprised if that rat's nest in InstallXLogFileSegment
> isn't right :-(.  But we have to treat "file can't be renamed" as a
> nonfatal condition on Windows.

I added some debugging code, and I'm getting an ERROR_SHARING_VIOLATION
error when another program keeps the file open, while
InstallXLogFileSegment is checking for ERROR_ACCESS_DENIED. It seems
that check is flat-out wrong, as is the one in pgrename(). A bit of
googling suggests that Windows 9x might've returned ERROR_ACCESS_DENIED
in that case, so that's probably where that originated. pgwin32_open()
correctly checks for ERROR_SHARING_VIOLATION, but also for
ERROR_LOCK_VIOLATION.

I also note that since pgrename() doesn't set errno, the error message
printed in InstallXLogFileSegment if it fails is bogus. pgrename()
should set errno, using _dosmaperr().

A completely different approach would be to treat any failure on all
platforms as non-fatal. We shouldn't really cut the checkpoint short if
recycling a WAL file fails, whatever the reason. That seems like a more
robust approach than trying to guess which error codes are OK to ignore.

When called from XLogFileInit(), it's not expected that the target file
doesn't exist after InstallXLogFileSegment(), but it would just fail to
open it and throw an error anyway.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: pgsql: On Windows, when a file is deleted and another process still has

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> A completely different approach would be to treat any failure on all
> platforms as non-fatal. We shouldn't really cut the checkpoint short if
> recycling a WAL file fails, whatever the reason. That seems like a more
> robust approach than trying to guess which error codes are OK to ignore.

I could live with that, as long as it gets logged.

            regards, tom lane