Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving. - Mailing list pgsql-bugs

From Luke Koops
Subject Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
Date
Msg-id A3144629B5AC714A8BF27806EBFA7057514623F2@sottexch7.corp.ad.entrust.com
Whole thread Raw
In response to Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-bugs
I picked up the patch and verified both fixes on 8.3.7.

In one test, Handles to two different WAL files were being held by two diff=
erent backends.  The WAL files were renamed to .deleted after I forced a sw=
itch xlog.  Eventually the .deleted files disappeared.  In one case the bac=
kend exited.  In the other, the backend moved on to the latest WAL file.

In another test, I opened a WAL file so that it could not be renamed or del=
eted.  The appropriate error was logged and the .done file remained.  The e=
rror is logged quite frequently.  When released the WAL file it was soon de=
leted.

If you get into a case where the rename works but the unlink fails (I don't=
 see how this could happen in real life, except possibly for a race conditi=
on with AV software), you will have a situation where there is a .done file=
 that does not match any WAL logs, and you will have a .deleted file that w=
on't get cleaned up.

I couldn't reproduce this, so I faked it by adding a .done file back into t=
he archive_status folder after it was deleted.  The orphaned .done file doe=
sn't cause any trouble.  It doesn't get cleaned up, it doesn't generate any=
 log messages, and it doesn't interfere with WAL file recycling or removal =
(unlike the trouble that is caused by orphaned .ready files).

The patch looks good.

Thank-you,

-Luke

> -----Original Message-----
> From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com]
> Sent: Thursday, September 10, 2009 5:44 AM
> Cc: Tom Lane; Luke Koops; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion
> in pg_xlog folder, this interferes with WAL archiving.
>
> Heikki Linnakangas wrote:
> > Tom Lane wrote:
> >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> >>> No, it's a backend that's holding the file open, with
> FILE_SHARE_DELETE.
> >> If that's the only case we care about covering, then
> rename might be
> >> enough.  I was just wondering what it would take to solve the more
> >> general problem of something holding it open with the
> wrong flags at
> >> the time we want to get rid of it.
> >
> > Yes, that's a separate problem, and I think we should
> address that too.
> > That's what I thought was going on in OP's case at first,
> the patch I
> > posted in my first reply should address that.
> >
> > I'll try to reproduce that case too, and verify that the
> patch fixes it.
>
> Ok, I've committed a patch along those lines. The file is now
> renamed before unlinking (on Windows), and the return code of
> rename() and
> unlink() is checked, so that we don't delete the .done file
> if the WAL file deletion failed. This fixes both scenarios,
> the one OP reported with another backend keeping the file
> open, and the one where a different process keeps a file open
> without FILE_SHARE_DELETE.
>
> I considered making failure to rename or delete a WARNING
> instead of ERROR, so that RemoveOldXLogFiles() would still
> clean up any other old WAL files. However, when a file is
> recycled, we throw an error anyway if the rename fails in
> InstallXLogFileSegment(), so it doesn't seem like it would
> buy us much.
>
> BTW, it seems that errno is not set on Windows when rename
> fails, but we still try to print the OS error message in
> InstallXLogFileSegment().
> When I tested the case where another process is keeping the
> file locked, for example, I got this:
>
> ERROR:  could not rename file
> "pg_xlog/000000010000000100000073" to
> "pg_xlog/000000010000000100000092" (initialization of log
> file 1, segment 146): No such file or directory
>
> even though the file clearly exists, it's just locked. I'm
> not sure where errno is coming from in that case, and if we
> should do something about that, but that exceeds my appetite
> for fixing Windows issues right now.
>
> --
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com
>

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #5049: query crashing backend with TRAP: FailedAssertion
Next
From: "Aoyai Kouhei"
Date:
Subject: BUG #5050: text to timestamp failure