Thread: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
"Luke Koops"
Date:
The following bug has been logged online: Bug reference: 5038 Logged by: Luke Koops Email address: luke.koops@entrust.com PostgreSQL version: 8.3.7 Operating system: Windows 2003 Server Enterprise Edition Description: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving. Details: On my system, one of the WAL files is pending deletion. The handle is being held by one of the postgres backend processes, but that is another potential bug. At first, the unlink worked, and the .ready and .done files were deleted. But the WAL file still shows up in the pg_xlog directory listing. Note: the WAL file did get archived properly. There was no error reported at the time. When it comes time to recycle the log files, RemoveOldXLogFiles() calls ReadDir() to get the list of files, then it calls XLogArchiveCheckDone() which, if it cannot find a .done or a .ready file, calls XLogArchiveNotify(). XLogArchiveNotify() creates the .ready file again. This causes the archiver to call the archive command on the old WAL file that is pending deletion. The copy command will fail and all subsequent archive attempts will keep trying to copy the old WAL file that is pending deletion. At this point, none of the WAL files will get shipped and the pg_xlog folder will start filling up. Before calling XLogArchiveCheckDone(), RemoveOldXLogFiles() makes a number of tests to make sure the name is for a legitimate XLOG. This would be a good time to make sure the file is real, not pending deletion. That would prevent the creation of the .ready file and WAL archiving would continue to work. It might be a good idea to log something at the DEBUG level if a directory entry is encoutered that matches the naming conventions but is not a real file. You could probably reproduce this behaviour by changing the permissions on a WAL file, although you wouldn't be able to test a fix in the same way. I have not reliably reproduced the WAL file handle "leak" in the postgres back end. I believe may be related to statements timing out. My system currently has statement_timeout=1min, but that will be removed. I will report the "leak" when I have a better handle (no pun) on the situation. -Luke
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Heikki Linnakangas
Date:
Luke Koops wrote: > The following bug has been logged online: > > Bug reference: 5038 > Logged by: Luke Koops > Email address: luke.koops@entrust.com > PostgreSQL version: 8.3.7 > Operating system: Windows 2003 Server Enterprise Edition > Description: WAL file is pending deletion in pg_xlog folder, this > interferes with WAL archiving. > Details: > > On my system, one of the WAL files is pending deletion. The handle is being > held by one of the postgres backend processes, but that is another potential > bug. Hmm. Under normal Unix filesystem semantics, that doesn't matter much since the file can still be unlinked. It will still consume space, but that's not a big issue. On Windows, however, the open handle keeps the file locked, so it can't be deleted. If I'm reading the code correctly, when a backend opens a WAL file for writing to it, it stays open until the backend needs to write to another WAL file. If the backend only writes once to a file, and then doesn't create WAL records anymore (= executes read-only queries only), the file is kept open indefinitely. Perhaps we should try to close the old WAL file sooner. It's easy to check if the current open log segment is old and close it if so, but I'm not sure what the check should be hooked into. > At first, the unlink worked, and the .ready and .done files were deleted. > But the WAL file still shows up in the pg_xlog directory listing. If the file didn't go away, it seems like the unlink didn't work. We don't check the return code in RemoveOldXlogFiles(); I suspect that we're getting EBUSY in that scenario. We should check for that if we're going to delete the .ready and .done files. Patch attached, but I haven't tested it. I don't have Windows environment at hand, but I'll try to find one.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 9b64578..f54dd3b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3106,7 +3106,12 @@ RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr) ereport(DEBUG2, (errmsg("removing transaction log file \"%s\"", xlde->d_name))); - unlink(path); + + if (unlink(path) != 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove old transaction log file \"%s\": %m", + path))); CheckpointStats.ckpt_segs_removed++; }
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Perhaps we should try to close the old WAL file sooner. There is zero hope of making that work. What we probably need to do is fix the code that scans pg_xlog so that it ignores files that are pending deletion. (I assume there's some way to find that out on Windows.) regards, tom lane
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Luke Koops
Date:
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Monday, September 07, 2009 1:17 PM > To: Heikki Linnakangas > Cc: Luke Koops; pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion > in pg_xlog folder, this interferes with WAL archiving. > > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > Perhaps we should try to close the old WAL file sooner. > > There is zero hope of making that work. What we probably > need to do is fix the code that scans pg_xlog so that it > ignores files that are pending deletion. (I assume there's > some way to find that out on > Windows.) On *nux systems, unlink removes the link from the directory. When other pr= ocesses get a directory listing, the file will no longer be listed. On Win= dows, the file name continues to show up in directory listings. The file i= s in a state called pending deletion. Windows documentation doesn't give a= specific test for this state. Perhaps you could use _access(). =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =46rom http://support.microsoft.com/kb/159199 This file is in a state known as pending deletion. This file has been delet= ed, but there are still handles open to it. NTFS will wait until all handle= s to this file are closed before updating the index. If an attempt is made = to access the file, however, NTFS will deny the attempt. Because the file i= s listed in the index, but is effectively deleted, you can see the file but= you cannot access it. Windows NT returns an "Access Denied" error message when you attempt to man= ipulate the file. You are not able to view the permissions, the owner, or t= he contents of the file. The file does, however, show up in a DIR listing i= n File Manager and in Explorer. This occurs even though the user trying to = access the file has permissions to the file. Even an administrator will be = unable to take ownership of this file. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > regards, tom lane >
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Luke Koops
Date:
For those of you who are still looking at this, I tried to reproduce the is= sue by holding one of the WAL files open with another program (just opened = it with the cygwin build of less.exe for windows). That didn't do the tric= k. It prevented unlink or rename from working at all. I wrote a program (= open.exe) that opens the file using pgwin32_open() and passed in the same p= arameters that postgres uses when opening a WAL file. That allowed the fil= e to be renamed. And, when deleted, the open file went into the pending de= letion state. I used open.exe to hold onto a WAL file that was going to be recycled. The= recycling worked, but what is going to happen down the road when the handl= e is released, leaving a gap in the WAL file sequence. Or if it is not rel= eased, when a backend tries to open the WAL file and does not have access t= o it? When open.exe was holding onto a WAL file that was going to be deleted, the= deletion worked. The file went into the deletion pending state. The arch= ive status for the WAL file went through the .ready =3D=3D> .done =3D=3D> {= no status file} =3D=3D> .ready sequence. At that point Postgres repeatedly= tries to archive the WAL file. I reported earlier that I believe postgres leaked the file handle to the WA= L file. I don't believe that is the case. We have a process that only che= cks data in the database for integrity. It is only reading. I think it op= ened the WAL file initially, perhaps the backend had some maintenance work = to do when that session started and had to write something to the WAL and t= hen never moved on to a new one. Now that I can reproduce the pending deletion case, I'm working on code to = detect it reliably and, hopefully, efficiently. -Luke > -----Original Message----- > From: pgsql-bugs-owner@postgresql.org > [mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Luke Koops > Sent: Monday, September 07, 2009 4:30 PM > To: 'Tom Lane'; Heikki Linnakangas > Cc: pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion > in pg_xlog folder, this interferes with WAL archiving. > > > -----Original Message----- > > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > Sent: Monday, September 07, 2009 1:17 PM > > To: Heikki Linnakangas > > Cc: Luke Koops; pgsql-bugs@postgresql.org > > Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion > in pg_xlog > > folder, this interferes with WAL archiving. > > > > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > > Perhaps we should try to close the old WAL file sooner. > > > > There is zero hope of making that work. What we probably > need to do > > is fix the code that scans pg_xlog so that it ignores files > that are > > pending deletion. (I assume there's some way to find that out on > > Windows.) > On *nux systems, unlink removes the link from the directory. > When other processes get a directory listing, the file will > no longer be listed. On Windows, the file name continues to > show up in directory listings. The file is in a state called > pending deletion. Windows documentation doesn't give a > specific test for this state. Perhaps you could use _access(). > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > From http://support.microsoft.com/kb/159199 > > This file is in a state known as pending deletion. This file > has been deleted, but there are still handles open to it. > NTFS will wait until all handles to this file are closed > before updating the index. If an attempt is made to access > the file, however, NTFS will deny the attempt. Because the > file is listed in the index, but is effectively deleted, you > can see the file but you cannot access it. > > Windows NT returns an "Access Denied" error message when you > attempt to manipulate the file. You are not able to view the > permissions, the owner, or the contents of the file. The file > does, however, show up in a DIR listing in File Manager and > in Explorer. This occurs even though the user trying to > access the file has permissions to the file. Even an > administrator will be unable to take ownership of this file. > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > regards, tom lane > > > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs >
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Heikki Linnakangas
Date:
Luke Koops wrote: > For those of you who are still looking at this, I tried to reproduce the issue by holding one of the WAL files open withanother program (just opened it with the cygwin build of less.exe for windows). That didn't do the trick. It preventedunlink or rename from working at all. I wrote a program (open.exe) that opens the file using pgwin32_open() andpassed in the same parameters that postgres uses when opening a WAL file. That allowed the file to be renamed. And,when deleted, the open file went into the pending deletion state. Yeah, it's the FILE_SHARE_DELETE flag that allows the deletion. > I used open.exe to hold onto a WAL file that was going to be recycled. The recycling worked, but what is going to happendown the road when the handle is released, leaving a gap in the WAL file sequence. Or if it is not released, whena backend tries to open the WAL file and does not have access to it? When the file is recycled, I believe we're fine. The file is not deleted, only renamed, so it won't be deleted when open.exe closes it. No gap in WAL sequence is created. > When open.exe was holding onto a WAL file that was going to be deleted, the deletion worked. The file went into the deletionpending state. The archive status for the WAL file went through the .ready ==> .done ==> {no status file} ==> .readysequence. At that point Postgres repeatedly tries to archive the WAL file. > I reported earlier that I believe postgres leaked the file handle to the WAL file. I don't believe that is the case. We have a process that only checks data in the database for integrity. It is only reading. I think it opened the WAL fileinitially, perhaps the backend had some maintenance work to do when that session started and had to write something tothe WAL and then never moved on to a new one. > > Now that I can reproduce the pending deletion case, I'm working on code to detect it reliably and, hopefully, efficiently. I got hold of a Windows virtual machine as well, and could reproduce the issue. It was a bit tricky to coerce the file to be deleted instead of recycled, but setting "max_advance = 0" in RemoveOldXlogFiles() finally did the trick. I googled around, and saw some discussion that suggest that when a file is in "pending deletion" state, it's implemented by setting a "delete-on-close" flag on the existing file handle. The upshot of that is that if you pull the power plug, the file won't be deleted after all. One option is to rename the file before deleting it. For all practical purposes, that's the same as if the file no longer exists. Seems like the simplest solution to me. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > One option is to rename the file before deleting it. I was under the impression that renaming would fail in the same cases where deletion fails, ie something else holding the file open ... regards, tom lane
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Luke Koops
Date:
Tom, If you first delete the file and it goes into deletion pending, the= n subsequent attempts to rename or delete will fail. Heikki is proposing that the file be renamed before it is first del= eted. -Luke > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Wednesday, September 09, 2009 3:06 PM > To: Heikki Linnakangas > Cc: Luke Koops; pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion > in pg_xlog folder, this interferes with WAL archiving. > > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > One option is to rename the file before deleting it. > > I was under the impression that renaming would fail in the > same cases where deletion fails, ie something else holding > the file open ... > > regards, tom lane >
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Tom Lane
Date:
Luke Koops <luke.koops@entrust.com> writes: > Heikki is proposing that the file be renamed before it is first deleted. Indeed, and what I'm saying is that that will still fail if some unhelpful program is holding the file open without FILE_SHARE_DELETE. Which is what I understood your case to be ... regards, tom lane
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Luke Koops
Date:
My case is that a backend was holding the file open. It was not some other= unhelpful program. That's why I had to write a program that held the file= open with the same flags that postgres uses (including FILE_SHARE_DELETE). We have a process that wrote to the DB, and then for many days it only read= from the DB, so it didn't move on to another WAL file. -Luke > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Wednesday, September 09, 2009 3:20 PM > To: Luke Koops > Cc: Heikki Linnakangas; pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion > in pg_xlog folder, this interferes with WAL archiving. > > Luke Koops <luke.koops@entrust.com> writes: > > Heikki is proposing that the file be renamed before > it is first deleted. > > Indeed, and what I'm saying is that that will still fail if > some unhelpful program is holding the file open without > FILE_SHARE_DELETE. > Which is what I understood your case to be ... > > regards, tom lane >
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Heikki Linnakangas
Date:
Tom Lane wrote: > Luke Koops <luke.koops@entrust.com> writes: >> Heikki is proposing that the file be renamed before it is first deleted. > > Indeed, and what I'm saying is that that will still fail if some > unhelpful program is holding the file open without FILE_SHARE_DELETE. > Which is what I understood your case to be ... No, it's a backend that's holding the file open, with FILE_SHARE_DELETE. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > No, it's a backend that's holding the file open, with FILE_SHARE_DELETE. If that's the only case we care about covering, then rename might be enough. I was just wondering what it would take to solve the more general problem of something holding it open with the wrong flags at the time we want to get rid of it. regards, tom lane
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Heikki Linnakangas
Date:
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> No, it's a backend that's holding the file open, with FILE_SHARE_DELETE. > > If that's the only case we care about covering, then rename might be > enough. I was just wondering what it would take to solve the more > general problem of something holding it open with the wrong flags > at the time we want to get rid of it. Yes, that's a separate problem, and I think we should address that too. That's what I thought was going on in OP's case at first, the patch I posted in my first reply should address that. I'll try to reproduce that case too, and verify that the patch fixes it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote: > Tom Lane wrote: >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >>> No, it's a backend that's holding the file open, with FILE_SHARE_DELETE. >> If that's the only case we care about covering, then rename might be >> enough. I was just wondering what it would take to solve the more >> general problem of something holding it open with the wrong flags >> at the time we want to get rid of it. > > Yes, that's a separate problem, and I think we should address that too. > That's what I thought was going on in OP's case at first, the patch I > posted in my first reply should address that. > > I'll try to reproduce that case too, and verify that the patch fixes it. Ok, I've committed a patch along those lines. The file is now renamed before unlinking (on Windows), and the return code of rename() and unlink() is checked, so that we don't delete the .done file if the WAL file deletion failed. This fixes both scenarios, the one OP reported with another backend keeping the file open, and the one where a different process keeps a file open without FILE_SHARE_DELETE. I considered making failure to rename or delete a WARNING instead of ERROR, so that RemoveOldXLogFiles() would still clean up any other old WAL files. However, when a file is recycled, we throw an error anyway if the rename fails in InstallXLogFileSegment(), so it doesn't seem like it would buy us much. BTW, it seems that errno is not set on Windows when rename fails, but we still try to print the OS error message in InstallXLogFileSegment(). When I tested the case where another process is keeping the file locked, for example, I got this: ERROR: could not rename file "pg_xlog/000000010000000100000073" to "pg_xlog/000000010000000100000092" (initialization of log file 1, segment 146): No such file or directory even though the file clearly exists, it's just locked. I'm not sure where errno is coming from in that case, and if we should do something about that, but that exceeds my appetite for fixing Windows issues right now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: BUG #5038: WAL file is pending deletion in pg_xlog folder, this interferes with WAL archiving.
From
Luke Koops
Date:
I picked up the patch and verified both fixes on 8.3.7. In one test, Handles to two different WAL files were being held by two diff= erent backends. The WAL files were renamed to .deleted after I forced a sw= itch xlog. Eventually the .deleted files disappeared. In one case the bac= kend exited. In the other, the backend moved on to the latest WAL file. In another test, I opened a WAL file so that it could not be renamed or del= eted. The appropriate error was logged and the .done file remained. The e= rror is logged quite frequently. When released the WAL file it was soon de= leted. If you get into a case where the rename works but the unlink fails (I don't= see how this could happen in real life, except possibly for a race conditi= on with AV software), you will have a situation where there is a .done file= that does not match any WAL logs, and you will have a .deleted file that w= on't get cleaned up. I couldn't reproduce this, so I faked it by adding a .done file back into t= he archive_status folder after it was deleted. The orphaned .done file doe= sn't cause any trouble. It doesn't get cleaned up, it doesn't generate any= log messages, and it doesn't interfere with WAL file recycling or removal = (unlike the trouble that is caused by orphaned .ready files). The patch looks good. Thank-you, -Luke > -----Original Message----- > From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com] > Sent: Thursday, September 10, 2009 5:44 AM > Cc: Tom Lane; Luke Koops; pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5038: WAL file is pending deletion > in pg_xlog folder, this interferes with WAL archiving. > > Heikki Linnakangas wrote: > > Tom Lane wrote: > >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > >>> No, it's a backend that's holding the file open, with > FILE_SHARE_DELETE. > >> If that's the only case we care about covering, then > rename might be > >> enough. I was just wondering what it would take to solve the more > >> general problem of something holding it open with the > wrong flags at > >> the time we want to get rid of it. > > > > Yes, that's a separate problem, and I think we should > address that too. > > That's what I thought was going on in OP's case at first, > the patch I > > posted in my first reply should address that. > > > > I'll try to reproduce that case too, and verify that the > patch fixes it. > > Ok, I've committed a patch along those lines. The file is now > renamed before unlinking (on Windows), and the return code of > rename() and > unlink() is checked, so that we don't delete the .done file > if the WAL file deletion failed. This fixes both scenarios, > the one OP reported with another backend keeping the file > open, and the one where a different process keeps a file open > without FILE_SHARE_DELETE. > > I considered making failure to rename or delete a WARNING > instead of ERROR, so that RemoveOldXLogFiles() would still > clean up any other old WAL files. However, when a file is > recycled, we throw an error anyway if the rename fails in > InstallXLogFileSegment(), so it doesn't seem like it would > buy us much. > > BTW, it seems that errno is not set on Windows when rename > fails, but we still try to print the OS error message in > InstallXLogFileSegment(). > When I tested the case where another process is keeping the > file locked, for example, I got this: > > ERROR: could not rename file > "pg_xlog/000000010000000100000073" to > "pg_xlog/000000010000000100000092" (initialization of log > file 1, segment 146): No such file or directory > > even though the file clearly exists, it's just locked. I'm > not sure where errno is coming from in that case, and if we > should do something about that, but that exceeds my appetite > for fixing Windows issues right now. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >