Thread: Cascading replication on Windows bug
Starting with 9.2, when a WAL segment is restored from the archive, it is copied over any existing file in pg_xlog with the same name. This is done in two steps: first the file is restored from archive to a temporary file called RECOVERYXLOG, then the old file is deleted and the temporary file is renamed in place. After that, a flag is set in shared memory for each WAL sender, to tell them to close the old file if they still have it open. That doesn't work on Windows. As long as a walsender is keeping the old file open, the unlink() on it fails. You get an error like this in the startup process: FATAL: could not rename file "pg_xlog/RECOVERYXLOG" to "pg_xlog/00000001000000000000000D": Permission denied Not sure how to fix that. Perhaps we could copy the data over the old file, rather than unlink and rename it. Or signal the walsenders and retry if the unlink() fails with EACCESS. Now, another question is, do we need to delay the release because of this? The impact of this is basically that cascading replication sometimes causes the standby to die, if a WAL archive is used together with streaming replication. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > That doesn't work on Windows. As long as a walsender is keeping the old > file open, the unlink() on it fails. You get an error like this in the > startup process: > FATAL: could not rename file "pg_xlog/RECOVERYXLOG" to > "pg_xlog/00000001000000000000000D": Permission denied I thought we had some workaround for that problem. Otherwise, you'd be seeing this type of failure every time a checkpoint tries to drop or rename files. regards, tom lane
On 05.09.2012 14:28, Tom Lane wrote: > Heikki Linnakangas<hlinnaka@iki.fi> writes: >> That doesn't work on Windows. As long as a walsender is keeping the old >> file open, the unlink() on it fails. You get an error like this in the >> startup process: >> FATAL: could not rename file "pg_xlog/RECOVERYXLOG" to >> "pg_xlog/00000001000000000000000D": Permission denied > > I thought we had some workaround for that problem. Otherwise, you'd be > seeing this type of failure every time a checkpoint tries to drop or > rename files. Hmm, now that I look at the error message more carefully, what happens is that the unlink() succeeds, but when the startup process tries to rename the new file in place, the rename() fails. The comments in RemoveOldXLogFiles() explains that, and also shows how to work around it: > /* > * On Windows, if another process (e.g another backend) > * holds the file open in FILE_SHARE_DELETE mode, unlink > * will succeed, but the file will still show up in > * directory listing until the last handle is closed. To > * avoid confusing the lingering deleted file for a live > * WAL file that needs to be archived, rename it before > * deleting it. > * > * If another process holds the file open without > * FILE_SHARE_DELETE flag, rename will fail. We'll try > * again at the next checkpoint. > */ I think we need the same trick here, and rename the old file first, then unlink() it, and then rename the new file in place. I'll try that out.. - Heikki
On 05.09.2012 16:45, Heikki Linnakangas wrote: > On 05.09.2012 14:28, Tom Lane wrote: >> Heikki Linnakangas<hlinnaka@iki.fi> writes: >>> That doesn't work on Windows. As long as a walsender is keeping the old >>> file open, the unlink() on it fails. You get an error like this in the >>> startup process: >>> FATAL: could not rename file "pg_xlog/RECOVERYXLOG" to >>> "pg_xlog/00000001000000000000000D": Permission denied >> >> I thought we had some workaround for that problem. Otherwise, you'd be >> seeing this type of failure every time a checkpoint tries to drop or >> rename files. > > Hmm, now that I look at the error message more carefully, what happens > is that the unlink() succeeds, but when the startup process tries to > rename the new file in place, the rename() fails. The comments in > RemoveOldXLogFiles() explains that, and also shows how to work around it: > >> /* >> * On Windows, if another process (e.g another backend) >> * holds the file open in FILE_SHARE_DELETE mode, unlink >> * will succeed, but the file will still show up in >> * directory listing until the last handle is closed. To >> * avoid confusing the lingering deleted file for a live >> * WAL file that needs to be archived, rename it before >> * deleting it. >> * >> * If another process holds the file open without >> * FILE_SHARE_DELETE flag, rename will fail. We'll try >> * again at the next checkpoint. >> */ > > I think we need the same trick here, and rename the old file first, then > unlink() it, and then rename the new file in place. I'll try that out.. Ok, committed a patch to do that. - Heikki