Thread: PostgreSQL occasionally unable to rename WAL files (NTFS)
Hello,
Running 13.1 on Windows Server 2019, I am getting the following log entries occasionally:
2021-02-11 12:34:10.149 NZDT [6072] LOG: could not rename file "pg_wal/0000000100000099000000D3": Permission denied
2021-02-11 12:40:31.377 NZDT [6072] LOG: could not rename file "pg_wal/0000000100000099000000D3": Permission denied
2021-02-11 12:46:06.294 NZDT [6072] LOG: could not rename file "pg_wal/0000000100000099000000D3": Permission denied
2021-02-11 12:46:16.502 NZDT [6072] LOG: could not rename file "pg_wal/0000000100000099000000DA": Permission denied
2021-02-11 12:50:20.917 NZDT [6072] LOG: could not rename file "pg_wal/0000000100000099000000D3": Permission denied
2021-02-11 12:50:31.098 NZDT [6072] LOG: could not rename file "pg_wal/0000000100000099000000DA": Permission denied
What appears to be happening is the affected WAL files (which is usually only 2 or 3 WAL files at a time) are somehow "losing" their NTFS permissions, so the PG process can't rename them - though of course the PG process created them. Even running icacls as admin gives "Access is denied" on those files. A further oddity is the affected files do end up disappearing after a while.
The NTFS permissions on the pg_wal directory are correct, and most WAL files are unaffected. Chkdsk reports no problems, and the database is working fine otherwise. Have tried disabling antivirus software in case that was doing something but no difference.
I found another recent report of similar behaviour here: https://stackoverflow.com/questions/65405479/postgresql-13-log-could-not-rename-file-pg-wal-0000000100000001000000c6
WAL config as follows:
wal_level = replica
fsync = on
synchronous_commit = on
wal_sync_method = fsync
full_page_writes = on
wal_compression = off
wal_log_hints = off
wal_init_zero = on
wal_recycle = on
wal_buffers = -1
wal_writer_delay = 200ms
wal_writer_flush_after = 1MB
wal_skip_threshold = 2MB
commit_delay = 0
commit_siblings = 5
checkpoint_timeout = 5min
max_wal_size = 2GB
min_wal_size = 256MB
checkpoint_completion_target = 0.7
checkpoint_flush_after = 0
checkpoint_warning = 30s
archive_mode = off
I'm thinking of disabling wal_recycle as a first step to see if that makes any difference, but thought I'd seek some comments first.
Not sure how much of a problem this is - the database is running fine otherwise - but any thoughts would be appreciated.
Thanks & regards,
Guy
Hi, Le 11/02/2021 à 01:21, Guy Burgess a écrit : > > Hello, > > Running 13.1 on Windows Server 2019, I am getting the following log > entries occasionally: > > 2021-02-11 12:34:10.149 NZDT [6072] LOG: could not rename file > "pg_wal/0000000100000099000000D3": Permission denied > 2021-02-11 12:40:31.377 NZDT [6072] LOG: could not rename file > "pg_wal/0000000100000099000000D3": Permission denied > 2021-02-11 12:46:06.294 NZDT [6072] LOG: could not rename file > "pg_wal/0000000100000099000000D3": Permission denied > 2021-02-11 12:46:16.502 NZDT [6072] LOG: could not rename file > "pg_wal/0000000100000099000000DA": Permission denied > 2021-02-11 12:50:20.917 NZDT [6072] LOG: could not rename file > "pg_wal/0000000100000099000000D3": Permission denied > 2021-02-11 12:50:31.098 NZDT [6072] LOG: could not rename file > "pg_wal/0000000100000099000000DA": Permission denied > > What appears to be happening is the affected WAL files (which is > usually only 2 or 3 WAL files at a time) are somehow "losing" their > NTFS permissions, so the PG process can't rename them - though of > course the PG process created them. Even running icacls as admin gives > "Access is denied" on those files. A further oddity is the affected > files do end up disappearing after a while. > > The NTFS permissions on the pg_wal directory are correct, and most WAL > files are unaffected. Chkdsk reports no problems, and the database is > working fine otherwise. Have tried disabling antivirus software in > case that was doing something but no difference. > I haven't dealt with a Windows environment for quite some time, but from what I remember an antivirus installs a driver intercepting file accesses and these drivers are still active even if you disable the antivirus (I suppose they just call a noop instead of content analysis code) and can still interfere with your system. For example some years ago I've seen what looked like a race condition involving rename for MySQL on Windows that could not be fixed by disabling the antivirus but could by uninstalling it completely. You might want to uninstall the antivirus temporarily to check this. Best regards, -- Lionel Bouton gérant de JTEK SARL https://www.linkedin.com/in/lionelbouton/
Guten Tag Guy Burgess, am Donnerstag, 11. Februar 2021 um 01:21 schrieben Sie: > What appears to be happening is the affected WAL files (which is > usually only 2 or 3 WAL files at a time) are somehow "losing" their > NTFS permissions, so the PG process can't rename them - though of > course the PG process created them. Even running icacls as admin > gives "Access is denied" on those files. A further oddity is the > affected files do end up disappearing after a while. If you see that somewhat frequently, use Process Monitor and Process Explorer to see who accesses those files how. ProcExp easily allows you to find all open handles per file. If it's not AV, it might be something like Windows Search Indexer as well, if that is enabled by default in Server 2019 at all. Though, even with my Windows 10 and Search Indexer enabled I didn't run into such problems yet. And especially when rewriting large parts of my databases with lots of created WAL files, I see the Indexer working on those files, but NOT conflicting with Postgres yet. The behaviour you describe happens exactly when two processes e.g. concurrently hold HANDLEs on the same file and one of those deletes the file then. Windows keeps file names until all open HANDLEs are closed and depending on how those HANDLEs have been opened by the first app, concurrent deletion is perferctly fine for Windows. Though, a such deleted file can't be opened easily anymore and looks like it has lost permissions only. But that's not the case, it's deleted already. It might be that this happens for Postgres to itself somehow when some other app has an open HANDLE. I don't think that some other app is deleting that file by purpose instead, reading it for some reason seems more likely to me. > dwShareMode > FILE_SHARE_DELETE > Enables subsequent open operations on a file or device to request > delete access. Otherwise, other processes cannot open the file or > device if they request delete access. https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew Mit freundlichen Grüßen Thorsten Schöning -- Thorsten Schöning AM-SoFT IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister fürIT und TK E-Mail: Thorsten.Schoening@AM-SoFT.de Web: http://www.AM-SoFT.de/ Telefon: 05151- 9468- 0 Telefon: 05151- 9468-55 Fax: 05151- 9468-88 Mobil: 0178-8 9468-04 AM-SoFT IT-Service - Bitstore Hameln GmbH i.G., Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB neu - Geschäftsführer: Janine Galonska Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung. Mit freundlichen Grüßen Thorsten Schöning Tel: 05151 9468 0 Fax: 05151 9468 88 Mobil: Webseite: https://www.am-soft.de AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister fürIT und TK AM-Soft IT-Service - Bitstore Hameln GmbH i.G. Brandenburger Str. 7c 31789 Hameln Tel: 05151 9468 0 Bitstore IT-Consulting GmbH Zentrale - Berlin Lichtenberg Frankfurter Allee 285 10317 Berlin Tel: 030 453 087 80 CBS IT-Service - Bitstore Kaulsdorf UG Tel: 030 453 087 880 1 Büro Dallgow-Döberitz Tel: 03322 507 020 Büro Kloster Lehnin Tel: 033207 566 530 PCE IT-Service - Bitstore Darmstadt UG Darmstadt Tel: 06151 392 973 0 Büro Neuruppin Tel: 033932 606 090 ACI EDV Systemhaus Dresden GmbH Dresden Tel: 0351 254 410 Das Systemhaus - Bitstore Magdeburg GmbH Magdeburg Tel: 0391 636 651 0 Allerdata.IT - Bitstore Wittenberg GmbH Wittenberg Tel: 03491 876 735 7 Büro Liebenwalde Tel: 033054 810 00 HSA - das Büro - Bitstore Altenburg UG Altenburg Tel: 0344 784 390 97 Bitstore IT – Consulting GmbH NL Piesteritz Piesteritz Tel: 03491 644 868 6 Solltec IT-Services - Bitstore Braunschweig UG Braunschweig Tel: 0531 206 068 0 MF Computer Service - Bitstore Gütersloh GmbH Gütersloh Tel: 05245 920 809 3 Firmensitz: MF Computer Service - Bitstore Gütersloh GmbH, Gildestraße 25, 33442 Herzebrock-Clarholz Geschäftsführer Janine Galonska
On 12/02/2021 12:31 am, Lionel Bouton wrote: > I haven't dealt with a Windows environment for quite some time, but from > what I remember an antivirus installs a driver intercepting file > accesses and these drivers are still active even if you disable the > antivirus (I suppose they just call a noop instead of content analysis > code) and can still interfere with your system. For example some years > ago I've seen what looked like a race condition involving rename for > MySQL on Windows that could not be fixed by disabling the antivirus but > could by uninstalling it completely. > > You might want to uninstall the antivirus temporarily to check this. Thanks Lionel for this suggestion. Its the built-in Windows Defender AV which I believe can't be uninstalled, so am limited to 'disabling' it (plus it has exclusions for the PG directories & processes already). Using Procmon I can't see any AV file activity when it is disabled. Will keep monitoring it though. Kind regards Guy
On 12/02/2021 4:33 am, Thorsten Schöning wrote: > If you see that somewhat frequently, use Process Monitor and Process > Explorer to see who accesses those files how. ProcExp easily allows > you to find all open handles per file. If it's not AV, it might be > something like Windows Search Indexer as well, if that is enabled by > default in Server 2019 at all. > Thanks Thorsten, that's a great idea - I'm using Procmon and Process Explorer and able to monitor activity on the WAL files, so hopefully that sheds some light. The Indexer service is disabled, so can rule that out. Kind regards Guy
The behaviour you describe happens exactly when two processes e.g. concurrently hold HANDLEs on the same file and one of those deletes the file then. Windows keeps file names until all open HANDLEs are closed and depending on how those HANDLEs have been opened by the first app, concurrent deletion is perferctly fine for Windows. Though, a such deleted file can't be opened easily anymore and looks like it has lost permissions only. But that's not the case, it's deleted already. It might be that this happens for Postgres to itself somehow when some other app has an open HANDLE. I don't think that some other app is deleting that file by purpose instead, reading it for some reason seems more likely to me.
Using Process Monitor, Thorsten's explanation above appears to correctly diagnose what is happening. ProcMon data shows postgres.exe performing "CreateFile" operations on the affected WAL files, with the result status "DELETE PENDING". Which according to https://stackoverflow.com/a/29892104 means:
"Windows allows a process to delete a file, even though it is still opened by another process (e.g. Windows indexing service or Antivirus). It gets internally marked as "delete pending". The file does not actually get removed from the file system, it is still there after the File.Delete call. Anybody that tries to open the file after that gets an access denied error. The file doesn't actually get removed until the last handle to the file object gets closed"
which is the same behaviour Thorsten describes above (great info, thanks Thorsten).
The mystery now is that the only process logged as touching the affected WAL files is postgres.exe (of which there are many separate processes). Could it be that one of the postgres.exe instances is holding the affected WAL files in use after another postgres.exe instance has flagged the file as deleted? (or to put it the other way, a postgres.exe instance is flagging the file as deleted while another instance still has an open handle to the file)? If it is some other process such as the indexer (disabled) or AV (excluded from pgdata) is obtaining a handle on the WAL files, it isn't being logged by ProcMon.
Kind regards,
Guy
Guten Tag Guy Burgess, am Montag, 15. Februar 2021 um 11:52 schrieben Sie: > The mystery now is that the only process logged as touching the > affected WAL files is postgres.exe (of which there are many separate > processes). Could it be that one of the postgres.exe instances is > holding the affected WAL files in use after another postgres.exe > instance has flagged the file as deleted?[...] I suggest checking your WAL-related and archive/backup settings for Postgres again. There's e.g. "archive_command" optionally copying WALs to some other place and postgres.exe would wait until that process has finished, maybe locking the file to copy itself as well. Or "archive_timeout" interfering with some other operations or alike. Mit freundlichen Grüßen Thorsten Schöning -- Thorsten Schöning AM-SoFT IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister fürIT und TK E-Mail: Thorsten.Schoening@AM-SoFT.de Web: http://www.AM-SoFT.de/ Telefon: 05151- 9468- 0 Telefon: 05151- 9468-55 Fax: 05151- 9468-88 Mobil: 0178-8 9468-04 AM-SoFT IT-Service - Bitstore Hameln GmbH i.G., Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB neu - Geschäftsführer: Janine Galonska Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung. Mit freundlichen Grüßen Thorsten Schöning Tel: 05151 9468 0 Fax: 05151 9468 88 Mobil: Webseite: https://www.am-soft.de AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister fürIT und TK AM-Soft IT-Service - Bitstore Hameln GmbH i.G. Brandenburger Str. 7c 31789 Hameln Tel: 05151 9468 0 Bitstore IT-Consulting GmbH Zentrale - Berlin Lichtenberg Frankfurter Allee 285 10317 Berlin Tel: 030 453 087 80 CBS IT-Service - Bitstore Kaulsdorf UG Tel: 030 453 087 880 1 Büro Dallgow-Döberitz Tel: 03322 507 020 Büro Kloster Lehnin Tel: 033207 566 530 PCE IT-Service - Bitstore Darmstadt UG Darmstadt Tel: 06151 392 973 0 Büro Neuruppin Tel: 033932 606 090 ACI EDV Systemhaus Dresden GmbH Dresden Tel: 0351 254 410 Das Systemhaus - Bitstore Magdeburg GmbH Magdeburg Tel: 0391 636 651 0 Allerdata.IT - Bitstore Wittenberg GmbH Wittenberg Tel: 03491 876 735 7 Büro Liebenwalde Tel: 033054 810 00 HSA - das Büro - Bitstore Altenburg UG Altenburg Tel: 0344 784 390 97 Bitstore IT – Consulting GmbH NL Piesteritz Piesteritz Tel: 03491 644 868 6 Solltec IT-Services - Bitstore Braunschweig UG Braunschweig Tel: 0531 206 068 0 MF Computer Service - Bitstore Gütersloh GmbH Gütersloh Tel: 05245 920 809 3 Firmensitz: MF Computer Service - Bitstore Gütersloh GmbH, Gildestraße 25, 33442 Herzebrock-Clarholz Geschäftsführer Janine Galonska
On 16/02/2021 12:23 am, Thorsten Schöning wrote: >> The mystery now is that the only process logged as touching the >> affected WAL files is postgres.exe (of which there are many separate >> processes). Could it be that one of the postgres.exe instances is >> holding the affected WAL files in use after another postgres.exe >> instance has flagged the file as deleted?[...] > I suggest checking your WAL-related and archive/backup settings for > Postgres again. There's e.g. "archive_command" optionally copying WALs > to some other place and postgres.exe would wait until that process has > finished, maybe locking the file to copy itself as well. Or > "archive_timeout" interfering with some other operations or alike. Thanks Thorsten. The WAL archive settings are out-of-the-box defaults, i.e. disabled: archive_mode = off; archive_command = ''; archive_timeout = 0. I'm not sure there is anything else I can check at this time. The good thing is it doesn't seem to cause any problem other than logging "could not rename file" warnings, so I might have to park this for now. If I find anything else that might offer a new lead I will report back. Kind regards, Guy
Guy Burgess schrieb am 15.02.2021 um 11:52: > The mystery now is that the only process logged as touching the > affected WAL files is postgres.exe (of which there are many separate > processes). Could it be that one of the postgres.exe instances is > holding the affected WAL files in use after another postgres.exe > instance has flagged the file as deleted? (or to put it the other > way, a postgres.exe instance is flagging the file as deleted while > another instance still has an open handle to the file)? If it is some > other process such as the indexer (disabled) or AV (excluded from > pgdata) is obtaining a handle on the WAL files, it isn't being logged > by ProcMon. My first suspect is always the anti-virus on Windows when things like that happen with Postgres. Maybe the AV configuration was changed and pgdata is no longer excluded. Thomas
On Tue, Feb 16, 2021 at 12:22:36PM +0100, Thomas Kellerer wrote: > My first suspect is always the anti-virus on Windows when things like > that happen with Postgres. Or maybe not. 13 has introduced a regression in this area AFAIK, and 909b449 should have taken care of it (available in 13.3~). -- Michael