Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file - Mailing list pgsql-bugs

From Waka Ranai
Subject Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file
Date
Msg-id CAP8Vo=9o0FE6gzqZJ3XdeGPNqi=eNV3cM_6v-thE640YcYoWog@mail.gmail.com
Whole thread Raw
Responses Re: Bug report - pg_upgrade tool seems to have a race condition when trying to delete a pg_wal file
List pgsql-bugs

Hello,

 

I tested the pg_upgrade tool many times on different servers (always Windows server 19, actual subversion may differ) when trying to upgrade an existing database from Postgres 9.6 to Postgres 15 (I tried both the 15.4.2 and 15.7) and was almost all the time faced with this issue during the step “Setting next transaction ID and epoch for new cluster”.

Here’s the version of one of the servers, on which it failed at least three times :

 image.png

 

The command I ran is "C:\Program Files\PostgreSQL\15\bin\pg_upgrade.exe" -d "C:\Program Files\PostgreSQL\9.6\data" -D "C:\Program Files\PostgreSQL\15\data" -b "C:\Program Files\PostgreSQL\9.6\bin" -B "C:\Program Files\PostgreSQL\15\bin" -U postgres after having set PGPASSWORD to the correct password.

 

The issue was either “pg_resetwal: error: could not delete file "pg_wal/000000010000000000000001": Permission denied” or sometimes it was saying that the file could not be found instead of Permission denied. When I look in the directory while it is executing, I can see that the file is there previously, and always removed after the pg_upgrade crashes. I tried to inspect with Process Explorer what processes were using it, always processes from postgres, only one after a fresh install of postgres 15, but I saw that during the execution of pg_upgrade, sometimes two processes were using it.

 

I suspect that there is some sort of race condition where one process sees that the file exists, does something with it and deletes it, while another process saw the file existing, but upon trying to delete it, it could not find it anymore. I had a look in the code and I believe it happens in the function KillExistingXLOG from line 973 of pg_resetwal.c (https://github.com/postgres/postgres/blob/master/src/bin/pg_resetwal/pg_resetwal.c#L973) though I cannot be entirely sure of the cause.

 

You can find the logs produced by the pg_upgrade tool attached, with the verbose option.

 

Thanks in advance for the investigation and I hope to understand better the problem and hopefully see a fix soon as it is complicating the deployment of a major upgrade of our product,

 

Have a great day,

 

Thomas

Attachment

pgsql-bugs by date:

Previous
From: Ugur Yilmaz
Date:
Subject: Ynt: Ynt: Postgresql 16.3 installation error (setup file) on Windows 11
Next
From: Robert Haas
Date:
Subject: Re: BUG #18362: unaccent rules and Old Greek text