Thread: Random pg_upgrade 004_subscription test failure on drongo
Hi, In one of the recent buildfarm runs, the test pg_upgrade/004_subscription test fails at [1]. It has failed with the following: Restoring database schemas in the new cluster *failure* Consult the last few lines of "C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log" for the probable cause of the failure. Failure, exiting Unfortunately we don't have pg_upgrade_output.d contents in buildfarm to see what is the exact reason. I checked old buildfarm failures and found a similar failure in drongo for pg_upgarde/003_logical_slots at [2]. The test failure can happen because of the reason mentioned at [3] and [6]. I'm just requoting the contents from there, if you have checked [3] and [6] skip the below contents: --------------------------------------------- The failure reason for the earlier failure was found to be: According to it, the TRUNCATE command seemed to be failed due to a "File exists" error. 2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR: could not create file "base/1/2683": File exists 2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT: -- For binary upgrade, preserve pg_largeobject and index relfilenodes SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); TRUNCATE pg_catalog.pg_largeobject; ... ## Analysis I think it was caused due to the STATUS_DELETE_PENDING failure, not related with recent updates for pg_upgrade. The file "base/1/2683" is an index file for pg_largeobject_loid_pn_index, and the output meant that file creation failed. Below is a backtrace. ``` pgwin32_open() // <-- this returns -1 open() BasicOpenFilePerm() PathNameOpenFilePerm() PathNameOpenFile() mdcreate() smgrcreate() RelationCreateStorage() RelationSetNewRelfilenumber() ExecuteTruncateGuts() ExecuteTruncate() ``` But this is strange. Before calling mdcreate(), we surely unlink the file which have the same name. Below is a trace until unlink. ``` pgunlink() unlink() mdunlinkfork() mdunlink() smgrdounlinkall() RelationSetNewRelfilenumber() // common path with above ExecuteTruncateGuts() ExecuteTruncate() ``` I found Thomas said that [4] pgunlink sometimes could not remove a file even if it returns OK, at that time NTSTATUS is STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle() mentions the same thing: ``` /* * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet * gone (Windows NT status code is STATUS_DELETE_PENDING). In that * case, we'd better ask for the NT status too so we can translate it * to a more Unix-like error. We hope that nothing clobbers the NT * status in between the internal NtCreateFile() call and CreateFile() * returning. * ``` The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on that, indeed, open() would be able to fail with STATUS_DELETE_PENDING if the deletion is pending but it is trying to open. --------------------------------------------- This was fixed by the following change in the target upgrade nodes: bgwriter_lru_maxpages = 0 checkpoint_timeout = 1h Attached is a patch in similar lines for 004_subscription. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-10%2019%3A26%3A35 [2] - https://www.postgresql.org/message-id/flat/TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A%40TYAPR01MB5866.jpnprd01.prod.outlook.com [3] - https://www.postgresql.org/message-id/TYAPR01MB5866A4E7342088E91362BEF0F5BBA%40TYAPR01MB5866.jpnprd01.prod.outlook.com [4] - https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com [5] - https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55 [6]- https://www.postgresql.org/message-id/f0d303f1-e380-5988-91c7-74b755abd4bb%40gmail.com Regards, Vignesh
Attachment
On 2025-03-13 Th 5:04 AM, vignesh C wrote: > Hi, > > In one of the recent buildfarm runs, the test > pg_upgrade/004_subscription test fails at [1]. > It has failed with the following: > Restoring database schemas in the new cluster > *failure* > > Consult the last few lines of > "C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log" > for > the probable cause of the failure. > Failure, exiting > > Unfortunately we don't have pg_upgrade_output.d contents in buildfarm > to see what is the exact reason. That's not supposed to happen. I am testing a fix to see if I can make it collect the logs, but for now we'll have to wait till the next failure .. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 13/03/2025 11:04, vignesh C wrote: > ## Analysis > I think it was caused due to the STATUS_DELETE_PENDING failure, not > related with recent > updates for pg_upgrade. > > The file "base/1/2683" is an index file for > pg_largeobject_loid_pn_index, and the > output meant that file creation failed. Below is a backtrace. > > ``` > pgwin32_open() // <-- this returns -1 > open() > BasicOpenFilePerm() > PathNameOpenFilePerm() > PathNameOpenFile() > mdcreate() > smgrcreate() > RelationCreateStorage() > RelationSetNewRelfilenumber() > ExecuteTruncateGuts() > ExecuteTruncate() > ``` > > But this is strange. Before calling mdcreate(), we surely unlink the > file which have the same name. Below is a trace until unlink. > > ``` > pgunlink() > unlink() > mdunlinkfork() > mdunlink() > smgrdounlinkall() > RelationSetNewRelfilenumber() // common path with above > ExecuteTruncateGuts() > ExecuteTruncate() > ``` > > I found Thomas said that [4] pgunlink sometimes could not remove a > file even if it returns OK, at that time NTSTATUS is > STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle() > mentions the same thing: > > ``` > /* > * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet > * gone (Windows NT status code is STATUS_DELETE_PENDING). In that > * case, we'd better ask for the NT status too so we can translate it > * to a more Unix-like error. We hope that nothing clobbers the NT > * status in between the internal NtCreateFile() call and CreateFile() > * returning. > * > ``` > > The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on > that, indeed, open() would be able to fail with STATUS_DELETE_PENDING > if the deletion is pending but it is trying to open. > --------------------------------------------- > > This was fixed by the following change in the target upgrade nodes: > bgwriter_lru_maxpages = 0 > checkpoint_timeout = 1h > > Attached is a patch in similar lines for 004_subscription. Hmm, this problem isn't limited to this one pg_upgrade test, right? It could happen with any pg_upgrade invocation. And perhaps in a running server too, if a relfilenumber is reused quickly. In dropdb() and DropTableSpace() we do this: WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); Should we do the same here? Not sure where exactly to put that; perhaps in mdcreate(), if the creation fails with STATUS_DELETE_PENDING. -- Heikki Linnakangas Neon (https://neon.tech)
On Thu, 13 Mar 2025 at 18:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Hmm, this problem isn't limited to this one pg_upgrade test, right? It > could happen with any pg_upgrade invocation. And perhaps in a running > server too, if a relfilenumber is reused quickly. Yes, it can happen with these scenarios. > In dropdb() and DropTableSpace() we do this: > > WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); > > Should we do the same here? Not sure where exactly to put that; perhaps > in mdcreate(), if the creation fails with STATUS_DELETE_PENDING. I felt mdcreate would be a good place to fix this issue. Regards, Vignesh
On Thu, 13 Mar 2025 at 18:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > Hmm, this problem isn't limited to this one pg_upgrade test, right? It > could happen with any pg_upgrade invocation. And perhaps in a running > server too, if a relfilenumber is reused quickly. In dropdb() and > DropTableSpace() we do this: > > WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); > > Should we do the same here? Not sure where exactly to put that; perhaps > in mdcreate(), if the creation fails with STATUS_DELETE_PENDING. How about a patch similar to the attached one? I have run pg_upgrade tests multiple times, but unfortunately, I was unable to reproduce the issue or verify these changes. Regards, Vignesh