Thread: Random pg_upgrade 004_subscription test failure on drongo

Random pg_upgrade 004_subscription test failure on drongo

From
vignesh C
Date:
Hi,

In one of the recent buildfarm runs, the test
pg_upgrade/004_subscription test fails at [1].
It has failed with the following:
Restoring database schemas in the new cluster
*failure*

Consult the last few lines of

"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log"
for
the probable cause of the failure.
Failure, exiting

Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
to see what is the exact reason.
I checked old buildfarm failures and found a similar failure in drongo
for pg_upgarde/003_logical_slots at [2].

The test failure can happen because of the reason mentioned at [3] and [6].
I'm just requoting the contents from there, if you have checked [3]
and [6] skip the below contents:
---------------------------------------------
The failure reason for the earlier failure was found to be:
According to it, the TRUNCATE command seemed to be failed due to a
"File exists" error.
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR:
could not create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:
-- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
...

## Analysis
I think it was caused due to the STATUS_DELETE_PENDING failure, not
related with recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for
pg_largeobject_loid_pn_index, and the
output meant that file creation failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the
file which have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [4] pgunlink sometimes could not remove a
file even if it returns OK, at that time NTSTATUS is
STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
mentions the same thing:

```
/*
* ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
* gone (Windows NT status code is STATUS_DELETE_PENDING).  In that
* case, we'd better ask for the NT status too so we can translate it
* to a more Unix-like error.  We hope that nothing clobbers the NT
* status in between the internal NtCreateFile() call and CreateFile()
* returning.
*
```

The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on
that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
if the deletion is pending but it is trying to open.
---------------------------------------------

This was fixed by the following change in the target upgrade nodes:
bgwriter_lru_maxpages = 0
checkpoint_timeout = 1h

Attached is a patch in similar lines for 004_subscription.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-10%2019%3A26%3A35
[2] -
https://www.postgresql.org/message-id/flat/TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3] -
https://www.postgresql.org/message-id/TYAPR01MB5866A4E7342088E91362BEF0F5BBA%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[4] - https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com
[5] - https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
[6]- https://www.postgresql.org/message-id/f0d303f1-e380-5988-91c7-74b755abd4bb%40gmail.com

Regards,
Vignesh

Attachment

Re: Random pg_upgrade 004_subscription test failure on drongo

From
Andrew Dunstan
Date:
On 2025-03-13 Th 5:04 AM, vignesh C wrote:
> Hi,
>
> In one of the recent buildfarm runs, the test
> pg_upgrade/004_subscription test fails at [1].
> It has failed with the following:
> Restoring database schemas in the new cluster
> *failure*
>
> Consult the last few lines of
>
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log"
> for
> the probable cause of the failure.
> Failure, exiting
>
> Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
> to see what is the exact reason.



That's not supposed to happen. I am testing a fix to see if I can make 
it collect the logs, but for now we'll have to wait till the next failure ..



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Random pg_upgrade 004_subscription test failure on drongo

From
Heikki Linnakangas
Date:
On 13/03/2025 11:04, vignesh C wrote:
> ## Analysis
> I think it was caused due to the STATUS_DELETE_PENDING failure, not
> related with recent
> updates for pg_upgrade.
> 
> The file "base/1/2683" is an index file for
> pg_largeobject_loid_pn_index, and the
> output meant that file creation failed. Below is a backtrace.
> 
> ```
> pgwin32_open() // <-- this returns -1
> open()
> BasicOpenFilePerm()
> PathNameOpenFilePerm()
> PathNameOpenFile()
> mdcreate()
> smgrcreate()
> RelationCreateStorage()
> RelationSetNewRelfilenumber()
> ExecuteTruncateGuts()
> ExecuteTruncate()
> ```
> 
> But this is strange. Before calling mdcreate(), we surely unlink the
> file which have the same name. Below is a trace until unlink.
> 
> ```
> pgunlink()
> unlink()
> mdunlinkfork()
> mdunlink()
> smgrdounlinkall()
> RelationSetNewRelfilenumber() // common path with above
> ExecuteTruncateGuts()
> ExecuteTruncate()
> ```
> 
> I found Thomas said that [4] pgunlink sometimes could not remove a
> file even if it returns OK, at that time NTSTATUS is
> STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
> mentions the same thing:
> 
> ```
> /*
> * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
> * gone (Windows NT status code is STATUS_DELETE_PENDING).  In that
> * case, we'd better ask for the NT status too so we can translate it
> * to a more Unix-like error.  We hope that nothing clobbers the NT
> * status in between the internal NtCreateFile() call and CreateFile()
> * returning.
> *
> ```
> 
> The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on
> that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
> if the deletion is pending but it is trying to open.
> ---------------------------------------------
> 
> This was fixed by the following change in the target upgrade nodes:
> bgwriter_lru_maxpages = 0
> checkpoint_timeout = 1h
> 
> Attached is a patch in similar lines for 004_subscription.

Hmm, this problem isn't limited to this one pg_upgrade test, right? It 
could happen with any pg_upgrade invocation. And perhaps in a running 
server too, if a relfilenumber is reused quickly. In dropdb() and 
DropTableSpace() we do this:

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

Should we do the same here? Not sure where exactly to put that; perhaps 
in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Random pg_upgrade 004_subscription test failure on drongo

From
vignesh C
Date:
On Thu, 13 Mar 2025 at 18:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Hmm, this problem isn't limited to this one pg_upgrade test, right? It
> could happen with any pg_upgrade invocation. And perhaps in a running
> server too, if a relfilenumber is reused quickly.

Yes, it can happen with these scenarios.

> In dropdb() and DropTableSpace() we do this:
>
> WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
>
> Should we do the same here? Not sure where exactly to put that; perhaps
> in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

I felt mdcreate would be a good place to fix this issue.

Regards,
Vignesh



Re: Random pg_upgrade 004_subscription test failure on drongo

From
vignesh C
Date:
On Thu, 13 Mar 2025 at 18:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
>
> Hmm, this problem isn't limited to this one pg_upgrade test, right? It
> could happen with any pg_upgrade invocation. And perhaps in a running
> server too, if a relfilenumber is reused quickly. In dropdb() and
> DropTableSpace() we do this:
>
> WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
>
> Should we do the same here? Not sure where exactly to put that; perhaps
> in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

How about a patch similar to the attached one? I have run pg_upgrade
tests multiple times, but unfortunately, I was unable to reproduce the
issue or verify these changes.

Regards,
Vignesh

Attachment