Re: Random pg_upgrade 004_subscription test failure on drongo - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Random pg_upgrade 004_subscription test failure on drongo
Date
Msg-id cd3189f8-aaed-4ef6-a6b6-da72c1251f34@iki.fi
Whole thread Raw
In response to Random pg_upgrade 004_subscription test failure on drongo  (vignesh C <vignesh21@gmail.com>)
Responses Re: Random pg_upgrade 004_subscription test failure on drongo
Re: Random pg_upgrade 004_subscription test failure on drongo
List pgsql-hackers
On 13/03/2025 11:04, vignesh C wrote:
> ## Analysis
> I think it was caused due to the STATUS_DELETE_PENDING failure, not
> related with recent
> updates for pg_upgrade.
> 
> The file "base/1/2683" is an index file for
> pg_largeobject_loid_pn_index, and the
> output meant that file creation failed. Below is a backtrace.
> 
> ```
> pgwin32_open() // <-- this returns -1
> open()
> BasicOpenFilePerm()
> PathNameOpenFilePerm()
> PathNameOpenFile()
> mdcreate()
> smgrcreate()
> RelationCreateStorage()
> RelationSetNewRelfilenumber()
> ExecuteTruncateGuts()
> ExecuteTruncate()
> ```
> 
> But this is strange. Before calling mdcreate(), we surely unlink the
> file which have the same name. Below is a trace until unlink.
> 
> ```
> pgunlink()
> unlink()
> mdunlinkfork()
> mdunlink()
> smgrdounlinkall()
> RelationSetNewRelfilenumber() // common path with above
> ExecuteTruncateGuts()
> ExecuteTruncate()
> ```
> 
> I found Thomas said that [4] pgunlink sometimes could not remove a
> file even if it returns OK, at that time NTSTATUS is
> STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
> mentions the same thing:
> 
> ```
> /*
> * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
> * gone (Windows NT status code is STATUS_DELETE_PENDING).  In that
> * case, we'd better ask for the NT status too so we can translate it
> * to a more Unix-like error.  We hope that nothing clobbers the NT
> * status in between the internal NtCreateFile() call and CreateFile()
> * returning.
> *
> ```
> 
> The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on
> that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
> if the deletion is pending but it is trying to open.
> ---------------------------------------------
> 
> This was fixed by the following change in the target upgrade nodes:
> bgwriter_lru_maxpages = 0
> checkpoint_timeout = 1h
> 
> Attached is a patch in similar lines for 004_subscription.

Hmm, this problem isn't limited to this one pg_upgrade test, right? It 
could happen with any pg_upgrade invocation. And perhaps in a running 
server too, if a relfilenumber is reused quickly. In dropdb() and 
DropTableSpace() we do this:

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

Should we do the same here? Not sure where exactly to put that; perhaps 
in mdcreate(), if the creation fails with STATUS_DELETE_PENDING.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Test to dump and restore objects left behind by regression
Next
From: Laurenz Albe
Date:
Subject: Re: Allow default \watch interval in psql to be configured