Thread: pg_upgrade test failure

pg_upgrade test failure

From

Andres Freund

Date:

19 September 2022, 21:32:17

Hi,

After my last rebase of the meson tree I encountered the following test
failure:

https://cirrus-ci.com/task/5532444261613568

[20:23:04.171] ------------------------------------- 8< -------------------------------------
[20:23:04.171] stderr:
[20:23:04.171] #   Failed test 'pg_upgrade_output.d/ not removed after pg_upgrade --check success'
[20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249.
[20:23:04.171] #   Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success'
[20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261.
[20:23:04.171] # Looks like you failed 2 tests of 13.

regress_log:

https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

The pg_upgrade output contains these potentially relevant warnings:

...
*Clusters are compatible*
pg_upgrade: warning: could not remove file or directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log":
Directorynot empty
 
pg_upgrade: warning: could not remove file or directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511":
Directorynot empty
 
...


I don't know if actually related to the commit below, but there've been a
lot of runs of the pg_upgrade tests in the meson branch, and this is the first
failure of this kind. Unfortunately the error seems to be transient -
rerunning the tests succeeded.

On 2022-09-13 01:39:59 +0000, Michael Paquier wrote:
> Move any remaining files generated by pg_upgrade into an internal subdir
>
> This change concerns a couple of .txt files (for internal state checks)
> that were still written in the path where the binary is executed, and
> not in the subdirectory located in the target cluster.  Like the other
> .txt files doing already so (like loadable_libraries.txt), these are
> saved in the base output directory.  Note that on failure, the logs
> report the full path to the .txt file generated, so these are easy to
> find.
>
> Oversight in 38bfae3.
>
> Author: Daniel Gustafsson
> Reviewed-by: Michael Paquier, Justin Prysby
> Discussion: https://postgr.es/m/181A6DA8-3B7F-4B71-82D5-363FF0146820@yesql.se
> Backpatch-through: 15


Greetings,

Andres Freund

Re: pg_upgrade test failure

From

Michael Paquier

Date:

20 September 2022, 01:08:41

On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> I don't know if actually related to the commit below, but there've been a
> lot of runs of the pg_upgrade tests in the meson branch, and this is the first
> failure of this kind. Unfortunately the error seems to be transient -
> rerunning the tests succeeded.

This smells to me like a race condition in pg_upgrade (or even pg_ctl
for SERVER_LOG_FILE) where the code still has handles on some of the
files in the log/ subdirectory, causing its removal to not be able to
finish happen.  If this proves to be rather easy to reproduce, giving
a list of the files still present in this path would give a hint easy
to follow.  Does this reproduce with a good frequency?
--
Michael

Attachment

signature.asc

Re: pg_upgrade test failure

From

Andres Freund

Date:

20 September 2022, 01:13:17

Hi,

On 2022-09-20 10:08:41 +0900, Michael Paquier wrote:
> On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> > I don't know if actually related to the commit below, but there've been a
> > lot of runs of the pg_upgrade tests in the meson branch, and this is the first
> > failure of this kind. Unfortunately the error seems to be transient -
> > rerunning the tests succeeded.
> 
> This smells to me like a race condition in pg_upgrade (or even pg_ctl
> for SERVER_LOG_FILE) where the code still has handles on some of the
> files in the log/ subdirectory, causing its removal to not be able to
> finish happen.

I don't really see what'd race with what here? pg_upgrade has precise control
over what's happening here, no?


> If this proves to be rather easy to reproduce, giving
> a list of the files still present in this path would give a hint easy
> to follow.  Does this reproduce with a good frequency?

I've only seen it once so far, but there haven't been many CI runs of the
meson branch since rebasing ontop of the last changes to pg_upgrade.

Greetings,

Andres Freund

Re: pg_upgrade test failure

From

Michael Paquier

Date:

20 September 2022, 01:25:41

On Mon, Sep 19, 2022 at 06:13:17PM -0700, Andres Freund wrote:
> I don't really see what'd race with what here? pg_upgrade has precise control
> over what's happening here, no?

A code path could have forgotten a fclose() for example, but this code
is rather old and close-proof as far as I know.  Most of the log files
are used with redirections for external calls, though  I don't see
how these could still be hold after pg_upgrade finishes, though :/
Could the use meson somewhat influence when running tests on Windows?

> I've only seen it once so far, but there haven't been many CI runs of the
> meson branch since rebasing ontop of the last changes to pg_upgrade.

Hmm, okay.  Is that a specific branch in one of your public repos?
--
Michael

Attachment

signature.asc

Re: pg_upgrade test failure

From

Justin Pryzby

Date:

20 September 2022, 01:31:22

On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> Hi,
> 
> After my last rebase of the meson tree I encountered the following test
> failure:
> 
> https://cirrus-ci.com/task/5532444261613568
> 
> [20:23:04.171] ------------------------------------- 8< -------------------------------------
> [20:23:04.171] stderr:
> [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ not removed after pg_upgrade --check success'
> [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249.
> [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success'
> [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261.
> [20:23:04.171] # Looks like you failed 2 tests of 13.
> 
> regress_log:
>
https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
> 
> The pg_upgrade output contains these potentially relevant warnings:
> 
> ...
> *Clusters are compatible*
> pg_upgrade: warning: could not remove file or directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log":
Directorynot empty
 
> pg_upgrade: warning: could not remove file or directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511":
Directorynot empty
 
> ...

It looks like it failed to remove a *.log file under windows, which
caused rmtree to fail.

src/bin/pg_upgrade/pg_upgrade.h-#define DB_DUMP_LOG_FILE_MASK   "pg_upgrade_dump_%u.log"
src/bin/pg_upgrade/pg_upgrade.h-#define SERVER_LOG_FILE         "pg_upgrade_server.log"
src/bin/pg_upgrade/pg_upgrade.h-#define UTILITY_LOG_FILE        "pg_upgrade_utility.log"
src/bin/pg_upgrade/pg_upgrade.h:#define INTERNAL_LOG_FILE       "pg_upgrade_internal.log"

ee5353abb only changed .txt files used for errors so can't be the cause,
but the original commit 38bfae3 might be related.

I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
didn't warn about the file itself, but then failed one moment later in
rmdir.

-- 
Justin

Re: pg_upgrade test failure

From

Bharath Rupireddy

Date:

27 September 2022, 06:17:37

On Tue, Sep 20, 2022 at 7:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> > Hi,
> >
> > After my last rebase of the meson tree I encountered the following test
> > failure:
> >
> > https://cirrus-ci.com/task/5532444261613568
> >
> > [20:23:04.171] ------------------------------------- 8< -------------------------------------
> > [20:23:04.171] stderr:
> > [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ not removed after pg_upgrade --check success'
> > [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249.
> > [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success'
> > [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261.
> > [20:23:04.171] # Looks like you failed 2 tests of 13.
> >
> > regress_log:
> >
https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
> >
> > The pg_upgrade output contains these potentially relevant warnings:
> >
> > ...
> > *Clusters are compatible*
> > pg_upgrade: warning: could not remove file or directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log":
Directorynot empty
 
> > pg_upgrade: warning: could not remove file or directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511":
Directorynot empty
 
> > ...

Just for the records - the same issue was also seen here [1], [2].

[1] https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82
[2]
https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pg_upgrade test failure

From

Andres Freund

Date:

02 October 2022, 15:46:43

Hi,

On 2022-09-27 11:47:37 +0530, Bharath Rupireddy wrote:
> Just for the records - the same issue was also seen here [1], [2].
> 
> [1] https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82
> [2]
https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

Yea, this is at the moment one of the top sources of spurious test failures
for cfbot. Just manually looking at http://cfbot.cputube.org/ for tasks that
recently changed state on windows:

https://cirrus-ci.com/task/6422687231770624?logs=check_world#L60
https://cirrus-ci.com/task/6408332243107840?logs=check_world#L60
https://cirrus-ci.com/task/6202259712245760?logs=check_world#L60
https://cirrus-ci.com/task/6150885981028352?logs=check_world#L60
https://cirrus-ci.com/task/5361597290905600?logs=check_world#L60
https://cirrus-ci.com/task/5177327624650752?logs=check_world#L60
https://cirrus-ci.com/task/4862503887831040?logs=check_world#L60
https://cirrus-ci.com/task/4576362479484928?logs=check_world#L60

Something needs to happen here.

Greetings,

Andres Freund

Re: pg_upgrade test failure

From

Thomas Munro

Date:

02 October 2022, 20:07:25

On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
> didn't warn about the file itself, but then failed one moment later in
> rmdir.

Yeah, I think this is my fault.  In commit f357233c the new lstat()
call might return ENOENT for STATUS_DELETE_PENDING, and then we don't
enter pgunlink()'s 10 second sleep-retry loop.  Let me think about how
best to fix that, and how to write a regression test program that
would exercise stuff like this.  Might take a couple of days as I am
away from computers until mid-week.

Re: pg_upgrade test failure

From

Thomas Munro

Date:

02 October 2022, 23:10:06

On Mon, Oct 3, 2022 at 9:07 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
> > didn't warn about the file itself, but then failed one moment later in
> > rmdir.
>
> Yeah, I think this is my fault.  In commit f357233c the new lstat()
> call might return ENOENT for STATUS_DELETE_PENDING, and then we don't
> enter pgunlink()'s 10 second sleep-retry loop.  Let me think about how
> best to fix that, and how to write a regression test program that
> would exercise stuff like this.  Might take a couple of days as I am
> away from computers until mid-week.

I think something like the attached should do the right thing for
STATUS_DELETE_PENDING (sort of "ENOENT-in-progress").  unlink() goes
back to being blocking (sleep+retry until eventually we reach ENOENT
or we time out and give up with EACCES), but we still distinguish it
from true ENOENT so we have a fast exit in that case.  This is passing
CI, but not tested yet.

One ugly thing in this patch is that it has to deal with our
historical mistake (?) of including Windows headers in this file in
Cygwin builds for no reason and thus getting WIN32 defined on a
non-WIN32 build, as I've complained about before[1] but not yet tidied
up.

Remembering why we do any of this weird looking stuff that we don't
need on Unix, the general idea is that things that scan directories to
unlink everything before unlinking the parent directory need to block
for a while on STATUS_DELETE_PENDING to increase their probability of
success, while things that do anything else probably just want to skip
such zombie files completely.  To recap, we have:

 * readdir() sees files that are ENOENT-in-progress (so recursive
unlinks can see them)
 * unlink() waits for ENOENT-in-progress to reach ENOENT (what broke here)
 * stat() and lstat() report ENOENT-in-progress as ENOENT (done to fix
eg pg_basebackup, which used to fail at random on Windows)
 * open() reports ENOENT-in-progress as either ENOENT or EEXIST
depending on O_CREAT (because used by stat())

Clearly this set of kludges isn't perfect and other kludge-sets would
be possible too.  One thought is that we could hide ENOENT-in-progress
from readdir(), and add a new rmdir() wrapper instead.  If it gets a
directory-not-empty error from the kernel, it could at that point wait
for zombie files to go away (perhaps registering for file system
events with some local equivalent of KQ_FILTER_VNODE if there is one,
to be less sloppy that the current sleep() nonsense, but sleep would
work too).

When I'm back at my real keyboard I'll try to come up with tests for
this stuff, but I'm not sure how solid we can really make a test for
this particular case -- I think you'd need to have another thread open
the file and then close it after different periods of time, to
demonstrate that the retry loop works but also gives up, and that's
exactly the sort of timing-dependent stuff we try to avoid.  But I
think I'll try that anyway, because it's essential infrastructure to
allow Unix-only hackers to work only this stuff.  Once we have that,
we might be able to make some more progress with the various
FILE_DISPOSITION_POSIX_SEMANTICS proposals, if it helps, because we'll
have reproducible evidence for what it really does.

[1] https://commitfest.postgresql.org/39/3781/

Attachment

0001-Fix-STATUS_DELETE_PENDING-handling-in-pgunlink.patch

Re: pg_upgrade test failure

From

Michael Paquier

Date:

03 October 2022, 00:40:20

On Mon, Oct 03, 2022 at 12:10:06PM +1300, Thomas Munro wrote:
> I think something like the attached should do the right thing for
> STATUS_DELETE_PENDING (sort of "ENOENT-in-progress").  unlink() goes
> back to being blocking (sleep+retry until eventually we reach ENOENT
> or we time out and give up with EACCES), but we still distinguish it
> from true ENOENT so we have a fast exit in that case.  This is passing
> CI, but not tested yet.

    if (lstat(path, &st) < 0)
-       return -1;
+   {
+       if (lstat_error_was_status_delete_pending())
+           is_lnk = false;
+       else
+           return -1;
+   }
+   else
+       is_lnk = S_ISLNK(st.st_mode);
Sorry, I don't remember all the details in this area, but a directory
can never be marked as STATUS_DELETE_PENDING with some of its contents
still inside, right?  If it has some contents, forcing unlink() all
the time would be fine?

> One ugly thing in this patch is that it has to deal with our
> historical mistake (?) of including Windows headers in this file in
> Cygwin builds for no reason and thus getting WIN32 defined on a
> non-WIN32 build, as I've complained about before[1] but not yet tidied
> up.

Your proposal remains local to dirmod.c, so that does not sound like a
big deal to me for the time being.
--
Michael

Attachment

signature.asc

Re: pg_upgrade test failure

From

Thomas Munro

Date:

03 October 2022, 03:03:12

On Mon, Oct 3, 2022 at 1:40 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Oct 03, 2022 at 12:10:06PM +1300, Thomas Munro wrote:
> > I think something like the attached should do the right thing for
> > STATUS_DELETE_PENDING (sort of "ENOENT-in-progress").  unlink() goes
> > back to being blocking (sleep+retry until eventually we reach ENOENT
> > or we time out and give up with EACCES), but we still distinguish it
> > from true ENOENT so we have a fast exit in that case.  This is passing
> > CI, but not tested yet.
>
>     if (lstat(path, &st) < 0)
> -       return -1;
> +   {
> +       if (lstat_error_was_status_delete_pending())
> +           is_lnk = false;
> +       else
> +           return -1;
> +   }
> +   else
> +       is_lnk = S_ISLNK(st.st_mode);

> Sorry, I don't remember all the details in this area, but a directory
> can never be marked as STATUS_DELETE_PENDING with some of its contents
> still inside, right?

That's my understanding, yes: just like Unix, you can't remove a
directory with something in it.  Unlike Unix, that includes files that
have been unlinked but are still open somewhere.  (Note that in this
case it's not exactly a real directory, it's a junction point, which
is a directory but it doesn't have contents, it is a reparse point
pointing somewhere else, so I suspect that it can't really suffer from
ENOTEMPTY, but it probably can suffer from 'someone has it open for a
short time because they are concurrently stat-ing it'.)

> If it has some contents, forcing unlink() all
> the time would be fine?

Here's why I think it's probably OK to use unlink() unconditionally
after detecting STATUS_DELETE_PENDING.  AFAICT there is no way to even
find out if it's a file or a junction in this state, but it doesn't
matter: we are not waiting for rmdir() or unlink() to succeed, we are
waiting for it to fail with an error other than EACCES, most likely
ENOENT (or to time out, perhaps because someone held the file open for
11 seconds, or because EACCES was actually telling us about a
permissions problem).  EACCES is the errno for many things including
STATUS_DELETE_PENDING and also "you called unlink() but it's a
directory" (should be EPERM according to POSIX, or EISDIR according
to Linux).  Both of those reasons imply that the zombie directory
entry still exists, and we don't care which of those reasons triggered
it.    So I think that setting is_lnk = false is good enough here.  Do
you see a hole in it?

Re: pg_upgrade test failure

From

Michael Paquier

Date:

03 October 2022, 06:28:51

On Mon, Oct 03, 2022 at 04:03:12PM +1300, Thomas Munro wrote:
> So I think that setting is_lnk = false is good enough here.  Do
> you see a hole in it?

I cannot think on one, on top of my head.  Thanks for the
explanation.
--
Michael

Attachment

signature.asc

Re: pg_upgrade test failure

From

Thomas Munro

Date:

17 October 2022, 20:47:37

On Mon, Oct 3, 2022 at 7:29 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Oct 03, 2022 at 04:03:12PM +1300, Thomas Munro wrote:
> > So I think that setting is_lnk = false is good enough here.  Do
> > you see a hole in it?
>
> I cannot think on one, on top of my head.  Thanks for the
> explanation.

Some things I learned while trying to understand how I managed to
introduce that bug despite reading and testing:

* the code in pgunlink() has comments saying that its retry loop is to
handle sharing violations
* in fact that retry loop also comes into play for STATUS_DELETE_PENDING
* that case is fairly well hidden, because to reach it you need to
unlink(pathname) twice!  the second call will wait up to 10 seconds
for handles to close and then report ENOENT, allowing rmdir(parent) to
succeed
* I guess this code is relying on that double-unlink to block until
the directory is empty?
* you wouldn't notice any of this if you were testing on Windows 10 on
a desktop/laptop/VM, because it now uses POSIX semantics for unlink on
NTFS, so the first unlink truly (synchronously) unlinks (no more
STATUS_DELETE_PENDING)
* Server 2019, as used on CI, still uses the traditional NT semantics
(unlink is asynchronous, when all handles closes)
* the fix I proposed has the right effect (I will follow up with tests
to demonstrate)

I'll post my tests for this and a bunch more things I figured out
shortly in a new Windows-filesystem-semantics-omnibus thread.

Re: pg_upgrade test failure

From

Michael Paquier

Date:

18 October 2022, 04:06:15

On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote:
> * Server 2019, as used on CI, still uses the traditional NT semantics
> (unlink is asynchronous, when all handles closes)
> * the fix I proposed has the right effect (I will follow up with tests
> to demonstrate)

Wow, nice investigation.  And cirrus does not offer a newer option
either..  Do you think that Windows server 2022 (successor of 2019) is
able to use POSIX semantics for unlink()?  It looks that we are a few
years away from being able to do that assuming that cirrus offers a
newer version than server 2019, but I guess that the code could
mention this possibility in a comment, at least..
--
Michael

Attachment

signature.asc

Re: pg_upgrade test failure

From

Justin Pryzby

Date:

18 October 2022, 04:31:44

On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote:
> On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote:
> > * Server 2019, as used on CI, still uses the traditional NT semantics
> > (unlink is asynchronous, when all handles closes)
> > * the fix I proposed has the right effect (I will follow up with tests
> > to demonstrate)
> 
> Wow, nice investigation.  And cirrus does not offer a newer option
> either..

Currently Andres builds images based on cirrus's 2019 image, but I think
we could use any windows docker image.

> Do you think that Windows server 2022 (successor of 2019) is
> able to use POSIX semantics for unlink()?

I think it's possible to use it now, like what's done here.
https://commitfest.postgresql.org/40/3347/

The only caveat is that it's done conditionally.

Re: pg_upgrade test failure

From

Andres Freund

Date:

18 October 2022, 16:29:08

Hi,

On 2022-10-17 23:31:44 -0500, Justin Pryzby wrote:
> On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote:
> > On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote:
> > > * Server 2019, as used on CI, still uses the traditional NT semantics
> > > (unlink is asynchronous, when all handles closes)
> > > * the fix I proposed has the right effect (I will follow up with tests
> > > to demonstrate)
> > 
> > Wow, nice investigation.  And cirrus does not offer a newer option
> > either..
> 
> Currently Andres builds images based on cirrus's 2019 image, but I think
> we could use any windows docker image.

You unfortunately can't run newer containers than the host OS :(, just user
older ones. And if you use mismatching containers the startup gets slower
because it switches to use full virtualization rather than containers.

I think we need to switch to use full VMs rather than containers. The
performance of the windows containers is just atrocious (build times on a
local VM with the same number of cores is 1/2 of what we see in CI, test times
1/3), they're slow to start due to pulling all files and decompressing them,
and they're fragile. I've asked Bilal (CCed) to work on generating both
containers and images.

Greetings,

Andres Freund

Re: pg_upgrade test failure

From

Thomas Munro

Date:

07 November 2022, 12:16:09

So [1] on its own didn't fix this.  My next guess is that the attached
might help.

Hmm.  Following Michael's clue that this might involve log files and
pg_ctl, I noticed one thing: pg_ctl implements
wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and
our kill emulation does CallNamedPipe().  If the server is in the
process of exiting and the kernel is cleaning up all the handles we
didn't close, is there any reason to expect the signal pipe to be
closed after the log file?

[1] https://www.postgresql.org/message-id/flat/20221025213055.GA8537%40telsasoft.com#9030de6c4c5e544d2b057b793a5b42af

Attachment

0001-Refactor-rmtree-to-use-get_dirent_type.patch

Re: pg_upgrade test failure

From

Justin Pryzby

Date:

17 November 2022, 03:15:43

On Tue, Nov 08, 2022 at 01:16:09AM +1300, Thomas Munro wrote:
> So [1] on its own didn't fix this.  My next guess is that the attached
> might help.

I took the liberty of adding a CF entry for this
https://commitfest.postgresql.org/41/4011/

And afterwards figured I could be a little bit wasteful and run the
tests using meson test --repeat, rather than let cfbot do it over the
course of a month.
https://cirrus-ci.com/task/5115893722644480

So I didn't find evidence that it doesn't resolve the issue (but this
also doesn't prove that it will works).

-- 
Justin

Re: pg_upgrade test failure

From

Andres Freund

Date:

06 December 2022, 18:15:33

Hi,

On 2022-11-08 01:16:09 +1300, Thomas Munro wrote:
> So [1] on its own didn't fix this.  My next guess is that the attached
> might help.
> 
> Hmm.  Following Michael's clue that this might involve log files and
> pg_ctl, I noticed one thing: pg_ctl implements
> wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and
> our kill emulation does CallNamedPipe().  If the server is in the
> process of exiting and the kernel is cleaning up all the handles we
> didn't close, is there any reason to expect the signal pipe to be
> closed after the log file?

What is our plan here? This afaict is the most common "false positive" for
cfbot in the last weeks.

E.g.:


https://api.cirrus-ci.com/v1/artifact/task/5462686092230656/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
...
[00:02:58.761](93.859s) ok 10 - run of pg_upgrade for new instance
[00:02:58.808](0.047s) not ok 11 - pg_upgrade_output.d/ removed after pg_upgrade success
[00:02:58.815](0.007s) #   Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success'
#   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 288.


Michael:

Why does 002_pg_upgrade.pl try to filter the list of files in
pg_upgrade_output.d for files ending in .log? And why does it print those
only after starting the new node?

How about moving the iteration through the pg_upgrade_output.d to before the
->start and printing all the files, but only slurp_file() if the filename ends
with *.log?

Minor nit: It seems off to quite so many copies of
  $newnode->data_dir . "/pg_upgrade_output.d"
particularly where the test defines $log_path, but then still builds
it from scratch after (line 304).

Greetings,

Andres Freund

Re: pg_upgrade test failure

From

Thomas Munro

Date:

05 January 2023, 03:11:00

On Wed, Dec 7, 2022 at 7:15 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-11-08 01:16:09 +1300, Thomas Munro wrote:
> > So [1] on its own didn't fix this.  My next guess is that the attached
> > might help.

> What is our plan here? This afaict is the most common "false positive" for
> cfbot in the last weeks.

That branch hasn't failed on cfbot[1], except once due to one of the
other known flapping races we have to fix.  Which doesn't prove
anything, of course, but it is encouraging.  I wish we knew why the
test does this, though....

Here's a better version that works harder to avoid opening more than
one fd at a time (like the pgfnames()-based code it replaces), and
also uses fd.c facilities in the backend version (unlike pgfnames(),
which looks like it could leak a descriptor if palloc() threw, and
also doesn't know how to handle file descriptor pressure).

[1] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/41/4011

Attachment

v2-0001-Refactor-rmtree-to-use-get_dirent_type.patch

Re: pg_upgrade test failure

From

Thomas Munro

Date:

31 January 2023, 01:00:05

On Thu, Jan 5, 2023 at 4:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Dec 7, 2022 at 7:15 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote:
> > > So [1] on its own didn't fix this.  My next guess is that the attached
> > > might help.
>
> > What is our plan here? This afaict is the most common "false positive" for
> > cfbot in the last weeks.

I pushed the rmtree() change.  Let's see if that helps, or tells us
something new.

Michael:  There were some questions from Andres above.  FWIW I think
if you wanted to investigate this properly on a local Windows system
to chase down who's got the file open (shutdown sequence problem or
whatever), you'd probably have to install Server 2019, or maybe use an
old 8.1 VM if you still have such a thing, based on the suspicion that
typical 10 and 11 systems won't exhibit the problem.  But then I could
be wrong about what's going on...

Re: pg_upgrade test failure

From

Justin Pryzby

Date:

31 January 2023, 17:28:06

On Tue, Jan 31, 2023 at 02:00:05PM +1300, Thomas Munro wrote:
> On Thu, Jan 5, 2023 at 4:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, Dec 7, 2022 at 7:15 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote:
> > > > So [1] on its own didn't fix this.  My next guess is that the attached
> > > > might help.
> >
> > > What is our plan here? This afaict is the most common "false positive" for
> > > cfbot in the last weeks.
> 
> I pushed the rmtree() change.  Let's see if that helps, or tells us
> something new.

I found a few failures since then:


https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

pg_upgrade: warning: could not remove directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log":
Directorynot empty
 
pg_upgrade: warning: could not remove directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720":
Directorynot empty
 


https://api.cirrus-ci.com/v1/artifact/task/5119776607961088/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
same

I verified that those both include your 54e72b66e, which is pretty
strange, since the patch passed tests 10s of times on CI until it was
merged, when it started/kept failing.

-- 
Justin

Re: pg_upgrade test failure

From

Thomas Munro

Date:

31 January 2023, 20:54:42

On Wed, Feb 1, 2023 at 6:28 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > I pushed the rmtree() change.  Let's see if that helps, or tells us
> > something new.
>
> I found a few failures since then:
>
>
https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
>
> pg_upgrade: warning: could not remove directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log":
Directorynot empty

> pg_upgrade: warning: could not remove directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720":
Directorynot empty

So no change: we didn't see "could not unlink file ...".  So I think
that means that it was rmtree() that unlinked the file for the *first*
time, but someone else has it open.

Even though Windows is at this point eroding my love of computers and
making me consider a new career in, I dunno, carrot farming or
something, I have one more idea.  Check out this kluge in
src/bin/pg_upgrade/exec.c:

        /*
         * "pg_ctl -w stop" might have reported that the server has stopped
         * because the postmaster.pid file has been removed, but "pg_ctl -w
         * start" might still be in the process of closing and might still be
         * holding its stdout and -l log file descriptors open.  Therefore,
         * try to open the log file a few more times.
         */

I'm not sure about anything, but if that's what's happening here, then
maybe the attached would help.  In short, it would make the previous
theory true (the idea of a second unlink() saving the day).

Attachment

x.patch

Re: pg_upgrade test failure

From

Andres Freund

Date:

31 January 2023, 21:04:24

Hi,

On January 31, 2023 12:54:42 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
>On Wed, Feb 1, 2023 at 6:28 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>> > I pushed the rmtree() change.  Let's see if that helps, or tells us
>> > something new.
>>
>> I found a few failures since then:
>>
>>
https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
>>
>> pg_upgrade: warning: could not remove directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log":
Directorynot empty 
>> pg_upgrade: warning: could not remove directory
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720":
Directorynot empty 
>
>So no change: we didn't see "could not unlink file ...".  So I think
>that means that it was rmtree() that unlinked the file for the *first*
>time, but someone else has it open.
>
>Even though Windows is at this point eroding my love of computers and
>making me consider a new career in, I dunno, carrot farming or
>something, I have one more idea.  Check out this kluge in
>src/bin/pg_upgrade/exec.c:
>
>        /*
>         * "pg_ctl -w stop" might have reported that the server has stopped
>         * because the postmaster.pid file has been removed, but "pg_ctl -w
>         * start" might still be in the process of closing and might still be
>         * holding its stdout and -l log file descriptors open.  Therefore,
>         * try to open the log file a few more times.
>         */
>
>I'm not sure about anything, but if that's what's happening here, then
>maybe the attached would help.  In short, it would make the previous
>theory true (the idea of a second unlink() saving the day).


Maybe we should just handle it by sleeping and retrying, if on windows? Sad to even propose...
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: pg_upgrade test failure

From

Thomas Munro

Date:

31 January 2023, 21:08:17

On Wed, Feb 1, 2023 at 10:04 AM Andres Freund <andres@anarazel.de> wrote:
> On January 31, 2023 12:54:42 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
> >I'm not sure about anything, but if that's what's happening here, then
> >maybe the attached would help.  In short, it would make the previous
> >theory true (the idea of a second unlink() saving the day).
>
> Maybe we should just handle it by sleeping and retrying, if on windows? Sad to even propose...

Yeah, that's what that code I posted would do automatically, though
it's a bit hidden.  The second attempt to unlink() would see delete
already pending, and activate its secret internal sleep/retry loop.

Re: pg_upgrade test failure

From

Thomas Munro

Date:

31 January 2023, 21:20:23

On Wed, Feb 1, 2023 at 9:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... I have one more idea ...

I also had a second idea, barely good enough to mention and probably
just paranoia.  In a nearby thread I learned that process exit does
not release Windows advisory file locks synchronously, which surprised
this Unix hacker; it made me wonder what else might be released lazily
after process exit.  Handles?!  However, as previously mentioned, it's
possible that even with fully Unix-like resource cleanup on process
exit, we could be confused if we are using "the process that was on
the end of this pipe has closed it" as a proxy for "the process is
gone, *all* its handles are closed".  In any case, the previous kluge
should help wallpaper over any of that too, for this test anyway.

Re: pg_upgrade test failure

From

Thomas Munro

Date:

01 February 2023, 01:44:53

On Wed, Feb 1, 2023 at 10:08 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Feb 1, 2023 at 10:04 AM Andres Freund <andres@anarazel.de> wrote:
> > Maybe we should just handle it by sleeping and retrying, if on windows? Sad to even propose...
>
> Yeah, that's what that code I posted would do automatically, though
> it's a bit hidden.  The second attempt to unlink() would see delete
> already pending, and activate its secret internal sleep/retry loop.

OK, I pushed that.  Third time lucky?

I tracked down the discussion of that existing comment about pg_ctl,
which comes from the 9.2 days:

https://www.postgresql.org/message-id/flat/5044DE59.5020500%40dunslane.net

I guess maybe back then fopen() was Windows' own fopen() that wouldn't
allow two handles to a file at the same time?  These days we redirect
it to a wrapper with the magic "shared" flags, so the kluge installed
by commit f8c81c5dde2 may not even be needed anymore.  It does
demonstrate that there are long standing timing races around log
files, process exit and wait-for-shutdown logic, though.

Someone who develops for Windows could probably chase this right down,
and make sure that we do certain things in the right order, and/or
find better kernel facilities; at a wild guess, something like
OpenProcess() before you initiate shutdown, so you can then wait on
its handle, for example.  The docs for ExitProcess() make it clear
that handles are synchronously closed, so I think it's probably just
that our tests for when processes have exited are too fuzzy.

Re: pg_upgrade test failure

From

Thomas Munro

Date:

06 February 2023, 21:51:20

On Wed, Feb 1, 2023 at 2:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> OK, I pushed that.  Third time lucky?

I pulled down logs for a week of Windows CI, just over ~1k builds.
The failure rate was a few per day before, but there are no failures
like that after that went in.  There are logs that contain the
"Directory not empty" warning, but clearly the
try-again-and-this-time-wait-for-the-other-process logic must be
working (as horrible as it is) because then the test checks that the
directory is gone, and succeeds.  Hooray.

So that's one of our biggest CI flappers fixed.  Unfortunately without
treating the root cause, really.

Next up: the new "running" tests, spuriously failing around 8.8% of CI
builds on FreeBSD.  I'll go and ping that thread...

Re: pg_upgrade test failure

From

Andres Freund

Date:

06 February 2023, 21:57:38

Hi,

On February 6, 2023 1:51:20 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
>Next up: the new "running" tests, spuriously failing around 8.8% of CI
>builds on FreeBSD.  I'll go and ping that thread...

Is that rate unchanged? I thought I fixed the main issue last week?

Greetings,

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: pg_upgrade test failure

From

Thomas Munro

Date:

06 February 2023, 22:03:18

On Tue, Feb 7, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote:
> On February 6, 2023 1:51:20 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
> >Next up: the new "running" tests, spuriously failing around 8.8% of CI
> >builds on FreeBSD.  I'll go and ping that thread...
>
> Is that rate unchanged? I thought I fixed the main issue last week?

Unfortunately my cfbot database only holds a week's history.  What I
see is that there were 1254 FreeBSD tasks run in that window, of which
163 failed, and (more interestingly) 111 of those failures succeeded
on every other platform.  And clicking on a few on cfbot's page
reveals that it's the new running stuff, and I'm still trying to find
the interesting logs...

Re: pg_upgrade test failure

From

Thomas Munro

Date:

06 February 2023, 22:08:50

On Tue, Feb 7, 2023 at 11:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Feb 7, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote:
> > On February 6, 2023 1:51:20 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
> > >Next up: the new "running" tests, spuriously failing around 8.8% of CI
> > >builds on FreeBSD.  I'll go and ping that thread...
> >
> > Is that rate unchanged? I thought I fixed the main issue last week?
>
> Unfortunately my cfbot database only holds a week's history.  What I
> see is that there were 1254 FreeBSD tasks run in that window, of which
> 163 failed, and (more interestingly) 111 of those failures succeeded
> on every other platform.  And clicking on a few on cfbot's page
> reveals that it's the new running stuff, and I'm still trying to find
> the interesting logs...

Ah, that number might include some other problems, including in
subscription (#2900).  That's the problem with flapping tests, you get
desensitised and stop looking closely and miss things...

Re: pg_upgrade test failure

From

Andres Freund

Date:

06 February 2023, 22:14:22

Hi,

On 2023-02-07 11:03:18 +1300, Thomas Munro wrote:
> On Tue, Feb 7, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote:
> > On February 6, 2023 1:51:20 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
> > >Next up: the new "running" tests, spuriously failing around 8.8% of CI
> > >builds on FreeBSD.  I'll go and ping that thread...
> >
> > Is that rate unchanged? I thought I fixed the main issue last week?
>
> Unfortunately my cfbot database only holds a week's history.

Would be interesting to increase that to a considerably longer time. I can't
imagine that that'd take all that much resources?


> What I see is that there were 1254 FreeBSD tasks run in that window, of
> which 163 failed, and (more interestingly) 111 of those failures succeeded
> on every other platform.  And clicking on a few on cfbot's page reveals that
> it's the new running stuff, and I'm still trying to find the interesting
> logs...

I think I figured out why the logs frequently fail to upload - the server is
still running, so the size changes during the upload, causing the upload to
fail with errors like:

[12:46:43.552] Failed to upload artifacts: Put
"https://storage.googleapis.com/cirrus-ci-5309429912436736-3271c9/artifacts/postgresql-cfbot/postgresql/6729936359129088/testrun/build/testrun/runningcheck.log?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=cirrus-ci%40cirrus-ci-community.iam.gserviceaccount.com%2F20230206%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20230206T124536Z&X-Goog-Expires=600&X-Goog-SignedHeaders=host%3Bx-goog-content-length-range%3Bx-goog-meta-created_by_task&X-Goog-Signature=8e84192cbc754180b8baa6c00c41b463f580fe7183f0e7113c253aac13cc2458b835caef7940b91e102e96d54cff2b5714c77390e74244e2fb88c00c9957a801e33cbee2ac960e0db8a01fe08ee945bedf4616881e6beafa3a162c22948ac0b9a9359d93e1f461fc9f49385b784b75d633f1b01805b987d9d53bc7fb55263917ec85180a2140659d50990f066160f03e8bb8984e8d2aadb64c875c253167cf24da152a18d69fcd3d941edce145931e4feb23dc8cf43de7b7bbfc565786c1c692406f2a0a127f30385a8c4b66f96709b51d26d3c71617991c731b0e7206ee3906338dedf6359412edd024f8c76bd33400f4c9320c2bde9512fa8bcd6289e54d52":
http2:request body larger than specified content length 

I'm testing adding a pg_ctl stop to the on_failure right now.

Greetings,

Andres Freund

Re: pg_upgrade test failure

From

Andres Freund

Date:

06 February 2023, 23:43:01

On 2023-02-06 14:14:22 -0800, Andres Freund wrote:
> On 2023-02-07 11:03:18 +1300, Thomas Munro wrote:
> > What I see is that there were 1254 FreeBSD tasks run in that window, of
> > which 163 failed, and (more interestingly) 111 of those failures succeeded
> > on every other platform.  And clicking on a few on cfbot's page reveals that
> > it's the new running stuff, and I'm still trying to find the interesting
> > logs...
>
> I think I figured out why the logs frequently fail to upload - the server is
> still running, so the size changes during the upload, causing the upload to
> fail with errors like:
>
> [12:46:43.552] Failed to upload artifacts: Put
"https://storage.googleapis.com/cirrus-ci-5309429912436736-3271c9/artifacts/postgresql-cfbot/postgresql/6729936359129088/testrun/build/testrun/runningcheck.log?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=cirrus-ci%40cirrus-ci-community.iam.gserviceaccount.com%2F20230206%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20230206T124536Z&X-Goog-Expires=600&X-Goog-SignedHeaders=host%3Bx-goog-content-length-range%3Bx-goog-meta-created_by_task&X-Goog-Signature=8e84192cbc754180b8baa6c00c41b463f580fe7183f0e7113c253aac13cc2458b835caef7940b91e102e96d54cff2b5714c77390e74244e2fb88c00c9957a801e33cbee2ac960e0db8a01fe08ee945bedf4616881e6beafa3a162c22948ac0b9a9359d93e1f461fc9f49385b784b75d633f1b01805b987d9d53bc7fb55263917ec85180a2140659d50990f066160f03e8bb8984e8d2aadb64c875c253167cf24da152a18d69fcd3d941edce145931e4feb23dc8cf43de7b7bbfc565786c1c692406f2a0a127f30385a8c4b66f96709b51d26d3c71617991c731b0e7206ee3906338dedf6359412edd024f8c76bd33400f4c9320c2bde9512fa8bcd6289e54d52":
http2:request body larger than specified content length 
>
> I'm testing adding a pg_ctl stop to the on_failure right now.

Pushed the fix doing so.

RE: pg_upgrade test failure

From

"Hayato Kuroda (Fujitsu)"

Date:

29 October 2023, 05:43:46

Dear Andres,

While tracking BF failures related with pg_ugprade, I found the same failure has still happened [1] - [4].
According to the log, the output directory was remained even after the successful upgrade [5].
I analyzed and attached the fix patch, and below is my analysis... how do you think?

=====

lstat() seemed fail while doing the second try of rmtree(). This error message is
output from get_dirent_type().

Apart from pgunlink(), get_dirent_type() does not have an retry mechanism when
lstat()->_pglstat64() detects STATUS_DELETE_PENDING. Therefore, I think rmtree()
may not wait the file until it would be really removed, if the status is deceted
in the get_dirent_type().

One solution is to retry stat() or lstat() even in get_dirent_type(), like attached.


[1]: 2023-07-21 02:21:53 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-07-21%2002%3A21%3A53
[2]: 2023-10-21 13:39:15 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-10-21%2013%3A39%3A15
[3]: 2023-10-23 09:03:07 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-10-23%2009%3A03%3A07
[4]: 2023-10-27 23:06:17 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-10-27%2023%3A06%3A17
[5]
```
...
*Clusters are compatible*
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log":
Directorynot empty 
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867":
Directorynot empty 
pg_upgrade: warning: could not stat file
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log/pg_upgrade_internal.log":
Nosuch file or directory 
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log":
Directorynot empty 
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867":
Directorynot empty 
[23:46:07.585](17.106s) ok 12 - run of pg_upgrade --check for new instance
[23:46:07.587](0.002s) not ok 13 - pg_upgrade_output.d/ removed after pg_upgrade --check success
...
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

0001-Retry-stat-and-lstat-even-in-get_dirent_type.patch

Re: pg_upgrade test failure

From

vignesh C

Date:

05 January 2024, 04:49:48

On Sun, 29 Oct 2023 at 11:14, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Andres,
>
> While tracking BF failures related with pg_ugprade, I found the same failure has still happened [1] - [4].
> According to the log, the output directory was remained even after the successful upgrade [5].
> I analyzed and attached the fix patch, and below is my analysis... how do you think?
>

The same failure occurs randomly at [1] for a newly added test too:
pg_upgrade: warning: could not remove directory

"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log":
Directory not empty
pg_upgrade: warning: could not remove directory

"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796":
Directory not empty
pg_upgrade: warning: could not stat file

"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log/pg_upgrade_internal.log":
No such file or directory
pg_upgrade: warning: could not remove directory

"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log":
Directory not empty
pg_upgrade: warning: could not remove directory

"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796":
Directory not empty

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-01-04%2019%3A56%3A20

Regards,
Vignesh