Re: pg_upgrade test failure - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: pg_upgrade test failure
Date
Msg-id CA+hUKG+H-q78LxD20UOVyJbP1_idyocj6va_v1TF4RnN9JZa+w@mail.gmail.com
Whole thread Raw
In response to Re: pg_upgrade test failure  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: pg_upgrade test failure  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Mon, Oct 3, 2022 at 9:07 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
> > didn't warn about the file itself, but then failed one moment later in
> > rmdir.
>
> Yeah, I think this is my fault.  In commit f357233c the new lstat()
> call might return ENOENT for STATUS_DELETE_PENDING, and then we don't
> enter pgunlink()'s 10 second sleep-retry loop.  Let me think about how
> best to fix that, and how to write a regression test program that
> would exercise stuff like this.  Might take a couple of days as I am
> away from computers until mid-week.

I think something like the attached should do the right thing for
STATUS_DELETE_PENDING (sort of "ENOENT-in-progress").  unlink() goes
back to being blocking (sleep+retry until eventually we reach ENOENT
or we time out and give up with EACCES), but we still distinguish it
from true ENOENT so we have a fast exit in that case.  This is passing
CI, but not tested yet.

One ugly thing in this patch is that it has to deal with our
historical mistake (?) of including Windows headers in this file in
Cygwin builds for no reason and thus getting WIN32 defined on a
non-WIN32 build, as I've complained about before[1] but not yet tidied
up.

Remembering why we do any of this weird looking stuff that we don't
need on Unix, the general idea is that things that scan directories to
unlink everything before unlinking the parent directory need to block
for a while on STATUS_DELETE_PENDING to increase their probability of
success, while things that do anything else probably just want to skip
such zombie files completely.  To recap, we have:

 * readdir() sees files that are ENOENT-in-progress (so recursive
unlinks can see them)
 * unlink() waits for ENOENT-in-progress to reach ENOENT (what broke here)
 * stat() and lstat() report ENOENT-in-progress as ENOENT (done to fix
eg pg_basebackup, which used to fail at random on Windows)
 * open() reports ENOENT-in-progress as either ENOENT or EEXIST
depending on O_CREAT (because used by stat())

Clearly this set of kludges isn't perfect and other kludge-sets would
be possible too.  One thought is that we could hide ENOENT-in-progress
from readdir(), and add a new rmdir() wrapper instead.  If it gets a
directory-not-empty error from the kernel, it could at that point wait
for zombie files to go away (perhaps registering for file system
events with some local equivalent of KQ_FILTER_VNODE if there is one,
to be less sloppy that the current sleep() nonsense, but sleep would
work too).

When I'm back at my real keyboard I'll try to come up with tests for
this stuff, but I'm not sure how solid we can really make a test for
this particular case -- I think you'd need to have another thread open
the file and then close it after different periods of time, to
demonstrate that the retry loop works but also gives up, and that's
exactly the sort of timing-dependent stuff we try to avoid.  But I
think I'll try that anyway, because it's essential infrastructure to
allow Unix-only hackers to work only this stuff.  Once we have that,
we might be able to make some more progress with the various
FILE_DISPOSITION_POSIX_SEMANTICS proposals, if it helps, because we'll
have reproducible evidence for what it really does.

[1] https://commitfest.postgresql.org/39/3781/

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: missing indexes in indexlist with partitioned tables
Next
From: David Rowley
Date:
Subject: Re: Reducing the chunk header sizes on all memory context types