Thread: pg_upgrade test failure
Hi, After my last rebase of the meson tree I encountered the following test failure: https://cirrus-ci.com/task/5532444261613568 [20:23:04.171] ------------------------------------- 8< ------------------------------------- [20:23:04.171] stderr: [20:23:04.171] # Failed test 'pg_upgrade_output.d/ not removed after pg_upgrade --check success' [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249. [20:23:04.171] # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success' [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261. [20:23:04.171] # Looks like you failed 2 tests of 13. regress_log: https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade The pg_upgrade output contains these potentially relevant warnings: ... *Clusters are compatible* pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log": Directorynot empty pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511": Directorynot empty ... I don't know if actually related to the commit below, but there've been a lot of runs of the pg_upgrade tests in the meson branch, and this is the first failure of this kind. Unfortunately the error seems to be transient - rerunning the tests succeeded. On 2022-09-13 01:39:59 +0000, Michael Paquier wrote: > Move any remaining files generated by pg_upgrade into an internal subdir > > This change concerns a couple of .txt files (for internal state checks) > that were still written in the path where the binary is executed, and > not in the subdirectory located in the target cluster. Like the other > .txt files doing already so (like loadable_libraries.txt), these are > saved in the base output directory. Note that on failure, the logs > report the full path to the .txt file generated, so these are easy to > find. > > Oversight in 38bfae3. > > Author: Daniel Gustafsson > Reviewed-by: Michael Paquier, Justin Prysby > Discussion: https://postgr.es/m/181A6DA8-3B7F-4B71-82D5-363FF0146820@yesql.se > Backpatch-through: 15 Greetings, Andres Freund
On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote: > I don't know if actually related to the commit below, but there've been a > lot of runs of the pg_upgrade tests in the meson branch, and this is the first > failure of this kind. Unfortunately the error seems to be transient - > rerunning the tests succeeded. This smells to me like a race condition in pg_upgrade (or even pg_ctl for SERVER_LOG_FILE) where the code still has handles on some of the files in the log/ subdirectory, causing its removal to not be able to finish happen. If this proves to be rather easy to reproduce, giving a list of the files still present in this path would give a hint easy to follow. Does this reproduce with a good frequency? -- Michael
Attachment
Hi, On 2022-09-20 10:08:41 +0900, Michael Paquier wrote: > On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote: > > I don't know if actually related to the commit below, but there've been a > > lot of runs of the pg_upgrade tests in the meson branch, and this is the first > > failure of this kind. Unfortunately the error seems to be transient - > > rerunning the tests succeeded. > > This smells to me like a race condition in pg_upgrade (or even pg_ctl > for SERVER_LOG_FILE) where the code still has handles on some of the > files in the log/ subdirectory, causing its removal to not be able to > finish happen. I don't really see what'd race with what here? pg_upgrade has precise control over what's happening here, no? > If this proves to be rather easy to reproduce, giving > a list of the files still present in this path would give a hint easy > to follow. Does this reproduce with a good frequency? I've only seen it once so far, but there haven't been many CI runs of the meson branch since rebasing ontop of the last changes to pg_upgrade. Greetings, Andres Freund
On Mon, Sep 19, 2022 at 06:13:17PM -0700, Andres Freund wrote: > I don't really see what'd race with what here? pg_upgrade has precise control > over what's happening here, no? A code path could have forgotten a fclose() for example, but this code is rather old and close-proof as far as I know. Most of the log files are used with redirections for external calls, though I don't see how these could still be hold after pg_upgrade finishes, though :/ Could the use meson somewhat influence when running tests on Windows? > I've only seen it once so far, but there haven't been many CI runs of the > meson branch since rebasing ontop of the last changes to pg_upgrade. Hmm, okay. Is that a specific branch in one of your public repos? -- Michael
Attachment
On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote: > Hi, > > After my last rebase of the meson tree I encountered the following test > failure: > > https://cirrus-ci.com/task/5532444261613568 > > [20:23:04.171] ------------------------------------- 8< ------------------------------------- > [20:23:04.171] stderr: > [20:23:04.171] # Failed test 'pg_upgrade_output.d/ not removed after pg_upgrade --check success' > [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249. > [20:23:04.171] # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success' > [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261. > [20:23:04.171] # Looks like you failed 2 tests of 13. > > regress_log: > https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade > > The pg_upgrade output contains these potentially relevant warnings: > > ... > *Clusters are compatible* > pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log": Directorynot empty > pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511": Directorynot empty > ... It looks like it failed to remove a *.log file under windows, which caused rmtree to fail. src/bin/pg_upgrade/pg_upgrade.h-#define DB_DUMP_LOG_FILE_MASK "pg_upgrade_dump_%u.log" src/bin/pg_upgrade/pg_upgrade.h-#define SERVER_LOG_FILE "pg_upgrade_server.log" src/bin/pg_upgrade/pg_upgrade.h-#define UTILITY_LOG_FILE "pg_upgrade_utility.log" src/bin/pg_upgrade/pg_upgrade.h:#define INTERNAL_LOG_FILE "pg_upgrade_internal.log" ee5353abb only changed .txt files used for errors so can't be the cause, but the original commit 38bfae3 might be related. I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so didn't warn about the file itself, but then failed one moment later in rmdir. -- Justin
On Tue, Sep 20, 2022 at 7:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote: > > Hi, > > > > After my last rebase of the meson tree I encountered the following test > > failure: > > > > https://cirrus-ci.com/task/5532444261613568 > > > > [20:23:04.171] ------------------------------------- 8< ------------------------------------- > > [20:23:04.171] stderr: > > [20:23:04.171] # Failed test 'pg_upgrade_output.d/ not removed after pg_upgrade --check success' > > [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249. > > [20:23:04.171] # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success' > > [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261. > > [20:23:04.171] # Looks like you failed 2 tests of 13. > > > > regress_log: > > https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade > > > > The pg_upgrade output contains these potentially relevant warnings: > > > > ... > > *Clusters are compatible* > > pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log": Directorynot empty > > pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511": Directorynot empty > > ... Just for the records - the same issue was also seen here [1], [2]. [1] https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82 [2] https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2022-09-27 11:47:37 +0530, Bharath Rupireddy wrote: > Just for the records - the same issue was also seen here [1], [2]. > > [1] https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82 > [2] https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade Yea, this is at the moment one of the top sources of spurious test failures for cfbot. Just manually looking at http://cfbot.cputube.org/ for tasks that recently changed state on windows: https://cirrus-ci.com/task/6422687231770624?logs=check_world#L60 https://cirrus-ci.com/task/6408332243107840?logs=check_world#L60 https://cirrus-ci.com/task/6202259712245760?logs=check_world#L60 https://cirrus-ci.com/task/6150885981028352?logs=check_world#L60 https://cirrus-ci.com/task/5361597290905600?logs=check_world#L60 https://cirrus-ci.com/task/5177327624650752?logs=check_world#L60 https://cirrus-ci.com/task/4862503887831040?logs=check_world#L60 https://cirrus-ci.com/task/4576362479484928?logs=check_world#L60 Something needs to happen here. Greetings, Andres Freund
On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so > didn't warn about the file itself, but then failed one moment later in > rmdir. Yeah, I think this is my fault. In commit f357233c the new lstat() call might return ENOENT for STATUS_DELETE_PENDING, and then we don't enter pgunlink()'s 10 second sleep-retry loop. Let me think about how best to fix that, and how to write a regression test program that would exercise stuff like this. Might take a couple of days as I am away from computers until mid-week.
On Mon, Oct 3, 2022 at 9:07 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so > > didn't warn about the file itself, but then failed one moment later in > > rmdir. > > Yeah, I think this is my fault. In commit f357233c the new lstat() > call might return ENOENT for STATUS_DELETE_PENDING, and then we don't > enter pgunlink()'s 10 second sleep-retry loop. Let me think about how > best to fix that, and how to write a regression test program that > would exercise stuff like this. Might take a couple of days as I am > away from computers until mid-week. I think something like the attached should do the right thing for STATUS_DELETE_PENDING (sort of "ENOENT-in-progress"). unlink() goes back to being blocking (sleep+retry until eventually we reach ENOENT or we time out and give up with EACCES), but we still distinguish it from true ENOENT so we have a fast exit in that case. This is passing CI, but not tested yet. One ugly thing in this patch is that it has to deal with our historical mistake (?) of including Windows headers in this file in Cygwin builds for no reason and thus getting WIN32 defined on a non-WIN32 build, as I've complained about before[1] but not yet tidied up. Remembering why we do any of this weird looking stuff that we don't need on Unix, the general idea is that things that scan directories to unlink everything before unlinking the parent directory need to block for a while on STATUS_DELETE_PENDING to increase their probability of success, while things that do anything else probably just want to skip such zombie files completely. To recap, we have: * readdir() sees files that are ENOENT-in-progress (so recursive unlinks can see them) * unlink() waits for ENOENT-in-progress to reach ENOENT (what broke here) * stat() and lstat() report ENOENT-in-progress as ENOENT (done to fix eg pg_basebackup, which used to fail at random on Windows) * open() reports ENOENT-in-progress as either ENOENT or EEXIST depending on O_CREAT (because used by stat()) Clearly this set of kludges isn't perfect and other kludge-sets would be possible too. One thought is that we could hide ENOENT-in-progress from readdir(), and add a new rmdir() wrapper instead. If it gets a directory-not-empty error from the kernel, it could at that point wait for zombie files to go away (perhaps registering for file system events with some local equivalent of KQ_FILTER_VNODE if there is one, to be less sloppy that the current sleep() nonsense, but sleep would work too). When I'm back at my real keyboard I'll try to come up with tests for this stuff, but I'm not sure how solid we can really make a test for this particular case -- I think you'd need to have another thread open the file and then close it after different periods of time, to demonstrate that the retry loop works but also gives up, and that's exactly the sort of timing-dependent stuff we try to avoid. But I think I'll try that anyway, because it's essential infrastructure to allow Unix-only hackers to work only this stuff. Once we have that, we might be able to make some more progress with the various FILE_DISPOSITION_POSIX_SEMANTICS proposals, if it helps, because we'll have reproducible evidence for what it really does. [1] https://commitfest.postgresql.org/39/3781/
Attachment
On Mon, Oct 03, 2022 at 12:10:06PM +1300, Thomas Munro wrote: > I think something like the attached should do the right thing for > STATUS_DELETE_PENDING (sort of "ENOENT-in-progress"). unlink() goes > back to being blocking (sleep+retry until eventually we reach ENOENT > or we time out and give up with EACCES), but we still distinguish it > from true ENOENT so we have a fast exit in that case. This is passing > CI, but not tested yet. if (lstat(path, &st) < 0) - return -1; + { + if (lstat_error_was_status_delete_pending()) + is_lnk = false; + else + return -1; + } + else + is_lnk = S_ISLNK(st.st_mode); Sorry, I don't remember all the details in this area, but a directory can never be marked as STATUS_DELETE_PENDING with some of its contents still inside, right? If it has some contents, forcing unlink() all the time would be fine? > One ugly thing in this patch is that it has to deal with our > historical mistake (?) of including Windows headers in this file in > Cygwin builds for no reason and thus getting WIN32 defined on a > non-WIN32 build, as I've complained about before[1] but not yet tidied > up. Your proposal remains local to dirmod.c, so that does not sound like a big deal to me for the time being. -- Michael
Attachment
On Mon, Oct 3, 2022 at 1:40 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Oct 03, 2022 at 12:10:06PM +1300, Thomas Munro wrote: > > I think something like the attached should do the right thing for > > STATUS_DELETE_PENDING (sort of "ENOENT-in-progress"). unlink() goes > > back to being blocking (sleep+retry until eventually we reach ENOENT > > or we time out and give up with EACCES), but we still distinguish it > > from true ENOENT so we have a fast exit in that case. This is passing > > CI, but not tested yet. > > if (lstat(path, &st) < 0) > - return -1; > + { > + if (lstat_error_was_status_delete_pending()) > + is_lnk = false; > + else > + return -1; > + } > + else > + is_lnk = S_ISLNK(st.st_mode); > Sorry, I don't remember all the details in this area, but a directory > can never be marked as STATUS_DELETE_PENDING with some of its contents > still inside, right? That's my understanding, yes: just like Unix, you can't remove a directory with something in it. Unlike Unix, that includes files that have been unlinked but are still open somewhere. (Note that in this case it's not exactly a real directory, it's a junction point, which is a directory but it doesn't have contents, it is a reparse point pointing somewhere else, so I suspect that it can't really suffer from ENOTEMPTY, but it probably can suffer from 'someone has it open for a short time because they are concurrently stat-ing it'.) > If it has some contents, forcing unlink() all > the time would be fine? Here's why I think it's probably OK to use unlink() unconditionally after detecting STATUS_DELETE_PENDING. AFAICT there is no way to even find out if it's a file or a junction in this state, but it doesn't matter: we are not waiting for rmdir() or unlink() to succeed, we are waiting for it to fail with an error other than EACCES, most likely ENOENT (or to time out, perhaps because someone held the file open for 11 seconds, or because EACCES was actually telling us about a permissions problem). EACCES is the errno for many things including STATUS_DELETE_PENDING and also "you called unlink() but it's a directory" (should be EPERM according to POSIX, or EISDIR according to Linux). Both of those reasons imply that the zombie directory entry still exists, and we don't care which of those reasons triggered it. So I think that setting is_lnk = false is good enough here. Do you see a hole in it?
On Mon, Oct 03, 2022 at 04:03:12PM +1300, Thomas Munro wrote: > So I think that setting is_lnk = false is good enough here. Do > you see a hole in it? I cannot think on one, on top of my head. Thanks for the explanation. -- Michael
Attachment
On Mon, Oct 3, 2022 at 7:29 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Oct 03, 2022 at 04:03:12PM +1300, Thomas Munro wrote: > > So I think that setting is_lnk = false is good enough here. Do > > you see a hole in it? > > I cannot think on one, on top of my head. Thanks for the > explanation. Some things I learned while trying to understand how I managed to introduce that bug despite reading and testing: * the code in pgunlink() has comments saying that its retry loop is to handle sharing violations * in fact that retry loop also comes into play for STATUS_DELETE_PENDING * that case is fairly well hidden, because to reach it you need to unlink(pathname) twice! the second call will wait up to 10 seconds for handles to close and then report ENOENT, allowing rmdir(parent) to succeed * I guess this code is relying on that double-unlink to block until the directory is empty? * you wouldn't notice any of this if you were testing on Windows 10 on a desktop/laptop/VM, because it now uses POSIX semantics for unlink on NTFS, so the first unlink truly (synchronously) unlinks (no more STATUS_DELETE_PENDING) * Server 2019, as used on CI, still uses the traditional NT semantics (unlink is asynchronous, when all handles closes) * the fix I proposed has the right effect (I will follow up with tests to demonstrate) I'll post my tests for this and a bunch more things I figured out shortly in a new Windows-filesystem-semantics-omnibus thread.
On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote: > * Server 2019, as used on CI, still uses the traditional NT semantics > (unlink is asynchronous, when all handles closes) > * the fix I proposed has the right effect (I will follow up with tests > to demonstrate) Wow, nice investigation. And cirrus does not offer a newer option either.. Do you think that Windows server 2022 (successor of 2019) is able to use POSIX semantics for unlink()? It looks that we are a few years away from being able to do that assuming that cirrus offers a newer version than server 2019, but I guess that the code could mention this possibility in a comment, at least.. -- Michael
Attachment
On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote: > On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote: > > * Server 2019, as used on CI, still uses the traditional NT semantics > > (unlink is asynchronous, when all handles closes) > > * the fix I proposed has the right effect (I will follow up with tests > > to demonstrate) > > Wow, nice investigation. And cirrus does not offer a newer option > either.. Currently Andres builds images based on cirrus's 2019 image, but I think we could use any windows docker image. > Do you think that Windows server 2022 (successor of 2019) is > able to use POSIX semantics for unlink()? I think it's possible to use it now, like what's done here. https://commitfest.postgresql.org/40/3347/ The only caveat is that it's done conditionally.
Hi, On 2022-10-17 23:31:44 -0500, Justin Pryzby wrote: > On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote: > > On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote: > > > * Server 2019, as used on CI, still uses the traditional NT semantics > > > (unlink is asynchronous, when all handles closes) > > > * the fix I proposed has the right effect (I will follow up with tests > > > to demonstrate) > > > > Wow, nice investigation. And cirrus does not offer a newer option > > either.. > > Currently Andres builds images based on cirrus's 2019 image, but I think > we could use any windows docker image. You unfortunately can't run newer containers than the host OS :(, just user older ones. And if you use mismatching containers the startup gets slower because it switches to use full virtualization rather than containers. I think we need to switch to use full VMs rather than containers. The performance of the windows containers is just atrocious (build times on a local VM with the same number of cores is 1/2 of what we see in CI, test times 1/3), they're slow to start due to pulling all files and decompressing them, and they're fragile. I've asked Bilal (CCed) to work on generating both containers and images. Greetings, Andres Freund
So [1] on its own didn't fix this. My next guess is that the attached might help. Hmm. Following Michael's clue that this might involve log files and pg_ctl, I noticed one thing: pg_ctl implements wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and our kill emulation does CallNamedPipe(). If the server is in the process of exiting and the kernel is cleaning up all the handles we didn't close, is there any reason to expect the signal pipe to be closed after the log file? [1] https://www.postgresql.org/message-id/flat/20221025213055.GA8537%40telsasoft.com#9030de6c4c5e544d2b057b793a5b42af
Attachment
On Tue, Nov 08, 2022 at 01:16:09AM +1300, Thomas Munro wrote: > So [1] on its own didn't fix this. My next guess is that the attached > might help. I took the liberty of adding a CF entry for this https://commitfest.postgresql.org/41/4011/ And afterwards figured I could be a little bit wasteful and run the tests using meson test --repeat, rather than let cfbot do it over the course of a month. https://cirrus-ci.com/task/5115893722644480 So I didn't find evidence that it doesn't resolve the issue (but this also doesn't prove that it will works). -- Justin
Hi, On 2022-11-08 01:16:09 +1300, Thomas Munro wrote: > So [1] on its own didn't fix this. My next guess is that the attached > might help. > > Hmm. Following Michael's clue that this might involve log files and > pg_ctl, I noticed one thing: pg_ctl implements > wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and > our kill emulation does CallNamedPipe(). If the server is in the > process of exiting and the kernel is cleaning up all the handles we > didn't close, is there any reason to expect the signal pipe to be > closed after the log file? What is our plan here? This afaict is the most common "false positive" for cfbot in the last weeks. E.g.: https://api.cirrus-ci.com/v1/artifact/task/5462686092230656/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade ... [00:02:58.761](93.859s) ok 10 - run of pg_upgrade for new instance [00:02:58.808](0.047s) not ok 11 - pg_upgrade_output.d/ removed after pg_upgrade success [00:02:58.815](0.007s) # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success' # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 288. Michael: Why does 002_pg_upgrade.pl try to filter the list of files in pg_upgrade_output.d for files ending in .log? And why does it print those only after starting the new node? How about moving the iteration through the pg_upgrade_output.d to before the ->start and printing all the files, but only slurp_file() if the filename ends with *.log? Minor nit: It seems off to quite so many copies of $newnode->data_dir . "/pg_upgrade_output.d" particularly where the test defines $log_path, but then still builds it from scratch after (line 304). Greetings, Andres Freund
On Wed, Dec 7, 2022 at 7:15 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote: > > So [1] on its own didn't fix this. My next guess is that the attached > > might help. > What is our plan here? This afaict is the most common "false positive" for > cfbot in the last weeks. That branch hasn't failed on cfbot[1], except once due to one of the other known flapping races we have to fix. Which doesn't prove anything, of course, but it is encouraging. I wish we knew why the test does this, though.... Here's a better version that works harder to avoid opening more than one fd at a time (like the pgfnames()-based code it replaces), and also uses fd.c facilities in the backend version (unlike pgfnames(), which looks like it could leak a descriptor if palloc() threw, and also doesn't know how to handle file descriptor pressure). [1] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/41/4011
Attachment
On Thu, Jan 5, 2023 at 4:11 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Dec 7, 2022 at 7:15 AM Andres Freund <andres@anarazel.de> wrote: > > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote: > > > So [1] on its own didn't fix this. My next guess is that the attached > > > might help. > > > What is our plan here? This afaict is the most common "false positive" for > > cfbot in the last weeks. I pushed the rmtree() change. Let's see if that helps, or tells us something new. Michael: There were some questions from Andres above. FWIW I think if you wanted to investigate this properly on a local Windows system to chase down who's got the file open (shutdown sequence problem or whatever), you'd probably have to install Server 2019, or maybe use an old 8.1 VM if you still have such a thing, based on the suspicion that typical 10 and 11 systems won't exhibit the problem. But then I could be wrong about what's going on...
On Tue, Jan 31, 2023 at 02:00:05PM +1300, Thomas Munro wrote: > On Thu, Jan 5, 2023 at 4:11 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Dec 7, 2022 at 7:15 AM Andres Freund <andres@anarazel.de> wrote: > > > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote: > > > > So [1] on its own didn't fix this. My next guess is that the attached > > > > might help. > > > > > What is our plan here? This afaict is the most common "false positive" for > > > cfbot in the last weeks. > > I pushed the rmtree() change. Let's see if that helps, or tells us > something new. I found a few failures since then: https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade pg_upgrade: warning: could not remove directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log": Directorynot empty pg_upgrade: warning: could not remove directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720": Directorynot empty https://api.cirrus-ci.com/v1/artifact/task/5119776607961088/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade same I verified that those both include your 54e72b66e, which is pretty strange, since the patch passed tests 10s of times on CI until it was merged, when it started/kept failing. -- Justin
On Wed, Feb 1, 2023 at 6:28 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > I pushed the rmtree() change. Let's see if that helps, or tells us > > something new. > > I found a few failures since then: > > https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade > > pg_upgrade: warning: could not remove directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log": Directorynot empty > pg_upgrade: warning: could not remove directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720": Directorynot empty So no change: we didn't see "could not unlink file ...". So I think that means that it was rmtree() that unlinked the file for the *first* time, but someone else has it open. Even though Windows is at this point eroding my love of computers and making me consider a new career in, I dunno, carrot farming or something, I have one more idea. Check out this kluge in src/bin/pg_upgrade/exec.c: /* * "pg_ctl -w stop" might have reported that the server has stopped * because the postmaster.pid file has been removed, but "pg_ctl -w * start" might still be in the process of closing and might still be * holding its stdout and -l log file descriptors open. Therefore, * try to open the log file a few more times. */ I'm not sure about anything, but if that's what's happening here, then maybe the attached would help. In short, it would make the previous theory true (the idea of a second unlink() saving the day).
Attachment
Hi, On January 31, 2023 12:54:42 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote: >On Wed, Feb 1, 2023 at 6:28 AM Justin Pryzby <pryzby@telsasoft.com> wrote: >> > I pushed the rmtree() change. Let's see if that helps, or tells us >> > something new. >> >> I found a few failures since then: >> >> https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade >> >> pg_upgrade: warning: could not remove directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log": Directorynot empty >> pg_upgrade: warning: could not remove directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720": Directorynot empty > >So no change: we didn't see "could not unlink file ...". So I think >that means that it was rmtree() that unlinked the file for the *first* >time, but someone else has it open. > >Even though Windows is at this point eroding my love of computers and >making me consider a new career in, I dunno, carrot farming or >something, I have one more idea. Check out this kluge in >src/bin/pg_upgrade/exec.c: > > /* > * "pg_ctl -w stop" might have reported that the server has stopped > * because the postmaster.pid file has been removed, but "pg_ctl -w > * start" might still be in the process of closing and might still be > * holding its stdout and -l log file descriptors open. Therefore, > * try to open the log file a few more times. > */ > >I'm not sure about anything, but if that's what's happening here, then >maybe the attached would help. In short, it would make the previous >theory true (the idea of a second unlink() saving the day). Maybe we should just handle it by sleeping and retrying, if on windows? Sad to even propose... -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Feb 1, 2023 at 10:04 AM Andres Freund <andres@anarazel.de> wrote: > On January 31, 2023 12:54:42 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote: > >I'm not sure about anything, but if that's what's happening here, then > >maybe the attached would help. In short, it would make the previous > >theory true (the idea of a second unlink() saving the day). > > Maybe we should just handle it by sleeping and retrying, if on windows? Sad to even propose... Yeah, that's what that code I posted would do automatically, though it's a bit hidden. The second attempt to unlink() would see delete already pending, and activate its secret internal sleep/retry loop.
On Wed, Feb 1, 2023 at 9:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > ... I have one more idea ... I also had a second idea, barely good enough to mention and probably just paranoia. In a nearby thread I learned that process exit does not release Windows advisory file locks synchronously, which surprised this Unix hacker; it made me wonder what else might be released lazily after process exit. Handles?! However, as previously mentioned, it's possible that even with fully Unix-like resource cleanup on process exit, we could be confused if we are using "the process that was on the end of this pipe has closed it" as a proxy for "the process is gone, *all* its handles are closed". In any case, the previous kluge should help wallpaper over any of that too, for this test anyway.
On Wed, Feb 1, 2023 at 10:08 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Feb 1, 2023 at 10:04 AM Andres Freund <andres@anarazel.de> wrote: > > Maybe we should just handle it by sleeping and retrying, if on windows? Sad to even propose... > > Yeah, that's what that code I posted would do automatically, though > it's a bit hidden. The second attempt to unlink() would see delete > already pending, and activate its secret internal sleep/retry loop. OK, I pushed that. Third time lucky? I tracked down the discussion of that existing comment about pg_ctl, which comes from the 9.2 days: https://www.postgresql.org/message-id/flat/5044DE59.5020500%40dunslane.net I guess maybe back then fopen() was Windows' own fopen() that wouldn't allow two handles to a file at the same time? These days we redirect it to a wrapper with the magic "shared" flags, so the kluge installed by commit f8c81c5dde2 may not even be needed anymore. It does demonstrate that there are long standing timing races around log files, process exit and wait-for-shutdown logic, though. Someone who develops for Windows could probably chase this right down, and make sure that we do certain things in the right order, and/or find better kernel facilities; at a wild guess, something like OpenProcess() before you initiate shutdown, so you can then wait on its handle, for example. The docs for ExitProcess() make it clear that handles are synchronously closed, so I think it's probably just that our tests for when processes have exited are too fuzzy.
On Wed, Feb 1, 2023 at 2:44 PM Thomas Munro <thomas.munro@gmail.com> wrote: > OK, I pushed that. Third time lucky? I pulled down logs for a week of Windows CI, just over ~1k builds. The failure rate was a few per day before, but there are no failures like that after that went in. There are logs that contain the "Directory not empty" warning, but clearly the try-again-and-this-time-wait-for-the-other-process logic must be working (as horrible as it is) because then the test checks that the directory is gone, and succeeds. Hooray. So that's one of our biggest CI flappers fixed. Unfortunately without treating the root cause, really. Next up: the new "running" tests, spuriously failing around 8.8% of CI builds on FreeBSD. I'll go and ping that thread...
Hi, On February 6, 2023 1:51:20 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote: >Next up: the new "running" tests, spuriously failing around 8.8% of CI >builds on FreeBSD. I'll go and ping that thread... Is that rate unchanged? I thought I fixed the main issue last week? Greetings, Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Tue, Feb 7, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote: > On February 6, 2023 1:51:20 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote: > >Next up: the new "running" tests, spuriously failing around 8.8% of CI > >builds on FreeBSD. I'll go and ping that thread... > > Is that rate unchanged? I thought I fixed the main issue last week? Unfortunately my cfbot database only holds a week's history. What I see is that there were 1254 FreeBSD tasks run in that window, of which 163 failed, and (more interestingly) 111 of those failures succeeded on every other platform. And clicking on a few on cfbot's page reveals that it's the new running stuff, and I'm still trying to find the interesting logs...
On Tue, Feb 7, 2023 at 11:03 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Feb 7, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote: > > On February 6, 2023 1:51:20 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote: > > >Next up: the new "running" tests, spuriously failing around 8.8% of CI > > >builds on FreeBSD. I'll go and ping that thread... > > > > Is that rate unchanged? I thought I fixed the main issue last week? > > Unfortunately my cfbot database only holds a week's history. What I > see is that there were 1254 FreeBSD tasks run in that window, of which > 163 failed, and (more interestingly) 111 of those failures succeeded > on every other platform. And clicking on a few on cfbot's page > reveals that it's the new running stuff, and I'm still trying to find > the interesting logs... Ah, that number might include some other problems, including in subscription (#2900). That's the problem with flapping tests, you get desensitised and stop looking closely and miss things...
Hi, On 2023-02-07 11:03:18 +1300, Thomas Munro wrote: > On Tue, Feb 7, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote: > > On February 6, 2023 1:51:20 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote: > > >Next up: the new "running" tests, spuriously failing around 8.8% of CI > > >builds on FreeBSD. I'll go and ping that thread... > > > > Is that rate unchanged? I thought I fixed the main issue last week? > > Unfortunately my cfbot database only holds a week's history. Would be interesting to increase that to a considerably longer time. I can't imagine that that'd take all that much resources? > What I see is that there were 1254 FreeBSD tasks run in that window, of > which 163 failed, and (more interestingly) 111 of those failures succeeded > on every other platform. And clicking on a few on cfbot's page reveals that > it's the new running stuff, and I'm still trying to find the interesting > logs... I think I figured out why the logs frequently fail to upload - the server is still running, so the size changes during the upload, causing the upload to fail with errors like: [12:46:43.552] Failed to upload artifacts: Put "https://storage.googleapis.com/cirrus-ci-5309429912436736-3271c9/artifacts/postgresql-cfbot/postgresql/6729936359129088/testrun/build/testrun/runningcheck.log?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=cirrus-ci%40cirrus-ci-community.iam.gserviceaccount.com%2F20230206%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20230206T124536Z&X-Goog-Expires=600&X-Goog-SignedHeaders=host%3Bx-goog-content-length-range%3Bx-goog-meta-created_by_task&X-Goog-Signature=8e84192cbc754180b8baa6c00c41b463f580fe7183f0e7113c253aac13cc2458b835caef7940b91e102e96d54cff2b5714c77390e74244e2fb88c00c9957a801e33cbee2ac960e0db8a01fe08ee945bedf4616881e6beafa3a162c22948ac0b9a9359d93e1f461fc9f49385b784b75d633f1b01805b987d9d53bc7fb55263917ec85180a2140659d50990f066160f03e8bb8984e8d2aadb64c875c253167cf24da152a18d69fcd3d941edce145931e4feb23dc8cf43de7b7bbfc565786c1c692406f2a0a127f30385a8c4b66f96709b51d26d3c71617991c731b0e7206ee3906338dedf6359412edd024f8c76bd33400f4c9320c2bde9512fa8bcd6289e54d52": http2:request body larger than specified content length I'm testing adding a pg_ctl stop to the on_failure right now. Greetings, Andres Freund
On 2023-02-06 14:14:22 -0800, Andres Freund wrote: > On 2023-02-07 11:03:18 +1300, Thomas Munro wrote: > > What I see is that there were 1254 FreeBSD tasks run in that window, of > > which 163 failed, and (more interestingly) 111 of those failures succeeded > > on every other platform. And clicking on a few on cfbot's page reveals that > > it's the new running stuff, and I'm still trying to find the interesting > > logs... > > I think I figured out why the logs frequently fail to upload - the server is > still running, so the size changes during the upload, causing the upload to > fail with errors like: > > [12:46:43.552] Failed to upload artifacts: Put "https://storage.googleapis.com/cirrus-ci-5309429912436736-3271c9/artifacts/postgresql-cfbot/postgresql/6729936359129088/testrun/build/testrun/runningcheck.log?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=cirrus-ci%40cirrus-ci-community.iam.gserviceaccount.com%2F20230206%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20230206T124536Z&X-Goog-Expires=600&X-Goog-SignedHeaders=host%3Bx-goog-content-length-range%3Bx-goog-meta-created_by_task&X-Goog-Signature=8e84192cbc754180b8baa6c00c41b463f580fe7183f0e7113c253aac13cc2458b835caef7940b91e102e96d54cff2b5714c77390e74244e2fb88c00c9957a801e33cbee2ac960e0db8a01fe08ee945bedf4616881e6beafa3a162c22948ac0b9a9359d93e1f461fc9f49385b784b75d633f1b01805b987d9d53bc7fb55263917ec85180a2140659d50990f066160f03e8bb8984e8d2aadb64c875c253167cf24da152a18d69fcd3d941edce145931e4feb23dc8cf43de7b7bbfc565786c1c692406f2a0a127f30385a8c4b66f96709b51d26d3c71617991c731b0e7206ee3906338dedf6359412edd024f8c76bd33400f4c9320c2bde9512fa8bcd6289e54d52": http2:request body larger than specified content length > > I'm testing adding a pg_ctl stop to the on_failure right now. Pushed the fix doing so.
Dear Andres, While tracking BF failures related with pg_ugprade, I found the same failure has still happened [1] - [4]. According to the log, the output directory was remained even after the successful upgrade [5]. I analyzed and attached the fix patch, and below is my analysis... how do you think? ===== lstat() seemed fail while doing the second try of rmtree(). This error message is output from get_dirent_type(). Apart from pgunlink(), get_dirent_type() does not have an retry mechanism when lstat()->_pglstat64() detects STATUS_DELETE_PENDING. Therefore, I think rmtree() may not wait the file until it would be really removed, if the status is deceted in the get_dirent_type(). One solution is to retry stat() or lstat() even in get_dirent_type(), like attached. [1]: 2023-07-21 02:21:53 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-07-21%2002%3A21%3A53 [2]: 2023-10-21 13:39:15 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-10-21%2013%3A39%3A15 [3]: 2023-10-23 09:03:07 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-10-23%2009%3A03%3A07 [4]: 2023-10-27 23:06:17 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-10-27%2023%3A06%3A17 [5] ``` ... *Clusters are compatible* pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log": Directorynot empty pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867": Directorynot empty pg_upgrade: warning: could not stat file "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log/pg_upgrade_internal.log": Nosuch file or directory pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log": Directorynot empty pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867": Directorynot empty [23:46:07.585](17.106s) ok 12 - run of pg_upgrade --check for new instance [23:46:07.587](0.002s) not ok 13 - pg_upgrade_output.d/ removed after pg_upgrade --check success ... ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Sun, 29 Oct 2023 at 11:14, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Andres, > > While tracking BF failures related with pg_ugprade, I found the same failure has still happened [1] - [4]. > According to the log, the output directory was remained even after the successful upgrade [5]. > I analyzed and attached the fix patch, and below is my analysis... how do you think? > The same failure occurs randomly at [1] for a newly added test too: pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log": Directory not empty pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796": Directory not empty pg_upgrade: warning: could not stat file "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log/pg_upgrade_internal.log": No such file or directory pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log": Directory not empty pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796": Directory not empty [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-01-04%2019%3A56%3A20 Regards, Vignesh