Thread: Cygwin cleanup
Hi, Continuing a discussion started over at [1]. Moving this to a new thread so that other thread can focus on Unix cleanup, and both threads can get CI coverage... 1. In a few places, it is alleged that both __CYGWIN__ and WIN32 might be defined at the same time. Do you think we should try to get rid of that possibility? I understand that we have to have a few tests for __CYGWIN__ here and there, because eg file permissions don't work quite right and there's not much we can do about that. But it seems a bit unhelpful if we also have to worry about a more-or-less POSIX-ish build taking WIN32 paths at uncertain times if we forget to defend against that, or wonder why some places are not consistent. A quick recap of the three flavours of Windows platform we have to handle, as I understand it: * MSVC: Windowsy toolchain, Windowsy C * custom perl scripts instead of configure * msbuild instead of make * MSVC compiler * Windows C APIs * we provide our own emulation of some POSIX C APIs on top * MSYS: Unixy toolchain, Windowsy C * configure (portname = "win32") * make * GCC compiler * Windows C APIs * we provide our own emulation of some POSIX C APIs on top * Cygwin: Unixy toolchain, Unixy C * configure (portname = "cygwin") * make * GCC compiler * POSIX C APIs (emulations provided by the Cygwin runtime libraries) (The configure/make part will be harmonised by the Meson project.) The macro WIN32 is visibly defined by something in/near msbuild in MSVC builds: /D WIN32 is right here in the build transcripts (whereas the compiler defines _WIN32; good compiler). I am not sure how exactly it is first defined in MSYS builds; I suspect that MSYS gcc might define it itself, but I don't have access to MSYS to check. As for Cygwin, the only translation unit where I could find both __CYGWIN__ and WIN32 defined is dirmod.c, and that results from including <windows.h> and ultimately <minwindef.h> (even though WIN32 isn't defined yet at that time). I couldn't understand why we do that, but I probably didn't read enough commit history. The purpose of dirmod.c on Cygwin today is only to wrap otherwise pure POSIX code in retry loops to handle those spurious EACCES errors due to NT sharing violations, so there is no need for that. Proposal: let's make it a programming rule that we don't allow definitions from Windows headers to leak into Cygwin translation units, preferably by never including them, or if we really must, let's grant specific exemptions in an isolated and well documented way. We don't seem to need any such exemptions currently. Places where we currently worry about the contradictory macros could become conditional #error directives instead. 2. To make it possible to test any of that, you either need a working Windows+Cygwin setup, or working CI. I'm a salty old Unix hacker so I opted for the latter, and I also hope this will eventually be useful to others. Unfortunately I haven't figured out how to get everything working yet, so some of the check-world tests are failing. Clues most welcome! The version I'm posting here is set to run always, so that cfbot will show it alongside others. But I would imagine that if we got a committable-quality version of this, it'd probably be opt-in, so you'd have to say "ci-os-only: cygwin", or "ci-os-only: cygwin, windows" etc in a commit to your private github account to ask for it (or maybe we'll come up with a way to tell cfbot we want the full works of CI checks; the same decision will come up for MSYS, OpenBSD and NetBSD CI support that my colleague is working on). There are other things to fix too, including abysmal performance; see commit message. 3. You can't really run PostgreSQL on Cygwin for real, because its implementation of signals does not have reliable signal masking, so unsubtle and probably also subtle breakage occurs. That was reported upstream by Noah years ago, but they aren't working on a fix. lorikeet shows random failures, and presumably any CI system will do the same... I even wondered about putting our own magic entry/exit macros into signal handlers, that would use atomics to implement a second level of signal masking (?!) but that's an uncommonly large bandaid for a defective platform... and trying to fix Cygwin itself is a rabbithole too far for me. 4. When building with Cygwin GCC 11.3 you get a bunch of warnings that don't show up on other platforms, seemingly indicating that it interprets -Wimplicit-fallthrough=3 differently. Huh? [1] https://www.postgresql.org/message-id/CA%2BhUKGKZ_FjkBnjGADk%2Bpa2g4oKDcG8%3DSE5V23sPTP0EELfyzQ%40mail.gmail.com
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > 3. You can't really run PostgreSQL on Cygwin for real, because its > implementation of signals does not have reliable signal masking, so > unsubtle and probably also subtle breakage occurs. That was reported > upstream by Noah years ago, but they aren't working on a fix. > lorikeet shows random failures, and presumably any CI system will do > the same... If that's an accurate statement, shouldn't we just drop Cygwin support? Now that we have a native Windows build, it's hard to see how any live user would prefer to use the Cygwin build. regards, tom lane
On Tue, Jul 26, 2022 at 4:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > 3. You can't really run PostgreSQL on Cygwin for real, because its > > implementation of signals does not have reliable signal masking, so > > unsubtle and probably also subtle breakage occurs. That was reported > > upstream by Noah years ago, but they aren't working on a fix. > > lorikeet shows random failures, and presumably any CI system will do > > the same... > > If that's an accurate statement, shouldn't we just drop Cygwin support? This thread rejected the idea last time around: https://www.postgresql.org/message-id/flat/136712b0-0619-5619-4634-0f0286acaef7%402ndQuadrant.com lorikeet still shows the issue. Failures often involve assertions about PMSignalState or mq->mq_sender. Hmm, it's running Cygwin 3.2.0 (March 2021) and the latest release is 3.3.5, so it's remotely possible that it's been fixed recently. Maybe that'd be somewhere in here, but it's not jumping out: https://github.com/cygwin/cygwin/commits/master/winsup/cygwin/signal.cc (Oooh, another implementation of signalfd...)
Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Jul 26, 2022 at 4:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If that's an accurate statement, shouldn't we just drop Cygwin support? > This thread rejected the idea last time around: > https://www.postgresql.org/message-id/flat/136712b0-0619-5619-4634-0f0286acaef7%402ndQuadrant.com I think maybe we should re-open the discussion. I've certainly reached the stage of fed-up-ness. That platform seems seriously broken, upstream is making no progress on fixing it, and there doesn't seem to be any real-world use-case. The only positive argument for it is that Readline doesn't work in the other Windows builds --- but we've apparently not rechecked that statement in eighteen years, so maybe things are better now. If we could just continue to blithely ignore lorikeet's failures, I wouldn't mind so much; but doing any significant amount of new code development work for the platform seems like throwing away developer time. regards, tom lane
On Tue, Jul 26, 2022 at 7:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think maybe we should re-open the discussion. I've certainly > reached the stage of fed-up-ness. That platform seems seriously > broken, upstream is making no progress on fixing it, and there > doesn't seem to be any real-world use-case. The only positive > argument for it is that Readline doesn't work in the other > Windows builds --- but we've apparently not rechecked that > statement in eighteen years, so maybe things are better now. > > If we could just continue to blithely ignore lorikeet's failures, > I wouldn't mind so much; but doing any significant amount of new > code development work for the platform seems like throwing away > developer time. I agree with that. All things being equal, I like the idea of supporting a bunch of different platforms, and Cygwin doesn't really look that dead. It has recent releases. But if blocking signals doesn't actually work on that platform, making PostgreSQL work reliably there seems really difficult. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jul 26, 2022 at 04:24:25PM +1200, Thomas Munro wrote: > 3. You can't really run PostgreSQL on Cygwin for real, because its > implementation of signals does not have reliable signal masking, so > unsubtle and probably also subtle breakage occurs. That was reported > upstream by Noah years ago, but they aren't working on a fix. > lorikeet shows random failures, and presumably any CI system will do > the same... Reference: https://www.postgresql.org/message-id/20170321034703.GB2097809%40tornado.leadboat.com On my 2nd try: https://cirrus-ci.com/task/5311911574110208 TRAP: FailedAssertion("mq->mq_sender == NULL", File: "shm_mq.c", Line: 230, PID: 16370) 2022-07-26 06:32:35.525 PDT [15538][postmaster] LOG: background worker "parallel worker" (PID 16370) was terminated by signal6: Aborted > XXX Doesn't get all the way through yet... Mainly because getopt was causing all tap tests to fail. I tried to fix that in configure, but ended up changing the callers. This is getting close, but I don't think has actually managed to pass all tests yet.. https://cirrus-ci.com/task/5274721116749824 > 4. When building with Cygwin GCC 11.3 you get a bunch of warnings > that don't show up on other platforms, seemingly indicating that it > interprets -Wimplicit-fallthrough=3 differently. Huh? Evidently due to the same getopt issues. > XXX This should use a canned Docker image with all the right packages > installed Has anyone tried using non-canned images ? It sounds like this could reduce the 4min startup time for windows. https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment > XXX configure is soooo slooow, can we cache it?! Compiling is also > insanely slow, but ccache gets it down to a couple of minutes if you're > lucky One reason compiling was slow is because you ended up with -O2. You can cache configure as long as you're willing to re-run it whenever options were changed. That also applies to the existing headerscheck. > XXX I don't know how to put variables like BUILD_JOBS into the scripts WDYM ? If it's outside of bash and in windows shell it's like %var%, right ? https://cirrus-ci.org/guide/writing-tasks/#environment-variables I just noticed that cirrus is misbehaving: if there's a variable called CI (which there is), then it expands $CI_FOO like ${CI}_FOO rather than ${CI_FOO}. I've also seen weirdness when variable names or operators appear in the commit message... > XXX Needs some --with-X options Done > XXX We would never want this to run by default in CI, but it'd be nice > to be able to ask for it with ci-os-only! (See commented out line) > only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*' Doesn't this already do what's needed? As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only', the task will runs only on request. > XXX I have no idea if crash dump works, and if this should share > elements with the msys work in commitfest #3575 Based on the crash above, it wasn't working. And after some changes ... it still doesn't work. windows_os is probably skipping too many things. -- Justin
Attachment
On Wed, Jul 27, 2022 at 6:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Tue, Jul 26, 2022 at 04:24:25PM +1200, Thomas Munro wrote: > > 3. You can't really run PostgreSQL on Cygwin for real, because its > > implementation of signals does not have reliable signal masking, so > > unsubtle and probably also subtle breakage occurs. That was reported > > upstream by Noah years ago, but they aren't working on a fix. > > lorikeet shows random failures, and presumably any CI system will do > > the same... > > Reference: https://www.postgresql.org/message-id/20170321034703.GB2097809%40tornado.leadboat.com > > On my 2nd try: > > https://cirrus-ci.com/task/5311911574110208 > TRAP: FailedAssertion("mq->mq_sender == NULL", File: "shm_mq.c", Line: 230, PID: 16370) > 2022-07-26 06:32:35.525 PDT [15538][postmaster] LOG: background worker "parallel worker" (PID 16370) was terminated bysignal 6: Aborted Thanks for working on this! Huh, that Cygwin being shipped by Choco is quite old, older than lorikeet's, but not old enough to not have the bug: [04:33:55.234] Starting cygwin install, version 2.918 Based on clues in Noah's emails in the archives, I think versions from maybe somewhere around 2015 didn't have the bug, and then the bug appeared, and AFAIK it's still here. I wonder if you can tell Choco to install an ancient version, but even if that's possible you'd be dealing with other stupid problems and bugs. > > XXX Doesn't get all the way through yet... > > Mainly because getopt was causing all tap tests to fail. > I tried to fix that in configure, but ended up changing the callers. > > This is getting close, but I don't think has actually managed to pass all tests > yet.. https://cirrus-ci.com/task/5274721116749824 Woo. > > 4. When building with Cygwin GCC 11.3 you get a bunch of warnings > > that don't show up on other platforms, seemingly indicating that it > > interprets -Wimplicit-fallthrough=3 differently. Huh? > > Evidently due to the same getopt issues. Ahh, nice detective work. > > XXX This should use a canned Docker image with all the right packages > > installed > > Has anyone tried using non-canned images ? It sounds like this could reduce > the 4min startup time for windows. > > https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment Yeah, I had that working once. Not sure what the pros and cons would be for us. > > XXX configure is soooo slooow, can we cache it?! Compiling is also > > insanely slow, but ccache gets it down to a couple of minutes if you're > > lucky > > One reason compiling was slow is because you ended up with -O2. Ah, right. > You can cache configure as long as you're willing to re-run it whenever options > were changed. That also applies to the existing headerscheck. > > > XXX I don't know how to put variables like BUILD_JOBS into the scripts > > WDYM ? If it's outside of bash and in windows shell it's like %var%, right ? > https://cirrus-ci.org/guide/writing-tasks/#environment-variables Right. I should have taken the clue from the %cd% (I got a few ideas about how to do this from libarchive's CI scripting[1]). > I just noticed that cirrus is misbehaving: if there's a variable called CI > (which there is), then it expands $CI_FOO like ${CI}_FOO rather than ${CI_FOO}. > I've also seen weirdness when variable names or operators appear in the commit > message... > > > XXX Needs some --with-X options > > Done Neat. > > XXX We would never want this to run by default in CI, but it'd be nice > > to be able to ask for it with ci-os-only! (See commented out line) > > only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*' > > Doesn't this already do what's needed? > As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only', > the task will runs only on request. Yeah I was just trying to say that I was sharing the script in a way that always runs, but for commit we'd want that. This is all far too slow for cfbot to have to deal with on every build. Looks like we can expect to be able to build and test fast on Windows soonish, though, so maybe one day we'd just turn Cygwin and MSYS on? [1] https://github.com/libarchive/libarchive/blob/master/build/ci/cirrus_ci/ci.cmd
On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote: > Thanks for working on this! > > Huh, that Cygwin being shipped by Choco is quite old, older than > lorikeet's, but not old enough to not have the bug: > > [04:33:55.234] Starting cygwin install, version 2.918 Hm, I think that's the version of "cygwinsetup" but not cygwin.. It also says this: [13:16:36.014] Cygwin v3.3.4.20220408 [Approved] > I wonder if you can tell Choco > to install an ancient version, but even if that's possible you'd be > dealing with other stupid problems and bugs. Yes: choco install -y --no-progress --version 4.6.1 ccache > > > XXX This should use a canned Docker image with all the right packages > > > installed > > > > Has anyone tried using non-canned images ? It sounds like this could reduce > > the 4min startup time for windows. > > > > https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment > > Yeah, I had that working once. Not sure what the pros and cons would be for us. I think it could be a lot faster to start, since cirrus caches the generated docker image locally. Rather than (I gather) pulling the image every time. > > > XXX We would never want this to run by default in CI, but it'd be nice > > > to be able to ask for it with ci-os-only! (See commented out line) > > > only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*' > > > > Doesn't this already do what's needed? > > As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only', > > the task will runs only on request. > > Yeah I was just trying to say that I was sharing the script in a way > that always runs, but for commit we'd want that. This is all far too > slow for cfbot to have to deal with on every build. It occurred to me today that if cfbot preserved the original patch series, and commit messages, that would allow patch authors to write things like "ci-os-only: docs" for a doc only patch. I've never gotten cirrus' changesOnly() stuff to work... > Looks like we can expect to be able to build and test fast on Windows > soonish, though, Do you mean with meson ? > so maybe one day we'd just turn Cygwin and MSYS on? I didn't understand this ? -- Justin
On Fri, Jul 29, 2022 at 10:23 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote: > > [04:33:55.234] Starting cygwin install, version 2.918 > > Hm, I think that's the version of "cygwinsetup" but not cygwin.. > It also says this: [13:16:36.014] Cygwin v3.3.4.20220408 [Approved] Oops. Ok so we're testing the very latest then, and it definitely still has the bug as we thought. > It occurred to me today that if cfbot preserved the original patch series, and > commit messages, that would allow patch authors to write things like > "ci-os-only: docs" for a doc only patch. I've never gotten cirrus' > changesOnly() stuff to work... Maybe it's time to switch to "git am -3 ..." and reject patches that don't apply that way. > > Looks like we can expect to be able to build and test fast on Windows > > soonish, though, > > Do you mean with meson ? Yeah. Also there are some other things we can do to speed up testing on Windows (and elsewhere), like not running every test query with new psql + backend process pair, which takes at least a few hundred ms and sometimes up to several seconds on this platform; I have some patches I need to finish... > > so maybe one day we'd just turn Cygwin and MSYS on? > > I didn't understand this ? I mean, if, some sunny day, we can compile and test on Windows at non-glacial speeds, then it would become possible to contemplate having cfbot run these tasks for every patch every time.
On Wed, Jul 27, 2022 at 5:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jul 26, 2022 at 7:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I think maybe we should re-open the discussion. I've certainly > > reached the stage of fed-up-ness. That platform seems seriously > > broken, upstream is making no progress on fixing it, and there > > doesn't seem to be any real-world use-case. The only positive > > argument for it is that Readline doesn't work in the other > > Windows builds --- but we've apparently not rechecked that > > statement in eighteen years, so maybe things are better now. > > > > If we could just continue to blithely ignore lorikeet's failures, > > I wouldn't mind so much; but doing any significant amount of new > > code development work for the platform seems like throwing away > > developer time. > > I agree with that. All things being equal, I like the idea of > supporting a bunch of different platforms, and Cygwin doesn't really > look that dead. It has recent releases. But if blocking signals > doesn't actually work on that platform, making PostgreSQL work > reliably there seems really difficult. It's one thing to drop old dead Unixes but I don't think anyone would enjoy dropping support for an active open source project. The best outcome would be for people who have an interest in seeing PostgreSQL work correctly on Cygwin to help get the bug fixed. Here are the threads I'm aware of: https://cygwin.com/pipermail/cygwin/2017-August/234001.html https://cygwin.com/pipermail/cygwin/2017-August/234097.html I wonder if these problems would go away as a nice incidental side-effect if we used latches for postmaster wakeups. I don't know... maybe, if the problem is just with the postmaster's pattern of blocking/unblocking? Maybe backend startup is simple enough that it doesn't hit the bug? From a quick glance, I think the assertion failures that occur in regular backends can possibly be blamed on the postmaster getting confused about its children due to unexpected handler re-entry.
On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote: > > > XXX We would never want this to run by default in CI, but it'd be nice > > > to be able to ask for it with ci-os-only! (See commented out line) > > > only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*' > > > > Doesn't this already do what's needed? > > As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only', > > the task will runs only on request. > > Yeah I was just trying to say that I was sharing the script in a way > that always runs, but for commit we'd want that. That makes more sense after noticing that you created a cf entry (for which cfbot has been skipping my patch due to my "only_if" line). There's still a few persistent issues: This fails ~50% of the time in recovery 010-truncate I hacked around this by setting data_sync_retry. https://cirrus-ci.com/task/5289444063313920 I found these, not sure if they're relevant. https://www.postgresql.org/message-id/flat/CAA4eK1Kft05mwNuZbTVRmz8SNS3r%2BuriuCT8DxL5KJy5btoS-A%40mail.gmail.com https://www.postgresql.org/message-id/flat/CAFiTN-uGxgo5258hZy2QJoz%3Ds7_Cs7v9%3Db8Z2GgFV7qmQUOwxw%40mail.gmail.com And an fsync abort in 013 which seems similar to this other one. data_sync_retry also avoids this issue. https://cirrus-ci.com/task/6283023745286144?logs=cores#L34 https://www.postgresql.org/message-id/flat/CAMVYW_4QhjZ-19Xpr2x1B19soRCNu1BXHM8g1mOnAVtd5VViDw%40mail.gmail.com And sometimes various assertions failing in regress parallel_select (and then times out) https://api.cirrus-ci.com/v1/artifact/task/5537540282253312/log/src/test/regress/log/postmaster.log https://api.cirrus-ci.com/v1/artifact/task/6108746773430272/log/src/test/regress/log/postmaster.log Or "could not map dynamic shared memory segment" (actually in 027-stream-regress): https://cirrus-ci.com/task/6168860746317824 And segfault in vacuum parallel https://api.cirrus-ci.com/v1/artifact/task/5404589569605632/log/src/test/regress/log/postmaster.log Sometimes semctl() failed: Resource temporarily unavailable https://api.cirrus-ci.com/v1/artifact/task/5027860623654912/log/src/test/subscription/tmp_check/log/014_binary_publisher.log https://api.cirrus-ci.com/v1/artifact/task/5027860623654912/log/src/bin/pg_rewind/tmp_check/log/001_basic_standby_local.log Some more https://cirrus-ci.com/task/6468927780814848 If you're lucky, there's only 1 or 2 problems, of which those are different symptoms.. Maybe for now this needs to disable tap tests :( This shows that it *can* pass, if slowly, and infrequently: https://cirrus-ci.com/task/6546858536337408 This fixes my changes to configure for getopt. And simplifies the changes to *.pl (the .exe changes weren't necessary at all). And removes the changes for implicit-fallthrough; I realized that configure was just deciding that it didn't work and not using it at all. And adds support for backtraces. And remove kerberos and and add libxml Why did you write "|| exit /b 1" in all the bash invocations ? I think cirrus handles that automatically. -- Justin
Attachment
Hi, On 2022-07-28 17:23:19 -0500, Justin Pryzby wrote: > On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote: > > > > XXX This should use a canned Docker image with all the right packages > > > > installed > > > > > > Has anyone tried using non-canned images ? It sounds like this could reduce > > > the 4min startup time for windows. > > > > > > https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment > > > > Yeah, I had that working once. Not sure what the pros and cons would be for us. > > I think it could be a lot faster to start, since cirrus caches the generated > docker image locally. Rather than (I gather) pulling the image every time. I'm quite certain that is not true. All the docker images built are just uploaded to the google container registry and then downloaded onto a *separate* windows host. The dockerfile: stuff generates a separate task running on a separate machine... It's a bit better for non-windows containers, because there google has some optimization for pulling image (pieces) on demand or such. Greetings, Andres Freund
On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > [train wreck] Oh my, so I'm getting the impression we might actually be totally unstable on Cygwin. Which surprises me because ... wait a minute ... lorikeet isn't even running most of the tests. So... we don't really know the degree to which any of this works at all? > This shows that it *can* pass, if slowly, and infrequently: > https://cirrus-ci.com/task/6546858536337408 Ok, that's slightly reassuring, so maybe we *can* fix this, but I'm one step closer to what Tom said, re wasting developer time... > [lots of improvements] Cool. > Why did you write "|| exit /b 1" in all the bash invocations ? I think cirrus > handles that automatically. Cargo-culted from libarchive.
Hi, On 2022-08-04 16:16:06 +1200, Thomas Munro wrote: > Ok, that's slightly reassuring, so maybe we *can* fix this, but I'm > one step closer to what Tom said, re wasting developer time... It might be worth checking whether the cygwin installer, which at some point at least allowed installing postgres, has download numbers available anywhere. It's possible we could e.g. get away with just allowing libpq to be built. Greetings, Andres Freund
On Thu, Aug 4, 2022 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > [train wreck] > > Oh my, so I'm getting the impression we might actually be totally > unstable on Cygwin. Which surprises me because ... wait a minute ... > lorikeet isn't even running most of the tests. So... we don't really > know the degree to which any of this works at all? Hmm, it's possible that all these failures are just new-to-me effects of the known bug. Certainly the assertion failures are of the usual type, and I think it might be possible for the weird parallel query failure to be explained by the postmaster forking extra phantom child processes. It may be madness to try to work around this, but I wonder if we could use a static local variable that we update with atomic compare exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system. On entry, if you can do 0->1 it means you are allowed to run the function. If it's non-zero, set n->n+1 and return immediately: signal blocked, but queued for later. On exit, you CAS n->0. If n was > 1, then you have to jump back to the top and run the function body again.
Thomas Munro <thomas.munro@gmail.com> writes: > It may be madness to try to work around this, but I wonder if we could > use a static local variable that we update with atomic compare > exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and > PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system. > On entry, if you can do 0->1 it means you are allowed to run the > function. If it's non-zero, set n->n+1 and return immediately: signal > blocked, but queued for later. On exit, you CAS n->0. If n was > 1, > then you have to jump back to the top and run the function body again. And ... we're expending all this effort for what exactly? regards, tom lane
On Thu, Aug 4, 2022 at 5:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > It may be madness to try to work around this, but I wonder if we could > > use a static local variable that we update with atomic compare > > exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and > > PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system. > > On entry, if you can do 0->1 it means you are allowed to run the > > function. If it's non-zero, set n->n+1 and return immediately: signal > > blocked, but queued for later. On exit, you CAS n->0. If n was > 1, > > then you have to jump back to the top and run the function body again. > > And ... we're expending all this effort for what exactly? I'd be almost as happy if we ripped it all out, shut down lorikeet and added it to the list of fallen platforms. I'd feel a bit like a vandal, though. My suggestion is a last-ditch idea for Noah (CCd) and/or Andrew to consider, who (respectively) blocked this last time and run lorikeet. No plans to write that patch myself...
On Thu, Aug 04, 2022 at 04:16:06PM +1200, Thomas Munro wrote: > On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > [train wreck] > > Oh my, so I'm getting the impression we might actually be totally > unstable on Cygwin. Which surprises me because ... wait a minute ... > lorikeet isn't even running most of the tests. So... we don't really > know the degree to which any of this works at all? Right. Maybe it's of limited interest, but .. This updates the patch to build and test with meson. Which first requires patching some meson.builds. I guess that's needed for some current BF members, too. Unfortunately, ccache+PCH causes gcc to crash :( -- Justin
Attachment
On Thu, Oct 20, 2022 at 10:40:40PM -0500, Justin Pryzby wrote: > On Thu, Aug 04, 2022 at 04:16:06PM +1200, Thomas Munro wrote: > > On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > [train wreck] > > > > Oh my, so I'm getting the impression we might actually be totally > > unstable on Cygwin. Which surprises me because ... wait a minute ... > > lorikeet isn't even running most of the tests. So... we don't really > > know the degree to which any of this works at all? > > Right. > > Maybe it's of limited interest, but .. > > This updates the patch to build and test with meson. > Which first requires patching some meson.builds. > I guess that's needed for some current BF members, too. > Unfortunately, ccache+PCH causes gcc to crash :( Resending with the 'only-if' line commented (doh). And some fixes to 001 as Andres pointed out by on other thread. -- Justin
Attachment
Hi, On 2022-11-08 19:04:37 -0600, Justin Pryzby wrote: > From 2741472080eceac5cb6d002c39eaf319d7f72b50 Mon Sep 17 00:00:00 2001 > From: Justin Pryzby <pryzbyj@telsasoft.com> > Date: Fri, 30 Sep 2022 13:39:43 -0500 > Subject: [PATCH 1/3] meson: other fixes for cygwin > > XXX: what about HAVE_BUGGY_STRTOF ? What about it? As noted in another thread, HAVE_BUGGY_STRTOF is defined in a header, and shouldn't be affected by the buildsystem. Pushed this commit. > XXX This should use a canned Docker image with all the right packages > installed? But if the larger image is slower to start, then maybe not... I think once we convert the windows containers to windows VMs we can just install both cygwin and mingw in the same image. The overhead of installing too much seems far far smaller there. > + CONFIGURE_FLAGS: --enable-cassert --enable-debug --with-ldap --with-ssl=openssl --with-libxml > + # --enable-tap-tests I assume this is disabled as tap tests fail? > + C:\tools\cygwin\bin\bash.exe --login -c "cygserver-config -y" I'd copy the approach used for mingw of putting most of this in an environment variable. > +findargs='' > case $os in > freebsd|linux|macos) > - ;; > + ;; > + > + cygwin) > + # XXX Evidently I don't know how to write two arguments here without pathname expansion later, other than eval. > + #findargs='-name "*.stackdump"' > + for stack in $(find "$directory" -type f -name "*.stackdump") ; do > + binary=`basename "$stack" .stackdump` > + echo;echo; > + echo "dumping ${stack} for ${binary}" > + awk '/^0/{print $2}' $stack |addr2line -f -i -e ./src/backend/postgres.exe > + #awk '/^0/{print $2}' $stack |addr2line -f -i -e "./src/backend/$binary.exe" > + done > + exit 0 > + ;; Is this stuff actually needed? Could we use the infrastructure we use for backtraces with msvc instead? Or use something that understands .stackdump files? > +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm > [...] > +++ b/src/test/perl/PostgreSQL/Test/Utils.pm > [...] > +++ b/src/test/recovery/t/020_archive_status.pl > [...] I think these should be in a separate commit, they're not actually about CI. Greetings, Andres Freund
On Fri, Jul 29, 2022 at 10:57 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I wonder if these problems would go away as a nice incidental > side-effect if we used latches for postmaster wakeups. I don't > know... maybe, if the problem is just with the postmaster's pattern of > blocking/unblocking? Maybe backend startup is simple enough that it > doesn't hit the bug? From a quick glance, I think the assertion > failures that occur in regular backends can possibly be blamed on the > postmaster getting confused about its children due to unexpected > handler re-entry. Just to connect the dots, that's what this patch does: https://www.postgresql.org/message-id/flat/CA+hUKG+Z-HpOj1JsO9eWUP+ar7npSVinsC_npxSy+jdOMsx=Gg@mail.gmail.com (There may be other places that break under Cygwin's flaky sa_mask implementation, I don't know and haven't seen any clues about that.)
On Wed, 9 Nov 2022 at 06:34, Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Thu, Oct 20, 2022 at 10:40:40PM -0500, Justin Pryzby wrote: > > On Thu, Aug 04, 2022 at 04:16:06PM +1200, Thomas Munro wrote: > > > On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > [train wreck] > > > > > > Oh my, so I'm getting the impression we might actually be totally > > > unstable on Cygwin. Which surprises me because ... wait a minute ... > > > lorikeet isn't even running most of the tests. So... we don't really > > > know the degree to which any of this works at all? > > > > Right. > > > > Maybe it's of limited interest, but .. > > > > This updates the patch to build and test with meson. > > Which first requires patching some meson.builds. > > I guess that's needed for some current BF members, too. > > Unfortunately, ccache+PCH causes gcc to crash :( > > Resending with the 'only-if' line commented (doh). > And some fixes to 001 as Andres pointed out by on other thread. Is there still some work pending for this thread as Andres had committed some part, if so, can you post an updated patch for the same. Regards, Vignesh
On Tue, Jan 03, 2023 at 05:54:56PM +0530, vignesh C wrote: > > On Thu, Oct 20, 2022 at 10:40:40PM -0500, Justin Pryzby wrote: > > > On Thu, Aug 04, 2022 at 04:16:06PM +1200, Thomas Munro wrote: > > > > On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > > [train wreck] > > > > > > > > Oh my, so I'm getting the impression we might actually be totally > > > > unstable on Cygwin. Which surprises me because ... wait a minute ... > > > > lorikeet isn't even running most of the tests. So... we don't really > > > > know the degree to which any of this works at all? > > > > > > Right. > > > > > > Maybe it's of limited interest, but .. > > > > > > This updates the patch to build and test with meson. > > > Which first requires patching some meson.builds. > > > I guess that's needed for some current BF members, too. > > > Unfortunately, ccache+PCH causes gcc to crash :( > > > > Resending with the 'only-if' line commented (doh). > > And some fixes to 001 as Andres pointed out by on other thread. > > Is there still some work pending for this thread as Andres had > committed some part, if so, can you post an updated patch for the > same. Thomas, what's your opinion ?
On Wed, Jan 4, 2023 at 3:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Tue, Jan 03, 2023 at 05:54:56PM +0530, vignesh C wrote: > > Is there still some work pending for this thread as Andres had > > committed some part, if so, can you post an updated patch for the > > same. > > Thomas, what's your opinion ? One observation is that your CI patch *nearly* succeeds, even if hacked to turn on the full TAP tests, if applied on top of the WaitEventSet-for-postmaster patch: https://cirrus-ci.com/task/4533371804581888 No cigar though, it still failed a few times for me in the subscription tests with EAGAIN, when accessing semaphores: semctl(24576010, 14, SETVAL, 0) failed: Resource temporarily unavailable That isn't an error I expect from semctl(), but from some cursory research it seems like that system call is actually talking to the cygserver process over a pipe (?) to implement SysV semaphores. Maybe it couldn't keep up, but doesn't like to block? Perhaps we could try to tune that server, but let's try the POSIX kind of semaphores instead. From a quick peek at the source, they are implemented some other way on direct native NT voodoo, no cygserver involved. https://cirrus-ci.com/task/5142810819559424 [still running at time of writing] Gotta run, but I'll check again in the morning to see if that does better...
On Fri, Jan 6, 2023 at 1:22 AM Thomas Munro <thomas.munro@gmail.com> wrote: > https://cirrus-ci.com/task/5142810819559424 [still running at time of writing] > > Gotta run, but I'll check again in the morning to see if that does better... Yes! Two successful runs with all TAP tests so far. So it looks like we can probably stop lorikeet's spurious failures, by happy coincidence due to other work, and we could seriously consider committing this optional CI test for it, much like we have the optional MSYS build. Any interest in producing a tidied up version of the patch, Justin? Or I can, but I'll go and work on other things first. I pushed a change to switch the semaphore implementation. I haven't personally seen that failure mode on lorikeet, but I would guess that's because (1) it's only running a tiny subset of the tests, (2) it crashes for the other reason with higher likelihood, and/or (3) it's not using much concurrency yet because the build farm doesn't use meson yet.
On Wed, Nov 9, 2022 at 2:04 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > +data_sync_retry = on Sharing with the list some clues that Justin and I figured out about what that part is doing. Without it, you get failures like: PANIC: could not open file "pg_logical/snapshots/0-14FE6B0.snap": No such file or directory That's been seen before: https://www.postgresql.org/message-id/flat/17827.1549866683%40sss.pgh.pa.us That thread concluded that the operating system must have a non-atomic rename(), ie a kernel bug. I don't know why Cygwin would display that behaviour and our native Windows build not; maybe timing, or maybe our own open() and rename() wrappers for Windows do something important differently than Cygwin's open() and rename(). On reflection, that seems a bit too flimsy to have in-tree without more investigation, which I won't have time for myself, so I'm going to withdraw this entry.
On 2023-01-05 Th 16:39, Thomas Munro wrote: > On Fri, Jan 6, 2023 at 1:22 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> https://cirrus-ci.com/task/5142810819559424 [still running at time of writing] >> >> Gotta run, but I'll check again in the morning to see if that does better... > Yes! Two successful runs with all TAP tests so far. So it looks like > we can probably stop lorikeet's spurious failures, by happy > coincidence due to other work, and we could seriously consider > committing this optional CI test for it, much like we have the > optional MSYS build. Any interest in producing a tidied up version of > the patch, Justin? Or I can, but I'll go and work on other things > first. > > I pushed a change to switch the semaphore implementation. I haven't > personally seen that failure mode on lorikeet, but I would guess > that's because (1) it's only running a tiny subset of the tests, (2) > it crashes for the other reason with higher likelihood, and/or (3) > it's not using much concurrency yet because the build farm doesn't use > meson yet. OK, should I now try re-enabling TAP tests on lorikeet? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Sat, Jan 7, 2023 at 3:40 AM Andrew Dunstan <andrew@dunslane.net> wrote: > OK, should I now try re-enabling TAP tests on lorikeet? Not before https://commitfest.postgresql.org/41/4032/ is committed. After that, it might be worth a try? I have no idea if the PANIC problem I mentioned last night would apply to lorikeet's kernel too. To summarise the kinds of failure we have analysed in this thread: 1. SysV semaphores are buggy; fixed, I hope, by recent commit (= just don't use them). 2. The regular crashes we already knew about from other threads due to signal masking being buggy seem to be fixed, coincidentally, by CF #4032, not yet committed (= don't rely on sa_mask for correctness). 3. PANIC apparently caused by non-atomic rename(), based on analysis of similar failures seen on other old buggy OSes back in 2018. If lorikeet has problem #3 (which it may not; we know from CF #3951 that kernel versions differ in related respects and Server 2019 as used on CI seems to have the most conservative/old Windows behaviour) then it might fail in the TAP tests just like the proposed CI-for-Cygwin patch, unless you also do data_sync_retry=on, which seems like a pretty ugly workaround to me. I don't know what else might be broken by non-atomic rename(), and I'd rather not find out :-D I finished up here by trying to tidy up some weird looking nonsense in our code while working on general portability cleanup, since I needed a way to check if __CYGWIN__ stuff still works, but what we found out is that it's more broken than anyone realised, and now I have to pull the emergency rabbit hole ejection cord because I have less than zero time for or interest in debugging Cygwin.
On Sat, Jan 07, 2023 at 12:39:11AM +1300, Thomas Munro wrote: > On Wed, Nov 9, 2022 at 2:04 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > +data_sync_retry = on > > Sharing with the list some clues that Justin and I figured out about > what that part is doing. Without it, you get failures like: > > PANIC: could not open file "pg_logical/snapshots/0-14FE6B0.snap": > No such file or directory > > That's been seen before: > > https://www.postgresql.org/message-id/flat/17827.1549866683%40sss.pgh.pa.us > > That thread concluded that the operating system must have a non-atomic > rename(), ie a kernel bug. I don't know why Cygwin would display that > behaviour and our native Windows build not; maybe timing, or maybe our > own open() and rename() wrappers for Windows do something important > differently than Cygwin's open() and rename(). > > On reflection, that seems a bit too flimsy to have in-tree without > more investigation, which I won't have time for myself, so I'm going > to withdraw this entry. Not so fast :) Here's my latest copy of the patch. Most recently, rather than setting data_sync_retry=no, I changed to call fsync_fname_ext() rather than fsync_fname(), which uses PANIC (except when data_sync_retry is disabled). That seems to work, showing that the problem is limited to SnapBuildSerialize(), and not a problem with all fsync()... https://cirrus-ci.com/task/5990885733695488 Thomas raised a good question, which was how the tests were passing when SnapBuildSerialize() was raising an error, which is what it would've been doing when I used data_sync_retry=no. So .. why is wal_sync_method being used to control fsync for things other than WAL? See 6dc7760ac (c. 2005) which added wal_fsync_writethrough, at which point (since 9b178555f, c. 2004) wal_sync_method was already being used for SLOG. Now, it's also being used for logical decoding (since b89e1510 and 858ec1185, c. 2014) in rewriteheap.c/snapbuild.c. And pidfiles (since ee0e525bf, 2010). And the control file (8b938d36f7, 2019). Note that data_sync_retry wasn't added until 9ccdd7f66 (c. 2018) It looks like logical decoding may be the "most wrong" place that wal_sync_method is being used, so maybe my change is reasonable to consider, and not just a workaround. I'm going to re-open the CF entry to let this run for a while to see how it works out. -- Justin
Attachment
On Wed, Jan 11, 2023 at 10:39:49PM -0600, Justin Pryzby wrote: > Here's my latest copy of the patch. > + # due to resource constraints we don't run this task by default for now > + trigger_type: manual Now, with trigger_type commented, so Thomas doesn't have to click "trigger" for me.
Attachment
Hi, On 2023-01-11 22:39:49 -0600, Justin Pryzby wrote: > Thomas raised a good question, which was how the tests were passing when > SnapBuildSerialize() was raising an error, which is what it would've > been doing when I used data_sync_retry=no. Presumably some test not checking for failures in a part of the test. > So .. why is wal_sync_method being used to control fsync for things > other than WAL? Historical raisins, I think. The problem is that macOS lies about fsync, and one needs special magic to make it behave like a real fsync. Somebody thought that instead of inventing a separate GUC to control whether the "real" fsync is used for other subsystems, it'd be better to reuse wal_sync_method. Note that this isn't the function that is actually used for WAL (that's issue_xlog_fsync()), and that pg_fsync() only uses the GUC to know whether to use pg_fsync_writethrough() or pg_fsync_no_writethrough(fd). > See 6dc7760ac (c. 2005) which added wal_fsync_writethrough, at which > point (since 9b178555f, c. 2004) wal_sync_method was already being used > for SLOG. > > Now, it's also being used for logical decoding (since b89e1510 and > 858ec1185, c. 2014) in rewriteheap.c/snapbuild.c. And pidfiles (since > ee0e525bf, 2010). And the control file (8b938d36f7, 2019). Note that > data_sync_retry wasn't added until 9ccdd7f66 (c. 2018) > > It looks like logical decoding may be the "most wrong" place that > wal_sync_method is being used, so maybe my change is reasonable to > consider, and not just a workaround. I don't follow. What does using fsync_fname() vs fsync_fname_ext() have to do with pg_fsync() using wal_sync_method? fsync_fname() is just a wrapper around fsync_fname_ext(). Both end up in pg_fsync(). Are you actually proposing that we don't PANIC after an fsync for the category of files that you list here, even with data_sync_retry set? Greetings, Andres Freund
On Thu, Jan 12, 2023 at 06:43:54PM -0800, Andres Freund wrote: > > It looks like logical decoding may be the "most wrong" place that > > wal_sync_method is being used, so maybe my change is reasonable to > > consider, and not just a workaround. > > I don't follow. What does using fsync_fname() vs fsync_fname_ext() have to do > with pg_fsync() using wal_sync_method? fsync_fname() is just a wrapper around > fsync_fname_ext(). Both end up in pg_fsync(). My patch used fsync_fname_ext() which would cause an ERROR rather than a PANIC when failing to fsync the logical decoding pathname. > Are you actually proposing that we don't PANIC after an fsync for the category > of files that you list here, even with data_sync_retry set? Yes, but I'm referring only to my change to SnapBuildSerialize(). The rest of the verbage was me trying to figure out the history/evolution of pg_fsync usage. -- Justin
Hi, On 2023-01-12 22:17:55 -0600, Justin Pryzby wrote: > On Thu, Jan 12, 2023 at 06:43:54PM -0800, Andres Freund wrote: > > Are you actually proposing that we don't PANIC after an fsync for the category > > of files that you list here, even with data_sync_retry set? > > Yes, but I'm referring only to my change to SnapBuildSerialize(). I can't see how that change could be correct? Greetings, Andres Freund
On Thu, Jan 12, 2023 at 10:17:55PM -0600, Justin Pryzby wrote: > On Thu, Jan 12, 2023 at 06:43:54PM -0800, Andres Freund wrote: > > > It looks like logical decoding may be the "most wrong" place that > > > wal_sync_method is being used, so maybe my change is reasonable to > > > consider, and not just a workaround. > > > > I don't follow. What does using fsync_fname() vs fsync_fname_ext() have to do > > with pg_fsync() using wal_sync_method? fsync_fname() is just a wrapper around > > fsync_fname_ext(). Both end up in pg_fsync(). > > My patch used fsync_fname_ext() which would cause an ERROR rather than a > PANIC when failing to fsync the logical decoding pathname. > > > Are you actually proposing that we don't PANIC after an fsync for the category > > of files that you list here, even with data_sync_retry set? > > Yes, but I'm referring only to my change to SnapBuildSerialize(). > > The rest of the verbage was me trying to figure out the > history/evolution of pg_fsync usage. Also note the existing comment (originating from Thomas' "fsync-gate" commit, which introduced data_sync_retry): + * It's safe to just ERROR on fsync() here because we'll retry the whole + * operation including the writes. Also, it seems to work fine if one calls pg_fsync() again, rather than calling fsync_fname(), which re-opens the file. It also seems to work fine if one omits the initial call to fsync_fname("pg_logical/snapshots", true); Since SnapBuildSerialize() isn't atomic (the system could crash at any point), I'm not seeing why these wouldn't be adequately safe. But also hoping Thomas will comment on that. -- Justin
Hi, On 2023-01-23 17:28:14 -0600, Justin Pryzby wrote: > On Thu, Jan 12, 2023 at 10:17:55PM -0600, Justin Pryzby wrote: > > On Thu, Jan 12, 2023 at 06:43:54PM -0800, Andres Freund wrote: > > > > It looks like logical decoding may be the "most wrong" place that > > > > wal_sync_method is being used, so maybe my change is reasonable to > > > > consider, and not just a workaround. > > > > > > I don't follow. What does using fsync_fname() vs fsync_fname_ext() have to do > > > with pg_fsync() using wal_sync_method? fsync_fname() is just a wrapper around > > > fsync_fname_ext(). Both end up in pg_fsync(). > > > > My patch used fsync_fname_ext() which would cause an ERROR rather than a > > PANIC when failing to fsync the logical decoding pathname. > > > > > Are you actually proposing that we don't PANIC after an fsync for the category > > > of files that you list here, even with data_sync_retry set? > > > > Yes, but I'm referring only to my change to SnapBuildSerialize(). > > > > The rest of the verbage was me trying to figure out the > > history/evolution of pg_fsync usage. > > Also note the existing comment (originating from Thomas' "fsync-gate" > commit, which introduced data_sync_retry): > > + * It's safe to just ERROR on fsync() here because we'll retry the whole > + * operation including the writes. > > Also, it seems to work fine if one calls pg_fsync() again, rather than > calling fsync_fname(), which re-opens the file. I don't think that'd achieve the same thing necessarily. But it's notoriously hard to know what which OS requires in this area. > It also seems to work fine if one omits the initial call to > fsync_fname("pg_logical/snapshots", true); I don't think it's a good idea randomly weaken individual fsyncs just because it somehow, without any theory as to how, fixes tests on cygwin. > Since SnapBuildSerialize() isn't atomic (the system could crash at any > point), I'm not seeing why these wouldn't be adequately safe. I'm not sure what you mean by that. It's atomic from a crash safety view: It first writes into a tempfile, fsyncs that + directory, then renames the file into place, fsyncs new filename + directory again. Tempfiles are removed after a crash. In case of a crash you can either end up with an "old" or a "new" file. Greetings, Andres Freund
Note that cirrus failed like this: https://api.cirrus-ci.com/v1/artifact/task/4881596411543552/testrun/build/testrun/subscription/010_truncate/log/010_truncate_publisher.log 2023-01-25 23:17:10.417 GMT [29821][walsender] [sub1][3/0:0] ERROR: could not open file "pg_logical/snapshots/0-14F2060.snap":Is a directory 2023-01-25 23:17:10.417 GMT [29821][walsender] [sub1][3/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version'4', origin 'any', publication_names '"pub1"') 2023-01-25 23:17:10.418 GMT [29850][walsender] [pg_16413_sync_16394_7192732880582452157][6/0:0] PANIC: could not open file"pg_logical/snapshots/0-14F2060.snap": No such file or directory 2023-01-25 23:17:10.418 GMT [29850][walsender] [pg_16413_sync_16394_7192732880582452157][6/0:0] STATEMENT: START_REPLICATIONSLOT "pg_16413_sync_16394_7192732880582452157" LOGICAL 0/14F2060 (proto_version '4', origin 'any', publication_names'"pub3"') I don't understand how "Is a directory" happened .. It looks like maybe the call stack would've been: SnapBuildSerializationPoint() xlog_decode() or standby_decode() ? LogicalDecodingProcessRecord() XLogSendLogical() WalSndLoop() StartLogicalReplication() -- Justin
On Fri, Jan 13, 2023 at 5:17 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > My patch used fsync_fname_ext() which would cause an ERROR rather than a > PANIC when failing to fsync the logical decoding pathname. FTR While analysing a lot of CI logs trying to debug something else I came across a plain Windows/MSVC (not Cygwin) build that panicked like this: https://cirrus-ci.com/task/6689224833892352 https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/testrun/build/testrun/subscription/013_partition/log/013_partition_publisher.log https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/crashlog/crashlog-postgres.exe_0af4_2023-02-05_21-53-20-018.txt
On Wed, Feb 8, 2023 at 8:06 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Jan 13, 2023 at 5:17 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > My patch used fsync_fname_ext() which would cause an ERROR rather than a > > PANIC when failing to fsync the logical decoding pathname. > > FTR While analysing a lot of CI logs trying to debug something else I > came across a plain Windows/MSVC (not Cygwin) build that panicked like > this: > > https://cirrus-ci.com/task/6689224833892352 > https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/testrun/build/testrun/subscription/013_partition/log/013_partition_publisher.log > https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/crashlog/crashlog-postgres.exe_0af4_2023-02-05_21-53-20-018.txt Here are some more flapping CI failures due to this phenomenon (nothing to do with Cygwin, this is just regular Windows): 4509011781877760 | Windows - Server 2019, VS 2019 - Meson & ninja 4525770962370560 | Windows - Server 2019, VS 2019 - Meson & ninja 5664518341132288 | Windows - Server 2019, VS 2019 - Meson & ninja 5689846694412288 | Windows - Server 2019, VS 2019 - Meson & ninja 5853025126842368 | Windows - Server 2019, VS 2019 - Meson & ninja 6639943179567104 | Windows - Server 2019, VS 2019 - Meson & ninja 6727728217456640 | Windows - Server 2019, VS 2019 - Meson & ninja 6740158104469504 | Windows - Server 2019, VS 2019 - Meson & ninja They all say something like 'PANIC: could not open file "pg_logical/snapshots/0-1597938.snap": No such file or directory', because they all do rename(some_temporary_file, that_name), then try to re-open and sync it, but rename() on Windows fails to be atomic so a concurrent process can see an intermediate ENOENT state. I see a few 'local' workarounds we could do to fix that, but ... there seems to be a much better idea staring us in the face in the comments! I think this would be fixed as a happy by-product of this TODO in SnapBuildSerialize(): * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has * some noticeable overhead since it's performed synchronously during * decoding? I have done no analysis myself of whether that is sound, but assuming it is, I think the way to achieve that is to tweak FileTag so that it can describe the file to be fsync'd, and use the sync.c machinery to fsync the file in the background. Presumably that would provide a huge speed up for logical decoding, and people would rejoice. Some other topics that came up in this thread: * Now that PostgreSQL seems to be stable enough on Cygwin to get through the basic regression tests reliably, lorikeet might as well run the full TAP test suite? * Justin complained about the weird effects of wal_sync_method, and I finally got around to showing how I think that should be untangled, in https://commitfest.postgresql.org/44/4453/