Thread: Cygwin cleanup

Cygwin cleanup

From
Thomas Munro
Date:
Hi,

Continuing a discussion started over at [1].  Moving this to a new
thread so that other thread can focus on Unix cleanup, and both
threads can get CI coverage...

1.  In a few places, it is alleged that both __CYGWIN__ and WIN32
might be defined at the same time.  Do you think we should try to get
rid of that possibility?  I understand that we have to have a few
tests for __CYGWIN__ here and there, because eg file permissions don't
work quite right and there's not much we can do about that.  But it
seems a bit unhelpful if we also have to worry about a more-or-less
POSIX-ish build taking WIN32 paths at uncertain times if we forget to
defend against that, or wonder why some places are not consistent.

A quick recap of the three flavours of Windows platform we have to
handle, as I understand it:

 * MSVC: Windowsy toolchain, Windowsy C
   * custom perl scripts instead of configure
   * msbuild instead of make
   * MSVC compiler
   * Windows C APIs
   * we provide our own emulation of some POSIX C APIs on top
 * MSYS: Unixy toolchain, Windowsy C
   * configure (portname = "win32")
   * make
   * GCC compiler
   * Windows C APIs
   * we provide our own emulation of some POSIX C APIs on top
 * Cygwin: Unixy toolchain, Unixy C
   * configure (portname = "cygwin")
   * make
   * GCC compiler
   * POSIX C APIs (emulations provided by the Cygwin runtime libraries)

(The configure/make part will be harmonised by the Meson project.)

The macro WIN32 is visibly defined by something in/near msbuild in
MSVC builds: /D WIN32 is right here in the build transcripts (whereas
the compiler defines _WIN32; good compiler).  I am not sure how
exactly it is first defined in MSYS builds; I suspect that MSYS gcc
might define it itself, but I don't have access to MSYS to check.  As
for Cygwin, the only translation unit where I could find both
__CYGWIN__ and WIN32 defined is dirmod.c, and that results from
including <windows.h> and ultimately <minwindef.h> (even though WIN32
isn't defined yet at that time).  I couldn't understand why we do
that, but I probably didn't read enough commit history.  The purpose
of dirmod.c on Cygwin today is only to wrap otherwise pure POSIX code
in retry loops to handle those spurious EACCES errors due to NT
sharing violations, so there is no need for that.

Proposal: let's make it a programming rule that we don't allow
definitions from Windows headers to leak into Cygwin translation
units, preferably by never including them, or if we really must, let's
grant specific exemptions in an isolated and well documented way.  We
don't seem to need any such exemptions currently.  Places where we
currently worry about the contradictory macros could become
conditional #error directives instead.

2.  To make it possible to test any of that, you either need a working
Windows+Cygwin setup, or working CI.  I'm a salty old Unix hacker so I
opted for the latter, and I also hope this will eventually be useful
to others.  Unfortunately I haven't figured out how to get everything
working yet, so some of the check-world tests are failing.  Clues most
welcome!

The version I'm posting here is set to run always, so that cfbot will
show it alongside others.  But I would imagine that if we got a
committable-quality version of this, it'd probably be opt-in, so you'd
have to say "ci-os-only: cygwin", or "ci-os-only: cygwin, windows" etc
in a commit to your private github account to ask for it (or maybe
we'll come up with a way to tell cfbot we want the full works of CI
checks; the same decision will come up for MSYS, OpenBSD and NetBSD CI
support that my colleague is working on).  There are other things to
fix too, including abysmal performance; see commit message.

3.  You can't really run PostgreSQL on Cygwin for real, because its
implementation of signals does not have reliable signal masking, so
unsubtle and probably also subtle breakage occurs.  That was reported
upstream by Noah years ago, but they aren't working on a fix.
lorikeet shows random failures, and presumably any CI system will do
the same...  I even wondered about putting our own magic entry/exit
macros into signal handlers, that would use atomics to implement a
second level of signal masking (?!) but that's an uncommonly large
bandaid for a defective platform...  and trying to fix Cygwin itself
is a rabbithole too far for me.

4.  When building with Cygwin GCC 11.3 you get a bunch of warnings
that don't show up on other platforms, seemingly indicating that it
interprets -Wimplicit-fallthrough=3 differently.  Huh?

[1] https://www.postgresql.org/message-id/CA%2BhUKGKZ_FjkBnjGADk%2Bpa2g4oKDcG8%3DSE5V23sPTP0EELfyzQ%40mail.gmail.com

Attachment

Re: Cygwin cleanup

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> 3.  You can't really run PostgreSQL on Cygwin for real, because its
> implementation of signals does not have reliable signal masking, so
> unsubtle and probably also subtle breakage occurs.  That was reported
> upstream by Noah years ago, but they aren't working on a fix.
> lorikeet shows random failures, and presumably any CI system will do
> the same...

If that's an accurate statement, shouldn't we just drop Cygwin support?
Now that we have a native Windows build, it's hard to see how any live
user would prefer to use the Cygwin build.

            regards, tom lane



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Tue, Jul 26, 2022 at 4:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > 3.  You can't really run PostgreSQL on Cygwin for real, because its
> > implementation of signals does not have reliable signal masking, so
> > unsubtle and probably also subtle breakage occurs.  That was reported
> > upstream by Noah years ago, but they aren't working on a fix.
> > lorikeet shows random failures, and presumably any CI system will do
> > the same...
>
> If that's an accurate statement, shouldn't we just drop Cygwin support?

This thread rejected the idea last time around:

https://www.postgresql.org/message-id/flat/136712b0-0619-5619-4634-0f0286acaef7%402ndQuadrant.com

lorikeet still shows the issue.  Failures often involve assertions
about PMSignalState or mq->mq_sender.  Hmm, it's running Cygwin 3.2.0
(March 2021) and the latest release is 3.3.5, so it's remotely
possible that it's been fixed recently.  Maybe that'd be somewhere in
here, but it's not jumping out:

https://github.com/cygwin/cygwin/commits/master/winsup/cygwin/signal.cc

(Oooh, another implementation of signalfd...)



Re: Cygwin cleanup

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, Jul 26, 2022 at 4:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If that's an accurate statement, shouldn't we just drop Cygwin support?

> This thread rejected the idea last time around:
> https://www.postgresql.org/message-id/flat/136712b0-0619-5619-4634-0f0286acaef7%402ndQuadrant.com

I think maybe we should re-open the discussion.  I've certainly
reached the stage of fed-up-ness.  That platform seems seriously
broken, upstream is making no progress on fixing it, and there
doesn't seem to be any real-world use-case.  The only positive
argument for it is that Readline doesn't work in the other
Windows builds --- but we've apparently not rechecked that
statement in eighteen years, so maybe things are better now.

If we could just continue to blithely ignore lorikeet's failures,
I wouldn't mind so much; but doing any significant amount of new
code development work for the platform seems like throwing away
developer time.

            regards, tom lane



Re: Cygwin cleanup

From
Robert Haas
Date:
On Tue, Jul 26, 2022 at 7:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think maybe we should re-open the discussion.  I've certainly
> reached the stage of fed-up-ness.  That platform seems seriously
> broken, upstream is making no progress on fixing it, and there
> doesn't seem to be any real-world use-case.  The only positive
> argument for it is that Readline doesn't work in the other
> Windows builds --- but we've apparently not rechecked that
> statement in eighteen years, so maybe things are better now.
>
> If we could just continue to blithely ignore lorikeet's failures,
> I wouldn't mind so much; but doing any significant amount of new
> code development work for the platform seems like throwing away
> developer time.

I agree with that. All things being equal, I like the idea of
supporting a bunch of different platforms, and Cygwin doesn't really
look that dead. It has recent releases. But if blocking signals
doesn't actually work on that platform, making PostgreSQL work
reliably there seems really difficult.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Cygwin cleanup

From
Justin Pryzby
Date:
On Tue, Jul 26, 2022 at 04:24:25PM +1200, Thomas Munro wrote:
> 3.  You can't really run PostgreSQL on Cygwin for real, because its
> implementation of signals does not have reliable signal masking, so
> unsubtle and probably also subtle breakage occurs.  That was reported
> upstream by Noah years ago, but they aren't working on a fix.
> lorikeet shows random failures, and presumably any CI system will do
> the same...

Reference: https://www.postgresql.org/message-id/20170321034703.GB2097809%40tornado.leadboat.com

On my 2nd try:

https://cirrus-ci.com/task/5311911574110208
TRAP: FailedAssertion("mq->mq_sender == NULL", File: "shm_mq.c", Line: 230, PID: 16370)
2022-07-26 06:32:35.525 PDT [15538][postmaster] LOG:  background worker "parallel worker" (PID 16370) was terminated by
signal6: Aborted
 

> XXX Doesn't get all the way through yet...

Mainly because getopt was causing all tap tests to fail.
I tried to fix that in configure, but ended up changing the callers.

This is getting close, but I don't think has actually managed to pass all tests
yet..  https://cirrus-ci.com/task/5274721116749824

> 4.  When building with Cygwin GCC 11.3 you get a bunch of warnings
> that don't show up on other platforms, seemingly indicating that it
> interprets -Wimplicit-fallthrough=3 differently.  Huh?

Evidently due to the same getopt issues.

> XXX This should use a canned Docker image with all the right packages
> installed

Has anyone tried using non-canned images ?  It sounds like this could reduce
the 4min startup time for windows.

https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment

> XXX configure is soooo slooow, can we cache it?!  Compiling is also
> insanely slow, but ccache gets it down to a couple of minutes if you're
> lucky

One reason compiling was slow is because you ended up with -O2.

You can cache configure as long as you're willing to re-run it whenever options
were changed.  That also applies to the existing headerscheck.

> XXX I don't know how to put variables like BUILD_JOBS into the scripts

WDYM ?  If it's outside of bash and in windows shell it's like %var%, right ?
https://cirrus-ci.org/guide/writing-tasks/#environment-variables

I just noticed that cirrus is misbehaving: if there's a variable called CI
(which there is), then it expands $CI_FOO like ${CI}_FOO rather than ${CI_FOO}.
I've also seen weirdness when variable names or operators appear in the commit
message...

> XXX Needs some --with-X options

Done

> XXX We would never want this to run by default in CI, but it'd be nice
> to be able to ask for it with ci-os-only!  (See commented out line)
>  only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*'

Doesn't this already do what's needed?  
As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only',
the task will runs only on request.

> XXX I have no idea if crash dump works, and if this should share
> elements with the msys work in commitfest #3575

Based on the crash above, it wasn't working.  And after some changes ... it
still doesn't work.

windows_os is probably skipping too many things.

-- 
Justin

Attachment

Re: Cygwin cleanup

From
Thomas Munro
Date:
On Wed, Jul 27, 2022 at 6:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Tue, Jul 26, 2022 at 04:24:25PM +1200, Thomas Munro wrote:
> > 3.  You can't really run PostgreSQL on Cygwin for real, because its
> > implementation of signals does not have reliable signal masking, so
> > unsubtle and probably also subtle breakage occurs.  That was reported
> > upstream by Noah years ago, but they aren't working on a fix.
> > lorikeet shows random failures, and presumably any CI system will do
> > the same...
>
> Reference: https://www.postgresql.org/message-id/20170321034703.GB2097809%40tornado.leadboat.com
>
> On my 2nd try:
>
> https://cirrus-ci.com/task/5311911574110208
> TRAP: FailedAssertion("mq->mq_sender == NULL", File: "shm_mq.c", Line: 230, PID: 16370)
> 2022-07-26 06:32:35.525 PDT [15538][postmaster] LOG:  background worker "parallel worker" (PID 16370) was terminated
bysignal 6: Aborted
 

Thanks for working on this!

Huh, that Cygwin being shipped by Choco is quite old, older than
lorikeet's, but not old enough to not have the bug:

[04:33:55.234] Starting cygwin install, version 2.918

Based on clues in Noah's emails in the archives, I think versions from
maybe somewhere around 2015 didn't have the bug, and then the bug
appeared, and AFAIK it's still here.  I wonder if you can tell Choco
to install an ancient version, but even if that's possible you'd be
dealing with other stupid problems and bugs.

> > XXX Doesn't get all the way through yet...
>
> Mainly because getopt was causing all tap tests to fail.
> I tried to fix that in configure, but ended up changing the callers.
>
> This is getting close, but I don't think has actually managed to pass all tests
> yet..  https://cirrus-ci.com/task/5274721116749824

Woo.

> > 4.  When building with Cygwin GCC 11.3 you get a bunch of warnings
> > that don't show up on other platforms, seemingly indicating that it
> > interprets -Wimplicit-fallthrough=3 differently.  Huh?
>
> Evidently due to the same getopt issues.

Ahh, nice detective work.

> > XXX This should use a canned Docker image with all the right packages
> > installed
>
> Has anyone tried using non-canned images ?  It sounds like this could reduce
> the 4min startup time for windows.
>
> https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment

Yeah, I had that working once.  Not sure what the pros and cons would be for us.

> > XXX configure is soooo slooow, can we cache it?!  Compiling is also
> > insanely slow, but ccache gets it down to a couple of minutes if you're
> > lucky
>
> One reason compiling was slow is because you ended up with -O2.

Ah, right.

> You can cache configure as long as you're willing to re-run it whenever options
> were changed.  That also applies to the existing headerscheck.
>
> > XXX I don't know how to put variables like BUILD_JOBS into the scripts
>
> WDYM ?  If it's outside of bash and in windows shell it's like %var%, right ?
> https://cirrus-ci.org/guide/writing-tasks/#environment-variables

Right.  I should have taken the clue from the %cd% (I got a few ideas
about how to do this from libarchive's CI scripting[1]).

> I just noticed that cirrus is misbehaving: if there's a variable called CI
> (which there is), then it expands $CI_FOO like ${CI}_FOO rather than ${CI_FOO}.
> I've also seen weirdness when variable names or operators appear in the commit
> message...
>
> > XXX Needs some --with-X options
>
> Done

Neat.

> > XXX We would never want this to run by default in CI, but it'd be nice
> > to be able to ask for it with ci-os-only!  (See commented out line)
> >  only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*'
>
> Doesn't this already do what's needed?
> As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only',
> the task will runs only on request.

Yeah I was just trying to say that I was sharing the script in a way
that always runs, but for commit we'd want that.  This is all far too
slow for cfbot to have to deal with on every build.  Looks like we can
expect to be able to build and test fast on Windows soonish, though,
so maybe one day we'd just turn Cygwin and MSYS on?

[1] https://github.com/libarchive/libarchive/blob/master/build/ci/cirrus_ci/ci.cmd



Re: Cygwin cleanup

From
Justin Pryzby
Date:
On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote:
> Thanks for working on this!
> 
> Huh, that Cygwin being shipped by Choco is quite old, older than
> lorikeet's, but not old enough to not have the bug:
> 
> [04:33:55.234] Starting cygwin install, version 2.918

Hm, I think that's the version of "cygwinsetup" but not cygwin..
It also says this: [13:16:36.014] Cygwin v3.3.4.20220408 [Approved]

> I wonder if you can tell Choco
> to install an ancient version, but even if that's possible you'd be
> dealing with other stupid problems and bugs.

Yes: choco install -y --no-progress --version 4.6.1 ccache

> > > XXX This should use a canned Docker image with all the right packages
> > > installed
> >
> > Has anyone tried using non-canned images ?  It sounds like this could reduce
> > the 4min startup time for windows.
> >
> > https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment
> 
> Yeah, I had that working once.  Not sure what the pros and cons would be for us.

I think it could be a lot faster to start, since cirrus caches the generated
docker image locally.  Rather than (I gather) pulling the image every time.

> > > XXX We would never want this to run by default in CI, but it'd be nice
> > > to be able to ask for it with ci-os-only!  (See commented out line)
> > >  only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*'
> >
> > Doesn't this already do what's needed?
> > As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only',
> > the task will runs only on request.
> 
> Yeah I was just trying to say that I was sharing the script in a way
> that always runs, but for commit we'd want that.  This is all far too
> slow for cfbot to have to deal with on every build.

It occurred to me today that if cfbot preserved the original patch series, and
commit messages, that would allow patch authors to write things like
"ci-os-only: docs" for a doc only patch.  I've never gotten cirrus'
changesOnly() stuff to work...

> Looks like we can expect to be able to build and test fast on Windows
> soonish, though,

Do you mean with meson ?

> so maybe one day we'd just turn Cygwin and MSYS on?

I didn't understand this ?

-- 
Justin



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Fri, Jul 29, 2022 at 10:23 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote:
> > [04:33:55.234] Starting cygwin install, version 2.918
>
> Hm, I think that's the version of "cygwinsetup" but not cygwin..
> It also says this: [13:16:36.014] Cygwin v3.3.4.20220408 [Approved]

Oops.  Ok so we're testing the very latest then, and it definitely
still has the bug as we thought.

> It occurred to me today that if cfbot preserved the original patch series, and
> commit messages, that would allow patch authors to write things like
> "ci-os-only: docs" for a doc only patch.  I've never gotten cirrus'
> changesOnly() stuff to work...

Maybe it's time to switch to "git am -3 ..." and reject patches that
don't apply that way.

> > Looks like we can expect to be able to build and test fast on Windows
> > soonish, though,
>
> Do you mean with meson ?

Yeah.  Also there are some other things we can do to speed up testing
on Windows (and elsewhere), like not running every test query with new
psql + backend process pair, which takes at least a few hundred ms and
sometimes up to several seconds on this platform; I have some patches
I need to finish...

> > so maybe one day we'd just turn Cygwin and MSYS on?
>
> I didn't understand this ?

I mean, if, some sunny day, we can compile and test on Windows at
non-glacial speeds, then it would become possible to contemplate
having cfbot run these tasks for every patch every time.



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Wed, Jul 27, 2022 at 5:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jul 26, 2022 at 7:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I think maybe we should re-open the discussion.  I've certainly
> > reached the stage of fed-up-ness.  That platform seems seriously
> > broken, upstream is making no progress on fixing it, and there
> > doesn't seem to be any real-world use-case.  The only positive
> > argument for it is that Readline doesn't work in the other
> > Windows builds --- but we've apparently not rechecked that
> > statement in eighteen years, so maybe things are better now.
> >
> > If we could just continue to blithely ignore lorikeet's failures,
> > I wouldn't mind so much; but doing any significant amount of new
> > code development work for the platform seems like throwing away
> > developer time.
>
> I agree with that. All things being equal, I like the idea of
> supporting a bunch of different platforms, and Cygwin doesn't really
> look that dead. It has recent releases. But if blocking signals
> doesn't actually work on that platform, making PostgreSQL work
> reliably there seems really difficult.

It's one thing to drop old dead Unixes but I don't think anyone would
enjoy dropping support for an active open source project.  The best
outcome would be for people who have an interest in seeing PostgreSQL
work correctly on Cygwin to help get the bug fixed.  Here are the
threads I'm aware of:

https://cygwin.com/pipermail/cygwin/2017-August/234001.html
https://cygwin.com/pipermail/cygwin/2017-August/234097.html

I wonder if these problems would go away as a nice incidental
side-effect if we used latches for postmaster wakeups.  I don't
know... maybe, if the problem is just with the postmaster's pattern of
blocking/unblocking?  Maybe backend startup is simple enough that it
doesn't hit the bug?  From a quick glance, I think the assertion
failures that occur in regular backends can possibly be blamed on the
postmaster getting confused about its children due to unexpected
handler re-entry.



Re: Cygwin cleanup

From
Justin Pryzby
Date:
On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote:
> > > XXX We would never want this to run by default in CI, but it'd be nice
> > > to be able to ask for it with ci-os-only!  (See commented out line)
> > >  only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*'
> >
> > Doesn't this already do what's needed?
> > As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only',
> > the task will runs only on request.
> 
> Yeah I was just trying to say that I was sharing the script in a way
> that always runs, but for commit we'd want that.

That makes more sense after noticing that you created a cf entry (for which
cfbot has been skipping my patch due to my "only_if" line).  There's still a
few persistent issues:

This fails ~50% of the time in recovery 010-truncate
I hacked around this by setting data_sync_retry.
https://cirrus-ci.com/task/5289444063313920
I found these, not sure if they're relevant.
https://www.postgresql.org/message-id/flat/CAA4eK1Kft05mwNuZbTVRmz8SNS3r%2BuriuCT8DxL5KJy5btoS-A%40mail.gmail.com
https://www.postgresql.org/message-id/flat/CAFiTN-uGxgo5258hZy2QJoz%3Ds7_Cs7v9%3Db8Z2GgFV7qmQUOwxw%40mail.gmail.com

And an fsync abort in 013 which seems similar to this other one.
data_sync_retry also avoids this issue.
https://cirrus-ci.com/task/6283023745286144?logs=cores#L34
https://www.postgresql.org/message-id/flat/CAMVYW_4QhjZ-19Xpr2x1B19soRCNu1BXHM8g1mOnAVtd5VViDw%40mail.gmail.com

And sometimes various assertions failing in regress parallel_select (and then times out)
https://api.cirrus-ci.com/v1/artifact/task/5537540282253312/log/src/test/regress/log/postmaster.log
https://api.cirrus-ci.com/v1/artifact/task/6108746773430272/log/src/test/regress/log/postmaster.log
Or "could not map dynamic shared memory segment" (actually in 027-stream-regress):
https://cirrus-ci.com/task/6168860746317824

And segfault in vacuum parallel
https://api.cirrus-ci.com/v1/artifact/task/5404589569605632/log/src/test/regress/log/postmaster.log

Sometimes semctl() failed: Resource temporarily unavailable

https://api.cirrus-ci.com/v1/artifact/task/5027860623654912/log/src/test/subscription/tmp_check/log/014_binary_publisher.log

https://api.cirrus-ci.com/v1/artifact/task/5027860623654912/log/src/bin/pg_rewind/tmp_check/log/001_basic_standby_local.log

Some more
https://cirrus-ci.com/task/6468927780814848

If you're lucky, there's only 1 or 2 problems, of which those are different
symptoms..  Maybe for now this needs to disable tap tests :(

This shows that it *can* pass, if slowly, and infrequently:
https://cirrus-ci.com/task/6546858536337408

This fixes my changes to configure for getopt.
And simplifies the changes to *.pl (the .exe changes weren't necessary at all).
And removes the changes for implicit-fallthrough; I realized that configure was
  just deciding that it didn't work and not using it at all.
And adds support for backtraces.
And remove kerberos and and add libxml

Why did you write "|| exit /b 1" in all the bash invocations ?  I think cirrus
handles that automatically.

-- 
Justin

Attachment

Re: Cygwin cleanup

From
Andres Freund
Date:
Hi,

On 2022-07-28 17:23:19 -0500, Justin Pryzby wrote:
> On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote:
> > > > XXX This should use a canned Docker image with all the right packages
> > > > installed
> > >
> > > Has anyone tried using non-canned images ?  It sounds like this could reduce
> > > the 4min startup time for windows.
> > >
> > > https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment
> >
> > Yeah, I had that working once.  Not sure what the pros and cons would be for us.
>
> I think it could be a lot faster to start, since cirrus caches the generated
> docker image locally.  Rather than (I gather) pulling the image every time.

I'm quite certain that is not true. All the docker images built are just
uploaded to the google container registry and then downloaded onto a
*separate* windows host. The dockerfile: stuff generates a separate task
running on a separate machine...

It's a bit better for non-windows containers, because there google has some
optimization for pulling image (pieces) on demand or such.

Greetings,

Andres Freund



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> [train wreck]

Oh my, so I'm getting the impression we might actually be totally
unstable on Cygwin.  Which surprises me because ... wait a minute ...
lorikeet isn't even running most of the tests.  So... we don't really
know the degree to which any of this works at all?

> This shows that it *can* pass, if slowly, and infrequently:
> https://cirrus-ci.com/task/6546858536337408

Ok, that's slightly reassuring, so maybe we *can* fix this, but I'm
one step closer to what Tom said, re wasting developer time...

> [lots of improvements]

Cool.

> Why did you write "|| exit /b 1" in all the bash invocations ?  I think cirrus
> handles that automatically.

Cargo-culted from libarchive.



Re: Cygwin cleanup

From
Andres Freund
Date:
Hi,

On 2022-08-04 16:16:06 +1200, Thomas Munro wrote:
> Ok, that's slightly reassuring, so maybe we *can* fix this, but I'm
> one step closer to what Tom said, re wasting developer time...

It might be worth checking whether the cygwin installer, which at some point
at least allowed installing postgres, has download numbers available anywhere.

It's possible we could e.g. get away with just allowing libpq to be built.

Greetings,

Andres Freund



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Thu, Aug 4, 2022 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > [train wreck]
>
> Oh my, so I'm getting the impression we might actually be totally
> unstable on Cygwin.  Which surprises me because ... wait a minute ...
> lorikeet isn't even running most of the tests.  So... we don't really
> know the degree to which any of this works at all?

Hmm, it's possible that all these failures are just new-to-me effects
of the known bug.  Certainly the assertion failures are of the usual
type, and I think it might be possible for the weird parallel query
failure to be explained by the postmaster forking extra phantom child
processes.

It may be madness to try to work around this, but I wonder if we could
use a static local variable that we update with atomic compare
exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and
PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system.
On entry, if you can do 0->1 it means you are allowed to run the
function.  If it's non-zero, set n->n+1 and return immediately: signal
blocked, but queued for later.  On exit, you CAS n->0.  If n was > 1,
then you have to jump back to the top and run the function body again.



Re: Cygwin cleanup

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> It may be madness to try to work around this, but I wonder if we could
> use a static local variable that we update with atomic compare
> exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and
> PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system.
> On entry, if you can do 0->1 it means you are allowed to run the
> function.  If it's non-zero, set n->n+1 and return immediately: signal
> blocked, but queued for later.  On exit, you CAS n->0.  If n was > 1,
> then you have to jump back to the top and run the function body again.

And ... we're expending all this effort for what exactly?

            regards, tom lane



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Thu, Aug 4, 2022 at 5:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > It may be madness to try to work around this, but I wonder if we could
> > use a static local variable that we update with atomic compare
> > exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and
> > PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system.
> > On entry, if you can do 0->1 it means you are allowed to run the
> > function.  If it's non-zero, set n->n+1 and return immediately: signal
> > blocked, but queued for later.  On exit, you CAS n->0.  If n was > 1,
> > then you have to jump back to the top and run the function body again.
>
> And ... we're expending all this effort for what exactly?

I'd be almost as happy if we ripped it all out, shut down lorikeet and
added it to the list of fallen platforms.  I'd feel a bit like a
vandal, though.  My suggestion is a last-ditch idea for Noah (CCd)
and/or Andrew to consider, who (respectively) blocked this last time
and run lorikeet.  No plans to write that patch myself...



Re: Cygwin cleanup

From
Justin Pryzby
Date:
On Thu, Aug 04, 2022 at 04:16:06PM +1200, Thomas Munro wrote:
> On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > [train wreck]
> 
> Oh my, so I'm getting the impression we might actually be totally
> unstable on Cygwin.  Which surprises me because ... wait a minute ...
> lorikeet isn't even running most of the tests.  So... we don't really
> know the degree to which any of this works at all?

Right.

Maybe it's of limited interest, but ..

This updates the patch to build and test with meson.
Which first requires patching some meson.builds.
I guess that's needed for some current BF members, too.
Unfortunately, ccache+PCH causes gcc to crash :(

-- 
Justin

Attachment

Re: Cygwin cleanup

From
Justin Pryzby
Date:
On Thu, Oct 20, 2022 at 10:40:40PM -0500, Justin Pryzby wrote:
> On Thu, Aug 04, 2022 at 04:16:06PM +1200, Thomas Munro wrote:
> > On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > [train wreck]
> > 
> > Oh my, so I'm getting the impression we might actually be totally
> > unstable on Cygwin.  Which surprises me because ... wait a minute ...
> > lorikeet isn't even running most of the tests.  So... we don't really
> > know the degree to which any of this works at all?
> 
> Right.
> 
> Maybe it's of limited interest, but ..
> 
> This updates the patch to build and test with meson.
> Which first requires patching some meson.builds.
> I guess that's needed for some current BF members, too.
> Unfortunately, ccache+PCH causes gcc to crash :(

Resending with the 'only-if' line commented (doh).
And some fixes to 001 as Andres pointed out by on other thread.

-- 
Justin

Attachment

Re: Cygwin cleanup

From
Andres Freund
Date:
Hi,

On 2022-11-08 19:04:37 -0600, Justin Pryzby wrote:
> From 2741472080eceac5cb6d002c39eaf319d7f72b50 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby <pryzbyj@telsasoft.com>
> Date: Fri, 30 Sep 2022 13:39:43 -0500
> Subject: [PATCH 1/3] meson: other fixes for cygwin
> 
> XXX: what about HAVE_BUGGY_STRTOF ?

What about it? As noted in another thread, HAVE_BUGGY_STRTOF is defined in a
header, and shouldn't be affected by the buildsystem.


Pushed this commit.


> XXX This should use a canned Docker image with all the right packages
> installed?  But if the larger image is slower to start, then maybe not...

I think once we convert the windows containers to windows VMs we can just
install both cygwin and mingw in the same image. The overhead of installing
too much seems far far smaller there.


> +    CONFIGURE_FLAGS: --enable-cassert --enable-debug --with-ldap --with-ssl=openssl --with-libxml
> +    # --enable-tap-tests

I assume this is disabled as tap tests fail?


> +    C:\tools\cygwin\bin\bash.exe --login -c "cygserver-config -y"

I'd copy the approach used for mingw of putting most of this in an environment
variable.


> +findargs=''
>  case $os in
>      freebsd|linux|macos)
> -    ;;
> +        ;;
> +
> +    cygwin)
> +        # XXX Evidently I don't know how to write two arguments here without pathname expansion later, other than
eval.
> +        #findargs='-name "*.stackdump"'
> +        for stack in $(find "$directory" -type f -name "*.stackdump") ; do
> +            binary=`basename "$stack" .stackdump`
> +            echo;echo;
> +            echo "dumping ${stack} for ${binary}"
> +            awk '/^0/{print $2}' $stack |addr2line -f -i -e ./src/backend/postgres.exe
> +            #awk '/^0/{print $2}' $stack |addr2line -f -i -e "./src/backend/$binary.exe"
> +        done
> +        exit 0
> +        ;;

Is this stuff actually needed? Could we use the infrastructure we use for
backtraces with msvc instead? Or use something that understands .stackdump
files?


> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
> [...]
> +++ b/src/test/perl/PostgreSQL/Test/Utils.pm
> [...]
> +++ b/src/test/recovery/t/020_archive_status.pl
> [...]

I think these should be in a separate commit, they're not actually about CI.


Greetings,

Andres Freund



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Fri, Jul 29, 2022 at 10:57 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I wonder if these problems would go away as a nice incidental
> side-effect if we used latches for postmaster wakeups.  I don't
> know... maybe, if the problem is just with the postmaster's pattern of
> blocking/unblocking?  Maybe backend startup is simple enough that it
> doesn't hit the bug?  From a quick glance, I think the assertion
> failures that occur in regular backends can possibly be blamed on the
> postmaster getting confused about its children due to unexpected
> handler re-entry.

Just to connect the dots, that's what this patch does:

https://www.postgresql.org/message-id/flat/CA+hUKG+Z-HpOj1JsO9eWUP+ar7npSVinsC_npxSy+jdOMsx=Gg@mail.gmail.com

(There may be other places that break under Cygwin's flaky sa_mask
implementation, I don't know and haven't seen any clues about that.)



Re: Cygwin cleanup

From
vignesh C
Date:
On Wed, 9 Nov 2022 at 06:34, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Thu, Oct 20, 2022 at 10:40:40PM -0500, Justin Pryzby wrote:
> > On Thu, Aug 04, 2022 at 04:16:06PM +1200, Thomas Munro wrote:
> > > On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > > [train wreck]
> > >
> > > Oh my, so I'm getting the impression we might actually be totally
> > > unstable on Cygwin.  Which surprises me because ... wait a minute ...
> > > lorikeet isn't even running most of the tests.  So... we don't really
> > > know the degree to which any of this works at all?
> >
> > Right.
> >
> > Maybe it's of limited interest, but ..
> >
> > This updates the patch to build and test with meson.
> > Which first requires patching some meson.builds.
> > I guess that's needed for some current BF members, too.
> > Unfortunately, ccache+PCH causes gcc to crash :(
>
> Resending with the 'only-if' line commented (doh).
> And some fixes to 001 as Andres pointed out by on other thread.

Is there still some work pending for this thread as Andres had
committed some part, if so, can you post an updated patch for the
same.

Regards,
Vignesh



Re: Cygwin cleanup

From
Justin Pryzby
Date:
On Tue, Jan 03, 2023 at 05:54:56PM +0530, vignesh C wrote:
> > On Thu, Oct 20, 2022 at 10:40:40PM -0500, Justin Pryzby wrote:
> > > On Thu, Aug 04, 2022 at 04:16:06PM +1200, Thomas Munro wrote:
> > > > On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > > > [train wreck]
> > > >
> > > > Oh my, so I'm getting the impression we might actually be totally
> > > > unstable on Cygwin.  Which surprises me because ... wait a minute ...
> > > > lorikeet isn't even running most of the tests.  So... we don't really
> > > > know the degree to which any of this works at all?
> > >
> > > Right.
> > >
> > > Maybe it's of limited interest, but ..
> > >
> > > This updates the patch to build and test with meson.
> > > Which first requires patching some meson.builds.
> > > I guess that's needed for some current BF members, too.
> > > Unfortunately, ccache+PCH causes gcc to crash :(
> >
> > Resending with the 'only-if' line commented (doh).
> > And some fixes to 001 as Andres pointed out by on other thread.
> 
> Is there still some work pending for this thread as Andres had
> committed some part, if so, can you post an updated patch for the
> same.

Thomas, what's your opinion ?



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Wed, Jan 4, 2023 at 3:25 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Tue, Jan 03, 2023 at 05:54:56PM +0530, vignesh C wrote:
> > Is there still some work pending for this thread as Andres had
> > committed some part, if so, can you post an updated patch for the
> > same.
>
> Thomas, what's your opinion ?

One observation is that your CI patch *nearly* succeeds, even if
hacked to turn on the full TAP tests, if applied on top of the
WaitEventSet-for-postmaster patch:

https://cirrus-ci.com/task/4533371804581888

No cigar though, it still failed a few times for me in the
subscription tests with EAGAIN, when accessing semaphores:

semctl(24576010, 14, SETVAL, 0) failed: Resource temporarily unavailable

That isn't an error I expect from semctl(), but from some cursory
research it seems like that system call is actually talking to the
cygserver process over a pipe (?) to implement SysV semaphores.  Maybe
it couldn't keep up, but doesn't like to block?  Perhaps we could try
to tune that server, but let's try the POSIX kind of semaphores
instead.  From a quick peek at the source, they are implemented some
other way on direct native NT voodoo, no cygserver involved.

https://cirrus-ci.com/task/5142810819559424 [still running at time of writing]

Gotta run, but I'll check again in the morning to see if that does better...



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Fri, Jan 6, 2023 at 1:22 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> https://cirrus-ci.com/task/5142810819559424 [still running at time of writing]
>
> Gotta run, but I'll check again in the morning to see if that does better...

Yes!  Two successful runs with all TAP tests so far.  So it looks like
we can probably stop lorikeet's spurious failures, by happy
coincidence due to other work, and we could seriously consider
committing this optional CI test for it, much like we have the
optional MSYS build.  Any interest in producing a tidied up version of
the patch, Justin?  Or I can, but I'll go and work on other things
first.

I pushed a change to switch the semaphore implementation.  I haven't
personally seen that failure mode on lorikeet, but I would guess
that's because (1) it's only running a tiny subset of the tests, (2)
it crashes for the other reason with higher likelihood, and/or (3)
it's not using much concurrency yet because the build farm doesn't use
meson yet.



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Wed, Nov 9, 2022 at 2:04 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> +data_sync_retry = on

Sharing with the list some clues that Justin and I figured out about
what that part is doing.  Without it, you get failures like:

  PANIC:  could not open file "pg_logical/snapshots/0-14FE6B0.snap":
No such file or directory

That's been seen before:

  https://www.postgresql.org/message-id/flat/17827.1549866683%40sss.pgh.pa.us

That thread concluded that the operating system must have a non-atomic
rename(), ie a kernel bug.  I don't know why Cygwin would display that
behaviour and our native Windows build not; maybe timing, or maybe our
own open() and rename() wrappers for Windows do something important
differently than Cygwin's open() and rename().

On reflection, that seems a bit too flimsy to have in-tree without
more investigation, which I won't have time for myself, so I'm going
to withdraw this entry.



Re: Cygwin cleanup

From
Andrew Dunstan
Date:
On 2023-01-05 Th 16:39, Thomas Munro wrote:
> On Fri, Jan 6, 2023 at 1:22 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> https://cirrus-ci.com/task/5142810819559424 [still running at time of writing]
>>
>> Gotta run, but I'll check again in the morning to see if that does better...
> Yes!  Two successful runs with all TAP tests so far.  So it looks like
> we can probably stop lorikeet's spurious failures, by happy
> coincidence due to other work, and we could seriously consider
> committing this optional CI test for it, much like we have the
> optional MSYS build.  Any interest in producing a tidied up version of
> the patch, Justin?  Or I can, but I'll go and work on other things
> first.
>
> I pushed a change to switch the semaphore implementation.  I haven't
> personally seen that failure mode on lorikeet, but I would guess
> that's because (1) it's only running a tiny subset of the tests, (2)
> it crashes for the other reason with higher likelihood, and/or (3)
> it's not using much concurrency yet because the build farm doesn't use
> meson yet.


OK, should I now try re-enabling TAP tests on lorikeet?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Cygwin cleanup

From
Thomas Munro
Date:
On Sat, Jan 7, 2023 at 3:40 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> OK, should I now try re-enabling TAP tests on lorikeet?

Not before https://commitfest.postgresql.org/41/4032/ is committed.
After that, it might be worth a try?  I have no idea if the PANIC
problem I mentioned last night would apply to lorikeet's kernel too.
To summarise the kinds of failure we have analysed in this thread:

1.  SysV semaphores are buggy; fixed, I hope, by recent commit (= just
don't use them).
2.  The regular crashes we already knew about from other threads due
to signal masking being buggy seem to be fixed, coincidentally, by CF
#4032, not yet committed (= don't rely on sa_mask for correctness).
3.  PANIC apparently caused by non-atomic rename(), based on analysis
of similar failures seen on other old buggy OSes back in 2018.

If lorikeet has problem #3 (which it may not; we know from CF #3951
that kernel versions differ in related respects and Server 2019 as
used on CI seems to have the most conservative/old Windows behaviour)
then it might fail in the TAP tests just like the proposed
CI-for-Cygwin patch, unless you also do data_sync_retry=on, which
seems like a pretty ugly workaround to me.  I don't know what else
might be broken by non-atomic rename(), and I'd rather not find out
:-D  I finished up here by trying to tidy up some weird looking
nonsense in our code while working on general portability cleanup,
since I needed a way to check if __CYGWIN__ stuff still works, but
what we found out is that it's more broken than anyone realised, and
now I have to pull the emergency rabbit hole ejection cord because I
have less than zero time for or interest in debugging Cygwin.



Re: Cygwin cleanup

From
Justin Pryzby
Date:
On Sat, Jan 07, 2023 at 12:39:11AM +1300, Thomas Munro wrote:
> On Wed, Nov 9, 2022 at 2:04 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > +data_sync_retry = on
> 
> Sharing with the list some clues that Justin and I figured out about
> what that part is doing.  Without it, you get failures like:
> 
>   PANIC:  could not open file "pg_logical/snapshots/0-14FE6B0.snap":
> No such file or directory
> 
> That's been seen before:
> 
>   https://www.postgresql.org/message-id/flat/17827.1549866683%40sss.pgh.pa.us
> 
> That thread concluded that the operating system must have a non-atomic
> rename(), ie a kernel bug.  I don't know why Cygwin would display that
> behaviour and our native Windows build not; maybe timing, or maybe our
> own open() and rename() wrappers for Windows do something important
> differently than Cygwin's open() and rename().
> 
> On reflection, that seems a bit too flimsy to have in-tree without
> more investigation, which I won't have time for myself, so I'm going
> to withdraw this entry.

Not so fast :)

Here's my latest copy of the patch.  Most recently, rather than setting
data_sync_retry=no, I changed to call fsync_fname_ext() rather than
fsync_fname(), which uses PANIC (except when data_sync_retry is
disabled).  That seems to work, showing that the problem is limited to
SnapBuildSerialize(), and not a problem with all fsync()...

https://cirrus-ci.com/task/5990885733695488

Thomas raised a good question, which was how the tests were passing when
SnapBuildSerialize() was raising an error, which is what it would've
been doing when I used data_sync_retry=no.

So .. why is wal_sync_method being used to control fsync for things
other than WAL?

See 6dc7760ac (c. 2005) which added wal_fsync_writethrough, at which
point (since 9b178555f, c. 2004) wal_sync_method was already being used
for SLOG.

Now, it's also being used for logical decoding (since b89e1510 and
858ec1185, c. 2014) in rewriteheap.c/snapbuild.c.  And pidfiles (since
ee0e525bf, 2010).  And the control file (8b938d36f7, 2019).  Note that
data_sync_retry wasn't added until 9ccdd7f66 (c.  2018)

It looks like logical decoding may be the "most wrong" place that
wal_sync_method is being used, so maybe my change is reasonable to
consider, and not just a workaround.

I'm going to re-open the CF entry to let this run for a while to see how
it works out.

-- 
Justin

Attachment

Re: Cygwin cleanup

From
Justin Pryzby
Date:
On Wed, Jan 11, 2023 at 10:39:49PM -0600, Justin Pryzby wrote:
> Here's my latest copy of the patch.
> +  # due to resource constraints we don't run this task by default for now
> +  trigger_type: manual

Now, with trigger_type commented, so Thomas doesn't have to click
"trigger" for me.

Attachment

Re: Cygwin cleanup

From
Andres Freund
Date:
Hi,

On 2023-01-11 22:39:49 -0600, Justin Pryzby wrote:
> Thomas raised a good question, which was how the tests were passing when
> SnapBuildSerialize() was raising an error, which is what it would've
> been doing when I used data_sync_retry=no.

Presumably some test not checking for failures in a part of the test.


> So .. why is wal_sync_method being used to control fsync for things
> other than WAL?

Historical raisins, I think. The problem is that macOS lies about fsync, and
one needs special magic to make it behave like a real fsync. Somebody thought
that instead of inventing a separate GUC to control whether the "real" fsync
is used for other subsystems, it'd be better to reuse wal_sync_method.

Note that this isn't the function that is actually used for WAL (that's
issue_xlog_fsync()), and that pg_fsync() only uses the GUC to know whether to
use pg_fsync_writethrough() or pg_fsync_no_writethrough(fd).



> See 6dc7760ac (c. 2005) which added wal_fsync_writethrough, at which
> point (since 9b178555f, c. 2004) wal_sync_method was already being used
> for SLOG.
>
> Now, it's also being used for logical decoding (since b89e1510 and
> 858ec1185, c. 2014) in rewriteheap.c/snapbuild.c.  And pidfiles (since
> ee0e525bf, 2010).  And the control file (8b938d36f7, 2019).  Note that
> data_sync_retry wasn't added until 9ccdd7f66 (c.  2018)
>
> It looks like logical decoding may be the "most wrong" place that
> wal_sync_method is being used, so maybe my change is reasonable to
> consider, and not just a workaround.

I don't follow. What does using fsync_fname() vs fsync_fname_ext() have to do
with pg_fsync() using wal_sync_method? fsync_fname() is just a wrapper around
fsync_fname_ext(). Both end up in pg_fsync().

Are you actually proposing that we don't PANIC after an fsync for the category
of files that you list here, even with data_sync_retry set?


Greetings,

Andres Freund



Re: Cygwin cleanup

From
Justin Pryzby
Date:
On Thu, Jan 12, 2023 at 06:43:54PM -0800, Andres Freund wrote:
> > It looks like logical decoding may be the "most wrong" place that
> > wal_sync_method is being used, so maybe my change is reasonable to
> > consider, and not just a workaround.
> 
> I don't follow. What does using fsync_fname() vs fsync_fname_ext() have to do
> with pg_fsync() using wal_sync_method? fsync_fname() is just a wrapper around
> fsync_fname_ext(). Both end up in pg_fsync().

My patch used fsync_fname_ext() which would cause an ERROR rather than a
PANIC when failing to fsync the logical decoding pathname.

> Are you actually proposing that we don't PANIC after an fsync for the category
> of files that you list here, even with data_sync_retry set?

Yes, but I'm referring only to my change to SnapBuildSerialize().

The rest of the verbage was me trying to figure out the
history/evolution of pg_fsync usage.

-- 
Justin



Re: Cygwin cleanup

From
Andres Freund
Date:
Hi,

On 2023-01-12 22:17:55 -0600, Justin Pryzby wrote:
> On Thu, Jan 12, 2023 at 06:43:54PM -0800, Andres Freund wrote:
> > Are you actually proposing that we don't PANIC after an fsync for the category
> > of files that you list here, even with data_sync_retry set?
> 
> Yes, but I'm referring only to my change to SnapBuildSerialize().

I can't see how that change could be correct?

Greetings,

Andres Freund



Re: Cygwin cleanup

From
Justin Pryzby
Date:
On Thu, Jan 12, 2023 at 10:17:55PM -0600, Justin Pryzby wrote:
> On Thu, Jan 12, 2023 at 06:43:54PM -0800, Andres Freund wrote:
> > > It looks like logical decoding may be the "most wrong" place that
> > > wal_sync_method is being used, so maybe my change is reasonable to
> > > consider, and not just a workaround.
> > 
> > I don't follow. What does using fsync_fname() vs fsync_fname_ext() have to do
> > with pg_fsync() using wal_sync_method? fsync_fname() is just a wrapper around
> > fsync_fname_ext(). Both end up in pg_fsync().
> 
> My patch used fsync_fname_ext() which would cause an ERROR rather than a
> PANIC when failing to fsync the logical decoding pathname.
> 
> > Are you actually proposing that we don't PANIC after an fsync for the category
> > of files that you list here, even with data_sync_retry set?
> 
> Yes, but I'm referring only to my change to SnapBuildSerialize().
> 
> The rest of the verbage was me trying to figure out the
> history/evolution of pg_fsync usage.

Also note the existing comment (originating from Thomas' "fsync-gate"
commit, which introduced data_sync_retry):

+        * It's safe to just ERROR on fsync() here because we'll retry the whole
+        * operation including the writes.

Also, it seems to work fine if one calls pg_fsync() again, rather than
calling fsync_fname(), which re-opens the file.

It also seems to work fine if one omits the initial call to
fsync_fname("pg_logical/snapshots", true);

Since SnapBuildSerialize() isn't atomic (the system could crash at any
point), I'm not seeing why these wouldn't be adequately safe.  But also
hoping Thomas will comment on that.

-- 
Justin



Re: Cygwin cleanup

From
Andres Freund
Date:
Hi,

On 2023-01-23 17:28:14 -0600, Justin Pryzby wrote:
> On Thu, Jan 12, 2023 at 10:17:55PM -0600, Justin Pryzby wrote:
> > On Thu, Jan 12, 2023 at 06:43:54PM -0800, Andres Freund wrote:
> > > > It looks like logical decoding may be the "most wrong" place that
> > > > wal_sync_method is being used, so maybe my change is reasonable to
> > > > consider, and not just a workaround.
> > > 
> > > I don't follow. What does using fsync_fname() vs fsync_fname_ext() have to do
> > > with pg_fsync() using wal_sync_method? fsync_fname() is just a wrapper around
> > > fsync_fname_ext(). Both end up in pg_fsync().
> > 
> > My patch used fsync_fname_ext() which would cause an ERROR rather than a
> > PANIC when failing to fsync the logical decoding pathname.
> > 
> > > Are you actually proposing that we don't PANIC after an fsync for the category
> > > of files that you list here, even with data_sync_retry set?
> > 
> > Yes, but I'm referring only to my change to SnapBuildSerialize().
> > 
> > The rest of the verbage was me trying to figure out the
> > history/evolution of pg_fsync usage.
> 
> Also note the existing comment (originating from Thomas' "fsync-gate"
> commit, which introduced data_sync_retry):
> 
> +        * It's safe to just ERROR on fsync() here because we'll retry the whole
> +        * operation including the writes.
> 
> Also, it seems to work fine if one calls pg_fsync() again, rather than
> calling fsync_fname(), which re-opens the file.

I don't think that'd achieve the same thing necessarily. But it's notoriously
hard to know what which OS requires in this area.


> It also seems to work fine if one omits the initial call to
> fsync_fname("pg_logical/snapshots", true);

I don't think it's a good idea randomly weaken individual fsyncs just because
it somehow, without any theory as to how, fixes tests on cygwin.


> Since SnapBuildSerialize() isn't atomic (the system could crash at any
> point), I'm not seeing why these wouldn't be adequately safe.

I'm not sure what you mean by that. It's atomic from a crash safety view: It
first writes into a tempfile, fsyncs that + directory, then renames the file
into place, fsyncs new filename + directory again. Tempfiles are removed after
a crash.  In case of a crash you can either end up with an "old" or a "new"
file.

Greetings,

Andres Freund



Re: Cygwin cleanup

From
Justin Pryzby
Date:
Note that cirrus failed like this:


https://api.cirrus-ci.com/v1/artifact/task/4881596411543552/testrun/build/testrun/subscription/010_truncate/log/010_truncate_publisher.log

2023-01-25 23:17:10.417 GMT [29821][walsender] [sub1][3/0:0] ERROR:  could not open file
"pg_logical/snapshots/0-14F2060.snap":Is a directory
 
2023-01-25 23:17:10.417 GMT [29821][walsender] [sub1][3/0:0] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0
(proto_version'4', origin 'any', publication_names '"pub1"')
 
2023-01-25 23:17:10.418 GMT [29850][walsender] [pg_16413_sync_16394_7192732880582452157][6/0:0] PANIC:  could not open
file"pg_logical/snapshots/0-14F2060.snap": No such file or directory
 
2023-01-25 23:17:10.418 GMT [29850][walsender] [pg_16413_sync_16394_7192732880582452157][6/0:0] STATEMENT:
START_REPLICATIONSLOT "pg_16413_sync_16394_7192732880582452157" LOGICAL 0/14F2060 (proto_version '4', origin 'any',
publication_names'"pub3"')
 

I don't understand how "Is a directory" happened ..

It looks like maybe the call stack would've been:

SnapBuildSerializationPoint()
xlog_decode() or standby_decode() ?
LogicalDecodingProcessRecord()
XLogSendLogical()
WalSndLoop()
StartLogicalReplication()

-- 
Justin



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Fri, Jan 13, 2023 at 5:17 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> My patch used fsync_fname_ext() which would cause an ERROR rather than a
> PANIC when failing to fsync the logical decoding pathname.

FTR While analysing a lot of CI logs trying to debug something else I
came across a plain Windows/MSVC (not Cygwin) build that panicked like
this:

https://cirrus-ci.com/task/6689224833892352

https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/testrun/build/testrun/subscription/013_partition/log/013_partition_publisher.log

https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/crashlog/crashlog-postgres.exe_0af4_2023-02-05_21-53-20-018.txt



Re: Cygwin cleanup

From
Thomas Munro
Date:
On Wed, Feb 8, 2023 at 8:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jan 13, 2023 at 5:17 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > My patch used fsync_fname_ext() which would cause an ERROR rather than a
> > PANIC when failing to fsync the logical decoding pathname.
>
> FTR While analysing a lot of CI logs trying to debug something else I
> came across a plain Windows/MSVC (not Cygwin) build that panicked like
> this:
>
> https://cirrus-ci.com/task/6689224833892352
>
https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/testrun/build/testrun/subscription/013_partition/log/013_partition_publisher.log
>
https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/crashlog/crashlog-postgres.exe_0af4_2023-02-05_21-53-20-018.txt

Here are some more flapping CI failures due to this phenomenon
(nothing to do with Cygwin, this is just regular Windows):

 4509011781877760 | Windows - Server 2019, VS 2019 - Meson & ninja
 4525770962370560 | Windows - Server 2019, VS 2019 - Meson & ninja
 5664518341132288 | Windows - Server 2019, VS 2019 - Meson & ninja
 5689846694412288 | Windows - Server 2019, VS 2019 - Meson & ninja
 5853025126842368 | Windows - Server 2019, VS 2019 - Meson & ninja
 6639943179567104 | Windows - Server 2019, VS 2019 - Meson & ninja
 6727728217456640 | Windows - Server 2019, VS 2019 - Meson & ninja
 6740158104469504 | Windows - Server 2019, VS 2019 - Meson & ninja

They all say something like 'PANIC:  could not open file
"pg_logical/snapshots/0-1597938.snap": No such file or directory',
because they all do rename(some_temporary_file, that_name), then try
to re-open and sync it, but rename() on Windows fails to be atomic so
a concurrent process can see an intermediate ENOENT state.  I see a
few 'local' workarounds we could do to fix that, but ... there seems
to be a much better idea staring us in the face in the comments!

I think this would be fixed as a happy by-product of this TODO in
SnapBuildSerialize():

     * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
     * some noticeable overhead since it's performed synchronously during
     * decoding?

I have done no analysis myself of whether that is sound, but assuming
it is, I think the way to achieve that is to tweak FileTag so that it
can describe the file to be fsync'd, and use the sync.c machinery to
fsync the file in the background.  Presumably that would provide a
huge speed up for logical decoding, and people would rejoice.

Some other topics that came up in this thread:
 * Now that PostgreSQL seems to be stable enough on Cygwin to get
through the basic regression tests reliably, lorikeet might as well
run the full TAP test suite?
 * Justin complained about the weird effects of wal_sync_method, and I
finally got around to showing how I think that should be untangled, in
https://commitfest.postgresql.org/44/4453/