Re: Cygwin cleanup - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: Cygwin cleanup
Date
Msg-id 20230112043949.GQ9837@telsasoft.com
Whole thread Raw
In response to Re: Cygwin cleanup  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Cygwin cleanup
Re: Cygwin cleanup
List pgsql-hackers
On Sat, Jan 07, 2023 at 12:39:11AM +1300, Thomas Munro wrote:
> On Wed, Nov 9, 2022 at 2:04 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > +data_sync_retry = on
> 
> Sharing with the list some clues that Justin and I figured out about
> what that part is doing.  Without it, you get failures like:
> 
>   PANIC:  could not open file "pg_logical/snapshots/0-14FE6B0.snap":
> No such file or directory
> 
> That's been seen before:
> 
>   https://www.postgresql.org/message-id/flat/17827.1549866683%40sss.pgh.pa.us
> 
> That thread concluded that the operating system must have a non-atomic
> rename(), ie a kernel bug.  I don't know why Cygwin would display that
> behaviour and our native Windows build not; maybe timing, or maybe our
> own open() and rename() wrappers for Windows do something important
> differently than Cygwin's open() and rename().
> 
> On reflection, that seems a bit too flimsy to have in-tree without
> more investigation, which I won't have time for myself, so I'm going
> to withdraw this entry.

Not so fast :)

Here's my latest copy of the patch.  Most recently, rather than setting
data_sync_retry=no, I changed to call fsync_fname_ext() rather than
fsync_fname(), which uses PANIC (except when data_sync_retry is
disabled).  That seems to work, showing that the problem is limited to
SnapBuildSerialize(), and not a problem with all fsync()...

https://cirrus-ci.com/task/5990885733695488

Thomas raised a good question, which was how the tests were passing when
SnapBuildSerialize() was raising an error, which is what it would've
been doing when I used data_sync_retry=no.

So .. why is wal_sync_method being used to control fsync for things
other than WAL?

See 6dc7760ac (c. 2005) which added wal_fsync_writethrough, at which
point (since 9b178555f, c. 2004) wal_sync_method was already being used
for SLOG.

Now, it's also being used for logical decoding (since b89e1510 and
858ec1185, c. 2014) in rewriteheap.c/snapbuild.c.  And pidfiles (since
ee0e525bf, 2010).  And the control file (8b938d36f7, 2019).  Note that
data_sync_retry wasn't added until 9ccdd7f66 (c.  2018)

It looks like logical decoding may be the "most wrong" place that
wal_sync_method is being used, so maybe my change is reasonable to
consider, and not just a workaround.

I'm going to re-open the CF entry to let this run for a while to see how
it works out.

-- 
Justin

Attachment

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: Michael Paquier
Date:
Subject: Re: Spinlock is missing when updating two_phase of ReplicationSlot