On Sat, Jan 07, 2023 at 12:39:11AM +1300, Thomas Munro wrote:
> On Wed, Nov 9, 2022 at 2:04 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > +data_sync_retry = on
>
> Sharing with the list some clues that Justin and I figured out about
> what that part is doing. Without it, you get failures like:
>
> PANIC: could not open file "pg_logical/snapshots/0-14FE6B0.snap":
> No such file or directory
>
> That's been seen before:
>
> https://www.postgresql.org/message-id/flat/17827.1549866683%40sss.pgh.pa.us
>
> That thread concluded that the operating system must have a non-atomic
> rename(), ie a kernel bug. I don't know why Cygwin would display that
> behaviour and our native Windows build not; maybe timing, or maybe our
> own open() and rename() wrappers for Windows do something important
> differently than Cygwin's open() and rename().
>
> On reflection, that seems a bit too flimsy to have in-tree without
> more investigation, which I won't have time for myself, so I'm going
> to withdraw this entry.
Not so fast :)
Here's my latest copy of the patch. Most recently, rather than setting
data_sync_retry=no, I changed to call fsync_fname_ext() rather than
fsync_fname(), which uses PANIC (except when data_sync_retry is
disabled). That seems to work, showing that the problem is limited to
SnapBuildSerialize(), and not a problem with all fsync()...
https://cirrus-ci.com/task/5990885733695488
Thomas raised a good question, which was how the tests were passing when
SnapBuildSerialize() was raising an error, which is what it would've
been doing when I used data_sync_retry=no.
So .. why is wal_sync_method being used to control fsync for things
other than WAL?
See 6dc7760ac (c. 2005) which added wal_fsync_writethrough, at which
point (since 9b178555f, c. 2004) wal_sync_method was already being used
for SLOG.
Now, it's also being used for logical decoding (since b89e1510 and
858ec1185, c. 2014) in rewriteheap.c/snapbuild.c. And pidfiles (since
ee0e525bf, 2010). And the control file (8b938d36f7, 2019). Note that
data_sync_retry wasn't added until 9ccdd7f66 (c. 2018)
It looks like logical decoding may be the "most wrong" place that
wal_sync_method is being used, so maybe my change is reasonable to
consider, and not just a workaround.
I'm going to re-open the CF entry to let this run for a while to see how
it works out.
--
Justin