Re: Introduce wait_for_subscription_sync for TAP tests - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Introduce wait_for_subscription_sync for TAP tests
Date
Msg-id CA+hUKG+WKsZpdoryeqM8_rk5uQPCqS2HGY92WiMGFsK2wVkcig@mail.gmail.com
Whole thread Raw
In response to Re: Introduce wait_for_subscription_sync for TAP tests  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Mon, Sep 12, 2022 at 10:42 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Sep 10, 2022 at 10:00 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Sat, Sep 10, 2022 at 9:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Masahiko Sawada <sawada.mshk@gmail.com> writes:
> > > > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef
> > > > is relevant.
> > >
> > > Noting that the errors have only appeared in the past couple of
> > > days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f
> > > (Fix recovery_prefetch with low maintenance_io_concurrency).
> >
> > Yeah, I also just spotted the coincidence of those failures while
> > monitoring the build farm.  I'll look into this later today.  My
> > initial suspicion is that there was pre-existing code here that was
> > (incorrectly?) relying on the lack of error reporting in that case.
> > But maybe I misunderstood and it was incorrect to report the error for
> > some reason that was not robustly covered with tests.
>
> After I wrote that I saw Sawada-san's message and waited for more
> information, and I see there was now a commit.  I noticed that
> peripatus was already logging the 'missing contrecord' error even when
> it didn't fail the test, and still does.  I'm still looking into that
> (ie whether I need to take that new report_invalid_record() call out
> and replace it with errormsg_deferred = true so that XLogReadRecord()
> returns NULL with no error message in this case).

I will go ahead and remove this new error message added by adb46615.
The message was correlated with the problem on peripatus fixed by
88f48831, but not the cause of it -- but it's also not terribly
helpful and might be confusing.  It might be reported: (1) in
pg_waldump when you hit the end of the segment with a missing
contrecord after the end, arguably rightfully, but then perhaps
someone might complain that they expect an error from pg_waldump only
on the final live segment at the end of the WAL, and (2) in a
walsender that is asked to shut down while between reads of pages with
a spanning contrecord, reported by logical_read_xlog_page() with a
messageless error (presumably peripatus's case?), (3) in crash
recovery with wal_recycle off (whereas normally you'd expect to see
the page_addr message from a recycled file), maybe more legitimately
than the above.  The problem I needed to solve can be solved without
the message just by setting that flag as mentioned above, so I'll do
that to remove the new noise.



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: longfin and tamandua aren't too happy but I'm not sure why
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: more descriptive message for process termination due to max_slot_wal_keep_size