RE: Add two missing tests in 035_standby_logical_decoding.pl - Mailing list pgsql-hackers

From Yu Shi (Fujitsu)
Subject RE: Add two missing tests in 035_standby_logical_decoding.pl
Date
Msg-id OSZPR01MB6310F24417BDD04731BD7978FD659@OSZPR01MB6310.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Add two missing tests in 035_standby_logical_decoding.pl  ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>)
Responses Re: Add two missing tests in 035_standby_logical_decoding.pl
List pgsql-hackers
On Mon, Apr 24, 2023 8:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote:
> 
> On 4/24/23 11:45 AM, Amit Kapila wrote:
> > On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >>
> >> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
> >> <bertranddrouvot.pg@gmail.com> wrote:
> >>>
> >>
> >> Few comments:
> >> ============
> >>
> >
> > +# We can not test if the WAL file still exists immediately.
> > +# We need to let some time to the standby to actually "remove" it.
> > +my $i = 0;
> > +while (1)
> > +{
> > + last if !-f $standby_walfile;
> > + if ($i++ == 10 * $default_timeout)
> > + {
> > + die
> > +   "could not determine if WAL file has been retained or not, can't continue";
> > + }
> > + usleep(100_000);
> > +}
> >
> > Is this adhoc wait required because we can't guarantee that the
> > checkpoint is complete on standby even after using wait_for_catchup?
> 
> Yes, the restart point on the standby is not necessary completed even after
> wait_for_catchup is done.
> 

I think that's because when replaying a checkpoint record, the startup process
of standby only saves the information of the checkpoint, and we need to wait for
the checkpointer to perform a restartpoint (see RecoveryRestartPoint), right? If
so, could we force a checkpoint on standby? After this, the standby should have
completed the restartpoint and we don't need to wait.

Besides, would it be better to wait for the cascading standby? If the wal log
file needed for cascading standby is removed on the standby, the subsequent test
will fail. Do we need to consider this scenario? I saw the following error
message after setting recovery_min_apply_delay to 5s on the cascading standby,
and the test failed due to a timeout while waiting for cascading standby.

Log of cascading standby node:
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000000000003 has already been
removed

Regards,
Shi Yu

pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: pg_stat_io for the startup process
Next
From: Andreas 'ads' Scherbaum
Date:
Subject: Find dangling membership roles in pg_dumpall