Re: Add two missing tests in 035_standby_logical_decoding.pl - Mailing list pgsql-hackers

From Drouvot, Bertrand
Subject Re: Add two missing tests in 035_standby_logical_decoding.pl
Date
Msg-id 7e5f41be-ec0f-b6f2-2e2a-60fd5648ee84@gmail.com
Whole thread Raw
In response to Re: Add two missing tests in 035_standby_logical_decoding.pl  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Add two missing tests in 035_standby_logical_decoding.pl
List pgsql-hackers
Hi,

On 5/6/23 3:28 PM, Amit Kapila wrote:
> On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> There is 2 runs with this extra info in place:
>>
>> A successful one: https://cirrus-ci.com/task/6528745436086272
>> A failed one: https://cirrus-ci.com/task/4558139312308224
>>
> 
> Thanks, I think I got some clue as to why this test is failing
> randomly. 

Great, thanks!

> Observations:
> ============
> 1. In the failed run, the KeepLogSeg(), reduced the _logSegNo to 3
> which is the reason for the failure because now the standby won't be
> able to remove/recycle the WAL file corresponding to segment number 3
> which the test was expecting.

Agree.

> 2.  We didn't expect the KeepLogSeg() to reduce the _logSegNo because
> all logical slots were invalidated. However, I think we forgot that
> both standby and primary have physical slots which might also
> influence the XLogGetReplicationSlotMinimumLSN() calculation in
> KeepLogSeg().

Oh right...

> Next steps:
> =========
> 1. The first thing is we should verify this theory by adding some LOG
> in KeepLogSeg() to see if the _logSegNo is reduced due to the value
> returned by XLogGetReplicationSlotMinimumLSN().

Yeah, will do that early next week.

> 2. The reason for the required file not being removed in the primary
> is also that it has a physical slot which prevents the file removal.

Yeah, agree. But this one is not an issue as we are not
checking for the WAL file removal on the primary, do you agree?

> 3. If the above theory is correct then I see a few possibilities to
> fix this test (a) somehow ensure that restart_lsn of the physical slot
> on standby is advanced up to the point where we can safely remove the
> required files; (b) just create a separate test case by initializing a
> fresh node for primary and standby where we only have logical slots on
> standby. This will be a bit costly but probably less risky. (c) any
> better ideas?
> 

(c): Since, I think, the physical slot on the primary is not a concern for
the reason mentioned above, then instead of (b):

What about postponing the physical slot creation on the standby and the
cascading standby node initialization after this test?

That way, this test would be done without a physical slot on the standby and
we could also get rid of the "Wait for the cascading standby to catchup before
removing the WAL file(s)" part.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Shinya Kato
Date:
Subject: Remove duplicates of membership from results of \du
Next
From: Melanie Plageman
Date:
Subject: Re: pg_stat_io not tracking smgrwriteback() is confusing