Re: Add two missing tests in 035_standby_logical_decoding.pl - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Add two missing tests in 035_standby_logical_decoding.pl
Date
Msg-id CAA4eK1KD4pmEjqPkPjNdB5rTegJftzJjsZ3xuBndbPANQrueEw@mail.gmail.com
Whole thread Raw
In response to Re: Add two missing tests in 035_standby_logical_decoding.pl  ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>)
Responses Re: Add two missing tests in 035_standby_logical_decoding.pl
List pgsql-hackers
On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> There is 2 runs with this extra info in place:
>
> A successful one: https://cirrus-ci.com/task/6528745436086272
> A failed one: https://cirrus-ci.com/task/4558139312308224
>

Thanks, I think I got some clue as to why this test is failing
randomly. Following is the comparison of successful and failed run
logs for standby:

Success case
============
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] DEBUG:  00000: write 0/4000148 flush
0/4000000 apply 0/4000000 reply_time 2023-05-06 07:23:05.496365+00
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] LOCATION:  ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] DEBUG:  00000: write 0/4000148 flush
0/4000148 apply 0/4000000 reply_time 2023-05-06 07:23:05.4964+00
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] LOCATION:  ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] DEBUG:  00000: write 0/4000148 flush
0/4000148 apply 0/4000148 reply_time 2023-05-06 07:23:05.496531+00
2023-05-06 07:23:05.496 UTC [17617][walsender]
[cascading_standby][3/0:0] LOCATION:  ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:23:05.500 UTC [17706][client backend]
[035_standby_logical_decoding.pl][2/12:0] LOG:  00000: statement:
checkpoint;
2023-05-06 07:23:05.500 UTC [17706][client backend]
[035_standby_logical_decoding.pl][2/12:0] LOCATION:
exec_simple_query, postgres.c:1074
2023-05-06 07:23:05.500 UTC [17550][checkpointer] LOG:  00000:
restartpoint starting: immediate wait
...
...
2023-05-06 07:23:05.500 UTC [17550][checkpointer] LOCATION:
CheckPointReplicationSlots, slot.c:1576
2023-05-06 07:23:05.501 UTC [17550][checkpointer] DEBUG:  00000:
updated min recovery point to 0/4000148 on timeline 1
2023-05-06 07:23:05.501 UTC [17550][checkpointer] LOCATION:
UpdateMinRecoveryPoint, xlog.c:2500
2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOG:  00000:
CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098,
_logSegNo is 4
2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOCATION:
CreateRestartPoint, xlog.c:7271
2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOG:  00000:
CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr
is 0/4000148, _logSegNo is 4

Failed case:
==========
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] DEBUG:  00000: write 0/3D1A000 flush
0/3CFA000 apply 0/4000000 reply_time 2023-05-06 07:53:19.65207+00
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] LOCATION:  ProcessStandbyReplyMessage,
walsender.c:2101
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] DEBUG:  00000: write 0/3D1A000 flush
0/3D1A000 apply 0/4000000 reply_time 2023-05-06 07:53:19.656471+00
2023-05-06 07:53:19.657 UTC [17914][walsender]
[cascading_standby][3/0:0] LOCATION:  ProcessStandbyReplyMessage,
walsender.c:2101
...
...
2023-05-06 07:53:19.686 UTC [17881][checkpointer] DEBUG:  00000:
updated min recovery point to 0/4000148 on timeline 1
2023-05-06 07:53:19.686 UTC [17881][checkpointer] LOCATION:
UpdateMinRecoveryPoint, xlog.c:2500
2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOG:  00000:
CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098,
_logSegNo is 4
2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOCATION:
CreateRestartPoint, xlog.c:7271
2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOG:  00000:
CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr
is 0/4000148, _logSegNo is 3

Observations:
============
1. In the failed run, the KeepLogSeg(), reduced the _logSegNo to 3
which is the reason for the failure because now the standby won't be
able to remove/recycle the WAL file corresponding to segment number 3
which the test was expecting.
2.  We didn't expect the KeepLogSeg() to reduce the _logSegNo because
all logical slots were invalidated. However, I think we forgot that
both standby and primary have physical slots which might also
influence the XLogGetReplicationSlotMinimumLSN() calculation in
KeepLogSeg().
3. Now, the reason for its success in some of the runs is that
restart_lsn of physical slots also moved ahead by the time checkpoint
happens. You can see the difference of LSNs for
ProcessStandbyReplyMessage in failed and successful cases.

Next steps:
=========
1. The first thing is we should verify this theory by adding some LOG
in KeepLogSeg() to see if the _logSegNo is reduced due to the value
returned by XLogGetReplicationSlotMinimumLSN().
2. The reason for the required file not being removed in the primary
is also that it has a physical slot which prevents the file removal.
3. If the above theory is correct then I see a few possibilities to
fix this test (a) somehow ensure that restart_lsn of the physical slot
on standby is advanced up to the point where we can safely remove the
required files; (b) just create a separate test case by initializing a
fresh node for primary and standby where we only have logical slots on
standby. This will be a bit costly but probably less risky. (c) any
better ideas?

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Joe Conway
Date:
Subject: Re: Bancolombia Open Source Program Office - Proposal of contribution on lock inactive users
Next
From: Shinya Kato
Date:
Subject: Remove duplicates of membership from results of \du