Re: Add two missing tests in 035_standby_logical_decoding.pl - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Add two missing tests in 035_standby_logical_decoding.pl |
Date | |
Msg-id | CAA4eK1KD4pmEjqPkPjNdB5rTegJftzJjsZ3xuBndbPANQrueEw@mail.gmail.com Whole thread Raw |
In response to | Re: Add two missing tests in 035_standby_logical_decoding.pl ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>) |
Responses |
Re: Add two missing tests in 035_standby_logical_decoding.pl
|
List | pgsql-hackers |
On Sat, May 6, 2023 at 1:52 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > There is 2 runs with this extra info in place: > > A successful one: https://cirrus-ci.com/task/6528745436086272 > A failed one: https://cirrus-ci.com/task/4558139312308224 > Thanks, I think I got some clue as to why this test is failing randomly. Following is the comparison of successful and failed run logs for standby: Success case ============ 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] DEBUG: 00000: write 0/4000148 flush 0/4000000 apply 0/4000000 reply_time 2023-05-06 07:23:05.496365+00 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage, walsender.c:2101 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] DEBUG: 00000: write 0/4000148 flush 0/4000148 apply 0/4000000 reply_time 2023-05-06 07:23:05.4964+00 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage, walsender.c:2101 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] DEBUG: 00000: write 0/4000148 flush 0/4000148 apply 0/4000148 reply_time 2023-05-06 07:23:05.496531+00 2023-05-06 07:23:05.496 UTC [17617][walsender] [cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage, walsender.c:2101 2023-05-06 07:23:05.500 UTC [17706][client backend] [035_standby_logical_decoding.pl][2/12:0] LOG: 00000: statement: checkpoint; 2023-05-06 07:23:05.500 UTC [17706][client backend] [035_standby_logical_decoding.pl][2/12:0] LOCATION: exec_simple_query, postgres.c:1074 2023-05-06 07:23:05.500 UTC [17550][checkpointer] LOG: 00000: restartpoint starting: immediate wait ... ... 2023-05-06 07:23:05.500 UTC [17550][checkpointer] LOCATION: CheckPointReplicationSlots, slot.c:1576 2023-05-06 07:23:05.501 UTC [17550][checkpointer] DEBUG: 00000: updated min recovery point to 0/4000148 on timeline 1 2023-05-06 07:23:05.501 UTC [17550][checkpointer] LOCATION: UpdateMinRecoveryPoint, xlog.c:2500 2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOG: 00000: CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098, _logSegNo is 4 2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7271 2023-05-06 07:23:05.515 UTC [17550][checkpointer] LOG: 00000: CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr is 0/4000148, _logSegNo is 4 Failed case: ========== 2023-05-06 07:53:19.657 UTC [17914][walsender] [cascading_standby][3/0:0] DEBUG: 00000: write 0/3D1A000 flush 0/3CFA000 apply 0/4000000 reply_time 2023-05-06 07:53:19.65207+00 2023-05-06 07:53:19.657 UTC [17914][walsender] [cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage, walsender.c:2101 2023-05-06 07:53:19.657 UTC [17914][walsender] [cascading_standby][3/0:0] DEBUG: 00000: write 0/3D1A000 flush 0/3D1A000 apply 0/4000000 reply_time 2023-05-06 07:53:19.656471+00 2023-05-06 07:53:19.657 UTC [17914][walsender] [cascading_standby][3/0:0] LOCATION: ProcessStandbyReplyMessage, walsender.c:2101 ... ... 2023-05-06 07:53:19.686 UTC [17881][checkpointer] DEBUG: 00000: updated min recovery point to 0/4000148 on timeline 1 2023-05-06 07:53:19.686 UTC [17881][checkpointer] LOCATION: UpdateMinRecoveryPoint, xlog.c:2500 2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOG: 00000: CreateRestartPoint: After XLByteToSeg RedoRecPtr is 0/4000098, _logSegNo is 4 2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOCATION: CreateRestartPoint, xlog.c:7271 2023-05-06 07:53:19.707 UTC [17881][checkpointer] LOG: 00000: CreateRestartPoint: After KeepLogSeg RedoRecPtr is 0/4000098, endptr is 0/4000148, _logSegNo is 3 Observations: ============ 1. In the failed run, the KeepLogSeg(), reduced the _logSegNo to 3 which is the reason for the failure because now the standby won't be able to remove/recycle the WAL file corresponding to segment number 3 which the test was expecting. 2. We didn't expect the KeepLogSeg() to reduce the _logSegNo because all logical slots were invalidated. However, I think we forgot that both standby and primary have physical slots which might also influence the XLogGetReplicationSlotMinimumLSN() calculation in KeepLogSeg(). 3. Now, the reason for its success in some of the runs is that restart_lsn of physical slots also moved ahead by the time checkpoint happens. You can see the difference of LSNs for ProcessStandbyReplyMessage in failed and successful cases. Next steps: ========= 1. The first thing is we should verify this theory by adding some LOG in KeepLogSeg() to see if the _logSegNo is reduced due to the value returned by XLogGetReplicationSlotMinimumLSN(). 2. The reason for the required file not being removed in the primary is also that it has a physical slot which prevents the file removal. 3. If the above theory is correct then I see a few possibilities to fix this test (a) somehow ensure that restart_lsn of the physical slot on standby is advanced up to the point where we can safely remove the required files; (b) just create a separate test case by initializing a fresh node for primary and standby where we only have logical slots on standby. This will be a bit costly but probably less risky. (c) any better ideas? -- With Regards, Amit Kapila.
pgsql-hackers by date: