On Thu, May 13, 2021 at 3:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I noticed $SUBJECT in curculio.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23
>
> The failure test report:
> t/020_messages.pl (Wstat: 7424 Tests: 1 Failed: 0)
> Non-zero exit status: 29
> Parse errors: Bad plan. You planned 5 tests but ran 1.
> Files=21, Tests=149, 79 wallclock secs ( 0.03 usr 0.03 sys + 22.36
> cusr 9.47 csys = 31.89 CPU)
> Result: FAIL
> Makefile:21: recipe for target 'check' failed
> gmake: *** [check] Error 1
>
> As far as I can understand the failure happens due to error:
> 2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR:
> replication slot "tap_sub" is active for PID 7354
> 2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT:
> SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape')
> FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
> 'proto_version', '1',
> 'publication_names', 'tap_pub',
> 'messages', 'true')
> OFFSET 1 LIMIT 1
>
> As per my initial analysis, this happens because in the test after
> creating a subscription, we are not waiting for the subscriber to
> catch up. I think there is a narrow window where it is possible that
> after creating a subscription, the apply worker starts replication
> even after we have done Alter Subscription .. DISABLE. This is also
> visible in buildfarm logs:
>
> 2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT:
> SELECT get_byte(data, 0)
> FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
> 'proto_version', '1',
> 'publication_names', 'tap_pub',
> 'messages', 'true')
> ...
>
> 2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG: received
> replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
> (proto_version '2', publication_names '"tap_pub"')
>
> Notice that we have checked the messages before receiving the start
> replication command from the subscriber. Now, if we would have waited
> for the subscription to catch up then our check in pg_stat_replication
> wouldn't have given a false positive. I have noticed that we do have
> such wait in other subscription tests but somehow missed it here.
I agree with your analysis, apart from this issue there is one other
small problem. Currently after disabling subscription
pg_stat_replication is used to verify if the walsender is exited. The
steps of the walsender exit includes cleaning up of walsender and then
releasing a replication slot. There will be random test failure if
pg_logical_slot_peek_binary_changes is called in this narrow time
window. I think it is better to fix it by checking the active column
in pg_replication_slot instead of pg_stat_replication which is more
reliable.
Attached patch which has the fixes for the same.
Thoughts?
Regards,
Vignesh