Re: subscriptioncheck failure - Mailing list pgsql-hackers

From vignesh C
Subject Re: subscriptioncheck failure
Date
Msg-id CALDaNm02JV5bfOWWKLPJN3wCMvFRL-AdzyUEQ3CyQqDUfygneg@mail.gmail.com
Whole thread Raw
In response to subscriptioncheck failure  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: subscriptioncheck failure
List pgsql-hackers
On Thu, May 13, 2021 at 3:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I noticed $SUBJECT in curculio.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23
>
> The failure test report:
> t/020_messages.pl                (Wstat: 7424 Tests: 1 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: Bad plan.  You planned 5 tests but ran 1.
> Files=21, Tests=149, 79 wallclock secs ( 0.03 usr  0.03 sys + 22.36
> cusr  9.47 csys = 31.89 CPU)
> Result: FAIL
> Makefile:21: recipe for target 'check' failed
> gmake: *** [check] Error 1
>
> As far as I can understand the failure happens due to error:
> 2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR:
> replication slot "tap_sub" is active for PID 7354
> 2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT:
> SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape')
> FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
> 'proto_version', '1',
> 'publication_names', 'tap_pub',
> 'messages', 'true')
> OFFSET 1 LIMIT 1
>
> As per my initial analysis, this happens because in the test after
> creating a subscription, we are not waiting for the subscriber to
> catch up. I think there is a narrow window where it is possible that
> after creating a subscription, the apply worker starts replication
> even after we have done Alter Subscription .. DISABLE. This is also
> visible in buildfarm logs:
>
> 2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT:
> SELECT get_byte(data, 0)
> FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
> 'proto_version', '1',
> 'publication_names', 'tap_pub',
> 'messages', 'true')
> ...
>
> 2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG:  received
> replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
> (proto_version '2', publication_names '"tap_pub"')
>
> Notice that we have checked the messages before receiving the start
> replication command from the subscriber. Now, if we would have waited
> for the subscription to catch up then our check in pg_stat_replication
> wouldn't have given a false positive. I have noticed that we do have
> such wait in other subscription tests but somehow missed it here.

I agree with your analysis, apart from this issue there is one other
small problem. Currently after disabling subscription
pg_stat_replication is used to verify if the walsender is exited. The
steps of the walsender exit includes cleaning up of walsender and then
releasing a replication slot. There will be random test failure if
pg_logical_slot_peek_binary_changes is called in this narrow time
window. I think it is better to fix it by checking the active column
in pg_replication_slot instead of pg_stat_replication which is more
reliable.
Attached patch which has the fixes for the same.
Thoughts?

Regards,
Vignesh

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Forget close an open relation in ReorderBufferProcessTXN()
Next
From: Bharath Rupireddy
Date:
Subject: Re: RFC: Logging plan of the running query