Thread: subscription/026_stats test is intermittently slow?

subscription/026_stats test is intermittently slow?

From

Robert Haas

Date:

19 April, 20:57:41

I just did a run of the regression test where this test was the last
one to finish by quite a lot. Key log entries:

[13:35:48.583](0.039s) # initializing database system by copying initdb template
...
[13:35:52.397](0.108s) ok 5 - Check reset timestamp for
'test_tab1_sub' is newer after second reset.

#### Begin standard error

psql:<stdin>:1: NOTICE:  created replication slot "test_tab2_sub" on publisher

#### End standard error

Waiting for replication conn test_tab2_sub's replay_lsn to pass
0/151E8C8 on publisher

done

[13:38:53.706](181.310s) ok 6 - Check that table 'test_tab2' now has 1 row.
...
[13:38:54.344](0.294s) 1..13

I reran the test and it looks very different:

[13:54:01.703](0.090s) ok 5 - Check reset timestamp for
'test_tab1_sub' is newer after second reset.
...
Waiting for replication conn test_tab2_sub's replay_lsn to pass
0/151E900 on publisher
...
[13:54:03.006](1.303s) ok 6 - Check that table 'test_tab2' now has 1 row.

It looks to me like in the first run it took 3 minutes for the
replay_lsn to catch up to the desired value, and in the second run,
two seconds. I think I have seen previous instances where something
similar happened, although in those cases I did not stop to record any
details. Have others seen this? Is there something we can/should do
about it?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: subscription/026_stats test is intermittently slow?

From

Michael Paquier

Date:

20 April, 05:57:45

On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:
> It looks to me like in the first run it took 3 minutes for the
> replay_lsn to catch up to the desired value, and in the second run,
> two seconds. I think I have seen previous instances where something
> similar happened, although in those cases I did not stop to record any
> details. Have others seen this? Is there something we can/should do
> about it?

FWIW, I've also seen delays as well with this test on a few occasions.
Thanks for mentioning it.
--
Michael

Attachment

signature.asc

Re: subscription/026_stats test is intermittently slow?

From

Alexander Lakhin

Date:

20 April, 08:00:00

Hello Michael and Robert,

20.04.2024 05:57, Michael Paquier wrote:
> On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:
>> It looks to me like in the first run it took 3 minutes for the
>> replay_lsn to catch up to the desired value, and in the second run,
>> two seconds. I think I have seen previous instances where something
>> similar happened, although in those cases I did not stop to record any
>> details. Have others seen this? Is there something we can/should do
>> about it?
> FWIW, I've also seen delays as well with this test on a few occasions.
> Thanks for mentioning it.

It reminds me of
https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com

At least, I could reproduce such a delay with the attached patch applied.

Best regards,
Alexander

Attachment

026_stats-180.patch

Re: subscription/026_stats test is intermittently slow?

From

vignesh C

Date:

22 April, 18:32:10

On Sat, 20 Apr 2024 at 10:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Michael and Robert,
>
> 20.04.2024 05:57, Michael Paquier wrote:
> > On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:
> >> It looks to me like in the first run it took 3 minutes for the
> >> replay_lsn to catch up to the desired value, and in the second run,
> >> two seconds. I think I have seen previous instances where something
> >> similar happened, although in those cases I did not stop to record any
> >> details. Have others seen this? Is there something we can/should do
> >> about it?
> > FWIW, I've also seen delays as well with this test on a few occasions.
> > Thanks for mentioning it.
>
> It reminds me of
> https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com

Thanks Alexander for the test, I was able to reproduce the issue with
the test you shared and also verify that the patch at [1] fixes the
same. This is the same issue where the apply worker for test_tab2_sub
was getting started after 180 seconds because the common latch (which
is used for worker attached, subscription creation/modification and
apply worker process exit) was getting reset when the other
subscription test_tab1_sub's worker gets started. The same can be seen
from the logs:
2024-04-22 20:47:52.009 IST [323280] 026_stats.pl LOG:  statement: BEGIN;
2024-04-22 20:47:52.009 IST [323280] 026_stats.pl LOG:  statement:
SELECT pg_sleep(0.5);
2024-04-22 20:47:52.426 IST [323281] LOG:  logical replication apply
worker for subscription "test_tab1_sub" has started
2024-04-22 20:47:52.511 IST [323280] 026_stats.pl LOG:  statement:
CREATE TABLE test_tab2(a int primary key);
2024-04-22 20:47:52.518 IST [323280] 026_stats.pl LOG:  statement:
INSERT INTO test_tab2 VALUES (1);
2024-04-22 20:47:52.519 IST [323280] 026_stats.pl LOG:  statement: COMMIT;
2024-04-22 20:47:52.540 IST [323286] 026_stats.pl LOG:  statement:
CREATE SUBSCRIPTION test_tab2_sub CONNECTION 'port=56685
host=/tmp/RwzpQrVMYH dbname=postgres' PUBLICATION test_tab2_pub
2024-04-22 20:50:52.658 IST [326265] LOG:  logical replication apply
worker for subscription "test_tab2_sub" has started
2024-04-22 20:50:52.668 IST [326267] LOG:  logical replication table
synchronization worker for subscription "test_tab2_sub", table
"test_tab2" has started

[1] - https://www.postgresql.org/message-id/CALDaNm10R7L0Dxq%2B-J%3DPp3AfM_yaokpbhECvJ69QiGH8-jQquw%40mail.gmail.com

Regards,
Vignesh

Re: subscription/026_stats test is intermittently slow?

From

Andres Freund

Date:

22 April, 20:55:08

Hi,

On 2024-04-19 13:57:41 -0400, Robert Haas wrote:
> Have others seen this? Is there something we can/should do about it?

Yes, I've also seen this - but never quite reproducible enough to properly
tackle it.

The first thing I'd like to do is to make the wait_for_catchup routine
regularly log the current state, so we can in retrospect analyze e.g. whether
there was continual, but slow, replay progress, or whether replay was entirely
stuck.  wait_for_catchup() not being debuggable has been a problem in many
different tests, so I think it's high time to fix that.

Greetings,

Andres Freund

Re: subscription/026_stats test is intermittently slow?

From

Amit Kapila

Date:

23 April, 15:10:10

On Tue, Apr 23, 2024 at 11:49 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Sat, 20 Apr 2024 at 10:30, Alexander Lakhin <exclusion@gmail.com> wrote:
> >
> > Hello Michael and Robert,
> >
> > 20.04.2024 05:57, Michael Paquier wrote:
> > > On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:
> > >> It looks to me like in the first run it took 3 minutes for the
> > >> replay_lsn to catch up to the desired value, and in the second run,
> > >> two seconds. I think I have seen previous instances where something
> > >> similar happened, although in those cases I did not stop to record any
> > >> details. Have others seen this? Is there something we can/should do
> > >> about it?
> > > FWIW, I've also seen delays as well with this test on a few occasions.
> > > Thanks for mentioning it.
> >
> > It reminds me of
> > https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com
>
> Thanks Alexander for the test, I was able to reproduce the issue with
> the test you shared and also verify that the patch at [1] fixes the
> same.
>

One of the issues reported in the thread you referred to has the same
symptoms [1]. I'll review and analyze your proposal.

[1] - https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com

--
With Regards,
Amit Kapila.