Re: Replication slot stats misgivings - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Replication slot stats misgivings
Date
Msg-id CAA4eK1K6-8pyt6prAP44tjnw4m4KP=1_EUMOH97nfP0TqUOJkg@mail.gmail.com
Whole thread Raw
In response to Re: Replication slot stats misgivings  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Replication slot stats misgivings  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Tue, Apr 27, 2021 at 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Apr 27, 2021 at 8:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I have pushed this patch and seeing one buildfarm failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-27%2009%3A23%3A14
>
>   starting permutation: s1_init s1_begin s1_insert_tbl1 s1_insert_tbl2
> s2_alter_tbl1_char s1_commit s2_get_changes
> + isolationtester: canceling step s1_init after 314 seconds
>   step s1_init: SELECT 'init' FROM
> pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
>   ?column?
>
> I am analyzing this.
>

After checking below logs corresponding to this test, it seems test
has been executed and create_slot was successful:
2021-04-27 11:06:43.770 UTC [17694956:52] isolation/concurrent_ddl_dml
STATEMENT:  SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
2021-04-27 11:07:11.748 UTC [5243096:9] LOG:  checkpoint starting: time
2021-04-27 11:09:24.332 UTC [5243096:10] LOG:  checkpoint complete:
wrote 14 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.716 s, sync=0.001 s, total=132.584 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=198 kB, estimate=406 kB
2021-04-27 11:09:40.116 UTC [6226046:1] [unknown] LOG:  connection
received: host=[local]
2021-04-27 11:09:40.117 UTC [17694956:53] isolation/concurrent_ddl_dml
LOG:  statement: BEGIN;
2021-04-27 11:09:40.117 UTC [17694956:54] isolation/concurrent_ddl_dml
LOG:  statement: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
2021-04-27 11:09:40.118 UTC [17694956:55] isolation/concurrent_ddl_dml
LOG:  statement: INSERT INTO tbl2 (val1, val2) VALUES (1, 1);
2021-04-27 11:09:40.119 UTC [10944636:49] isolation/concurrent_ddl_dml
LOG:  statement: ALTER TABLE tbl1 ALTER COLUMN val2 TYPE character
varying;

I am not sure but there is some possibility that even though create
slot is successful, the isolation tester got successful in canceling
it, maybe because create_slot is just finished at the same time. As we
can see from logs, during this test checkpoint also happened which
could also lead to the slowness of this particular command.

Also, I see a lot of messages like below which indicate stats
collector is also quite slow:
2021-04-27 10:57:59.385 UTC [18743536:1] LOG:  using stale statistics
instead of current ones because stats collector is not responding

I am not sure if the timeout happened because the machine is slow or
is it in any way related to code. I am seeing some previous failures
due to timeout on this machine [1][2]. In those failures, I see the
"using stale stats...." message. Also, I am not able to see why it can
fail due to this patch?

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-02-23%2004%3A23%3A56
[2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-12-24%2005%3A31%3A43

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Should we document the behaviour of TRUNCATE skipping repeated relations?
Next
From: Tom Lane
Date:
Subject: Re: Better sanity checking for message length words