Re: Replication slot stats misgivings - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Replication slot stats misgivings
Date
Msg-id CAD21AoD8DvPDFKtZJOV1YkeTeemAREQ+ppuQL7U3LJcqFi1V-w@mail.gmail.com
Whole thread Raw
In response to Re: Replication slot stats misgivings  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Replication slot stats misgivings
List pgsql-hackers
On Tue, Apr 27, 2021 at 11:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Apr 27, 2021 at 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Apr 27, 2021 at 8:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I have pushed this patch and seeing one buildfarm failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-27%2009%3A23%3A14
> >
> >   starting permutation: s1_init s1_begin s1_insert_tbl1 s1_insert_tbl2
> > s2_alter_tbl1_char s1_commit s2_get_changes
> > + isolationtester: canceling step s1_init after 314 seconds
> >   step s1_init: SELECT 'init' FROM
> > pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
> >   ?column?
> >
> > I am analyzing this.
> >
>
> After checking below logs corresponding to this test, it seems test
> has been executed and create_slot was successful:

The pg_create_logical_replication_slot() was executed at 11:04:25:

2021-04-27 11:04:25.494 UTC [17694956:49] isolation/concurrent_ddl_dml
LOG:  statement: SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot', 'test_decoding');

Therefore this command took 314 sec that matches the number the
isolation test reported. And the folling logs follow:

2021-04-27 11:06:43.770 UTC [17694956:50] isolation/concurrent_ddl_dml
LOG:  logical decoding found consistent point at 0/17F9078
2021-04-27 11:06:43.770 UTC [17694956:51] isolation/concurrent_ddl_dml
DETAIL:  There are no running transactions.

> 2021-04-27 11:06:43.770 UTC [17694956:52] isolation/concurrent_ddl_dml
> STATEMENT:  SELECT 'init' FROM
> pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
> 2021-04-27 11:07:11.748 UTC [5243096:9] LOG:  checkpoint starting: time
> 2021-04-27 11:09:24.332 UTC [5243096:10] LOG:  checkpoint complete:
> wrote 14 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=0.716 s, sync=0.001 s, total=132.584 s; sync files=0,
> longest=0.000 s, average=0.000 s; distance=198 kB, estimate=406 kB
> 2021-04-27 11:09:40.116 UTC [6226046:1] [unknown] LOG:  connection
> received: host=[local]
> 2021-04-27 11:09:40.117 UTC [17694956:53] isolation/concurrent_ddl_dml
> LOG:  statement: BEGIN;
> 2021-04-27 11:09:40.117 UTC [17694956:54] isolation/concurrent_ddl_dml
> LOG:  statement: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
> 2021-04-27 11:09:40.118 UTC [17694956:55] isolation/concurrent_ddl_dml
> LOG:  statement: INSERT INTO tbl2 (val1, val2) VALUES (1, 1);
> 2021-04-27 11:09:40.119 UTC [10944636:49] isolation/concurrent_ddl_dml
> LOG:  statement: ALTER TABLE tbl1 ALTER COLUMN val2 TYPE character
> varying;
>
> I am not sure but there is some possibility that even though create
> slot is successful, the isolation tester got successful in canceling
> it, maybe because create_slot is just finished at the same time.

Yeah, we see the test log "canceling step s1_init after 314 seconds"
but don't see any log indicating canceling query.

>  As we
> can see from logs, during this test checkpoint also happened which
> could also lead to the slowness of this particular command.

Yes. I also think the checkpoint could somewhat lead to the slowness.
And since create_slot() took 2min to find a consistent snapshot the
system might have already been busy.

>
> Also, I see a lot of messages like below which indicate stats
> collector is also quite slow:
> 2021-04-27 10:57:59.385 UTC [18743536:1] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
>
> I am not sure if the timeout happened because the machine is slow or
> is it in any way related to code. I am seeing some previous failures
> due to timeout on this machine [1][2]. In those failures, I see the
> "using stale stats...." message.

It seems like a time-dependent issue but I'm wondering why the logical
decoding test failed at this time.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Better sanity checking for message length words
Next
From: Masahiko Sawada
Date:
Subject: Re: Performance degradation of REFRESH MATERIALIZED VIEW