Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date
Msg-id 912614.1601502758@sss.pgh.pa.us
Whole thread Raw
In response to Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-bugs
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2020-Sep-30, Tom Lane wrote:
>> The question that this raises is how the heck did that get past
>> our test suites?  It seems like the error should have been obvious
>> to even the most minimal testing.

> ... yeah, that's indeed an important question.  I'm going to guess that
> the TAP suites are too forgiving :-(

One thing I noticed while trying to trace this down is that while the
initial table sync is happening, we have *both* a regular
walsender/walreceiver pair and a "sync" pair, eg

postgres  905650  0.0  0.0 186052 11888 ?        Ss   17:12   0:00 postgres: logical replication worker for
subscription16398  
postgres  905651 50.1  0.0 173704 13496 ?        Ss   17:12   0:09 postgres: walsender postgres [local] idle
postgres  905652  104  0.4 186832 148608 ?       Rs   17:12   0:19 postgres: logical replication worker for
subscription16398 sync 16393  
postgres  905653 12.2  0.0 174380 15524 ?        Ss   17:12   0:02 postgres: walsender postgres [local] COPY

Is it supposed to be like that?  Notice also that the regular walsender
has consumed significant CPU time; it's not pinning a CPU like the sync
walreceiver is, but it's eating maybe 20% of a CPU according to "top".
I wonder whether in cases with only small tables (which is likely all
that our tests test), the regular walreceiver manages to complete the
table sync despite repeated(?) failures of the sync worker.

            regards, tom lane



pgsql-bugs by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Next
From: ChandraKumar Ovanan
Date:
Subject: Re: BUG #16636: Upper case issue in JSONB type