Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop - Mailing list pgsql-bugs

From Alvaro Herrera
Subject Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date
Msg-id 20201010190637.GA5774@alvherre.pgsql
Whole thread Raw
In response to Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
List pgsql-bugs
On 2020-Sep-30, Tom Lane wrote:

> After some digging around, I realize that that commit actually
> resulted in a protocol break.  libpqwalreceiver is expecting to
> get an additional CommandComplete message after COPY OUT finishes,
> per libpqrcv_endstreaming(), and it's no longer getting one.
> 
> (I have not read the protocol document to see if this is per spec;
> but spec or no, that's what libpqwalreceiver is expecting.)
> 
> The question that this raises is how the heck did that get past
> our test suites?  It seems like the error should have been obvious
> to even the most minimal testing.

So I think I can answer this now.  Petr, as intellectual author of this
code I would appreciate your thoughts on this -- you probably understand
this much better.  Same goes for Peter as committer.

I think the logical replication feature is missing an detailed
architectural diagram.

What is happening is that the tablesync code, which is in charge of
initial syncing of tables, has two main code paths: the first one is an
optimization (?) in LogicalRepSyncTableStart() that if the sync takes
little time, the process will exit immediately (via
finish_sync_worker()) without returning control to the main logical
replication apply worker code.  In the TAP tests, all tables are pretty
small, so that optimization always fires.

So in that case we never get control back to walrcv_endstreaming (the
complainant in Henry's test case), which is why replication never fails
in the tests.  You can verify this because the final "return slotname"
line in LogicalRepSyncTableStart has zero coverage.


The other code path returns the slot name to ApplyWorkerMain, which
does the normal walrcv_startstreaming / LogicalRepApplyLoop /
endstreaming dance.  And whenever that code path is taken, all the
src/test/subscription tests fail with the missing command tag, and
they work correctly when the command tag is restored, which is what
we wanted.

My proposal at this stage is to remove the optimization in
LogicalRepSyncTableStart, per 0001, if only because the code becomes
simpler (0002 reindents code).  With this, we get a coverage increase
for tablesync.c not only in the final "return slotname" but also for 
process_syncing_tables_for_sync() which is currently uncovered:
https://coverage.postgresql.org/src/backend/replication/logical/tablesync.c.gcov.html#268

However, and this is one reason why I'd welcome Petr/Peter thoughts on
this, I don't really understand what happens in LogicalRepApplyLoop
afterwards with a tablesync worker; are we actually doing anything
useful there, considering that the actual data copy seems to have
occurred in the CopyFrom() call in copy_table?  In other words, by the
time we return control to ApplyWorkerMain with a slot name, isn't the
work all done, and the only thing we need is to synchronize protocol and
close the connection?

Or is it?


Another thing I noticed is that if I crank up the number of rows in the
new t/100_bugs.pl test case(*), the two sync workers run in sequence -- not
in parallel.  What's going on with that?  Shouldn't there be two
simultaneous workers?

(*) In the submitted patch, the test uses 10 million rows.  I intend to
reduce that to, say, 5000 rows in the committed version.  Strictly
speaking, there's no need for this test at all since with the
LogicalRepSyncTableStart changes I propose, even t/001_rep_changes.pl
fails.  But since my new test case fails even without that change, I
prefer to keep it anyway.


Attachment

pgsql-bugs by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: BUG #16665: Segmentation fault
Next
From: Raphael Megzari
Date:
Subject: Re: BUG #16665: Segmentation fault