Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop - Mailing list pgsql-bugs
From | Alvaro Herrera |
---|---|
Subject | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop |
Date | |
Msg-id | 20201010190637.GA5774@alvherre.pgsql Whole thread Raw |
In response to | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
|
List | pgsql-bugs |
On 2020-Sep-30, Tom Lane wrote: > After some digging around, I realize that that commit actually > resulted in a protocol break. libpqwalreceiver is expecting to > get an additional CommandComplete message after COPY OUT finishes, > per libpqrcv_endstreaming(), and it's no longer getting one. > > (I have not read the protocol document to see if this is per spec; > but spec or no, that's what libpqwalreceiver is expecting.) > > The question that this raises is how the heck did that get past > our test suites? It seems like the error should have been obvious > to even the most minimal testing. So I think I can answer this now. Petr, as intellectual author of this code I would appreciate your thoughts on this -- you probably understand this much better. Same goes for Peter as committer. I think the logical replication feature is missing an detailed architectural diagram. What is happening is that the tablesync code, which is in charge of initial syncing of tables, has two main code paths: the first one is an optimization (?) in LogicalRepSyncTableStart() that if the sync takes little time, the process will exit immediately (via finish_sync_worker()) without returning control to the main logical replication apply worker code. In the TAP tests, all tables are pretty small, so that optimization always fires. So in that case we never get control back to walrcv_endstreaming (the complainant in Henry's test case), which is why replication never fails in the tests. You can verify this because the final "return slotname" line in LogicalRepSyncTableStart has zero coverage. The other code path returns the slot name to ApplyWorkerMain, which does the normal walrcv_startstreaming / LogicalRepApplyLoop / endstreaming dance. And whenever that code path is taken, all the src/test/subscription tests fail with the missing command tag, and they work correctly when the command tag is restored, which is what we wanted. My proposal at this stage is to remove the optimization in LogicalRepSyncTableStart, per 0001, if only because the code becomes simpler (0002 reindents code). With this, we get a coverage increase for tablesync.c not only in the final "return slotname" but also for process_syncing_tables_for_sync() which is currently uncovered: https://coverage.postgresql.org/src/backend/replication/logical/tablesync.c.gcov.html#268 However, and this is one reason why I'd welcome Petr/Peter thoughts on this, I don't really understand what happens in LogicalRepApplyLoop afterwards with a tablesync worker; are we actually doing anything useful there, considering that the actual data copy seems to have occurred in the CopyFrom() call in copy_table? In other words, by the time we return control to ApplyWorkerMain with a slot name, isn't the work all done, and the only thing we need is to synchronize protocol and close the connection? Or is it? Another thing I noticed is that if I crank up the number of rows in the new t/100_bugs.pl test case(*), the two sync workers run in sequence -- not in parallel. What's going on with that? Shouldn't there be two simultaneous workers? (*) In the submitted patch, the test uses 10 million rows. I intend to reduce that to, say, 5000 rows in the committed version. Strictly speaking, there's no need for this test at all since with the LogicalRepSyncTableStart changes I propose, even t/001_rep_changes.pl fails. But since my new test case fails even without that change, I prefer to keep it anyway.
Attachment
pgsql-bugs by date: