Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop - Mailing list pgsql-bugs

From Dilip Kumar
Subject Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date
Msg-id CAFiTN-sn5odfWKAB2UM14NbtWx_bn6RXSJpeMXaezc+ANf0Png@mail.gmail.com
Whole thread Raw
In response to Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-bugs
On Sat, Nov 7, 2020 at 9:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Nov 7, 2020 at 5:31 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2020-Nov-05, Amit Kapila wrote:
> >
> > > On Wed, Nov 4, 2020 at 7:19 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > >
> > > > On 2020-Nov-04, Amit Kapila wrote:
> > > >
> > > > > On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > >
> > > > > > * STREAM COMMIT bug?
> > > > > >   In apply_handle_stream_commit, we do CommitTransactionCommand, but
> > > > > >   apparently in a tablesync worker we shouldn't do it.
> > > > >
> > > > > In the tablesync stage, we don't allow streaming. See pgoutput_startup
> > > > > where we disable streaming for the init phase. As far as I understand,
> > > > > for tablesync we create the initial slot during which streaming will
> > > > > be disabled then we will copy the table (here logical decoding won't
> > > > > be used) and then allow the apply worker to get any other data which
> > > > > is inserted in the meantime. Now, I might be missing something here
> > > > > but if you can explain it a bit more or share some test to show how we
> > > > > can reach here via tablesync worker then we can discuss the possible
> > > > > solution.
> > > >
> > > > Hmm, okay, that sounds like there would be no bug then.  Maybe what we
> > > > need is just an assert in apply_handle_stream_commit that
> > > > !am_tablesync_worker(), as in the attached patch.  Passes tests.
> > > >
> > >
> > > +1. But do we want to have this Assert only in stream_commit API or
> > > all stream APIs as well?
> >
> > Well, the only reason I care about this is that apply_handle_commit
> > contains a comment that we must not do CommitTransactionCommand in the
> > syncworker case; so if you look at apply_handle_stream_commit and note
> > that it doesn't concern it about that, you become concerned that it
> > might be broken.  I don't think the other routines handling the "stream"
> > thing have that issue.
> >
>
> Fair enough, as mentioned in my previous email, I think we need to
> confirm once that after copy how the decoding happens on upstream for
> transactions during the phase where tablesync workers is moving to
> state SUBREL_STATE_SYNCDONE from SUBREL_STATE_CATCHUP. I'll try to
> come up (in next few days) with some test case to debug and test this
> particular scenario and share my findings.

IIUC, the table sync worker does the initial copy using the consistent
snapshot.  And after that, if the main apply worker is behind us then
it will wait for the apply worker to reach the table sync worker's
start point and after that, the apply worker can continue applying the
changes.  OTOH, of the apply worker have already moved ahead in
processing the WAL after it had launched the table sync worker that
means the apply worker would have skipped those many transactions as
the table was not in SYNC DONE state so now the table sync worker need
to cover this gap by applying the walls using normal apply path and it
can be moved to the SYNC done state once it catches up with the actual
apply worker.  After putting the table sync worker in the catchup
state the apply worker will wait for the table sync worker to catchup.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-bugs by date:

Previous
From: Amit Kapila
Date:
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Next
From: Dilip Kumar
Date:
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop