Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop - Mailing list pgsql-bugs
From | Dilip Kumar |
---|---|
Subject | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop |
Date | |
Msg-id | CAFiTN-uJL6uB2tZN0CH9f9hJgBHzu=rMRaCzvzk-txSM9R=+kQ@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
|
List | pgsql-bugs |
On Fri, Nov 20, 2020 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 20, 2020 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Nov 20, 2020 at 10:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Wed, Nov 18, 2020 at 2:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Nov 18, 2020 at 11:19 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > > > > > On Wed, Nov 18, 2020 at 3:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > To cut a long story short, a tablesync worker CAN in fact end up > > > > > > > processing (e.g. apply_dispatch) streaming messages. > > > > > > > So the tablesync worker CAN get into the apply_handle_stream_commit. > > > > > > > And this scenario, albeit rare, will crash. > > > > > > > > > > > > > > > > > > > Thank you for reproducing this issue. Dilip, Peter, is anyone of you > > > > > > interested in writing a fix for this? > > > > > > > > > > Hi Amit. > > > > > > > > > > FYI - Sorry, I am away/offline for the next 5 days. > > > > > > > > > > However, if this bug still remains unfixed after next Tuesday then I > > > > > can look at it then. > > > > > > > > > > > > > Fair enough. Let's see if Dilip or I can get a chance to look into > > > > this before that. > > > > > > > > > --- > > > > > > > > > > IIUC there are 2 options: > > > > > 1) Disallow streaming for the tablesync worker. > > > > > 2) Make streaming work for the tablesync worker. > > > > > > > > > > I prefer option (a) not only because of the KISS principle, but also > > > > > because this is how the tablesync worker was previously thought to > > > > > behave anyway. I expect this fix may be like the code that Dilip > > > > > already posted [1] > > > > > [1] https://www.postgresql.org/message-id/CAFiTN-uUgKpfdbwSGnn3db3mMQAeviOhQvGWE_pC9icZF7VDKg%40mail.gmail.com > > > > > > > > > > OTOH, option (b) fix may or may not be possible (I don't know), but I > > > > > have doubts that it is worthwhile to consider making a special fix for > > > > > a scenario which so far has never been reproduced outside of the > > > > > debugger. > > > > > > > > > > > > > I would prefer option (b) unless the fix is not possible due to design > > > > constraints. I don't think it is a good idea to make tablesync workers > > > > behave differently unless we have a reason for doing so. > > > > > > > > > > Okay, I will analyze this and try to submit my finding today. > > > > I have done my analysis, basically, the table sync worker is applying > > all the changes (for multiple transactions from upstream) under the > > single transaction (on downstream). Now for normal changes, we can > > just avoid committing in apply_handle_commit and everything is fine > > for streaming changes we also have the transaction at the stream level > > which we need to manage the buffiles for storing the streaming > > changes. So if we want to do that for the streaming transaction then > > we need to avoid commit transactions on apply_handle_stream_commit as > > apply_handle_stream_stop for the table sync worker. > > > > And what about apply_handle_stream_abort? And, I guess we need to > avoid other related things like update of > replorigin_session_origin_lsn, replorigin_session_origin_timestamp, > etc. in apply_handle_stream_commit() as we are apply_handle_commit(). Yes, we need to change these as well. I have tested using the POC patch and working fine. I will send the patch after some more testing. > > I think it is difficult to have a reliable test case for this but feel > free to propose if you have any ideas on the same. I am not sure how to write an automated test case for this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-bugs by date: