RE: Perform streaming logical transactions by background workers and parallel apply - Mailing list pgsql-hackers
From | houzj.fnst@fujitsu.com |
---|---|
Subject | RE: Perform streaming logical transactions by background workers and parallel apply |
Date | |
Msg-id | OS0PR01MB57162495C97CAEA88AC8E9DB94FA9@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Perform streaming logical transactions by background workers and parallel apply (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: Perform streaming logical transactions by background workers and parallel apply
|
List | pgsql-hackers |
On Thursday, January 5, 2023 4:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Jan 5, 2023 at 9:07 AM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > > > On Wednesday, January 4, 2023 9:29 PM Dilip Kumar > <dilipbalaut@gmail.com> wrote: > > > > I think this looks good to me. > > > > Thanks for the comments. > > Attach the new version patch set which changed the comments as > suggested. > > Thanks for the updated patch, while testing this I see one strange > behavior which seems like bug to me, here is the step to reproduce > > 1. start 2 servers(config: logical_decoding_work_mem=64kB) > ./pg_ctl -D data/ -c -l pub_logs start > ./pg_ctl -D data1/ -c -l sub_logs start > > 2. Publisher: > create table t(a int PRIMARY KEY ,b text); > CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS > 'select array_agg(md5(g::text))::text from generate_series(1, 256) g'; > create publication test_pub for table t > with(PUBLISH='insert,delete,update,truncate'); > alter table t replica identity FULL ; > insert into t values (generate_series(1,2000),large_val()) ON CONFLICT > (a) DO UPDATE SET a=EXCLUDED.a*300; > > 3. Subscription Server: > create table t(a int,b text); > create subscription test_sub CONNECTION 'host=localhost port=5432 > dbname=postgres' PUBLICATION test_pub WITH ( slot_name = > test_slot_sub1,streaming=parallel); > > 4. Publication Server: > begin ; > savepoint a; > delete from t; > savepoint b; > insert into t values (generate_series(1,5000),large_val()) ON CONFLICT > (a) DO UPDATE SET a=EXCLUDED.a*30000; -- (while executing this start > publisher in 2-3 secs) > > Restart the publication server, while the transaction is still in an > uncommitted state. > ./pg_ctl -D data/ -c -l pub_logs stop -mi > ./pg_ctl -D data/ -c -l pub_logs start -mi > > after this, the parallel apply worker stuck in waiting on stream lock > forever (even after 10 mins) -- see below, from subscriber logs I can > see one of the parallel apply worker [75677] started but never > finished [no error], after that I have performed more operation [same > insert] which got applied by new parallel apply worked which got > started and finished within 1 second. > Thanks for reporting the problem. After analyzing the behavior, I think it's a bug on publisher side which is not directly related to parallel apply. I think the root reason is that we didn't try to send a stream end(stream abort) message to subscriber for the crashed transaction which was streamed before. The behavior is that, after restarting, the publisher will start to decode the transaction that aborted due to crash, and when try to stream the first change of that transaction, it will send a stream start message but then it realizes that the transaction was aborted, so it will enter the PG_CATCH block of ReorderBufferProcessTXN() and call ReorderBufferResetTXN() which send the stream stop message. And in this case, there would be a parallel apply worker started on subscriber waiting for stream end message which will never come. I think the same behavior happens for the non-parallel mode which will cause a stream file left on subscriber and will not be cleaned until the apply worker is restarted. To fix it, I think we need to send a stream abort message when we are cleaning up crashed transaction on publisher(e.g., in ReorderBufferAbortOld()). And here is a tiny patch which change the same. I have confirmed that the bug is fixed and all regression tests pass. What do you think ? I will start a new thread and try to write a testcase if possible after reaching a consensus. Best regards, Hou zj
Attachment
pgsql-hackers by date: