Thread: Notify downstream to discard the streamed transaction which was aborted due to crash.

Notify downstream to discard the streamed transaction which was aborted due to crash.

From
"houzj.fnst@fujitsu.com"
Date:
Hi,

When developing another feature, I find an existing bug which was reported from Dilip[1].

Currently, it's possible that we only send a streaming block without sending a
end of stream message(stream abort) when decoding and streaming a transaction
that was aborted due to crash because we might not WAL log a XLOG_XACT_ABORT
for such a crashed transaction. This will cause the subscriber(with
streaming=on) create a stream file but won't delete it until the apply
worker restart.

BUG repro(borrowed from Dilip):
---
1. start 2 servers(config: logical_decoding_work_mem=64kB)
./pg_ctl -D data/ -c -l pub_logs start
./pg_ctl -D data1/ -c -l sub_logs start

2. Publisher:
create table t(a int PRIMARY KEY ,b text);
create publication test_pub for table t
with(PUBLISH='insert,delete,update,truncate');
alter table t replica identity FULL ;

3. Subscription Server:
create table t(a int,b text);
create subscription test_sub CONNECTION 'host=localhost port=10000
dbname=postgres' PUBLICATION test_pub WITH ( slot_name =
test_slot_sub1,streaming=on);

4. Publication Server:
begin ;
insert into t values (generate_series(1,50000), 'zzzzzzzzz');  -- (while executing this restart publisher in 2-3 secs)

Restart the publication server, while the transaction is still in an
uncommitted state.
./pg_ctl -D data/ -c -l pub_logs restart -mi
---

After restarting the publisher, we can see the subscriber receive a streaming
block and create a stream file(/base/pgsql_tmp/xxx.fileset).

To fix it, One idea is to send a stream abort message when we are cleaning up
crashed transaction on publisher(e.g. in ReorderBufferAbortOld()). And here is
a tiny patch which changes the same. I have confirmed that the bug is fixed and
all regression tests pass. I didn't add a testcase because we need to make sure
the crash happens before all the WAL logged transactions data are decoded which
doesn't seem easy to write a stable test for this.

Thoughts ?

[1] https://www.postgresql.org/message-id/CAFiTN-sTYk%3Dh75Jn1a7ee%2B5hOcdQFjKGDvF_0NWQQXmoBv4A%2BA%40mail.gmail.com

Best regards,
Hou zj

Attachment
On Fri, Jan 6, 2023 at 9:25 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
>
> To fix it, One idea is to send a stream abort message when we are cleaning up
> crashed transaction on publisher(e.g. in ReorderBufferAbortOld()). And here is
> a tiny patch which changes the same. I have confirmed that the bug is fixed and
> all regression tests pass. I didn't add a testcase because we need to make sure
> the crash happens before all the WAL logged transactions data are decoded which
> doesn't seem easy to write a stable test for this.
>

Your fix looks good to me. Have you tried this in PG-14 where it was introduced?

-- 
With Regards,
Amit Kapila.



RE: Notify downstream to discard the streamed transaction which was aborted due to crash.

From
"houzj.fnst@fujitsu.com"
Date:
On Friday, January 6, 2023 1:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Jan 6, 2023 at 9:25 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
> wrote:
> >
> >
> > To fix it, One idea is to send a stream abort message when we are
> > cleaning up crashed transaction on publisher(e.g. in
> > ReorderBufferAbortOld()). And here is a tiny patch which changes the
> > same. I have confirmed that the bug is fixed and all regression tests
> > pass. I didn't add a testcase because we need to make sure the crash
> > happens before all the WAL logged transactions data are decoded which
> doesn't seem easy to write a stable test for this.
> >
> 
> Your fix looks good to me. Have you tried this in PG-14 where it was
> introduced?

Yes, I have confirmed that PG-14 has the same problem and can be fixed after
applying the patch.

Best regards,
Hou zj

On Fri, Jan 6, 2023 at 9:25 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> Hi,
>
> When developing another feature, I find an existing bug which was reported from Dilip[1].
>
> Currently, it's possible that we only send a streaming block without sending a
> end of stream message(stream abort) when decoding and streaming a transaction
> that was aborted due to crash because we might not WAL log a XLOG_XACT_ABORT
> for such a crashed transaction. This will cause the subscriber(with
> streaming=on) create a stream file but won't delete it until the apply
> worker restart.
>
> BUG repro(borrowed from Dilip):
> ---
> 1. start 2 servers(config: logical_decoding_work_mem=64kB)
> ./pg_ctl -D data/ -c -l pub_logs start
> ./pg_ctl -D data1/ -c -l sub_logs start
>
> 2. Publisher:
> create table t(a int PRIMARY KEY ,b text);
> create publication test_pub for table t
> with(PUBLISH='insert,delete,update,truncate');
> alter table t replica identity FULL ;
>
> 3. Subscription Server:
> create table t(a int,b text);
> create subscription test_sub CONNECTION 'host=localhost port=10000
> dbname=postgres' PUBLICATION test_pub WITH ( slot_name =
> test_slot_sub1,streaming=on);
>
> 4. Publication Server:
> begin ;
> insert into t values (generate_series(1,50000), 'zzzzzzzzz');  -- (while executing this restart publisher in 2-3
secs)
>
> Restart the publication server, while the transaction is still in an
> uncommitted state.
> ./pg_ctl -D data/ -c -l pub_logs restart -mi
> ---
>
> After restarting the publisher, we can see the subscriber receive a streaming
> block and create a stream file(/base/pgsql_tmp/xxx.fileset).
>
> To fix it, One idea is to send a stream abort message when we are cleaning up
> crashed transaction on publisher(e.g. in ReorderBufferAbortOld()). And here is
> a tiny patch which changes the same. I have confirmed that the bug is fixed and
> all regression tests pass. I didn't add a testcase because we need to make sure
> the crash happens before all the WAL logged transactions data are decoded which
> doesn't seem easy to write a stable test for this.
>
> Thoughts ?

Fix looks good to me.  Thanks for working on this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Fri, Jan 6, 2023 at 11:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> >
> > To fix it, One idea is to send a stream abort message when we are cleaning up
> > crashed transaction on publisher(e.g. in ReorderBufferAbortOld()). And here is
> > a tiny patch which changes the same. I have confirmed that the bug is fixed and
> > all regression tests pass. I didn't add a testcase because we need to make sure
> > the crash happens before all the WAL logged transactions data are decoded which
> > doesn't seem easy to write a stable test for this.
> >
> > Thoughts ?
>
> Fix looks good to me.  Thanks for working on this.
>

Pushed!

-- 
With Regards,
Amit Kapila.