Thread: Logical WAL sender unresponsive during decoding commit

Logical WAL sender unresponsive during decoding commit

From
Andrey Borodin
Date:
Hi hackers!

Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing table
pg_repack.
I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think this
worthfixing. 
Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace?  Full session is attaches as file.

#0  pfree (pointer=0x561850bbee40) at ./build/../src/backend/utils/mmgr/mcxt.c:1032
#1  0x00005617712530d6 in ReorderBufferReturnTupleBuf (tuple=<optimized out>, rb=<optimized out>) at
./build/../src/backend/replication/logical/reorderbuffer.c:469
#2  ReorderBufferReturnChange (rb=<optimized out>, change=0x561772456048) at
./build/../src/backend/replication/logical/reorderbuffer.c:398
#3  0x0000561771253da1 in ReorderBufferRestoreChanges (rb=rb@entry=0x561771c14e10, txn=0x561771c0b078,
file=file@entry=0x561771c15168,segno=segno@entry=0x561771c15178) at
./build/../src/backend/replication/logical/reorderbuffer.c:2570
#4  0x00005617712553ba in ReorderBufferIterTXNNext (state=0x561771c15130, rb=0x561771c14e10) at
./build/../src/backend/replication/logical/reorderbuffer.c:1146
#5  ReorderBufferCommit (rb=0x561771c14e10, xid=xid@entry=2976347782, commit_lsn=79160378448744, end_lsn=<optimized
out>,commit_time=commit_time@entry=686095734290578, origin_id=origin_id@entry=0, origin_lsn=0) at
./build/../src/backend/replication/logical/reorderbuffer.c:1523
#6  0x000056177124a30a in DecodeCommit (xid=2976347782, parsed=0x7ffc3cb4c240, buf=0x7ffc3cb4c400, ctx=0x561771b10850)
at./build/../src/backend/replication/logical/decode.c:640 
#7  DecodeXactOp (ctx=0x561771b10850, buf=buf@entry=0x7ffc3cb4c400) at
./build/../src/backend/replication/logical/decode.c:248
#8  0x000056177124a6a9 in LogicalDecodingProcessRecord (ctx=0x561771b10850, record=0x561771b10ae8) at
./build/../src/backend/replication/logical/decode.c:117
#9  0x000056177125d1e5 in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2893
#10 0x000056177125f5f2 in WalSndLoop (send_data=send_data@entry=0x56177125d180 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2242
#11 0x0000561771260125 in StartLogicalReplication (cmd=<optimized out>) at
./build/../src/backend/replication/walsender.c:1179
#12 exec_replication_command (cmd_string=cmd_string@entry=0x561771abe590 "START_REPLICATION SLOT dttsjtaa66crdhbm015h
LOGICAL0/0 ( \"include-timestamp\" '1', \"include-types\" '1', \"include-xids\" '1', \"write-in-chunks\" '1',
\"add-tables\"'/* sanitized */.claim_audit,public.__consu"...) at ./build/../src/backend/replication/walsender.c:1612 
#13 0x00005617712b2334 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x561771b2a438, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4267 
#14 0x000056177123857c in BackendRun (port=0x561771b0d7a0, port=0x561771b0d7a0) at
./build/../src/backend/postmaster/postmaster.c:4484
#15 BackendStartup (port=0x561771b0d7a0) at ./build/../src/backend/postmaster/postmaster.c:4167
#16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#17 0x000056177123954b in PostmasterMain (argc=9, argv=0x561771ab70e0) at
./build/../src/backend/postmaster/postmaster.c:1398
#18 0x0000561770fae8b6 in main (argc=9, argv=0x561771ab70e0) at ./build/../src/backend/main/main.c:228

What do you think?

Thank you!


Best regards, Andrey Borodin.

Attachment

Re: Logical WAL sender unresponsive during decoding commit

From
Amit Kapila
Date:
On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers!
>
> Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing table
pg_repack.
> I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think this
worthfixing.
 
> Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace?
>

I think if we want to do this in this code path then it may be it is
better to add it in ReorderBufferProcessTXN where we are looping to
process each change.

-- 
With Regards,
Amit Kapila.



Re: Logical WAL sender unresponsive during decoding commit

From
Masahiko Sawada
Date:
On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > Hi hackers!
> >
> > Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing table
pg_repack.
> > I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think this
worthfixing.
 
> > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace?
> >
>
> I think if we want to do this in this code path then it may be it is
> better to add it in ReorderBufferProcessTXN where we are looping to
> process each change.

+1

The same issue is recently reported[1] on -bugs and I proposed the
patch that adds CHECK_FOR_INTERRUPTS() to the loop in
ReorderBufferProcessTXN(). I think it should be backpatched.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoD%2BaNfLje%2B9JOqWbTiq1GL4BOp9_f7FxLADm8rS8cDhCQ%40mail.gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical WAL sender unresponsive during decoding commit

From
Amit Kapila
Date:
On Tue, Aug 16, 2022 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > >
> > > Hi hackers!
> > >
> > > Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing
tablepg_repack.
 
> > > I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think
thisworth fixing.
 
> > > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace?
> > >
> >
> > I think if we want to do this in this code path then it may be it is
> > better to add it in ReorderBufferProcessTXN where we are looping to
> > process each change.
>
> +1
>
> The same issue is recently reported[1] on -bugs and I proposed the
> patch that adds CHECK_FOR_INTERRUPTS() to the loop in
> ReorderBufferProcessTXN(). I think it should be backpatched.
>

I agree that it is better to backpatch this as well. Would you like to
verify if your patch works for all branches or if it need some tweaks?

-- 
With Regards,
Amit Kapila.



Re: Logical WAL sender unresponsive during decoding commit

From
Masahiko Sawada
Date:
On Tue, Aug 16, 2022 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 16, 2022 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > > >
> > > > Hi hackers!
> > > >
> > > > Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing
tablepg_repack.
 
> > > > I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think
thisworth fixing.
 
> > > > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace?
> > > >
> > >
> > > I think if we want to do this in this code path then it may be it is
> > > better to add it in ReorderBufferProcessTXN where we are looping to
> > > process each change.
> >
> > +1
> >
> > The same issue is recently reported[1] on -bugs and I proposed the
> > patch that adds CHECK_FOR_INTERRUPTS() to the loop in
> > ReorderBufferProcessTXN(). I think it should be backpatched.
> >
>
> I agree that it is better to backpatch this as well. Would you like to
> verify if your patch works for all branches or if it need some tweaks?
>

Yes, I've confirmed v10 and master but will do that for other branches
and send patches for all supported branches.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Logical WAL sender unresponsive during decoding commit

From
Andrey Borodin
Date:

> On 16 Aug 2022, at 10:25, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> The same issue is recently reported[1] on -bugs
Oh, I missed that thread.

> and I proposed the
> patch that adds CHECK_FOR_INTERRUPTS() to the loop in
> ReorderBufferProcessTXN().
I agree that it's a good place for check.

> I think it should be backpatched.
Yes, I think so too.

> [1] https://www.postgresql.org/message-id/CAD21AoD%2BaNfLje%2B9JOqWbTiq1GL4BOp9_f7FxLADm8rS8cDhCQ%40mail.gmail.com

The patch in this thread looks good to me.


Thank you!

Best regards, Andrey Borodin.


Re: Logical WAL sender unresponsive during decoding commit

From
Masahiko Sawada
Date:
On Tue, Aug 16, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Aug 16, 2022 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Aug 16, 2022 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > > > >
> > > > > Hi hackers!
> > > > >
> > > > > Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing
tablepg_repack.
 
> > > > > I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think
thisworth fixing.
 
> > > > > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace?
> > > > >
> > > >
> > > > I think if we want to do this in this code path then it may be it is
> > > > better to add it in ReorderBufferProcessTXN where we are looping to
> > > > process each change.
> > >
> > > +1
> > >
> > > The same issue is recently reported[1] on -bugs and I proposed the
> > > patch that adds CHECK_FOR_INTERRUPTS() to the loop in
> > > ReorderBufferProcessTXN(). I think it should be backpatched.
> > >
> >
> > I agree that it is better to backpatch this as well. Would you like to
> > verify if your patch works for all branches or if it need some tweaks?
> >
>
> Yes, I've confirmed v10 and master but will do that for other branches
> and send patches for all supported branches.
>

I've attached patches for all supported branches.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: Logical WAL sender unresponsive during decoding commit

From
Amit Kapila
Date:
On Tue, Aug 16, 2022 at 2:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've attached patches for all supported branches.
>

LGTM. I'll push this tomorrow unless there are comments/suggestions.

-- 
With Regards,
Amit Kapila.



Re: Logical WAL sender unresponsive during decoding commit

From
Amit Kapila
Date:
On Mon, Aug 22, 2022 at 4:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 16, 2022 at 2:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've attached patches for all supported branches.
> >
>
> LGTM. I'll push this tomorrow unless there are comments/suggestions.
>

Pushed.

-- 
With Regards,
Amit Kapila.



Re: Logical WAL sender unresponsive during decoding commit

From
Robert Haas
Date:
On Tue, Aug 23, 2022 at 4:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Aug 22, 2022 at 4:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Aug 16, 2022 at 2:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > I've attached patches for all supported branches.
> >
> > LGTM. I'll push this tomorrow unless there are comments/suggestions.
>
> Pushed.

I think this was a good change, but there's at least one other problem
here: within ReorderBufferRestoreChanges, the while (restored <
max_changes_in_memory && *segno <= last_segno) doesn't seem to contain
a CFI. Note that this can loop either by repeatedly failing to open a
file, or by repeatedly reading from a file and passing the data read
to ReorderBufferRestoreChange. So I think there should just be a CFI
at the top of this loop to make sure both cases are covered.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Logical WAL sender unresponsive during decoding commit

From
Amit Kapila
Date:
On Thu, Oct 20, 2022 at 5:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > Pushed.
>
> I think this was a good change, but there's at least one other problem
> here: within ReorderBufferRestoreChanges, the while (restored <
> max_changes_in_memory && *segno <= last_segno) doesn't seem to contain
> a CFI. Note that this can loop either by repeatedly failing to open a
> file, or by repeatedly reading from a file and passing the data read
> to ReorderBufferRestoreChange. So I think there should just be a CFI
> at the top of this loop to make sure both cases are covered.
>

Agreed. The failures due to file operations can make this loop
unpredictable in terms of time, so it is a good idea to have CFI at
the top of this loop.

I can take care of this unless there are any objections or you want to
do it. We have backpatched the previous similar change, so I think we
should backpatch this as well. What do you think?

-- 
With Regards,
Amit Kapila.



Re: Logical WAL sender unresponsive during decoding commit

From
Robert Haas
Date:
On Thu, Oct 20, 2022 at 1:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Oct 20, 2022 at 5:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > Pushed.
> >
> > I think this was a good change, but there's at least one other problem
> > here: within ReorderBufferRestoreChanges, the while (restored <
> > max_changes_in_memory && *segno <= last_segno) doesn't seem to contain
> > a CFI. Note that this can loop either by repeatedly failing to open a
> > file, or by repeatedly reading from a file and passing the data read
> > to ReorderBufferRestoreChange. So I think there should just be a CFI
> > at the top of this loop to make sure both cases are covered.
>
> Agreed. The failures due to file operations can make this loop
> unpredictable in terms of time, so it is a good idea to have CFI at
> the top of this loop.
>
> I can take care of this unless there are any objections or you want to
> do it. We have backpatched the previous similar change, so I think we
> should backpatch this as well. What do you think?

Please go ahead. +1 for back-patching.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Logical WAL sender unresponsive during decoding commit

From
Amit Kapila
Date:
On Thu, Oct 20, 2022 at 7:17 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Oct 20, 2022 at 1:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Oct 20, 2022 at 5:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > > Pushed.
> > >
> > > I think this was a good change, but there's at least one other problem
> > > here: within ReorderBufferRestoreChanges, the while (restored <
> > > max_changes_in_memory && *segno <= last_segno) doesn't seem to contain
> > > a CFI. Note that this can loop either by repeatedly failing to open a
> > > file, or by repeatedly reading from a file and passing the data read
> > > to ReorderBufferRestoreChange. So I think there should just be a CFI
> > > at the top of this loop to make sure both cases are covered.
> >
> > Agreed. The failures due to file operations can make this loop
> > unpredictable in terms of time, so it is a good idea to have CFI at
> > the top of this loop.
> >
> > I can take care of this unless there are any objections or you want to
> > do it. We have backpatched the previous similar change, so I think we
> > should backpatch this as well. What do you think?
>
> Please go ahead. +1 for back-patching.
>

Yesterday, I have pushed this change.

-- 
With Regards,
Amit Kapila.