Thread: Logical WAL sender unresponsive during decoding commit
Hi hackers! Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing table pg_repack. I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think this worthfixing. Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace? Full session is attaches as file. #0 pfree (pointer=0x561850bbee40) at ./build/../src/backend/utils/mmgr/mcxt.c:1032 #1 0x00005617712530d6 in ReorderBufferReturnTupleBuf (tuple=<optimized out>, rb=<optimized out>) at ./build/../src/backend/replication/logical/reorderbuffer.c:469 #2 ReorderBufferReturnChange (rb=<optimized out>, change=0x561772456048) at ./build/../src/backend/replication/logical/reorderbuffer.c:398 #3 0x0000561771253da1 in ReorderBufferRestoreChanges (rb=rb@entry=0x561771c14e10, txn=0x561771c0b078, file=file@entry=0x561771c15168,segno=segno@entry=0x561771c15178) at ./build/../src/backend/replication/logical/reorderbuffer.c:2570 #4 0x00005617712553ba in ReorderBufferIterTXNNext (state=0x561771c15130, rb=0x561771c14e10) at ./build/../src/backend/replication/logical/reorderbuffer.c:1146 #5 ReorderBufferCommit (rb=0x561771c14e10, xid=xid@entry=2976347782, commit_lsn=79160378448744, end_lsn=<optimized out>,commit_time=commit_time@entry=686095734290578, origin_id=origin_id@entry=0, origin_lsn=0) at ./build/../src/backend/replication/logical/reorderbuffer.c:1523 #6 0x000056177124a30a in DecodeCommit (xid=2976347782, parsed=0x7ffc3cb4c240, buf=0x7ffc3cb4c400, ctx=0x561771b10850) at./build/../src/backend/replication/logical/decode.c:640 #7 DecodeXactOp (ctx=0x561771b10850, buf=buf@entry=0x7ffc3cb4c400) at ./build/../src/backend/replication/logical/decode.c:248 #8 0x000056177124a6a9 in LogicalDecodingProcessRecord (ctx=0x561771b10850, record=0x561771b10ae8) at ./build/../src/backend/replication/logical/decode.c:117 #9 0x000056177125d1e5 in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2893 #10 0x000056177125f5f2 in WalSndLoop (send_data=send_data@entry=0x56177125d180 <XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2242 #11 0x0000561771260125 in StartLogicalReplication (cmd=<optimized out>) at ./build/../src/backend/replication/walsender.c:1179 #12 exec_replication_command (cmd_string=cmd_string@entry=0x561771abe590 "START_REPLICATION SLOT dttsjtaa66crdhbm015h LOGICAL0/0 ( \"include-timestamp\" '1', \"include-types\" '1', \"include-xids\" '1', \"write-in-chunks\" '1', \"add-tables\"'/* sanitized */.claim_audit,public.__consu"...) at ./build/../src/backend/replication/walsender.c:1612 #13 0x00005617712b2334 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x561771b2a438, dbname=<optimized out>, username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4267 #14 0x000056177123857c in BackendRun (port=0x561771b0d7a0, port=0x561771b0d7a0) at ./build/../src/backend/postmaster/postmaster.c:4484 #15 BackendStartup (port=0x561771b0d7a0) at ./build/../src/backend/postmaster/postmaster.c:4167 #16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725 #17 0x000056177123954b in PostmasterMain (argc=9, argv=0x561771ab70e0) at ./build/../src/backend/postmaster/postmaster.c:1398 #18 0x0000561770fae8b6 in main (argc=9, argv=0x561771ab70e0) at ./build/../src/backend/main/main.c:228 What do you think? Thank you! Best regards, Andrey Borodin.
Attachment
On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Hi hackers! > > Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing table pg_repack. > I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think this worthfixing. > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace? > I think if we want to do this in this code path then it may be it is better to add it in ReorderBufferProcessTXN where we are looping to process each change. -- With Regards, Amit Kapila.
On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > > Hi hackers! > > > > Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing table pg_repack. > > I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think this worthfixing. > > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace? > > > > I think if we want to do this in this code path then it may be it is > better to add it in ReorderBufferProcessTXN where we are looping to > process each change. +1 The same issue is recently reported[1] on -bugs and I proposed the patch that adds CHECK_FOR_INTERRUPTS() to the loop in ReorderBufferProcessTXN(). I think it should be backpatched. Regards, [1] https://www.postgresql.org/message-id/CAD21AoD%2BaNfLje%2B9JOqWbTiq1GL4BOp9_f7FxLADm8rS8cDhCQ%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Aug 16, 2022 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > > > > Hi hackers! > > > > > > Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing tablepg_repack. > > > I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think thisworth fixing. > > > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace? > > > > > > > I think if we want to do this in this code path then it may be it is > > better to add it in ReorderBufferProcessTXN where we are looping to > > process each change. > > +1 > > The same issue is recently reported[1] on -bugs and I proposed the > patch that adds CHECK_FOR_INTERRUPTS() to the loop in > ReorderBufferProcessTXN(). I think it should be backpatched. > I agree that it is better to backpatch this as well. Would you like to verify if your patch works for all branches or if it need some tweaks? -- With Regards, Amit Kapila.
On Tue, Aug 16, 2022 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 16, 2022 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > > > > > > Hi hackers! > > > > > > > > Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing tablepg_repack. > > > > I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think thisworth fixing. > > > > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace? > > > > > > > > > > I think if we want to do this in this code path then it may be it is > > > better to add it in ReorderBufferProcessTXN where we are looping to > > > process each change. > > > > +1 > > > > The same issue is recently reported[1] on -bugs and I proposed the > > patch that adds CHECK_FOR_INTERRUPTS() to the loop in > > ReorderBufferProcessTXN(). I think it should be backpatched. > > > > I agree that it is better to backpatch this as well. Would you like to > verify if your patch works for all branches or if it need some tweaks? > Yes, I've confirmed v10 and master but will do that for other branches and send patches for all supported branches. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
> On 16 Aug 2022, at 10:25, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > The same issue is recently reported[1] on -bugs Oh, I missed that thread. > and I proposed the > patch that adds CHECK_FOR_INTERRUPTS() to the loop in > ReorderBufferProcessTXN(). I agree that it's a good place for check. > I think it should be backpatched. Yes, I think so too. > [1] https://www.postgresql.org/message-id/CAD21AoD%2BaNfLje%2B9JOqWbTiq1GL4BOp9_f7FxLADm8rS8cDhCQ%40mail.gmail.com The patch in this thread looks good to me. Thank you! Best regards, Andrey Borodin.
On Tue, Aug 16, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Aug 16, 2022 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Aug 16, 2022 at 10:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Aug 16, 2022 at 2:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Tue, Aug 16, 2022 at 9:28 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > > > > > > > > Hi hackers! > > > > > > > > > > Some time ago I've seen a hanging logical replication that was trying to send transaction commit after doing tablepg_repack. > > > > > I understand that those things do not mix well. Yet walsender was ignoring pg_terminate_backend() and I think thisworth fixing. > > > > > Can we add CHECK_FOR_INTERRUPTS(); somewhere in this backtrace? > > > > > > > > > > > > > I think if we want to do this in this code path then it may be it is > > > > better to add it in ReorderBufferProcessTXN where we are looping to > > > > process each change. > > > > > > +1 > > > > > > The same issue is recently reported[1] on -bugs and I proposed the > > > patch that adds CHECK_FOR_INTERRUPTS() to the loop in > > > ReorderBufferProcessTXN(). I think it should be backpatched. > > > > > > > I agree that it is better to backpatch this as well. Would you like to > > verify if your patch works for all branches or if it need some tweaks? > > > > Yes, I've confirmed v10 and master but will do that for other branches > and send patches for all supported branches. > I've attached patches for all supported branches. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
- REL11_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
- REL15_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
- REL14_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
- REL13_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
- REL12_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
- REL10_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
- master_v2-0001-Add-CHECK_FOR_INTERRUPTS-in-logical-decoding-s-pr.patch
On Tue, Aug 16, 2022 at 2:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've attached patches for all supported branches. > LGTM. I'll push this tomorrow unless there are comments/suggestions. -- With Regards, Amit Kapila.
On Mon, Aug 22, 2022 at 4:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 16, 2022 at 2:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've attached patches for all supported branches. > > > > LGTM. I'll push this tomorrow unless there are comments/suggestions. > Pushed. -- With Regards, Amit Kapila.
On Tue, Aug 23, 2022 at 4:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Aug 22, 2022 at 4:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 16, 2022 at 2:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've attached patches for all supported branches. > > > > LGTM. I'll push this tomorrow unless there are comments/suggestions. > > Pushed. I think this was a good change, but there's at least one other problem here: within ReorderBufferRestoreChanges, the while (restored < max_changes_in_memory && *segno <= last_segno) doesn't seem to contain a CFI. Note that this can loop either by repeatedly failing to open a file, or by repeatedly reading from a file and passing the data read to ReorderBufferRestoreChange. So I think there should just be a CFI at the top of this loop to make sure both cases are covered. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 20, 2022 at 5:17 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > Pushed. > > I think this was a good change, but there's at least one other problem > here: within ReorderBufferRestoreChanges, the while (restored < > max_changes_in_memory && *segno <= last_segno) doesn't seem to contain > a CFI. Note that this can loop either by repeatedly failing to open a > file, or by repeatedly reading from a file and passing the data read > to ReorderBufferRestoreChange. So I think there should just be a CFI > at the top of this loop to make sure both cases are covered. > Agreed. The failures due to file operations can make this loop unpredictable in terms of time, so it is a good idea to have CFI at the top of this loop. I can take care of this unless there are any objections or you want to do it. We have backpatched the previous similar change, so I think we should backpatch this as well. What do you think? -- With Regards, Amit Kapila.
On Thu, Oct 20, 2022 at 1:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Oct 20, 2022 at 5:17 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > Pushed. > > > > I think this was a good change, but there's at least one other problem > > here: within ReorderBufferRestoreChanges, the while (restored < > > max_changes_in_memory && *segno <= last_segno) doesn't seem to contain > > a CFI. Note that this can loop either by repeatedly failing to open a > > file, or by repeatedly reading from a file and passing the data read > > to ReorderBufferRestoreChange. So I think there should just be a CFI > > at the top of this loop to make sure both cases are covered. > > Agreed. The failures due to file operations can make this loop > unpredictable in terms of time, so it is a good idea to have CFI at > the top of this loop. > > I can take care of this unless there are any objections or you want to > do it. We have backpatched the previous similar change, so I think we > should backpatch this as well. What do you think? Please go ahead. +1 for back-patching. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 20, 2022 at 7:17 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Oct 20, 2022 at 1:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Oct 20, 2022 at 5:17 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > Pushed. > > > > > > I think this was a good change, but there's at least one other problem > > > here: within ReorderBufferRestoreChanges, the while (restored < > > > max_changes_in_memory && *segno <= last_segno) doesn't seem to contain > > > a CFI. Note that this can loop either by repeatedly failing to open a > > > file, or by repeatedly reading from a file and passing the data read > > > to ReorderBufferRestoreChange. So I think there should just be a CFI > > > at the top of this loop to make sure both cases are covered. > > > > Agreed. The failures due to file operations can make this loop > > unpredictable in terms of time, so it is a good idea to have CFI at > > the top of this loop. > > > > I can take care of this unless there are any objections or you want to > > do it. We have backpatched the previous similar change, so I think we > > should backpatch this as well. What do you think? > > Please go ahead. +1 for back-patching. > Yesterday, I have pushed this change. -- With Regards, Amit Kapila.