Thread: walsender vs. XLogBackgroundFlush during shutdown
Hi, I see there's an ongoing discussion about race conditions in walreceiver blocking shutdown, so let me start a new thread about a race condition in walsender during shutdown. The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck, waiting for walsender processes to catch-up and terminate indefinitely. The reason for that is pretty simple - the walsenders are doing logical decoding, and during shutdown they're waiting for WalSndCaughtUp=true. But per XLogSendLogical() this only happens if if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr()) { WalSndCaughtUp = true; ... } That is, we need to get beyong GetFlushRecPtr(). But that may never happen, because there may be incomplete (and unflushed) page in WAL buffers, not forced out by any transaction. So if there's a WAL record overflowing to the unflushed page, the walsender will never catch up. Now, this situation is apparently expected, because WalSndWaitForWal() does this: /* * If we're shutting down, trigger pending WAL to be written out, * otherwise we'd possibly end up waiting for WAL that never gets * written, because walwriter has shut down already. */ if (got_STOPPING) XLogBackgroundFlush(); but unfortunately that does not actually do anything, because right at the very beginning XLogBackgroundFlush() does this: /* back off to last completed page boundary */ WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ; That is, it intentionally ignores the incomplete page, which means the walsender can't read the record and reach GetFlushRecPtr(). XLogBackgroundFlush() does this since (at least) 2007, so how come we never had issues with this before? Firstly, walsenders used for physical replication don't have this issue, because they don't need to decode the WAL record overflowing to the incomplete/unflushed page. So it seems only walsenders used for logical decoding are vulnerable to this. Secondly, this seems to happen only when a bit of WAL is generated just at the right moment during shutdown. I have initially ran into this issue with the failover slots (i.e. the patch that was committed and reverted in the 9.6 cycle, IIRC), which is doing exactly this - it's writing the slot positions into WAL during shutdown. Which made it pretty trivial to trigger this issue. But I don't think we're safe without the failover slots patch, because any output plugin can do something similar - say, LogLogicalMessage() or something like that. I'm not aware of a plugin doing such things, but I don't think it's illegal / prohibited either. (Of course, plugins that generate WAL won't be useful for decoding on standby in the future.) So what I think we should do is to tweak XLogBackgroundFlush() so that during shutdown it skips the back-off to page boundary, and flushes even the last piece of WAL. There are only two callers anyway, so something like XLogBackgroundFlush(bool) would be simple enough. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas, On Wed, 1 May 2019 at 02:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I see there's an ongoing discussion about race conditions in walreceiver > blocking shutdown, so let me start a new thread about a race condition in > walsender during shutdown. > > The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck, > waiting for walsender processes to catch-up and terminate indefinitely. I can confirm, during the past couple of years we observed such a problem a few times and this is really annoying. > The reason for that is pretty simple - the walsenders are doing logical > decoding, and during shutdown they're waiting for WalSndCaughtUp=true. > But per XLogSendLogical() this only happens if > > if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr()) > { > WalSndCaughtUp = true; > ... > } After a couple of days investigating and debugging I came to a slightly different conclusion, WalSndCaughtUp is set to true in my case. Since I am mostly a python person, I decided to use psycopg2 for my investigation. I took an example from http://initd.org/psycopg/docs/advanced.html#logical-replication-quick-start as a starting point, created a slot and started the script. I wasn't running any transactions, therefore the DemoConsumer.__call__ was never executed and cursor.send_feedback(flush_lsn=msg.data_start) was never called either. Basically, the only what the psycopg2 internals was doing - periodically sending keepalive messages or replying to keepalives sent by postgres. In the postgres debug log they are visible as: 2019-05-01 12:58:32.785 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 If you try to do a fast shutdown of postgres while the script is running, it will never finish, and in the postgres log you will get indefinite stream of messages: 2019-05-01 13:00:02.880 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 2019-05-01 13:00:02.880 CEST [13939] DEBUG: sending replication keepalive 2019-05-01 13:00:02.880 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 2019-05-01 13:00:02.880 CEST [13939] DEBUG: sending replication keepalive 2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 2019-05-01 13:00:02.881 CEST [13939] DEBUG: sending replication keepalive 2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 2019-05-01 13:00:02.881 CEST [13939] DEBUG: sending replication keepalive 2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 2019-05-01 13:00:02.881 CEST [13939] DEBUG: sending replication keepalive 2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 Actually, the same problem will happen even in the case when the consumer script receives some message, but not very intensively, but it is just a bit harder to reproduce it. If you attach to the walsender with gdb, you'll see the following picture: (gdb) bt #0 0x00007fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08, len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28 #1 0x000055cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0, ptr=0x55cb958dca08, len=94) at be-secure.c:318 #2 0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0, ptr=0x55cb958dca08, len=94) at be-secure.c:265 #3 0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433 #4 0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409 #5 0x000055cb93dac30b in send_message_to_frontend (edata=0x55cb942b4380 <errordata>) at elog.c:3317 #6 0x000055cb93da8973 in EmitErrorReport () at elog.c:1481 #7 0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481 #8 0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3 "sending replication keepalive") at elog.c:1376 #9 0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at walsender.c:3358 #10 0x000055cb93bca062 in WalSndDone (send_data=0x55cb93bc9e29 <XLogSendLogical>) at walsender.c:2872 #11 0x000055cb93bc9155 in WalSndLoop (send_data=0x55cb93bc9e29 <XLogSendLogical>) at walsender.c:2194 #12 0x000055cb93bc7b11 in StartLogicalReplication (cmd=0x55cb95931cc0) at walsender.c:1109 #13 0x000055cb93bc83d6 in exec_replication_command (cmd_string=0x55cb958b2360 "START_REPLICATION SLOT \"test\" LOGICAL 0/00000000") at walsender.c:1541 #14 0x000055cb93c31653 in PostgresMain (argc=1, argv=0x55cb958deb68, dbname=0x55cb958deb48 "postgres", username=0x55cb958deb28 "akukushkin") at postgres.c:4178 #15 0x000055cb93b95185 in BackendRun (port=0x55cb958d71e0) at postmaster.c:4361 #16 0x000055cb93b94824 in BackendStartup (port=0x55cb958d71e0) at postmaster.c:4033 #17 0x000055cb93b90ccd in ServerLoop () at postmaster.c:1706 #18 0x000055cb93b90463 in PostmasterMain (argc=3, argv=0x55cb958ac710) at postmaster.c:1379 #19 0x000055cb93abb08e in main (argc=3, argv=0x55cb958ac710) at main.c:228 (gdb) f 10 #10 0x000055cb93bca062 in WalSndDone (send_data=0x55cb93bc9e29 <XLogSendLogical>) at walsender.c:2872 2872 WalSndKeepalive(true); (gdb) p WalSndCaughtUp $1 = true (gdb) p *MyWalSnd $2 = {pid = 21845, state = WALSNDSTATE_STREAMING, sentPtr = 23586168, needreload = false, write = 0, flush = 23586112, apply = 0, writeLag = -1, flushLag = -1, applyLag = -1, mutex = 0 '\000', latch = 0x7fd66096b594, sync_standby_priority = 0} As you can see, the value of WalSndCaughtUp is set to true! The shutdown never finishes because the value of sentPtr is higher than values of MyWalSnd->flush or MyWalSnd->write: (gdb) l 2858 2853 /* 2854 * To figure out whether all WAL has successfully been replicated, check 2855 * flush location if valid, write otherwise. Tools like pg_receivewal will 2856 * usually (unless in synchronous mode) return an invalid flush location. 2857 */ 2858 replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ? 2859 MyWalSnd->write : MyWalSnd->flush; 2860 2861 if (WalSndCaughtUp && sentPtr == replicatedPtr && 2862 !pq_is_send_pending()) 2863 { 2864 /* Inform the standby that XLOG streaming is done */ 2865 EndCommand("COPY 0", DestRemote); 2866 pq_flush(); 2867 2868 proc_exit(0); 2869 } What is more interesting, if one is using pg_recvlogical, it is not possible to reproduce the issue. That happens because pg_recvlogical sends the response on keepalive messages by sending the flush location equals to walEnd which it got from keepalive. As a next logical step I tried to do the same in python with psycopg2. Unfortunately, the keepalive functionality is hidden in the C code and it is not possible to change it without recompiling the psycopg2, but there is an asynchronous interface available: http://initd.org/psycopg/docs/extras.html#psycopg2.extras.ReplicationCursor.wal_end. I just had to do just one minor adjustment: try: sel = select([cur], [], [], max(0, timeout)) if not any(sel): - cur.send_feedback() # timed out, send keepalive message + cur.send_feedback(flush_lsn=cur.wal_end) # timed out, send keepalive message except InterruptedError: pass # recalculate timeout and continue wal_end шы еру property of the cursor object and it is updated not only for every message received, but also from keepalive messages. Basically such a little change made the python example behavior very similar to the pg_recvlogical. All above text probably looks like a brain dump, but I don't think that it conflicts with Tomas's findings it rather compliments them. I am very glad that now I know how to mitigate the problem on the client side, but IMHO it is also very important to fix the server behavior if it is ever possible. Regards, -- Alexander Kukushkin
On Wed, May 01, 2019 at 02:13:10PM +0200, Alexander Kukushkin wrote: > Hi Tomas, > >On Wed, 1 May 2019 at 02:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> I see there's an ongoing discussion about race conditions in walreceiver >> blocking shutdown, so let me start a new thread about a race condition in >> walsender during shutdown. >> >> The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck, >> waiting for walsender processes to catch-up and terminate indefinitely. > >I can confirm, during the past couple of years we observed such a >problem a few times and this is really annoying. > >> The reason for that is pretty simple - the walsenders are doing logical >> decoding, and during shutdown they're waiting for WalSndCaughtUp=true. >> But per XLogSendLogical() this only happens if >> >> if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr()) >> { >> WalSndCaughtUp = true; >> ... >> } > >After a couple of days investigating and debugging I came to a >slightly different conclusion, WalSndCaughtUp is set to true in my >case. >Since I am mostly a python person, I decided to use psycopg2 for my >investigation. I took an example from >http://initd.org/psycopg/docs/advanced.html#logical-replication-quick-start >as a starting point, created a slot and started the script. >I wasn't running any transactions, therefore the DemoConsumer.__call__ >was never executed and cursor.send_feedback(flush_lsn=msg.data_start) >was never called either. Basically, the only what the psycopg2 >internals was doing - periodically sending keepalive messages or >replying to keepalives sent by postgres. OK, so that seems like a separate issue, somewhat unrelated to the issue I reported. And I'm not sure it's a walsender issue - it seems it might be a psycopg2 issue in not reporting the flush properly, no? >Actually, the same problem will happen even in the case when the >consumer script receives some message, but not very intensively, but >it is just a bit harder to reproduce it. > >If you attach to the walsender with gdb, you'll see the following picture: >(gdb) bt >#0 0x00007fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08, >len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28 >#1 0x000055cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0, >ptr=0x55cb958dca08, len=94) at be-secure.c:318 >#2 0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0, >ptr=0x55cb958dca08, len=94) at be-secure.c:265 >#3 0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433 >#4 0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409 >#5 0x000055cb93dac30b in send_message_to_frontend >(edata=0x55cb942b4380 <errordata>) at elog.c:3317 >#6 0x000055cb93da8973 in EmitErrorReport () at elog.c:1481 >#7 0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481 >#8 0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3 >"sending replication keepalive") at elog.c:1376 >#9 0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at >walsender.c:3358 Is it stuck in the send() call forever, or did you happen to grab this bracktrace? > >All above text probably looks like a brain dump, but I don't think >that it conflicts with Tomas's findings it rather compliments them. >I am very glad that now I know how to mitigate the problem on the >client side, but IMHO it is also very important to fix the server >behavior if it is ever possible. > I think having a report of an issue, with a way to reproduce it is a first (and quite important) step. So thanks for doing that. That being said, I think those are two separate issues, with different causes and likely different fixes. I don't think fixing the xlog flush will resolve your issue, and vice versa. FWIW attached is a patch that I used to reliably trigger the xlog flush issue - it simply adds LogLogicalMessage() to commit handler in the built-in output plugin. So all you need to do is create a subscription, start generating commit and trigger a restart. The message is 8kB, so it's definitely long enough to overflow to the next xlog page. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi, On 2019-05-01 02:28:45 +0200, Tomas Vondra wrote: > The reason for that is pretty simple - the walsenders are doing logical > decoding, and during shutdown they're waiting for WalSndCaughtUp=true. > But per XLogSendLogical() this only happens if > > if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr()) > { > WalSndCaughtUp = true; > ... > } > > That is, we need to get beyong GetFlushRecPtr(). But that may never > happen, because there may be incomplete (and unflushed) page in WAL > buffers, not forced out by any transaction. So if there's a WAL record > overflowing to the unflushed page, the walsender will never catch up. > > Now, this situation is apparently expected, because WalSndWaitForWal() > does this: > > /* > * If we're shutting down, trigger pending WAL to be written out, > * otherwise we'd possibly end up waiting for WAL that never gets > * written, because walwriter has shut down already. > */ > if (got_STOPPING) > XLogBackgroundFlush(); > > but unfortunately that does not actually do anything, because right at > the very beginning XLogBackgroundFlush() does this: > > /* back off to last completed page boundary */ > WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ; > That is, it intentionally ignores the incomplete page, which means the > walsender can't read the record and reach GetFlushRecPtr(). > > XLogBackgroundFlush() does this since (at least) 2007, so how come we > never had issues with this before? I assume that's because of the following logic: /* if we have already flushed that far, consider async commit records */ if (WriteRqst.Write <= LogwrtResult.Flush) { SpinLockAcquire(&XLogCtl->info_lck); WriteRqst.Write = XLogCtl->asyncXactLSN; SpinLockRelease(&XLogCtl->info_lck); flexible = false; /* ensure it all gets written */ } and various pieces of the code doing XLogSetAsyncXactLSN() to force flushing. I wonder if the issue is that we're better at avoiding unnecessary WAL to be written due to 6ef2eba3f57f17960b7cd4958e18aa79e357de2f > But I don't think we're safe without the failover slots patch, because > any output plugin can do something similar - say, LogLogicalMessage() or > something like that. I'm not aware of a plugin doing such things, but I > don't think it's illegal / prohibited either. (Of course, plugins that > generate WAL won't be useful for decoding on standby in the future.) FWIW, I'd consider such an output plugin outright broken. > So what I think we should do is to tweak XLogBackgroundFlush() so that > during shutdown it skips the back-off to page boundary, and flushes even > the last piece of WAL. There are only two callers anyway, so something > like XLogBackgroundFlush(bool) would be simple enough. I think it then just ought to be a normal XLogFlush(). I.e. something along the lines of: /* * If we're shutting down, trigger pending WAL to be written out, * otherwise we'd possibly end up waiting for WAL that never gets * written, because walwriter has shut down already. */ if (got_STOPPING && !RecoveryInProgress()) XLogFlush(GetXLogInsertRecPtr()); /* Update our idea of the currently flushed position. */ if (!RecoveryInProgress()) RecentFlushPtr = GetFlushRecPtr(); else RecentFlushPtr = GetXLogReplayRecPtr(NULL); Greetings, Andres Freund
On Wed, May 01, 2019 at 08:53:15AM -0700, Andres Freund wrote: >Hi, > >On 2019-05-01 02:28:45 +0200, Tomas Vondra wrote: >> The reason for that is pretty simple - the walsenders are doing logical >> decoding, and during shutdown they're waiting for WalSndCaughtUp=true. >> But per XLogSendLogical() this only happens if >> >> if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr()) >> { >> WalSndCaughtUp = true; >> ... >> } >> >> That is, we need to get beyong GetFlushRecPtr(). But that may never >> happen, because there may be incomplete (and unflushed) page in WAL >> buffers, not forced out by any transaction. So if there's a WAL record >> overflowing to the unflushed page, the walsender will never catch up. >> >> Now, this situation is apparently expected, because WalSndWaitForWal() >> does this: >> >> /* >> * If we're shutting down, trigger pending WAL to be written out, >> * otherwise we'd possibly end up waiting for WAL that never gets >> * written, because walwriter has shut down already. >> */ >> if (got_STOPPING) >> XLogBackgroundFlush(); >> >> but unfortunately that does not actually do anything, because right at >> the very beginning XLogBackgroundFlush() does this: >> >> /* back off to last completed page boundary */ >> WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ; > >> That is, it intentionally ignores the incomplete page, which means the >> walsender can't read the record and reach GetFlushRecPtr(). >> >> XLogBackgroundFlush() does this since (at least) 2007, so how come we >> never had issues with this before? > >I assume that's because of the following logic: > /* if we have already flushed that far, consider async commit records */ > if (WriteRqst.Write <= LogwrtResult.Flush) > { > SpinLockAcquire(&XLogCtl->info_lck); > WriteRqst.Write = XLogCtl->asyncXactLSN; > SpinLockRelease(&XLogCtl->info_lck); > flexible = false; /* ensure it all gets written */ > } > >and various pieces of the code doing XLogSetAsyncXactLSN() to force >flushing. I wonder if the issue is that we're better at avoiding >unnecessary WAL to be written due to >6ef2eba3f57f17960b7cd4958e18aa79e357de2f > I don't think so, because (a) there are no async commits involved, and (b) we originally ran into the issue on 9.6 and 6ef2eba3f57f1 is only in 10+. > >> But I don't think we're safe without the failover slots patch, because >> any output plugin can do something similar - say, LogLogicalMessage() or >> something like that. I'm not aware of a plugin doing such things, but I >> don't think it's illegal / prohibited either. (Of course, plugins that >> generate WAL won't be useful for decoding on standby in the future.) > >FWIW, I'd consider such an output plugin outright broken. > Why? Is that prohibited somewhere, either explicitly or implicitly? I do see obvious issues with generating WAL from plugin (infinite cycle and so on), but I suppose that's more a regular coding issue than something that would make all plugins doing that broken. FWIW I don't see any particular need to generate WAL from output plugins, I mentioned it mostly jsut as a convenient way to trigger the issue. I suppose there are other ways to generate a bit of WAL during shutdown. > >> So what I think we should do is to tweak XLogBackgroundFlush() so that >> during shutdown it skips the back-off to page boundary, and flushes even >> the last piece of WAL. There are only two callers anyway, so something >> like XLogBackgroundFlush(bool) would be simple enough. > >I think it then just ought to be a normal XLogFlush(). I.e. something >along the lines of: > > /* > * If we're shutting down, trigger pending WAL to be written out, > * otherwise we'd possibly end up waiting for WAL that never gets > * written, because walwriter has shut down already. > */ > if (got_STOPPING && !RecoveryInProgress()) > XLogFlush(GetXLogInsertRecPtr()); > > /* Update our idea of the currently flushed position. */ > if (!RecoveryInProgress()) > RecentFlushPtr = GetFlushRecPtr(); > else > RecentFlushPtr = GetXLogReplayRecPtr(NULL); > Perhaps. That would work too, I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On Wed, 1 May 2019 at 17:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > OK, so that seems like a separate issue, somewhat unrelated to the issue I > reported. And I'm not sure it's a walsender issue - it seems it might be a > psycopg2 issue in not reporting the flush properly, no? Agree, it is a different issue, but I am unsure what to blame, postgres or psycopg2. Right now in the psycopg2 we confirm more or less every XLogData message, but at the same time LSN on the primary is moving forward and we get updates with keepalive messages. I perfectly understand the need to periodically send updates of flush = walEnd (which comes from keepalive). It might happen that there is no transaction activity but WAL is still generated and as a result replication slot will prevent WAL from being cleaned up. From the client side perspective, it confirmed everything that it should, but from the postgres side, this is not enough to shut down cleanly. Maybe it is possible to change the check (sentPtr == replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or it would be unsafe? > >Actually, the same problem will happen even in the case when the > >consumer script receives some message, but not very intensively, but > >it is just a bit harder to reproduce it. > > > >If you attach to the walsender with gdb, you'll see the following picture: > >(gdb) bt > >#0 0x00007fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08, > >len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28 > >#1 0x000055cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0, > >ptr=0x55cb958dca08, len=94) at be-secure.c:318 > >#2 0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0, > >ptr=0x55cb958dca08, len=94) at be-secure.c:265 > >#3 0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433 > >#4 0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409 > >#5 0x000055cb93dac30b in send_message_to_frontend > >(edata=0x55cb942b4380 <errordata>) at elog.c:3317 > >#6 0x000055cb93da8973 in EmitErrorReport () at elog.c:1481 > >#7 0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481 > >#8 0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3 > >"sending replication keepalive") at elog.c:1376 > >#9 0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at > >walsender.c:3358 > > Is it stuck in the send() call forever, or did you happen to grab > this bracktrace? No, it didn't stuck there. During the shutdown postgres starts sending a few thousand keepalive messages per second and receives back so many feedback message, therefore the chances of interrupting somewhere in the send are quite high. The loop never breaks because psycopg2 is always replying with the same flush as very last time, which was set during processing of XLogData message. > I think having a report of an issue, with a way to reproduce it is a first > (and quite important) step. So thanks for doing that. > > That being said, I think those are two separate issues, with different > causes and likely different fixes. I don't think fixing the xlog flush > will resolve your issue, and vice versa. Agree, these are different issues. Regards, -- Alexander Kukushkin
On Wed, May 01, 2019 at 07:12:52PM +0200, Alexander Kukushkin wrote: >Hi, > >On Wed, 1 May 2019 at 17:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> OK, so that seems like a separate issue, somewhat unrelated to the issue I >> reported. And I'm not sure it's a walsender issue - it seems it might be a >> psycopg2 issue in not reporting the flush properly, no? > >Agree, it is a different issue, but I am unsure what to blame, >postgres or psycopg2. >Right now in the psycopg2 we confirm more or less every XLogData >message, but at the same time LSN on the primary is moving forward and >we get updates with keepalive messages. >I perfectly understand the need to periodically send updates of flush >= walEnd (which comes from keepalive). It might happen that there is >no transaction activity but WAL is still generated and as a result >replication slot will prevent WAL from being cleaned up. >From the client side perspective, it confirmed everything that it >should, but from the postgres side, this is not enough to shut down >cleanly. Maybe it is possible to change the check (sentPtr == >replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or >it would be unsafe? > I don't know. In general I think it's a bit strange that we're waiting for walsender processes to catch up even in fast shutdown mode, instead of just aborting them like other backends. But I assume there are reasons for that. OTOH it makes us vulnerable to issues like this, when a (presumably) misbehaving downstream prevents a shutdown. >> >ptr=0x55cb958dca08, len=94) at be-secure.c:318 >> >#2 0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0, >> >ptr=0x55cb958dca08, len=94) at be-secure.c:265 >> >#3 0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433 >> >#4 0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409 >> >#5 0x000055cb93dac30b in send_message_to_frontend >> >(edata=0x55cb942b4380 <errordata>) at elog.c:3317 >> >#6 0x000055cb93da8973 in EmitErrorReport () at elog.c:1481 >> >#7 0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481 >> >#8 0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3 >> >"sending replication keepalive") at elog.c:1376 >> >#9 0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at >> >walsender.c:3358 >> >> Is it stuck in the send() call forever, or did you happen to grab >> this bracktrace? > >No, it didn't stuck there. During the shutdown postgres starts sending >a few thousand keepalive messages per second and receives back so many >feedback message, therefore the chances of interrupting somewhere in >the send are quite high. Uh, that seems a bit broken, perhaps? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 1 May 2019 at 01:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Now, this situation is apparently expected, because WalSndWaitForWal()
does this:
/*
* If we're shutting down, trigger pending WAL to be written out,
* otherwise we'd possibly end up waiting for WAL that never gets
* written, because walwriter has shut down already.
*/
if (got_STOPPING)
XLogBackgroundFlush();
but unfortunately that does not actually do anything, because right at
the very beginning XLogBackgroundFlush() does this:
/* back off to last completed page boundary */
WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;
That is, it intentionally ignores the incomplete page, which means the
walsender can't read the record and reach GetFlushRecPtr().
XLogBackgroundFlush() does this since (at least) 2007, so how come we
never had issues with this before?
Yeh, not quite what I originally wrote for that.
I think the confusion is that XLogBackgroundFlush() doesn't do quite what it says.
XLogWrite() kinda does with its "flexible" parameter. So I suggest we do the same on XLogBackgroundFlush() so callers can indicate whether they want it to be flexible or not.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 2 May 2019 at 14:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >From the client side perspective, it confirmed everything that it > >should, but from the postgres side, this is not enough to shut down > >cleanly. Maybe it is possible to change the check (sentPtr == > >replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or > >it would be unsafe? > > I don't know. > > In general I think it's a bit strange that we're waiting for walsender > processes to catch up even in fast shutdown mode, instead of just aborting > them like other backends. But I assume there are reasons for that. OTOH it > makes us vulnerable to issues like this, when a (presumably) misbehaving > downstream prevents a shutdown. IMHO waiting until remote side received and flushed all changes is a right strategy, but physical and logical replication should be handled slightly differently. For a physical replication we want to make sure that remote side received and flushed all changes, otherwise in case of switchover we won't be able to join the former primary as a new standby. Logical replication case is a bit different. I think we can safely shutdown walsender when the client confirmed the last XLogData message, while now we are waiting until the client confirms wal_end received in the keepalive message. If we shutdown walsender too early, and do a switchover, the client might miss some events, because logical slots are not replicated :( > >No, it didn't stuck there. During the shutdown postgres starts sending > >a few thousand keepalive messages per second and receives back so many > >feedback message, therefore the chances of interrupting somewhere in > >the send are quite high. > > Uh, that seems a bit broken, perhaps? Indeed, this is broken psycopg2 behavior :( I am thinking about submitting a patch fixing it. Actually I quickly skimmed through the pgjdbc logical replication source code and example https://jdbc.postgresql.org/documentation/head/replication.html and I think that it will also cause problems with the shutdown. Regards, -- Alexander Kukushkin