Re: allow online change primary_conninfo - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: allow online change primary_conninfo |
Date | |
Msg-id | 20190925.141917.160008163.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: allow online change primary_conninfo (Sergei Kornilov <sk@zsrv.org>) |
Responses |
Re: allow online change primary_conninfo
|
List | pgsql-hackers |
Hello. At Sat, 21 Sep 2019 13:06:25 +0300, Sergei Kornilov <sk@zsrv.org> wrote in <41171569060385@myt5-b646bde4b8f3.qloud-c.yandex.net> > Hello > > Thank you for review! Can you please also check v4 version? v5 implements design suggested by Kyotaro Horiguchi-san, whilev4 has another design. Which one do you prefer? Or are both wrong? > > > I can't parse that comment. What does "skipping to starting" mean? I > > assume it's just about avoiding wal_retrieve_retry_interval, but I think > > the comment ought to be rephrased. > > Design idea is to rewrite current state from working XLOG_FROM_STREAM to failed XLOG_FROM_ARCHIVE (without actually trythis method on this iteration) and immediately go to next iteration to advance the state machine. Next iteration afterfailed archive recovery is walreceiver. So walreceiver will be stopped just before this lines and started on next iteration.Walreceiver will be restarted, we do not call restore_command Sorry, it's my bad. It meant "If wal receiver is requested to restart, change state to XLOG_FROM_STREAM immediately skipping the next XLOG_FROM_ARCHIVE state.". > > Also, should we really check this before scanning for new timelines? > > Hmm, on the next day... No, this is not really necessary. No problem when timeline is not changed when explicit restart of wal receiver. But if it happened just after the standby received new hisotry file, we suffer an extra restart of wal receiver. It seems better that we check that in the case. > > Why is it the right thing to change to XLOG_FROM_ARCHIVE when we're just > > restarting walreceiver? The server might unnecessarily get stuck in > > archive based recovery for a long time this way? It seems to me that > > we'd need to actually trigger RequestXLogStreaming() in this case. > > I hope I clarified this in design idea description above. My suggestion was just to pretend that the next XLOG_FROM_STREAM failed, but the outcome actually looks wierd as Andres commented. I think that comes from the structure of the state machine. WAL receiver is started not at the beginning of XLOG_FROM_STREAM state but at the end of XLOG_FROM_ARCHIVE. So we need to switch once to XLOG_FROM_ARCHIVE in order to start wal receiver. So, I'd like to propose to move the stuff to the second switch(). (See the attached incomplete patch.) This is rather similar to Sergei's previous proposal, but the structure of the state machine is kept. Some other comments are below. In ProcessStartupSigHup, conninfo and slotname don't need to be retained until the end of the function. The log message in the function seems to be too detailed. On the other hand, if we changed primary_conninfo to '' (stop) or vise versa (start), the message (restart) looks strange. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6c69eb6dd7..dba0042db6 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11755,12 +11755,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, for (;;) { int oldSource = currentSource; + bool start_wal_receiver = false; /* - * First check if we failed to read from the current source, and + * First check if we failed to read from the current source or if + * we want to restart wal receiver, and * advance the state machine if so. The failure to read might've * happened outside this function, e.g when a CRC check fails on a * record, or within this loop. + * Restart wal receiver if explicitly requested, too. */ if (lastSourceFailed) { @@ -11788,53 +11791,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (!StandbyMode) return false; - /* - * If primary_conninfo is set, launch walreceiver to try - * to stream the missing WAL. - * - * If fetching_ckpt is true, RecPtr points to the initial - * checkpoint location. In that case, we use RedoStartLSN - * as the streaming start position instead of RecPtr, so - * that when we later jump backwards to start redo at - * RedoStartLSN, we will have the logs streamed already. - */ - if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0) - { - XLogRecPtr ptr; - TimeLineID tli; - - if (fetching_ckpt) - { - ptr = RedoStartLSN; - tli = ControlFile->checkPointCopy.ThisTimeLineID; - } - else - { - ptr = RecPtr; - - /* - * Use the record begin position to determine the - * TLI, rather than the position we're reading. - */ - tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); - - if (curFileTLI > 0 && tli < curFileTLI) - elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previousrecovered WAL file came from timeline %u", - (uint32) (tliRecPtr >> 32), - (uint32) tliRecPtr, - tli, curFileTLI); - } - curFileTLI = tli; - RequestXLogStreaming(tli, ptr, PrimaryConnInfo, - PrimarySlotName); - receivedUpto = 0; - } - /* * Move to XLOG_FROM_STREAM state in either case. We'll * get immediate failure if we didn't launch walreceiver, * and move on to the next state. */ + start_wal_receiver = true; currentSource = XLOG_FROM_STREAM; break; @@ -11864,8 +11826,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, ShutdownWalRcv(); /* - * Before we sleep, re-scan for possible new timelines if - * we were requested to recover to the latest timeline. + * Re-scan for possible new timelines if we were requested + * to recover to the latest timeline. */ if (recoveryTargetTimeLineGoal == RECOVERY_TARGET_TIMELINE_LATEST) { @@ -11929,8 +11891,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, lastSourceFailed ? "failure" : "success"); /* - * We've now handled possible failure. Try to read from the chosen - * source. + * We've now handled possible failure and configuration change. Try to + * read from the chosen source. */ lastSourceFailed = false; @@ -11969,9 +11931,71 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool havedata; /* - * Check if WAL receiver is still active. + * shutdown wal receiver if restart is requested. */ - if (!WalRcvStreaming()) + if (!start_wal_receiver && pendingWalRcvRestart) + { + if (WalRcvRunning()) + ShutdownWalRcv(); + + /* + * Re-scan for possible new timelines if we were + * requested to recover to the latest timeline. + */ + if (recoveryTargetTimeLineGoal == + RECOVERY_TARGET_TIMELINE_LATEST) + rescanLatestTimeLine(); + + start_wal_receiver = true; + } + pendingWalRcvRestart = false; + + /* + * Launch walreceiver if needed. + * + * If fetching_ckpt is true, RecPtr points to the initial + * checkpoint location. In that case, we use RedoStartLSN + * as the streaming start position instead of RecPtr, so + * that when we later jump backwards to start redo at + * RedoStartLSN, we will have the logs streamed already. + */ + if (start_wal_receiver && + PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0) + { + XLogRecPtr ptr; + TimeLineID tli; + + if (fetching_ckpt) + { + ptr = RedoStartLSN; + tli = ControlFile->checkPointCopy.ThisTimeLineID; + } + else + { + ptr = RecPtr; + + /* + * Use the record begin position to determine the + * TLI, rather than the position we're reading. + */ + tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); + + if (curFileTLI > 0 && tli < curFileTLI) + elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previousrecovered WAL file came from timeline %u", + (uint32) (tliRecPtr >> 32), + (uint32) tliRecPtr, + tli, curFileTLI); + } + curFileTLI = tli; + RequestXLogStreaming(tli, ptr, PrimaryConnInfo, + PrimarySlotName); + receivedUpto = 0; + } + + /* + * Else, check if WAL receiver is still active. + */ + else if (!WalRcvStreaming()) { lastSourceFailed = true; break;
pgsql-hackers by date: