Thread: Keepalive for max_standby_delay
Patch adds a keepalive message to ensure max_standby_delay is useful. No WAL format changes, no libpq changes. Just an additional message type for the streaming replication protocol, sent once per main loop in WALsender. Plus docs. Comments? -- Simon Riggs www.2ndQuadrant.com
Attachment
Simon Riggs <simon@2ndQuadrant.com> writes: > Patch adds a keepalive message to ensure max_standby_delay is useful. The proposed placement of the keepalive-send is about the worst it could possibly be. It needs to be done right before pq_flush to ensure minimum transfer delay. Otherwise any attempt to measure clock skew using the timestamp will be seriously off. Where you've placed it guarantees maximum delay not minimum. I'm also extremely dubious that it's a good idea to set recoveryLastXTime from this. Using both that and the timestamps from the wal log flies in the face of everything I remember about control theory. We should be doing only one or only the other, I think. regards, tom lane
On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > Patch adds a keepalive message to ensure max_standby_delay is useful. > > The proposed placement of the keepalive-send is about the worst it could > possibly be. It needs to be done right before pq_flush to ensure > minimum transfer delay. Otherwise any attempt to measure clock skew > using the timestamp will be seriously off. Where you've placed it > guarantees maximum delay not minimum. I don't understand. WalSndKeepAlive() contains a pq_flush() immediately after the timestamp is set. I did that way for exactly the same reasons you've said. Do you mean you only want to see one pq_flush()? I used two so that the actual data is never delayed by a keepalive. WAL Sender was going to sleep anyway, so shouldn't be a problem. > I'm also extremely dubious that it's a good idea to set > recoveryLastXTime from this. Using both that and the timestamps from > the wal log flies in the face of everything I remember about control > theory. We should be doing only one or only the other, I think. I can change it so that the recoveryLastXTime will not be updated if we are using the value from the keepalives. So we have one, or the other. Remember that replication can switch backwards and forwards between modes, so it seems sensible to have a common timing value whichever mode we're in. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote: >> I'm also extremely dubious that it's a good idea to set >> recoveryLastXTime from this. Using both that and the timestamps from >> the wal log flies in the face of everything I remember about control >> theory. We should be doing only one or only the other, I think. > > I can change it so that the recoveryLastXTime will not be updated if we > are using the value from the keepalives. So we have one, or the other. > Remember that replication can switch backwards and forwards between > modes, so it seems sensible to have a common timing value whichever mode > we're in. That means that recoveryLastXTime can jump forwards and backwards. Doesn't feel right to me either. If you want to expose the keepalive-time to queries, it should be a separate field, something like lastMasterKeepaliveTime and a pg_last_master_keepalive() function to read it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote: > >> I'm also extremely dubious that it's a good idea to set > >> recoveryLastXTime from this. Using both that and the timestamps from > >> the wal log flies in the face of everything I remember about control > >> theory. We should be doing only one or only the other, I think. > > > > I can change it so that the recoveryLastXTime will not be updated if we > > are using the value from the keepalives. So we have one, or the other. > > Remember that replication can switch backwards and forwards between > > modes, so it seems sensible to have a common timing value whichever mode > > we're in. > > That means that recoveryLastXTime can jump forwards and backwards. That behaviour would be bad, so why not just prevent that from happening? > Doesn't feel right to me either. If you want to expose the > keepalive-time to queries, it should be a separate field, something like > lastMasterKeepaliveTime and a pg_last_master_keepalive() function to > read it. That wouldn't be good because then you couldn't easily monitor the delay? You'd have to run two different functions depending on the state of replication (for which we would need yet another function). Users would just wrap that back up into a single function. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote: >> Doesn't feel right to me either. If you want to expose the >> keepalive-time to queries, it should be a separate field, something like >> lastMasterKeepaliveTime and a pg_last_master_keepalive() function to >> read it. > > That wouldn't be good because then you couldn't easily monitor the > delay? You'd have to run two different functions depending on the state > of replication (for which we would need yet another function). Users > would just wrap that back up into a single function. What exactly is the user trying to monitor? If it's "how far behind is the standby", the difference between pg_current_xlog_insert_location() in the master and pg_last_xlog_replay_location() in the standby seems more robust and well-defined to me. It's a measure of XLOG location (ie. bytes) instead of time, but time is a complicated concept. Also note that as the patch stands, if you receive a keep-alive from the master at point X, it doesn't mean that the standby is fully up-to-date. It's possible that the walsender just finished sending a huge batch of accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent. During that time, a lot more WAL has accumulated, yet walsender sends a keep-alive with the current timestamp. In general, the purpose of a keep-alive is to keep the connection alive, but you're trying to accomplish something else too, and I don't fully understand what it is. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, 2010-05-15 at 20:05 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote: > >> Doesn't feel right to me either. If you want to expose the > >> keepalive-time to queries, it should be a separate field, something like > >> lastMasterKeepaliveTime and a pg_last_master_keepalive() function to > >> read it. > > > > That wouldn't be good because then you couldn't easily monitor the > > delay? You'd have to run two different functions depending on the state > > of replication (for which we would need yet another function). Users > > would just wrap that back up into a single function. > > What exactly is the user trying to monitor? If it's "how far behind is > the standby", the difference between pg_current_xlog_insert_location() > in the master and pg_last_xlog_replay_location() in the standby seems > more robust and well-defined to me. It's a measure of XLOG location (ie. > bytes) instead of time, but time is a complicated concept. Maybe, but its meaningful to users and that is the point. > Also note that as the patch stands, if you receive a keep-alive from the > master at point X, it doesn't mean that the standby is fully up-to-date. > It's possible that the walsender just finished sending a huge batch of > accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent. > During that time, a lot more WAL has accumulated, yet walsender sends a > keep-alive with the current timestamp. Not at all. The timestamp for the keepalive is calculated after the pq_flush for the main WAL data. So it takes 10 years to send the next blob of WAL data the timestamp will be current. However, a point you made in an earlier thread is still true here. It sounds like it would be best if startup process didn't re-access shared memory for the next location until it had fully replayed all the WAL up to the point it last accessed. That way the changing value of the shared timestamp would have no effect on the calculated value at any point in time. I will recode using that concept. > In general, the purpose of a keep-alive is to keep the connection alive, > but you're trying to accomplish something else too, and I don't fully > understand what it is. That surprises me. If nothing else, its in the title of the thread, though since you personally added this to the Hot Standby todo more than 6 months ago I'd hope you of all people would understand the purpose. -- Simon Riggs www.2ndQuadrant.com
On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote: > I will recode using that concept. There's some behaviours that aren't helpful here: Startup gets new pointer when it runs out of data to replay. That might or might not include an updated keepalive timestamp, since there's no exact relationship between chunks sent and chunks received. Startup might ask for a new chunk when half a chunk has been received, or when multiple chunks have been received. WALSender doesn't chunk up what it sends, though libpq does at a lower level. So there's no way to make a keepalive happen every X bytes without doing this from within libpq. WALSender sleeps even when it might have more WAL to send, it doesn't check it just unconditionally sleeps. At least WALReceiver loops until it has no more to receive. I just can't imagine why that's useful behaviour. All in all, I think we should be calling this "burst replication" not streaming replication. That does cause an issue in trying to monitor what's going on cos there's so much jitter. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > WALSender sleeps even when it might have more WAL to send, it doesn't > check it just unconditionally sleeps. At least WALReceiver loops until > it has no more to receive. I just can't imagine why that's useful > behaviour. Good catch. That should be fixed. I also note that walsender doesn't respond to signals, while it's sending a large batch. That's analogous to the issue that was addressed recently in the archiver process. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Simon Riggs wrote: >> WALSender sleeps even when it might have more WAL to send, it doesn't >> check it just unconditionally sleeps. At least WALReceiver loops until >> it has no more to receive. I just can't imagine why that's useful >> behaviour. > > Good catch. That should be fixed. > > I also note that walsender doesn't respond to signals, while it's > sending a large batch. That's analogous to the issue that was addressed > recently in the archiver process. Attached patch rearranges the walsender loops slightly to fix the above. XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / 2) in one round and returns to the main loop after that even if there's unsent WAL, and the main loop no longer sleeps if there's unsent WAL. That way the main loop gets to respond to signals quickly, and we also get to update the shared memory status and PS display more often when there's a lot of catching up to do. Comments, have I screwed up anything? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *************** *** 100,106 **** static void InitWalSnd(void); static void WalSndHandshake(void); static void WalSndKill(int code, Datum arg); static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes); ! static bool XLogSend(StringInfo outMsg); static void CheckClosedConnection(void); /* --- 100,106 ---- static void WalSndHandshake(void); static void WalSndKill(int code, Datum arg); static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes); ! static bool XLogSend(StringInfo outMsg, bool *caughtup); static void CheckClosedConnection(void); /* *************** *** 360,365 **** static int --- 360,366 ---- WalSndLoop(void) { StringInfoData output_message; + bool caughtup = false; initStringInfo(&output_message); *************** *** 387,393 **** WalSndLoop(void) */ if (ready_to_stop) { ! XLogSend(&output_message); shutdown_requested = true; } --- 388,394 ---- */ if (ready_to_stop) { ! XLogSend(&output_message, &caughtup); shutdown_requested = true; } *************** *** 402,432 **** WalSndLoop(void) } /* ! * Nap for the configured time or until a message arrives. * * On some platforms, signals won't interrupt the sleep. To ensure we * respond reasonably promptly when someone signals us, break down the * sleep into NAPTIME_PER_CYCLE increments, and check for * interrupts after each nap. */ ! remain = WalSndDelay * 1000L; ! while (remain > 0) { ! if (got_SIGHUP || shutdown_requested || ready_to_stop) ! break; ! /* ! * Check to see whether a message from the standby or an interrupt ! * from other processes has arrived. ! */ ! pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain); ! CheckClosedConnection(); ! remain -= NAPTIME_PER_CYCLE; } - /* Attempt to send the log once every loop */ ! if (!XLogSend(&output_message)) goto eof; } --- 403,434 ---- } /* ! * If we had sent all accumulated WAL in last round, nap for the ! * configured time before retrying. * * On some platforms, signals won't interrupt the sleep. To ensure we * respond reasonably promptly when someone signals us, break down the * sleep into NAPTIME_PER_CYCLE increments, and check for * interrupts after each nap. */ ! if (caughtup) { ! remain = WalSndDelay * 1000L; ! while (remain > 0) ! { ! /* Check for interrupts */ ! if (got_SIGHUP || shutdown_requested || ready_to_stop) ! break; ! /* Sleep and check that the connection is still alive */ ! pg_usleep(remain > NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain); ! CheckClosedConnection(); ! remain -= NAPTIME_PER_CYCLE; ! } } /* Attempt to send the log once every loop */ ! if (!XLogSend(&output_message, &caughtup)) goto eof; } *************** *** 623,637 **** XLogRead(char *buf, XLogRecPtr recptr, Size nbytes) } /* ! * Read all WAL that's been written (and flushed) since last cycle, and send ! * it to client. * * Returns true if OK, false if trouble. */ static bool ! XLogSend(StringInfo outMsg) { XLogRecPtr SendRqstPtr; char activitymsg[50]; /* use volatile pointer to prevent code rearrangement */ --- 625,644 ---- } /* ! * Read up to MAX_SEND_SIZE bytes of WAL that's been written (and flushed), ! * but not yet sent to the client, and send it. If there is no unsent WAL, ! * *caughtup is set to true and nothing is sent, otherwise *caughtup is set ! * to false. * * Returns true if OK, false if trouble. */ static bool ! XLogSend(StringInfo outMsg, bool *caughtup) { XLogRecPtr SendRqstPtr; + XLogRecPtr startptr; + XLogRecPtr endptr; + Size nbytes; char activitymsg[50]; /* use volatile pointer to prevent code rearrangement */ *************** *** 642,725 **** XLogSend(StringInfo outMsg) /* Quick exit if nothing to do */ if (!XLByteLT(sentPtr, SendRqstPtr)) return true; /* ! * We gather multiple records together by issuing just one XLogRead() of a ! * suitable size, and send them as one CopyData message. Repeat until ! * we've sent everything we can. */ ! while (XLByteLT(sentPtr, SendRqstPtr)) { - XLogRecPtr startptr; - XLogRecPtr endptr; - Size nbytes; - /* ! * Figure out how much to send in one message. If there's less than ! * MAX_SEND_SIZE bytes to send, send everything. Otherwise send ! * MAX_SEND_SIZE bytes, but round to page boundary. ! * ! * The rounding is not only for performance reasons. Walreceiver ! * relies on the fact that we never split a WAL record across two ! * messages. Since a long WAL record is split at page boundary into ! * continuation records, page boundary is always a safe cut-off point. ! * We also assume that SendRqstPtr never points in the middle of a WAL ! * record. */ ! startptr = sentPtr; ! if (startptr.xrecoff >= XLogFileSize) ! { ! /* ! * crossing a logid boundary, skip the non-existent last log ! * segment in previous logical log file. ! */ ! startptr.xlogid += 1; ! startptr.xrecoff = 0; ! } ! endptr = startptr; ! XLByteAdvance(endptr, MAX_SEND_SIZE); ! /* round down to page boundary. */ ! endptr.xrecoff -= (endptr.xrecoff % XLOG_BLCKSZ); ! /* if we went beyond SendRqstPtr, back off */ ! if (XLByteLT(SendRqstPtr, endptr)) ! endptr = SendRqstPtr; ! /* ! * OK to read and send the slice. ! * ! * We don't need to convert the xlogid/xrecoff from host byte order to ! * network byte order because the both server can be expected to have ! * the same byte order. If they have different byte order, we don't ! * reach here. ! */ ! pq_sendbyte(outMsg, 'w'); ! pq_sendbytes(outMsg, (char *) &startptr, sizeof(startptr)); ! if (endptr.xlogid != startptr.xlogid) ! { ! Assert(endptr.xlogid == startptr.xlogid + 1); ! nbytes = endptr.xrecoff + XLogFileSize - startptr.xrecoff; ! } ! else ! nbytes = endptr.xrecoff - startptr.xrecoff; ! sentPtr = endptr; ! /* ! * Read the log directly into the output buffer to prevent extra ! * memcpy calls. ! */ ! enlargeStringInfo(outMsg, nbytes); ! XLogRead(&outMsg->data[outMsg->len], startptr, nbytes); ! outMsg->len += nbytes; ! outMsg->data[outMsg->len] = '\0'; ! pq_putmessage('d', outMsg->data, outMsg->len); ! resetStringInfo(outMsg); ! } /* Update shared memory status */ SpinLockAcquire(&walsnd->mutex); --- 649,730 ---- /* Quick exit if nothing to do */ if (!XLByteLT(sentPtr, SendRqstPtr)) + { + *caughtup = true; return true; + } + /* + * Otherwise let the caller know that we're not fully caught up. Unless + * there's a huge backlog, we'll be caught up to the current WriteRecPtr + * after we've sent everything below, but more WAL could accumulate while + * we're busy sending. + */ + *caughtup = false; /* ! * Figure out how much to send in one message. If there's less than ! * MAX_SEND_SIZE bytes to send, send everything. Otherwise send ! * MAX_SEND_SIZE bytes, but round to page boundary. ! * ! * The rounding is not only for performance reasons. Walreceiver ! * relies on the fact that we never split a WAL record across two ! * messages. Since a long WAL record is split at page boundary into ! * continuation records, page boundary is always a safe cut-off point. ! * We also assume that SendRqstPtr never points in the middle of a WAL ! * record. */ ! startptr = sentPtr; ! if (startptr.xrecoff >= XLogFileSize) { /* ! * crossing a logid boundary, skip the non-existent last log ! * segment in previous logical log file. */ ! startptr.xlogid += 1; ! startptr.xrecoff = 0; ! } ! endptr = startptr; ! XLByteAdvance(endptr, MAX_SEND_SIZE); ! /* round down to page boundary. */ ! endptr.xrecoff -= (endptr.xrecoff % XLOG_BLCKSZ); ! /* if we went beyond SendRqstPtr, back off */ ! if (XLByteLT(SendRqstPtr, endptr)) ! endptr = SendRqstPtr; ! /* ! * OK to read and send the slice. ! * ! * We don't need to convert the xlogid/xrecoff from host byte order to ! * network byte order because the both server can be expected to have ! * the same byte order. If they have different byte order, we don't ! * reach here. ! */ ! pq_sendbyte(outMsg, 'w'); ! pq_sendbytes(outMsg, (char *) &startptr, sizeof(startptr)); ! if (endptr.xlogid != startptr.xlogid) ! { ! Assert(endptr.xlogid == startptr.xlogid + 1); ! nbytes = endptr.xrecoff + XLogFileSize - startptr.xrecoff; ! } ! else ! nbytes = endptr.xrecoff - startptr.xrecoff; ! sentPtr = endptr; ! /* ! * Read the log directly into the output buffer to prevent extra ! * memcpy calls. ! */ ! enlargeStringInfo(outMsg, nbytes); ! XLogRead(&outMsg->data[outMsg->len], startptr, nbytes); ! outMsg->len += nbytes; ! outMsg->data[outMsg->len] = '\0'; ! pq_putmessage('d', outMsg->data, outMsg->len); ! resetStringInfo(outMsg); /* Update shared memory status */ SpinLockAcquire(&walsnd->mutex);
On Sun, 2010-05-16 at 00:05 +0300, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > Simon Riggs wrote: > >> WALSender sleeps even when it might have more WAL to send, it doesn't > >> check it just unconditionally sleeps. At least WALReceiver loops until > >> it has no more to receive. I just can't imagine why that's useful > >> behaviour. > > > > Good catch. That should be fixed. > > > > I also note that walsender doesn't respond to signals, while it's > > sending a large batch. That's analogous to the issue that was addressed > > recently in the archiver process. > > Attached patch rearranges the walsender loops slightly to fix the above. > XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / > 2) in one round and returns to the main loop after that even if there's > unsent WAL, and the main loop no longer sleeps if there's unsent WAL. > That way the main loop gets to respond to signals quickly, and we also > get to update the shared memory status and PS display more often when > there's a lot of catching up to do. > > Comments 8MB at a time still seems like a large batch to me. libpq is going to send it in smaller chunks anyway, so I don't see the importance of trying to keep the batch too large. It just introduces delay into the sending process. We should be sending chunks that matches libpq better. -- Simon Riggs www.2ndQuadrant.com
On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote: > On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote: > > > I will recode using that concept. > Startup gets new pointer when it runs out of data to replay. That might > or might not include an updated keepalive timestamp, since there's no > exact relationship between chunks sent and chunks received. Startup > might ask for a new chunk when half a chunk has been received, or when > multiple chunks have been received. New version, with some other cleanup of wait processing. New logic is that when Startup asks for next applychunk of WAL it saves the lastChunkTimestamp. That is then the base time used by WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later. Since multiple receivechunks can arrive from primary before Startup asks for next applychunk we use the oldest receivechunk timestamp, not the latest. Doing it this way means the lastChunkTimestamp doesn't change when new keepalives arrive, so we have a stable and reasonably accurate recordSendTimestamp for each WAL record. The keepalive is sent as the first part of a new message, if any. So partial chunks of data always have an accurate timestamp, even if that is slightly older as a result. Doesn't make much difference except with very large chunks. I think that addresses the points raised on this thread and others. -- Simon Riggs www.2ndQuadrant.com
Attachment
On Sun, May 16, 2010 at 6:05 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> WALSender sleeps even when it might have more WAL to send, it doesn't >>> check it just unconditionally sleeps. At least WALReceiver loops until >>> it has no more to receive. I just can't imagine why that's useful >>> behaviour. >> >> Good catch. That should be fixed. >> >> I also note that walsender doesn't respond to signals, while it's >> sending a large batch. That's analogous to the issue that was addressed >> recently in the archiver process. > > Attached patch rearranges the walsender loops slightly to fix the above. > XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / > 2) in one round and returns to the main loop after that even if there's > unsent WAL, and the main loop no longer sleeps if there's unsent WAL. > That way the main loop gets to respond to signals quickly, and we also > get to update the shared memory status and PS display more often when > there's a lot of catching up to do. > > Comments The main loop needs to respond to also the 'X' message from the standby quickly? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, May 17, 2010 at 1:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote: >> On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote: >> >> > I will recode using that concept. > >> Startup gets new pointer when it runs out of data to replay. That might >> or might not include an updated keepalive timestamp, since there's no >> exact relationship between chunks sent and chunks received. Startup >> might ask for a new chunk when half a chunk has been received, or when >> multiple chunks have been received. > > New version, with some other cleanup of wait processing. > > New logic is that when Startup asks for next applychunk of WAL it saves > the lastChunkTimestamp. That is then the base time used by > WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later. > Since multiple receivechunks can arrive from primary before Startup asks > for next applychunk we use the oldest receivechunk timestamp, not the > latest. Doing it this way means the lastChunkTimestamp doesn't change > when new keepalives arrive, so we have a stable and reasonably accurate > recordSendTimestamp for each WAL record. > > The keepalive is sent as the first part of a new message, if any. So > partial chunks of data always have an accurate timestamp, even if that > is slightly older as a result. Doesn't make much difference except with > very large chunks. > > I think that addresses the points raised on this thread and others. Is it OK that this keepalive message cannot be used by HS in file-based log-shipping? Even in SR, the startup process cannot use the keepalive until walreceiver has been started up. WalSndKeepAlive() always calls initStringInfo(), which seems to cause memory-leak. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, 2010-05-17 at 11:51 +0900, Fujii Masao wrote: > Is it OK that this keepalive message cannot be used by HS in file-based > log-shipping? Even in SR, the startup process cannot use the keepalive > until walreceiver has been started up. Yes, I see those things. We already have archive_timeout to handle the keepalive case in file-based. When starting up the delay is high anyway, so doesn't really matter about accuracy - though we do use latestXLogTime in that cases. > WalSndKeepAlive() always calls initStringInfo(), which seems to cause > memory-leak. Thanks. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote: > > > > Attached patch rearranges the walsender loops slightly to fix the above. > > XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / > > 2) in one round and returns to the main loop after that even if there's > > unsent WAL, and the main loop no longer sleeps if there's unsent WAL. > > That way the main loop gets to respond to signals quickly, and we also > > get to update the shared memory status and PS display more often when > > there's a lot of catching up to do. > > > > Comments > > 8MB at a time still seems like a large batch to me. > > libpq is going to send it in smaller chunks anyway, so I don't see the > importance of trying to keep the batch too large. It just introduces > delay into the sending process. We should be sending chunks that matches > libpq better. More to the point the logic will fail if XLOG_BLCKSZ > PQ_BUFFER_SIZE because it will send partial pages. Having MAX_SEND_SIZE > PQ_BUFFER_SIZE is pointless, as libpq currently stands. -- Simon Riggs www.2ndQuadrant.com
On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote: > What exactly is the user trying to monitor? If it's "how far behind is > the standby", the difference between pg_current_xlog_insert_location() > in the master and pg_last_xlog_replay_location() in the standby seems > more robust and well-defined to me. It's a measure of XLOG location (ie. > bytes) instead of time, but time is a complicated concept. I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we're thismany transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, we shoulddo so. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 17/05/10 04:40, Simon Riggs wrote: > On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote: >>> >>> Attached patch rearranges the walsender loops slightly to fix the above. >>> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / >>> 2) in one round and returns to the main loop after that even if there's >>> unsent WAL, and the main loop no longer sleeps if there's unsent WAL. >>> That way the main loop gets to respond to signals quickly, and we also >>> get to update the shared memory status and PS display more often when >>> there's a lot of catching up to do. >>> >>> Comments >> >> 8MB at a time still seems like a large batch to me. >> >> libpq is going to send it in smaller chunks anyway, so I don't see the >> importance of trying to keep the batch too large. It just introduces >> delay into the sending process. We should be sending chunks that matches >> libpq better. > > More to the point the logic will fail if XLOG_BLCKSZ> PQ_BUFFER_SIZE > because it will send partial pages. I don't see a failure. We rely on not splitting WAL records across messages, but we're talking about libpq-level CopyData messages, not TCP messages. > Having MAX_SEND_SIZE> PQ_BUFFER_SIZE is pointless, as libpq currently > stands. Well, it does affect the size of the read() in walsender, and I'm sure there's some overhead in setting the ps display and the other small stuff we do once per message. But you're probably right that we could easily make MAX_SEND_SIZE much smaller with no noticeable affect on performance, while making walsender more responsive to signals. I'll decrease it to, say, 512 kB. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 17/05/10 12:36, Jim Nasby wrote: > On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote: >> What exactly is the user trying to monitor? If it's "how far behind is >> the standby", the difference between pg_current_xlog_insert_location() >> in the master and pg_last_xlog_replay_location() in the standby seems >> more robust and well-defined to me. It's a measure of XLOG location (ie. >> bytes) instead of time, but time is a complicated concept. > > I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we'rethis many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, weshould do so. No doubt about that, the problem is that it's hard to provide a reliable time-based indication. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2010-05-18 at 17:06 -0400, Heikki Linnakangas wrote: > On 17/05/10 04:40, Simon Riggs wrote: > > On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote: > >>> > >>> Attached patch rearranges the walsender loops slightly to fix the above. > >>> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / > >>> 2) in one round and returns to the main loop after that even if there's > >>> unsent WAL, and the main loop no longer sleeps if there's unsent WAL. > >>> That way the main loop gets to respond to signals quickly, and we also > >>> get to update the shared memory status and PS display more often when > >>> there's a lot of catching up to do. > >>> > >>> Comments > >> > >> 8MB at a time still seems like a large batch to me. > >> > >> libpq is going to send it in smaller chunks anyway, so I don't see the > >> importance of trying to keep the batch too large. It just introduces > >> delay into the sending process. We should be sending chunks that matches > >> libpq better. > > > > More to the point the logic will fail if XLOG_BLCKSZ> PQ_BUFFER_SIZE > > because it will send partial pages. > > I don't see a failure. We rely on not splitting WAL records across > messages, but we're talking about libpq-level CopyData messages, not TCP > messages. OK > > Having MAX_SEND_SIZE> PQ_BUFFER_SIZE is pointless, as libpq currently > > stands. > > Well, it does affect the size of the read() in walsender, and I'm sure > there's some overhead in setting the ps display and the other small > stuff we do once per message. But you're probably right that we could > easily make MAX_SEND_SIZE much smaller with no noticeable affect on > performance, while making walsender more responsive to signals. I'll > decrease it to, say, 512 kB. I'm pretty certain we don't need to set the ps display once per message. ps doesn't need an update 5 times per second on average. There's no reason that the buffer size we use for XLogRead() should be the same as the send buffer, if you're worried about that. My point is that pq_putmessage contains internal flushes so at the libpq level you gain nothing by big batches. The read() will be buffered anyway with readahead so not sure what the issue is. We'll have to do this for sync rep anyway, so what's the big deal? Just do it now, once. Do we really want 9.1 code to differ here? -- Simon Riggs www.2ndQuadrant.com
On Tue, 2010-05-18 at 17:08 -0400, Heikki Linnakangas wrote: > On 17/05/10 12:36, Jim Nasby wrote: > > On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote: > >> What exactly is the user trying to monitor? If it's "how far behind is > >> the standby", the difference between pg_current_xlog_insert_location() > >> in the master and pg_last_xlog_replay_location() in the standby seems > >> more robust and well-defined to me. It's a measure of XLOG location (ie. > >> bytes) instead of time, but time is a complicated concept. > > > > I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we'rethis many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, weshould do so. > > No doubt about that, the problem is that it's hard to provide a reliable > time-based indication. I think I have one now. -- Simon Riggs www.2ndQuadrant.com
On 18/05/10 17:17, Simon Riggs wrote: > There's no reason that the buffer size we use for XLogRead() should be > the same as the send buffer, if you're worried about that. My point is > that pq_putmessage contains internal flushes so at the libpq level you > gain nothing by big batches. The read() will be buffered anyway with > readahead so not sure what the issue is. We'll have to do this for sync > rep anyway, so what's the big deal? Just do it now, once. Do we really > want 9.1 code to differ here? Do what? What exactly is it that you want instead, then? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote: > On 18/05/10 17:17, Simon Riggs wrote: > > There's no reason that the buffer size we use for XLogRead() should be > > the same as the send buffer, if you're worried about that. My point is > > that pq_putmessage contains internal flushes so at the libpq level you > > gain nothing by big batches. The read() will be buffered anyway with > > readahead so not sure what the issue is. We'll have to do this for sync > > rep anyway, so what's the big deal? Just do it now, once. Do we really > > want 9.1 code to differ here? > > Do what? What exactly is it that you want instead, then? Read and write smaller messages, so the latency is minimised. Libpq will send in 8192 byte packets, so writing anything larger gains nothing when the WAL data is also chunked at exactly the same size. -- Simon Riggs www.2ndQuadrant.com
On 19/05/10 00:37, Simon Riggs wrote: > On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote: >> On 18/05/10 17:17, Simon Riggs wrote: >>> There's no reason that the buffer size we use for XLogRead() should be >>> the same as the send buffer, if you're worried about that. My point is >>> that pq_putmessage contains internal flushes so at the libpq level you >>> gain nothing by big batches. The read() will be buffered anyway with >>> readahead so not sure what the issue is. We'll have to do this for sync >>> rep anyway, so what's the big deal? Just do it now, once. Do we really >>> want 9.1 code to differ here? >> >> Do what? What exactly is it that you want instead, then? > > Read and write smaller messages, so the latency is minimised. Libpq will > send in 8192 byte packets, so writing anything larger gains nothing when > the WAL data is also chunked at exactly the same size. Committed with chunk size of 128 kB. I hope that's a reasonable compromise, in the absence of any performance test data either way. I'm weary of setting it as low as 8k, as there is some per-message overhead. Some of that could be avoided by rearranging the loops so that the ps display is not updated at every message as you suggested, but I don't feel doing any extra rearrangements at this point. It would not be hard, but it also certainly wouldn't make the code simpler. I believe in practice 128kB is just as good as 8k from the responsiveness point of view. If a standby is not responding, walsender will be stuck trying to send no matter what the block size is. If it responding, no matter how slowly, 128kB should get transferred pretty quickly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> Committed with chunk size of 128 kB. I hope that's a reasonable > compromise, in the absence of any performance test data either way. So where are we with max_standby_delay? Status check? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Thu, 2010-05-27 at 01:34 +0300, Heikki Linnakangas wrote: > Committed with chunk size of 128 kB. Thanks. I'm sure that's fine. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2010-05-16 at 17:11 +0100, Simon Riggs wrote: > New version, with some other cleanup of wait processing. > > New logic is that when Startup asks for next applychunk of WAL it saves > the lastChunkTimestamp. That is then the base time used by > WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later. > Since multiple receivechunks can arrive from primary before Startup asks > for next applychunk we use the oldest receivechunk timestamp, not the > latest. Doing it this way means the lastChunkTimestamp doesn't change > when new keepalives arrive, so we have a stable and reasonably accurate > recordSendTimestamp for each WAL record. > > The keepalive is sent as the first part of a new message, if any. So > partial chunks of data always have an accurate timestamp, even if that > is slightly older as a result. Doesn't make much difference except with > very large chunks. > > I think that addresses the points raised on this thread and others. Was about to post v3 after your last commit, but just found a minor bug in my v2->v3 changes, even though they were fairly light. Too tired to fix now. The general thinking underlying this patch is still the same though and is worth discussing over next few days. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2010-05-26 at 15:45 -0700, Josh Berkus wrote: > > Committed with chunk size of 128 kB. I hope that's a reasonable > > compromise, in the absence of any performance test data either way. > > So where are we with max_standby_delay? Status check? Just this second posted about that, as it turns out. I have a v3 *almost* ready of the keepalive patch. It still makes sense to me after a few days reflection, so is worth discussion and review. In or out, I want this settled within a week. Definitely need some R&R here. -- Simon Riggs www.2ndQuadrant.com
> Just this second posted about that, as it turns out. > > I have a v3 *almost* ready of the keepalive patch. It still makes sense > to me after a few days reflection, so is worth discussion and review. In > or out, I want this settled within a week. Definitely need some R&R > here. Does the keepalive fix all the issues with max_standby_delay? Tom? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote: > > Just this second posted about that, as it turns out. > > > > I have a v3 *almost* ready of the keepalive patch. It still makes sense > > to me after a few days reflection, so is worth discussion and review. In > > or out, I want this settled within a week. Definitely need some R&R > > here. > > Does the keepalive fix all the issues with max_standby_delay? Tom? OK, here's v4. Summary * WALSender adds a timestamp onto the header of every WAL chunk sent. * Each WAL record now has a conceptual "send timestamp" that remains constant while that record is replayed. This is used as the basis from which max_standby_delay is calculated when required during replay. * Send timestamp is calculated as the later of the timestamp of chunk in which WAL record was sent and the latest XLog time. * WALSender sends an empty message as a keepalive when nothing else to send. (No longer a special message type for the keepalive). I think its close, but if there's a gaping hole here somewhere then I'll punt for this release. -- Simon Riggs www.2ndQuadrant.com
Attachment
Uh, we have three days before we package 9.0beta2. It would be good if we could decide on the max_standby_delay issue soon. --------------------------------------------------------------------------- Simon Riggs wrote: > On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote: > > > Just this second posted about that, as it turns out. > > > > > > I have a v3 *almost* ready of the keepalive patch. It still makes sense > > > to me after a few days reflection, so is worth discussion and review. In > > > or out, I want this settled within a week. Definitely need some R&R > > > here. > > > > Does the keepalive fix all the issues with max_standby_delay? Tom? > > OK, here's v4. > > Summary > > * WALSender adds a timestamp onto the header of every WAL chunk sent. > > * Each WAL record now has a conceptual "send timestamp" that remains > constant while that record is replayed. This is used as the basis from > which max_standby_delay is calculated when required during replay. > > * Send timestamp is calculated as the later of the timestamp of chunk in > which WAL record was sent and the latest XLog time. > > * WALSender sends an empty message as a keepalive when nothing else to > send. (No longer a special message type for the keepalive). > > I think its close, but if there's a gaping hole here somewhere then I'll > punt for this release. > > -- > Simon Riggs www.2ndQuadrant.com [ Attachment, skipping... ] > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
On 27/05/10 20:26, Simon Riggs wrote: > On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote: >>> Just this second posted about that, as it turns out. >>> >>> I have a v3 *almost* ready of the keepalive patch. It still makes sense >>> to me after a few days reflection, so is worth discussion and review. In >>> or out, I want this settled within a week. Definitely need some R&R >>> here. >> >> Does the keepalive fix all the issues with max_standby_delay? Tom? > > OK, here's v4. > > Summary > > * WALSender adds a timestamp onto the header of every WAL chunk sent. > > * Each WAL record now has a conceptual "send timestamp" that remains > constant while that record is replayed. This is used as the basis from > which max_standby_delay is calculated when required during replay. > > * Send timestamp is calculated as the later of the timestamp of chunk in > which WAL record was sent and the latest XLog time. > > * WALSender sends an empty message as a keepalive when nothing else to > send. (No longer a special message type for the keepalive). > > I think its close, but if there's a gaping hole here somewhere then I'll > punt for this release. This certainly alleviates some of the problems. You still need to ensure that master and standby have synchronized clocks, and you still get zero grace time after a long period of inactivity when not using streaming replication, however. Sending a keep-alive message every 100ms seems overly aggressive to me. If we really want to try to salvage max_standby_delay with a meaning similar to what it has now, I think we should go with the idea some people bashed around earlier and define the grace period as the difference between a WAL record becoming available to the standby for replay, and between replaying it. An approximation of that is to do "lastIdle = gettimeofday()" in XLogPageRead() whenever it needs to wait for new WAL to arrive, whether that's via streaming replication or by a success return code from restore_command, and compare the difference of that with current timestamp in WaitExceedsMaxStandbyDelay(). That's very simple, doesn't require synchronized clocks, and works the same with file- and stream-based setups. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Thanks for the review. On Tue, 2010-06-01 at 13:36 +0300, Heikki Linnakangas wrote: > If we really want to try to salvage max_standby_delay with a meaning > similar to what it has now, I think we should go with the idea some > people bashed around earlier and define the grace period as the > difference between a WAL record becoming available to the standby for > replay, and between replaying it. An approximation of that is to do > "lastIdle = gettimeofday()" in XLogPageRead() whenever it needs to wait > for new WAL to arrive, whether that's via streaming replication or by a > success return code from restore_command, and compare the difference of > that with current timestamp in WaitExceedsMaxStandbyDelay(). That wouldn't cope with a continuous stream of records arriving, unless you also include the second half of the patch. > That's very simple, doesn't require synchronized clocks, and works the > same with file- and stream-based setups. Nor does it provide a mechanism for monitoring of SR. standby_delay is explicitly defined in terms of the gap between two servers, so is a useful real world concept. apply_delay is somewhat less interesting. I'm sure most people would rather have monitoring and therefore the requirement for synchronised-ish clocks, than no monitoring. If you think no monitoring is OK, I don't, but there are other ways, so its not a point to fight about. > This certainly alleviates some of the problems. You still need to ensure > that master and standby have synchronized clocks, and you still get zero > grace time after a long period of inactivity when not using streaming > replication, however. Second issue can be added once we approve the rest of this if you like. > Sending a keep-alive message every 100ms seems overly aggressive to me. It's sent every wal_sender_delay. Why is that a negative? -- Simon Riggs www.2ndQuadrant.com
On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote: > Uh, we have three days before we package 9.0beta2. It would be good if > we could decide on the max_standby_delay issue soon. I've heard something from Heikki, not from anyone else. Those comments amount to "lets replace max_standby_delay with max_apply_delay". Got a beta idea? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> wrote: > On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote: > >> Uh, we have three days before we package 9.0beta2. It would be >> good if we could decide on the max_standby_delay issue soon. > > I've heard something from Heikki, not from anyone else. Those > comments amount to "lets replace max_standby_delay with > max_apply_delay". > > Got a beta idea? Given the incessant ticking of the clock, I have a hard time believing we have any real options besides max_standby_delay or a boolean which corresponds to the -1 and 0 settings of max_standby_delay. I think it's pretty clear that there's a use case for the positive values, although there are bound to be some who try it and are surprised by behavior at transition from idle to active. The whole debate seems to boil down to how important a middle ground is versus how damaging the surprise factor is. (I don't really buy the argument that we won't be able to remove it later if we replace it with something better.) I know there were initially some technical problems, too; have those been resolved? -Kevin
Simon Riggs <simon@2ndQuadrant.com> writes: > OK, here's v4. I've been trying to stay out of this thread, but with beta2 approaching and no resolution in sight, I'm afraid I have to get involved. This patch seems to me to be going in fundamentally the wrong direction. It's adding complexity and overhead (far more than is needed), and it's failing utterly to resolve the objections that I raised to start with. In particular, Simon seems to be basically refusing to do anything about the complaint that the code fails unless master and standby clocks are in close sync. I do not believe that this is acceptable, and since he won't fix it, I guess I'll have to. The other basic problem that I had with the current code, which this does nothing about, is that it believes that timestamps pulled from WAL archive should be treated on the same footing as timestamps of "live" actions. That might work in certain limited scenarios but it's going to be a disaster in general. I believe that the motivation for treating archived timestamps as live is, essentially, to force rapid catchup if a slave falls behind so far that it's reading from archive instead of SR. There are certainly use-cases where that's appropriate (though, again, not all); but even when you do want it it's a pretty inflexible implementation. For realistic values of max_standby_delay the behavior is going to pretty much be "instant kill whenever we're reading from archive". I have backed off my original suggestion that we should reduce max_standby_delay to a boolean: that was based on the idea that delays would only occur in response to DDL on the master, but since vacuum and btree page splits can also trigger delays, it seems clear that a "kill queries immediately" policy isn't really very useful in practice. So we need to make max_standby_delay work rather than just get rid of it. What I think might be a realistic compromise is this: 1. Separate max_standby_delay into two GUCs, say "max_streaming_delay" and "max_archive_delay". 2. When applying WAL that came across SR, use max_streaming_delay and let the time measurement be current time minus time of receipt of the current WAL send chunk. 3. When applying WAL that came from archive, use max_archive_delay and let the time measurement be current time minus time of acquisition of the current WAL segment from the archive. The current code's behavior in the latter case could effectively be modeled by setting max_archive_delay to zero, but that isn't the only plausible setting. More likely DBAs would set max_archive_delay to something smaller than max_streaming_delay, but still positive so as to not kill conflicting queries instantly. An important property of this design is that all relevant timestamps are taken on the slave, so clock skew isn't an issue. I'm still inclined to apply the part of Simon's patch that adds a transmit timestamp to each SR send chunk. That would actually be completely unused by the slave given my proposal above, but I think that it is an important step to take to future-proof the SR protocol against possible changes in the slave-side timing logic. I don't however see the value of transmitting "keepalive" records when we'd otherwise not have anything to send. The value of adding timestamps to the SR protocol is really to let the slave determine the current amount of clock skew between it and the master; which is a number that should not change so rapidly that it has to be updated every 100ms even in an idle system. Comments? regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > An important property of this design is that all relevant timestamps > are taken on the slave, so clock skew isn't an issue. I agree that this is important, and I do run NTP on all my servers and even monitor it using Nagios. It's still not a cure-all for time skew issues. > Comments? I'm not really a huge fan of adding another GUC, to be honest. I'm more inclined to say we treat 'max_archive_delay' as '0', and turn max_streaming_delay into what you've described. If we fall back so far that we have to go back to reading WALs, then we need to hurry up and catch-up and damn the torpedos. I'd also prefer that we only wait the delay time once until we're fully caught up again (and have gotten back around to waiting for new data). Thanks, Stephen
Tom Lane wrote: > I'm still inclined to apply the part of Simon's patch that adds a > transmit timestamp to each SR send chunk. That would actually be > completely unused by the slave given my proposal above, but I think that > it is an important step to take to future-proof the SR protocol against > possible changes in the slave-side timing logic. > +1. From a radically different perspective, I had to do something similar in the buildfarm years ago to protect us from machines reporting with grossly inaccurate timestamps. This was part of the solution. The client adds its current timestamp setting just before transmitting the data to the server. cheers andrew
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Comments? > I'm not really a huge fan of adding another GUC, to be honest. I'm more > inclined to say we treat 'max_archive_delay' as '0', and turn > max_streaming_delay into what you've described. If we fall back so far > that we have to go back to reading WALs, then we need to hurry up and > catch-up and damn the torpedos. If I thought that 0 were a generally acceptable value, I'd still be pushing the "simplify it to a boolean" agenda ;-). The problem is that that will sometimes kill standby queries even when they are quite short and doing nothing objectionable. > I'd also prefer that we only wait the > delay time once until we're fully caught up again (and have gotten > back around to waiting for new data). The delays will be measured from a receipt instant to current time, which means that the longer it takes to apply a WAL segment or WAL send chunk, the less grace period there will be. (Which is the same as what CVS HEAD does --- I'm just arguing about where we get the start time from.) I believe this does what you suggest and more. regards, tom lane
On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> Comments? > > > I'm not really a huge fan of adding another GUC, to be honest. I'm more > > inclined to say we treat 'max_archive_delay' as '0', and turn > > max_streaming_delay into what you've described. If we fall back so far > > that we have to go back to reading WALs, then we need to hurry up and > > catch-up and damn the torpedos. > > If I thought that 0 were a generally acceptable value, I'd still be > pushing the "simplify it to a boolean" agenda ;-). The problem is that > that will sometimes kill standby queries even when they are quite short > and doing nothing objectionable. OK, now I understand. I was just thinking the same as Stephen, but now I agree we need a second parameter. -- Simon Riggs www.2ndQuadrant.com
On Wed, Jun 2, 2010 at 2:03 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote: >> Stephen Frost <sfrost@snowman.net> writes: >> > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> >> Comments? >> >> > I'm not really a huge fan of adding another GUC, to be honest. I'm more >> > inclined to say we treat 'max_archive_delay' as '0', and turn >> > max_streaming_delay into what you've described. If we fall back so far >> > that we have to go back to reading WALs, then we need to hurry up and >> > catch-up and damn the torpedos. >> >> If I thought that 0 were a generally acceptable value, I'd still be >> pushing the "simplify it to a boolean" agenda ;-). The problem is that >> that will sometimes kill standby queries even when they are quite short >> and doing nothing objectionable. > > OK, now I understand. I was just thinking the same as Stephen, but now I > agree we need a second parameter. I too was thinking the same as Stephen, but now I also agree we need a second parameter. I think the whole design Tom proposed seems good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I believe that the motivation for treating archived timestamps as live > is, essentially, to force rapid catchup if a slave falls behind so far > that it's reading from archive instead of SR. Huh, I think this is the first mention of this that I've seen. I always assumed the motivation was just that you wanted to control how much data loss could occur on failover and how long recovery would take. I think separating the two delays is an interesting idea but I don't see how it counts as a simplification. This also still allows a slave to become arbitrarily far behind the master. The master might take a long time to fill up a log file, and if the slave is blocked for a few minutes, then requests the next bit of log file and blocks for a few more minutes, then requests the next log file and blocks for a few more minutes, etc. As long as the slave is reading more slowly than the master is filling those segments it will fall further and further behind but never be more than a few minutes behind receiving the logs. I propose an alternate way out of the problem of syncing two clocks. Instead of comparing timestamps compare time intervals. So as it reads xlog records it only ever compares the master timestamps with previous master timestamps to determine how much time has elapsed on the master. It compares that time elapsed with the time elapsed on the slave to determine if it's falling behind. I'm assuming it periodically catches up and can throw out its accumulated time elapsed and start fresh. If not then the accumulated error might be a problem, but hopefully one we can find a solution to. -- greg
On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote: > This patch seems to me to be going in fundamentally the wrong direction. > It's adding complexity and overhead (far more than is needed), and it's > failing utterly to resolve the objections that I raised to start with. Having read your proposal, it seems changing from time-on-sender to time-on-receiver is a one line change to the patch. What else are you thinking of removing, if anything? Adding an extra parameter adds more obviously and is something I now agree with. > In particular, Simon seems to be basically refusing to do anything about > the complaint that the code fails unless master and standby clocks are > in close sync. I do not believe that this is acceptable, and since he > won't fix it, I guess I'll have to. Syncing two servers in replication is common practice, as has been explained here; I'm still surprised people think otherwise. Measuring the time between two servers is the very purpose of the patch, so the synchronisation is not a design flaw, it is its raison d'etre. There's been a few spleens emptied on that topic, not all of them mine, and certainly no consensus on that. So I'm not refusing to do anything that's been agreed... -- Simon Riggs www.2ndQuadrant.com
On Wed, Jun 2, 2010 at 2:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Syncing two servers in replication is common practice, as has been > explained here; I'm still surprised people think otherwise. Measuring > the time between two servers is the very purpose of the patch, so the > synchronisation is not a design flaw, it is its raison d'etre. I think the purpose of the patch should not be to measure the time difference between servers, but rather the replication delay. While I don't necessarily agree with Tom's statement that this is must-fix, I do agree that it would be nicer if we could avoid depending on time sync. Yeah, I keep my servers time synced, too. But, shit happens. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
I wrote: > I'm still inclined to apply the part of Simon's patch that adds a > transmit timestamp to each SR send chunk. That would actually be > completely unused by the slave given my proposal above, but I think that > it is an important step to take to future-proof the SR protocol against > possible changes in the slave-side timing logic. On further contemplation, it seems like the protocol needs another field besides that: each record should also carry a boolean indicating whether walsender.c thinks it is currently "caught up", ie the record carries all WAL data up to the current end of WAL. If the sender is not caught up, then the receiver should apply max_archive_delay not max_streaming_delay. In this way, WAL chunks that are a little bit behind current time will be treated the same way whether they come across the SR link or via the archive. In conjunction with that, I think there's a logic bug in XLogSend; it ought to be changed like so: /* if we went beyond SendRqstPtr, back off */if (XLByteLT(SendRqstPtr, endptr)) + { endptr = SendRqstPtr; + *caughtup = true; + } In the current coding, the effect of not setting *caughtup here is just that we uselessly call XLogSend an extra time for each transmission (because the main loop won't ever delay immediately after a transmission). But without this, we'd never send caughtup = true to the slave. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I believe that the motivation for treating archived timestamps as live >> is, essentially, to force rapid catchup if a slave falls behind so far >> that it's reading from archive instead of SR. > Huh, I think this is the first mention of this that I've seen. I > always assumed the motivation was just that you wanted to control how > much data loss could occur on failover and how long recovery would > take. I think separating the two delays is an interesting idea but I > don't see how it counts as a simplification. Well, it isn't a simplification: it's bringing it up to the minimum complication level where it'll actually work sanely. The current implementation doesn't work sanely because it confuses stale timestamps read from WAL with real live time. > This also still allows a slave to become arbitrarily far behind the > master. Indeed, but nothing we do can prevent that, if the slave is just plain slower than the master. You have to assume that the slave is capable of keeping up in the absence of query-caused delays, or you're hosed. The real reason this is at issue is the fact that the max_standby_delay kill mechanism applies to certain buffer-level locking operations. On the master we just wait, and it's not a problem, because in practice the conflicting queries almost always release these locks pretty quick. On the slave, though, instant kill as a result of a buffer-level lock conflict would result in a very serious degradation in standby query reliability (while also doing practically nothing for the speed of WAL application, most of the time). This morning on the phone Bruce and I were seriously discussing the idea of ripping the max_standby_delay mechanism out of the buffer-level locking paths, and just let them work like they do on the master, ie, wait forever. If we did that then simplifying max_standby_delay to a boolean would be reasonable again (because it really would only come into play for DDL on the master). The sticky point is that once in a blue moon you do have a conflicting query sitting on a buffer lock for a long time, or even more likely a series of queries keeping the WAL replay process from obtaining buffer cleanup lock. So it seems that we have to have max_standby_delay-like logic for those locks, and also that zero grace period before kill isn't a very practical setting. However, there isn't a lot of point in obsessing over exactly how long the grace period ought to be, as long as it's more than a few milliseconds. It *isn't* going to have any real effect on whether the slave can stay caught up. You could make a fairly decent case for just measuring the grace period from when the replay process starts to wait, as I think I proposed awhile back. The value of measuring delay from a receipt time is that if you do happen to have a bunch of delays within a short interval you'll get more willing to kill queries --- but I really believe that that is a corner case and will have nothing to do with ordinary performance. > I propose an alternate way out of the problem of syncing two clocks. > Instead of comparing timestamps compare time intervals. So as it reads > xlog records it only ever compares the master timestamps with previous > master timestamps to determine how much time has elapsed on the > master. It compares that time elapsed with the time elapsed on the > slave to determine if it's falling behind. I think this would just add complexity and uncertainty, to deal with something that won't be much of a problem in practice. regards, tom lane
On 02/06/10 21:44, Tom Lane wrote: > In conjunction with that, I think there's a logic bug in XLogSend; > it ought to be changed like so: > > /* if we went beyond SendRqstPtr, back off */ > if (XLByteLT(SendRqstPtr, endptr)) > + { > endptr = SendRqstPtr; > + *caughtup = true; > + } > > In the current coding, the effect of not setting *caughtup here is just > that we uselessly call XLogSend an extra time for each transmission > (because the main loop won't ever delay immediately after a > transmission). But without this, we'd never send caughtup = true > to the slave. That's intentional. It could take some time for the WAL to be sent, if the network is busy, so by the time XLogSend returns you might well not be caught up anymore. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 02/06/10 21:44, Tom Lane wrote: >> In the current coding, the effect of not setting *caughtup here is just >> that we uselessly call XLogSend an extra time for each transmission >> (because the main loop won't ever delay immediately after a >> transmission). But without this, we'd never send caughtup = true >> to the slave. > That's intentional. It could take some time for the WAL to be sent, if > the network is busy, so by the time XLogSend returns you might well not > be caught up anymore. It may have been intentional, but it's still wrong. If you were able to pull all of WAL into the record-to-be-sent, you should sleep afterwards, not send an extra record containing a few more bytes. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I wrote: >> I'm still inclined to apply the part of Simon's patch that adds a >> transmit timestamp to each SR send chunk. That would actually be >> completely unused by the slave given my proposal above, but I think that >> it is an important step to take to future-proof the SR protocol against >> possible changes in the slave-side timing logic. > > On further contemplation, it seems like the protocol needs another field > besides that: each record should also carry a boolean indicating whether > walsender.c thinks it is currently "caught up", ie the record carries > all WAL data up to the current end of WAL. If the sender is not caught > up, then the receiver should apply max_archive_delay not > max_streaming_delay. In this way, WAL chunks that are a little bit > behind current time will be treated the same way whether they come > across the SR link or via the archive. Then we'd have max_catchup_delay and max_streaming_delay, wouldn't we? I'm still trying to understand the implications of the proposal, which sounds good but… given just enough load on the slave, wouldn't it be playing catchup all the time? Wouldn't enough load mean one read query per write commit on the master? Regards, -- dim
On 02/06/10 20:14, Tom Lane wrote: > For realistic values of max_standby_delay ... Hang on right there. What do you consider a realistic value for max_standby_delay? Because I'm not sure I have a grip on that myself. 5 seconds? 5 minutes? 5 hours? I can see use cases for all of those... > What I think might be a realistic compromise is this: > > 1. Separate max_standby_delay into two GUCs, say "max_streaming_delay" > and "max_archive_delay". > > 2. When applying WAL that came across SR, use max_streaming_delay and > let the time measurement be current time minus time of receipt of the > current WAL send chunk. > > 3. When applying WAL that came from archive, use max_archive_delay and > let the time measurement be current time minus time of acquisition of > the current WAL segment from the archive. > > The current code's behavior in the latter case could effectively be > modeled by setting max_archive_delay to zero, but that isn't the only > plausible setting. More likely DBAs would set max_archive_delay to > something smaller than max_streaming_delay, but still positive so as to > not kill conflicting queries instantly. The problem with defining max_archive_delay that way is again that you can fall behind indefinitely. If you set it to 5 minutes, it means that you'll wait a maximum of 5 minutes *per WAL segment*, even if WAL is being generated faster. I don't understand why you want to use a different delay when you're restoring from archive vs. when you're streaming (what about existing WAL files found in pg_xlog, BTW?). The source of WAL shouldn't make a difference. If it's because you assume that restoring from archive is a sign that you've fallen behind a lot, surely you've exceeded max_standby_delay then and I still don't see a need for a separate GUC. I stand by my suggestion from yesterday: Let's define max_standby_delay as the difference between a piece of WAL becoming available in the standby, and applying it. To approximate "piece of WAL becoming available" for SR, we can use the mechanism with send/applyChunks from Simon's latest patch, or go with the simpler scheme of just resetting a "last caughtup timestamp" to current time whenever we have to wait for new WAL to arrive. When restoring from archive, likewise reset "last caughtup timestamp" whenever restore_command returns non-0, i.e we have to wait for the next WAL file to arrive. That works the same for both SR and file-based log shipping, there's only one knob to set, is simple to implement and doesn't require synchronized clocks. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Dimitri Fontaine <dfontaine@hi-media.com> writes: > I'm still trying to understand the implications of the proposal, which > sounds good but… given just enough load on the slave, wouldn't it be > playing catchup all the time? Uh, if the slave is overloaded, *any* implementation will be playing catchup all the time. Not sure about your point here. regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > The problem with defining max_archive_delay that way is again that you > can fall behind indefinitely. See my response to Greg Stark --- I don't think this is really an issue. It's certainly far less of an issue than the current situation that query grace periods go to zero under not-at-all-extreme conditions. > I don't understand why you want to use a different delay when you're > restoring from archive vs. when you're streaming (what about existing > WAL files found in pg_xlog, BTW?). You're missing the point. I want the DBA to be able to control what happens in those two cases. In the current implementation he has no control over what happens while restoring from archive: it's going to effectively act like max_archive_delay = 0 all the time. If you're of the opinion that's good, you can set the parameter that way and be happy. I'm of the opinion you'll soon find out it isn't so good, because it'll kill standby queries too easily. > I stand by my suggestion from yesterday: Let's define max_standby_delay > as the difference between a piece of WAL becoming available in the > standby, and applying it. My proposal is essentially the same as yours plus allowing the DBA to choose different max delays for the caught-up and not-caught-up cases. Maybe everybody will end up setting the two delays the same, but I think we don't have enough experience to decide that for them now. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Uh, if the slave is overloaded, *any* implementation will be playing > catchup all the time. Not sure about your point here. Well, sorry, I missed the part where the catchup is measured on walsender side of things. Now that means that a non interrupted flow of queries quicker than the wal sender delay (100ms, right?) on the slave would certainly leave enough room for it to stay current. Well that depends also on the time it takes to replay the wals. I'm trying to decide if exposing this 100ms magic setup (or something else) would allow for a lot more control with respect to what means being overloaded. Say you set the 100ms delay to any value that you know (from testing and log_min_duration_statement, say) just a little higher than the slowest query you typically run on the slave. If that reduces the chances to ever playing cathing-up (as soon as there's no unexpected slow query there), that would be great. If you can manage to make sense of this, I'm interested into hearing how far it is from reality. Regards, -- dim
On Wed, Jun 2, 2010 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> I'm still inclined to apply the part of Simon's patch that adds a >> transmit timestamp to each SR send chunk. That would actually be >> completely unused by the slave given my proposal above, but I think that >> it is an important step to take to future-proof the SR protocol against >> possible changes in the slave-side timing logic. > > On further contemplation, it seems like the protocol needs another field > besides that: each record should also carry a boolean indicating whether > walsender.c thinks it is currently "caught up", ie the record carries > all WAL data up to the current end of WAL. If the sender is not caught > up, then the receiver should apply max_archive_delay not > max_streaming_delay. In this way, WAL chunks that are a little bit > behind current time will be treated the same way whether they come > across the SR link or via the archive. I'm not sure that makes sense. I thought the point of separating the archive and streaming cases was that the same timeout wouldn't necessarily be correct for a 16MB WAL file as it would for a 16-byte WAL record you've just received. IOW, you might want max_archive_delay > max_streaming_delay. The original proposal also seems somewhat easier to understand, to me at least. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, Jun 2, 2010 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Indeed, but nothing we do can prevent that, if the slave is just plain > slower than the master. You have to assume that the slave is capable of > keeping up in the absence of query-caused delays, or you're hosed. I was assuming the walreceiver only requests more wal in relatively small chunks and only when replay has caught up and needs more data. I haven't actually read this code so if that assumption is wrong then I'm off-base. But if my assumption is right then it's not merely the master running faster than the slave that can cause you to fall arbitrarily far behind. The "receive time" is delayed by how long the slave waited on a previous lock, so if we wait for a few minutes, then proceed and find we need more wal data we'll read data from the master that it could have generated those few minutes ago. > The sticky point is that once in a blue moon you do have a conflicting > query sitting on a buffer lock for a long time, or even more likely a > series of queries keeping the WAL replay process from obtaining buffer > cleanup lock. So I think this isn't necessarily such a blue moon event. As I understand it, all it would take is a single long-running report and a vacuum or HOT cleanup occurring on the master. If I want to set max_standby_delay to 60min to allow reports to run for up to an hour then any random HOT cleanup on the master will propagate to the slave and cause a WAL stall until the transactions which have that xid in their snapshot end. Even the buffer locks are could potentially be blocked for a long time if you happen to run the right kind of query (say a nested loop with the buffer in question on the outside and a large table on the inside). That's a rarer event though; is that what you were thinking of? -- greg
Greg Stark <gsstark@mit.edu> writes: > I was assuming the walreceiver only requests more wal in relatively > small chunks and only when replay has caught up and needs more data. I > haven't actually read this code so if that assumption is wrong then > I'm off-base. It is off-base. The receiver does not "request" data, the sender is what determines how much WAL is sent when. > So I think this isn't necessarily such a blue moon event. As I > understand it, all it would take is a single long-running report and a > vacuum or HOT cleanup occurring on the master. I think this is mostly FUD too. How often do you see vacuum blocked for an hour now? It probably can happen, which is why we need to be able to kick queries off the locks with max_standby_delay, but it's far from common. What we're concerned about here is how long a buffer lock on a single page is held, not how long heavyweight locks are held. The normal hold times are measured in microseconds. regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Greg Stark <gsstark@mit.edu> writes: > > So I think this isn't necessarily such a blue moon event. As I > > understand it, all it would take is a single long-running report and a > > vacuum or HOT cleanup occurring on the master. > > I think this is mostly FUD too. How often do you see vacuum blocked for > an hour now? It probably can happen, which is why we need to be able to > kick queries off the locks with max_standby_delay, but it's far from > common. What we're concerned about here is how long a buffer lock on a > single page is held, not how long heavyweight locks are held. The > normal hold times are measured in microseconds. Maybe I'm missing something, but I think Greg's point was that if you have a long-running query running against the standby/slave/whatever, which is holding locks on various relations to implement that report, and then a vacuum or HOT update happens on the master, the long-running report query will get killed off unless you bump max_streaming_delay up pretty high (eg: 60 mins). That being said, I'm not sure that there's really another solution. Yes, in this case, the slave can end up being an hour behind, but that's the trade-off you have to make if you want to run an hour-long query on the slave. The other answer is to make the master not update those tuples, etc, which might be possible by starting a transaction on the master which grabs things enough to prevent the vacuum/hot/etc update from happening. That may be possible manually, but it's not fun and it certainly isn't something we'll have built-in support for in 9.0. Thanks, Stephen
On Wed, 2010-06-02 at 16:00 -0400, Tom Lane wrote: > the current situation that query grace periods go to zero Possibly a better way to handle this concern is to make the second parameter: min_standby_grace_period - the minimum time a query will be given in which to execute, even if max_standby_delay has been reached or exceeded. Would that more directly address you concerns? min_standby_grace_period (ms) SIGHUP -- Simon Riggs www.2ndQuadrant.com
On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I don't understand why you want to use a different delay when you're > restoring from archive vs. when you're streaming (what about existing WAL > files found in pg_xlog, BTW?). The source of WAL shouldn't make a > difference. Yes. The pace of a recovery has nothing to do with that of log shipping. So to hurry up a recovery when restoring from archive seems to be useless. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2010-06-03 at 17:56 +0900, Fujii Masao wrote: > On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > I don't understand why you want to use a different delay when you're > > restoring from archive vs. when you're streaming (what about existing WAL > > files found in pg_xlog, BTW?). The source of WAL shouldn't make a > > difference. > > Yes. The pace of a recovery has nothing to do with that of log shipping. > So to hurry up a recovery when restoring from archive seems to be useless. When streaming drops for some reason we revert to scanning the archive for files. There is clearly two modes of operation. So it makes sense that you might want to set different times for the parameter in each case. We might think of those modes as "connected" and "degraded" modes rather than streaming and file shipping. -- Simon Riggs www.2ndQuadrant.com
On Thu, Jun 3, 2010 at 4:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> On 02/06/10 21:44, Tom Lane wrote: >>> In the current coding, the effect of not setting *caughtup here is just >>> that we uselessly call XLogSend an extra time for each transmission >>> (because the main loop won't ever delay immediately after a >>> transmission). But without this, we'd never send caughtup = true >>> to the slave. > >> That's intentional. It could take some time for the WAL to be sent, if >> the network is busy, so by the time XLogSend returns you might well not >> be caught up anymore. > > It may have been intentional, but it's still wrong. If you were able to > pull all of WAL into the record-to-be-sent, you should sleep afterwards, > not send an extra record containing a few more bytes. For reducing the workload of walsender? This seems OK in 9.0 since only asynchronous replication is supported. But when we'll implement synchronous replication in the future, we might have to revert that change. Since a transaction commit might wait for such an extra record to be replicated, walsender should aggressively send all sendable WAL. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Jun 3, 2010 at 6:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2010-06-03 at 17:56 +0900, Fujii Masao wrote: >> On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >> > I don't understand why you want to use a different delay when you're >> > restoring from archive vs. when you're streaming (what about existing WAL >> > files found in pg_xlog, BTW?). The source of WAL shouldn't make a >> > difference. >> >> Yes. The pace of a recovery has nothing to do with that of log shipping. >> So to hurry up a recovery when restoring from archive seems to be useless. > > When streaming drops for some reason we revert to scanning the archive > for files. There is clearly two modes of operation. Yes. > So it makes sense > that you might want to set different times for the parameter in each > case. What purpose would that serve? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2010-06-03 at 18:39 +0900, Fujii Masao wrote: > What purpose would that serve? Tom has already explained this and it makes sense for me. -- Simon Riggs www.2ndQuadrant.com
On Thu, Jun 3, 2010 at 5:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I stand by my suggestion from yesterday: Let's define max_standby_delay >> as the difference between a piece of WAL becoming available in the >> standby, and applying it. > > My proposal is essentially the same as yours plus allowing the DBA to > choose different max delays for the caught-up and not-caught-up cases. > Maybe everybody will end up setting the two delays the same, but I think > we don't have enough experience to decide that for them now. Applying WAL that arrives via SR is not always the sign of the caught-up or not-caught-up. OTOH, applying WAL restored from archive is not always the sign of either of them. So isn't it nonsense to separate the delay in order to control the behavior of a recovery for those cases? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > On Thu, Jun 3, 2010 at 4:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >>> That's intentional. It could take some time for the WAL to be sent, if >>> the network is busy, so by the time XLogSend returns you might well not >>> be caught up anymore. >> >> It may have been intentional, but it's still wrong. �If you were able to >> pull all of WAL into the record-to-be-sent, you should sleep afterwards, >> not send an extra record containing a few more bytes. > For reducing the workload of walsender? > This seems OK in 9.0 since only asynchronous replication is supported. > But when we'll implement synchronous replication in the future, we > might have to revert that change. Since a transaction commit might wait > for such an extra record to be replicated, walsender should aggressively > send all sendable WAL. It *is* aggressively sending all sendable WAL. The ideal steady state behavior of this loop ought to be that once per sleep interval, we send out one record containing all new WAL since the last time. We do not want it sending 10000 bytes, then another record with 100 bytes, then another record with 10 bytes, etc etc. That's inefficient and ultimately pointless. You'll always be behind again a millisecond later. regards, tom lane
On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> I was assuming the walreceiver only requests more wal in relatively >> small chunks and only when replay has caught up and needs more data. I >> haven't actually read this code so if that assumption is wrong then >> I'm off-base. > > It is off-base. The receiver does not "request" data, the sender is > what determines how much WAL is sent when. Hm, so what happens if the slave blocks, doesn't the sender block when the kernel buffers fill up? >> So I think this isn't necessarily such a blue moon event. As I >> understand it, all it would take is a single long-running report and a >> vacuum or HOT cleanup occurring on the master. > > I think this is mostly FUD too. How often do you see vacuum blocked for > an hour now? No, that's not comparable. On the master vacuum will just ignore tuples that are still visible to live transactions. On the slave it doesn't have a choice, it sees the cleanup record and must pause recovery until those transactions finish. -- greg
I wrote: > On further contemplation, it seems like the protocol needs another field > besides that: each record should also carry a boolean indicating whether > walsender.c thinks it is currently "caught up", ie the record carries > all WAL data up to the current end of WAL. Actually, there's a better way to do that: let's have the record carry not just a boolean but the actual current end-of-WAL LSN. The receiver could then not just determine "am I behind" but find out *how far* behind it is, and thereby perhaps adjust its behavior in more subtle ways than just a binary on/off fashion. (Actually doing anything like that is material for future work, of course, but I think we should try to get the SR protocol right now.) regards, tom lane
Greg Stark <gsstark@mit.edu> writes: >> Well, if the slave can't keep up, that's a separate problem. �It will >> not fail to keep up as a result of the transmission mechanism. > No, I mean if the slave is paused due to a conflict. Does it stop > reading data from the master or does it buffer it up on disk? If it > stops reading it from the master then the effect is the same as if the > slave stopped "requesting" data even if there's no actual request. The data keeps coming in and getting dumped into the slave's pg_xlog. walsender/walreceiver are not at all tied to the slave's application of WAL. In principle we could have the code around max_standby_delay notice just how far behind it's gotten and adjust the delay tolerance based on that; but I think designing a feedback loop for that is 9.1 material. regards, tom lane
On Thu, Jun 3, 2010 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The data keeps coming in and getting dumped into the slave's pg_xlog. > walsender/walreceiver are not at all tied to the slave's application > of WAL. In principle we could have the code around max_standby_delay > notice just how far behind it's gotten and adjust the delay tolerance > based on that; but I think designing a feedback loop for that is 9.1 > material. Er, no. In that case my first concern was misguided. I'm happy there's no feedback loop -- my fear was that there was and it would mean the "time received" could be decoupled from the time the wal was generated. But as you describe it then the time received might be slightly delayed from the time the wal was generated but to some constant degree -- not in a way that will be influenced by the log application being blocked on the slave. -- greg
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote: >> This patch seems to me to be going in fundamentally the wrong direction. >> It's adding complexity and overhead (far more than is needed), and it's >> failing utterly to resolve the objections that I raised to start with. > Having read your proposal, it seems changing from time-on-sender to > time-on-receiver is a one line change to the patch. > What else are you thinking of removing, if anything? Basically, we need to get rid of everything that feeds timestamps from the WAL content into the kill-delay logic. >> In particular, Simon seems to be basically refusing to do anything about >> the complaint that the code fails unless master and standby clocks are >> in close sync. I do not believe that this is acceptable, and since he >> won't fix it, I guess I'll have to. > Syncing two servers in replication is common practice, as has been > explained here; I'm still surprised people think otherwise. Doesn't affect the complaint in the least: I do not find it acceptable to have that be *mandated* in order for our code to work sensibly. I would be OK with having something approaching what you want as a non-default optional behavior (with a clearly-documented dependency on having synchronized clocks). But in any case the current behavior is still quite broken as regards reading stale timestamps from WAL. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2010-06-02 at 16:00 -0400, Tom Lane wrote: >> the current situation that query grace periods go to zero > Possibly a better way to handle this concern is to make the second > parameter: min_standby_grace_period - the minimum time a query will be > given in which to execute, even if max_standby_delay has been reached or > exceeded. > Would that more directly address you concerns? > min_standby_grace_period (ms) SIGHUP A minimum grace period seems like a good idea to me, but I think it's somewhat orthogonal to the core problem here. I think we all intuitively feel that there should be a way to dial back the grace period when a slave is "far behind" on applying WAL. The problem is first how to figure out what "far behind" means, and second how to adjust the grace period in a way that doesn't have surprising misbehaviors. A minimum grace period would prevent some of the very worst misbehaviors but it's not really addressing the core problem. regards, tom lane
On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote: > >> This patch seems to me to be going in fundamentally the wrong direction. > >> It's adding complexity and overhead (far more than is needed), and it's > >> failing utterly to resolve the objections that I raised to start with. > > > Having read your proposal, it seems changing from time-on-sender to > > time-on-receiver is a one line change to the patch. > > > What else are you thinking of removing, if anything? > > Basically, we need to get rid of everything that feeds timestamps from > the WAL content into the kill-delay logic. I understand your wish to do this, though it isn't always accurate to do that in the case where there is a backlog. The patch already does that *mostly* for the streaming case. The only time it does use the WAL content timestamp is when the WAL content timestamp is later than the oldest receive time, so in that case it is used as a correction. (see code comments and comments below also) > >> In particular, Simon seems to be basically refusing to do anything about > >> the complaint that the code fails unless master and standby clocks are > >> in close sync. I do not believe that this is acceptable, and since he > >> won't fix it, I guess I'll have to. > > > Syncing two servers in replication is common practice, as has been > > explained here; I'm still surprised people think otherwise. > > Doesn't affect the complaint in the least: I do not find it acceptable > to have that be *mandated* in order for our code to work sensibly. OK, accepted. > I would be OK with having something approaching what you want as a > non-default optional behavior (with a clearly-documented dependency > on having synchronized clocks). Yes, that's what I'd been thinking also. So that gives us a way forwards. standby_delay_base = apply (default) | send determines whether the standby_delay is calculated solely with reference to the standby server (apply) or whether times from the sending server are used (send). Use of send implies that the clocks on primary and standby are synchronised to within a useful accuracy, in which case it is usual to enforce that with time synchronisation software such as ntp. > But in any case the current behavior is > still quite broken as regards reading stale timestamps from WAL. Agreed. That wasn't the objective of this patch or a priority. If you're reading from an archive, you need to set max_standby_delay appropriately, current recommendation is -1. We can address that concern once the main streaming case is covered, or we can add that now. Are you planning to work on these things now as you said? How about I apply my patch now, then you do another set of changes afterwards to add the other items you mentioned, since that is now 2 additional parameters and related code? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote: >> But in any case the current behavior is >> still quite broken as regards reading stale timestamps from WAL. > Agreed. That wasn't the objective of this patch or a priority. > If you're reading from an archive, you need to set max_standby_delay > appropriately, current recommendation is -1. That's not a fix; the problem is that there *is no* appropriate setting for max_standby_delay when you're reading from archive. It is unlikely that the values of the timestamps you're reading should be considered to have any bearing on what grace period to allow; but nonetheless it's desirable to be able to have a non-infinite grace time. Basically what I think is that we want what you called "apply" semantics always for reading from archive (and I still think the DBA should be able to set that grace period separately from the one that applies in SR operation). Paying attention to the master's timestamps is only reasonable in the SR context. And yes, I want the dependency on WAL timestamps to be gone completely, not just mostly. I think that driving the delay logic off of SR receipt time (and/or the timestamp we agreed to add to the SR protocol) is the appropriate and sufficient way to deal with the problem. Looking at the embedded timestamps just confuses the feedback loop behavior. regards, tom lane
On Thu, 2010-06-03 at 13:32 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote: > >> But in any case the current behavior is > >> still quite broken as regards reading stale timestamps from WAL. > > > Agreed. That wasn't the objective of this patch or a priority. > > > If you're reading from an archive, you need to set max_standby_delay > > appropriately, current recommendation is -1. > > That's not a fix; the problem is that there *is no* appropriate setting > for max_standby_delay when you're reading from archive. It is unlikely > that the values of the timestamps you're reading should be considered to > have any bearing on what grace period to allow; but nonetheless it's > desirable to be able to have a non-infinite grace time. I'm not saying that's a fix; I agree we should change that also. > Basically what I think is that we want what you called "apply" semantics > always for reading from archive (and I still think the DBA should be > able to set that grace period separately from the one that applies in > SR operation). Paying attention to the master's timestamps is only > reasonable in the SR context. Agreed. > And yes, I want the dependency on WAL timestamps to be gone completely, > not just mostly. I think that driving the delay logic off of SR receipt > time (and/or the timestamp we agreed to add to the SR protocol) is the > appropriate and sufficient way to deal with the problem. Looking at the > embedded timestamps just confuses the feedback loop behavior. I disagree with "sufficient", with good reason. Please look at the code comments and see what will happen if we don't make the correction suggested. If after that you still disagree, then do it your way and we'll wait for comments in the beta; I think there will be some, but I am tired and prone to agreement to allow this to go through. -- Simon Riggs www.2ndQuadrant.com
On Thu, Jun 3, 2010 at 4:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> It is off-base. The receiver does not "request" data, the sender is >>> what determines how much WAL is sent when. > >> Hm, so what happens if the slave blocks, doesn't the sender block when >> the kernel buffers fill up? > > Well, if the slave can't keep up, that's a separate problem. It will > not fail to keep up as a result of the transmission mechanism. No, I mean if the slave is paused due to a conflict. Does it stop reading data from the master or does it buffer it up on disk? If it stops reading it from the master then the effect is the same as if the slave stopped "requesting" data even if there's no actual request. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It is off-base. �The receiver does not "request" data, the sender is >> what determines how much WAL is sent when. > Hm, so what happens if the slave blocks, doesn't the sender block when > the kernel buffers fill up? Well, if the slave can't keep up, that's a separate problem. It will not fail to keep up as a result of the transmission mechanism. regards, tom lane
On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote: > Are you planning to work on these things now as you said? Are you? Or do you want me to? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote: >> Are you planning to work on these things now as you said? > Are you? Or do you want me to? I decided there wasn't time to get anything useful done on it before the beta2 deadline (which is, more or less, right now). I will take another look over the next few days. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote: >> I decided there wasn't time to get anything useful done on it before the >> beta2 deadline (which is, more or less, right now). I will take another >> look over the next few days. > We all really need you to fix up max_standby_delay, or, let me do it. Yes, I'll get with it ... regards, tom lane
On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote: > >> Are you planning to work on these things now as you said? > > > Are you? Or do you want me to? > > I decided there wasn't time to get anything useful done on it before the > beta2 deadline (which is, more or less, right now). I will take another > look over the next few days. Esteemed colleague Tom, We all really need you to fix up max_standby_delay, or, let me do it. It appears we both agree that more testing would be beneficial. The only way we are going to get that is by finishing this off and declaring a new HS-beta. Please let me know how you wish to proceed. (No answer means I do it.) -- Simon Riggs www.2ndQuadrant.com
On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote: >>> I decided there wasn't time to get anything useful done on it before the >>> beta2 deadline (which is, more or less, right now). I will take another >>> look over the next few days. > >> We all really need you to fix up max_standby_delay, or, let me do it. > > Yes, I'll get with it ... Tom, Any update on this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yes, I'll get with it ... > Any update on this? Sorry, I've been a bit distracted by other responsibilities (libtiff security issues for Red Hat, if you must know). I'll get on it shortly. regards, tom lane
On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Yes, I'll get with it ... > >> Any update on this? > > Sorry, I've been a bit distracted by other responsibilities (libtiff > security issues for Red Hat, if you must know). I'll get on it shortly. What? You have other things to do besides hack on PostgreSQL? Shocking! :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas wrote: > On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Sorry, I've been a bit distracted by other responsibilities (libtiff >> security issues for Red Hat, if you must know). I'll get on it shortly. > > What? You have other things to do besides hack on PostgreSQL? Shocking! :-) I suspect you're kidding, but in case some on the list didn't realize, Tom's probably as famous (if not moreso) in the image compression community as he is in the database community: http://www.jpeg.org/jpeg/index.html "Probably the largest and most important contribution however was the workof the Independent JPEG Group (IJG), and Tom Lanein particular." http://www.w3.org/TR/PNG-Credits.html , http://www.w3.org/TR/PNG/ "PNG (Portable Network Graphics) SpecificationVersion 1.0...Contributing EditorTom Lane, tgl@sss.pgh.pa.us" http://www.fileformat.info/format/tiff/egff.htm "... by Dr. Tom Lane of the Independent JPEG Group, a member of theTIFF Advisory Committee"
On Mon, Jun 21, 2010 at 12:20 AM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > Robert Haas wrote: >> On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Sorry, I've been a bit distracted by other responsibilities (libtiff >>> security issues for Red Hat, if you must know). I'll get on it shortly. >> >> What? You have other things to do besides hack on PostgreSQL? Shocking! :-) > > I suspect you're kidding, but in case some on the list didn't realize, > Tom's probably as famous (if not moreso) in the image compression > community as he is in the database community: > > http://www.jpeg.org/jpeg/index.html > "Probably the largest and most important contribution however was the work > of the Independent JPEG Group (IJG), and Tom Lane in particular." > > http://www.w3.org/TR/PNG-Credits.html , http://www.w3.org/TR/PNG/ > "PNG (Portable Network Graphics) Specification > Version 1.0 > ... > Contributing Editor > Tom Lane, tgl@sss.pgh.pa.us" > > http://www.fileformat.info/format/tiff/egff.htm > "... by Dr. Tom Lane of the Independent JPEG Group, a member of the > TIFF Advisory Committee" Yes, I was joking, hence the smiley. I did know he was involved in the above, although I confess I didn't know to what degree... or that he had a doctorate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Yes, I'll get with it ... > > > Any update on this? > > Sorry, I've been a bit distracted by other responsibilities (libtiff > security issues for Red Hat, if you must know). I'll get on it shortly. I don't think the PostgreSQL project should wait any longer on this. If it does we risk loss of quality in final release, assuming no slippage. >From here, I will rework my patch of 31 May to * use arrival time on standby as base for max_standby_delay * make delay apply to both streaming and file cases * min_standby_grace_period - min grace on every query, default 0 Decision time, so thoughts please? -- Simon Riggs www.2ndQuadrant.com
On Mon, Jun 28, 2010 at 3:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >> > On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Yes, I'll get with it ... >> >> > Any update on this? >> >> Sorry, I've been a bit distracted by other responsibilities (libtiff >> security issues for Red Hat, if you must know). I'll get on it shortly. > > I don't think the PostgreSQL project should wait any longer on this. If > it does we risk loss of quality in final release, assuming no slippage. I agree, and had actually been intending to post a similar message in the next day or two. We've been waiting for this for nearly a month. > >From here, I will rework my patch of 31 May to > * use arrival time on standby as base for max_standby_delay Assuming that by this you mean, in the case of SR, the time of receipt of the current WAL chunk, and in the case of the archive, the time of its acquisition from the archive, +1. > * make delay apply to both streaming and file cases +1. > * min_standby_grace_period - min grace on every query, default 0 I could go either way on this. +0.5, I guess. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote: >> Sorry, I've been a bit distracted by other responsibilities (libtiff >> security issues for Red Hat, if you must know). I'll get on it shortly. > I don't think the PostgreSQL project should wait any longer on this. If > it does we risk loss of quality in final release, assuming no slippage. It will get done. It is not the very first thing on my to-do list. regards, tom lane
> It will get done. It is not the very first thing on my to-do list. ??? What is then? If it's not the first thing on your priority list, with 9.0 getting later by the day, maybe we should leave it to Robert and Simon, who *do* seem to have it first on *their* list? I swear, when Simon was keeping his branch to himself in August everyone was on his case. It sure seems like Tom is doing exactly the same thing. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Mon, Jun 28, 2010 at 10:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote: >>> Sorry, I've been a bit distracted by other responsibilities (libtiff >>> security issues for Red Hat, if you must know). I'll get on it shortly. > >> I don't think the PostgreSQL project should wait any longer on this. If >> it does we risk loss of quality in final release, assuming no slippage. > > It will get done. It is not the very first thing on my to-do list. That's apparent. On June 9th, you wrote "Yes, I'll get with it ..."; on June 16th, you wrote "I'll get on it shortly." Two weeks later you're backing off from "shortly" to "eventually". It is unfair to the community to assert a vigorous opinion of how something should be handled in the code and then refuse to commit to a time frame for providing a patch. It is even more unreasonable to commit to providing a timely patch (twice) and then fail to do so. We are trying to finalize a release here, and you've made it clear you think this code needs revision before then. I respect your opinion, but not your right to make the project release timetable dependent on your own schedule, and not your right to shut other people out of working on the issues you've raised. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > ... It is even more unreasonable to commit to > providing a timely patch (twice) and then fail to do so. We are > trying to finalize a release here, and you've made it clear you think > this code needs revision before then. I respect your opinion, but not > your right to make the project release timetable dependent on your own > schedule, and not your right to shut other people out of working on > the issues you've raised. Since nobody has put forward a proposed beta3 release date, I don't feel that I'm holding anything up. In the meantime, I have many responsibilities and am facing Red Hat internal deadlines. regards, tom lane
On Mon, Jun 28, 2010 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> ... It is even more unreasonable to commit to >> providing a timely patch (twice) and then fail to do so. We are >> trying to finalize a release here, and you've made it clear you think >> this code needs revision before then. I respect your opinion, but not >> your right to make the project release timetable dependent on your own >> schedule, and not your right to shut other people out of working on >> the issues you've raised. > > Since nobody has put forward a proposed beta3 release date, I don't feel > that I'm holding anything up. In the meantime, I have many > responsibilities and am facing Red Hat internal deadlines. See here, last paragraph: http://archives.postgresql.org/pgsql-hackers/2010-06/msg01093.php On a related note, this list is getting pretty darn short: http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items ...partly because, in my desire to get another beta out, I have been devoting a lot of time to clearing it out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Mon, 2010-06-28 at 10:09 -0700, Josh Berkus wrote: > > It will get done. It is not the very first thing on my to-do list. > > ??? What is then? > > If it's not the first thing on your priority list, with 9.0 getting > later by the day, maybe we should leave it to Robert and Simon, who *do* > seem to have it first on *their* list? > > I swear, when Simon was keeping his branch to himself in August everyone > was on his case. It sure seems like Tom is doing exactly the same thing. Hmmm, yes, looks that way. At that time I was actively working on the code, not just locking it to prevent other activity. The only urgency on my part here was to fulfil my responsibility to the project. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Mon, 2010-06-28 at 10:09 -0700, Josh Berkus wrote: > > > It will get done. It is not the very first thing on my to-do list. > > > > ??? What is then? > > > > If it's not the first thing on your priority list, with 9.0 getting > > later by the day, maybe we should leave it to Robert and Simon, who *do* > > seem to have it first on *their* list? > > > > I swear, when Simon was keeping his branch to himself in August everyone > > was on his case. It sure seems like Tom is doing exactly the same thing. > > Hmmm, yes, looks that way. At that time I was actively working on the > code, not just locking it to prevent other activity. > > The only urgency on my part here was to fulfil my responsibility to the > project. Simon, you have a very legitimate concern. I phoned Tom and he is planning to start working on the max_standby_delay tomorrow. I am unclear how it is different from your version, but I hope once Tom is done we can review his work and decide how to proceed. The fact that we allowed Tom this huge amount of time to submit an alternative patch is unusual and hopefully rare. FYI, Tom and I are hoping to work through all the outstanding issues before we package up 9.0 beta3 on Thursday, July 8. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
[ Apologies for the very slow turnaround on this --- I got hit with another batch of non-postgres security issues this week. ] Attached is a draft patch for revising the max_standby_delay behavior into something I think is a bit saner. There is some unfinished business: * I haven't touched the documentation yet * The code in xlog.c needs to be reverted to its former behavior so that recoveryLastXTime means what it's supposed to mean, ie just the last commit/abort timestamp. However neither of these affects the testability of the patch. Basically the way that it works is that the standby delay is computed with reference to XLogReceiptTime rather than any timestamp obtained from WAL. XLogReceiptTime is set to current time whenever we obtain a WAL segment from the archives or when we begin processing a fresh batch of WAL from walreceiver. There's a subtlety in the streaming case: we don't advance XLogReceiptTime if we are not caught up, that is if the startup process is more than one flush cycle behind walreceiver. In the normal case we'll advance XLogReceiptTime on each flush cycle, but once we start falling behind, it doesn't move so the grace time alloted to conflicting queries begins to decrease. I split max_standby_delay into two GUC variables, as previously threatened: max_standby_archive_delay and max_standby_streaming_delay. The former applies when processing data read from archive, the latter when processing data received from walreceiver. I think this is really quite important given the new behavior, because max_standby_archive_delay ought to be set with reference to the expected time to process one WAL segment, whereas max_standby_streaming_delay doesn't depend on that value at all. I'm not sure what good defaults are for these values, so I left them at 30 seconds for the moment. I'm inclined to think max_standby_archive_delay ought to be quite a bit less though. It might be worth adding a minimum-grace-time limit as we previously discussed, but this patch doesn't attempt to do that. Comments? regards, tom lane Index: src/backend/access/transam/xlog.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.427 diff -c -r1.427 xlog.c *** src/backend/access/transam/xlog.c 28 Jun 2010 19:46:19 -0000 1.427 --- src/backend/access/transam/xlog.c 2 Jul 2010 20:01:24 -0000 *************** *** 72,78 **** bool XLogArchiveMode = false; char *XLogArchiveCommand = NULL; bool EnableHotStandby = false; - int MaxStandbyDelay = 30 * 1000; bool fullPageWrites = true; bool log_checkpoints = false; int sync_method = DEFAULT_SYNC_METHOD; --- 72,77 ---- *************** *** 487,493 **** * Keeps track of which sources we've tried to read the current WAL * record from and failed. */ ! static int failedSources = 0; /* Buffer for currently read page (XLOG_BLCKSZ bytes) */ static char *readBuf = NULL; --- 487,502 ---- * Keeps track of which sources we've tried to read the current WAL * record from and failed. */ ! static int failedSources = 0; /* OR of XLOG_FROM_* codes */ ! ! /* ! * These variables track when we last obtained some WAL data to process, ! * and where we got it from. (XLogReceiptSource is initially the same as ! * readSource, but readSource gets reset to zero when we don't have data ! * to process right now.) ! */ ! static TimestampTz XLogReceiptTime = 0; ! static int XLogReceiptSource = 0; /* XLOG_FROM_* code */ /* Buffer for currently read page (XLOG_BLCKSZ bytes) */ static char *readBuf = NULL; *************** *** 2626,2632 **** * Open a logfile segment for reading (during recovery). * * If source = XLOG_FROM_ARCHIVE, the segment is retrieved from archive. ! * If source = XLOG_FROM_PG_XLOG, it's read from pg_xlog. */ static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, --- 2635,2641 ---- * Open a logfile segment for reading (during recovery). * * If source = XLOG_FROM_ARCHIVE, the segment is retrieved from archive. ! * Otherwise, it's assumed to be already available in pg_xlog. */ static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, *************** *** 2655,2660 **** --- 2664,2670 ---- break; case XLOG_FROM_PG_XLOG: + case XLOG_FROM_STREAM: XLogFilePath(path, tli, log, seg); restoredFromArchive = false; break; *************** *** 2674,2680 **** --- 2684,2696 ---- xlogfname); set_ps_display(activitymsg, false); + /* Track source of data in assorted state variables */ readSource = source; + XLogReceiptSource = source; + /* In FROM_STREAM case, caller tracks receipt time, not me */ + if (source != XLOG_FROM_STREAM) + XLogReceiptTime = GetCurrentTimestamp(); + return fd; } if (errno != ENOENT || !notfoundOk) /* unexpected failure? */ *************** *** 5568,5574 **** /* * Returns timestamp of last recovered commit/abort record. */ ! TimestampTz GetLatestXLogTime(void) { /* use volatile pointer to prevent code rearrangement */ --- 5584,5590 ---- /* * Returns timestamp of last recovered commit/abort record. */ ! static TimestampTz GetLatestXLogTime(void) { /* use volatile pointer to prevent code rearrangement */ *************** *** 5582,5587 **** --- 5598,5620 ---- } /* + * Returns time of receipt of current chunk of XLOG data, as well as + * whether it was received from streaming replication or from archives. + */ + void + GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream) + { + /* + * This must be executed in the startup process, since we don't export + * the relevant state to shared memory. + */ + Assert(InRecovery); + + *rtime = XLogReceiptTime; + *fromStream = (XLogReceiptSource == XLOG_FROM_STREAM); + } + + /* * Note that text field supplied is a parameter name and does not require * translation */ *************** *** 6060,6065 **** --- 6093,6101 ---- xlogctl->recoveryLastRecPtr = ReadRecPtr; SpinLockRelease(&xlogctl->info_lck); + /* Also ensure XLogReceiptTime has a sane value */ + XLogReceiptTime = GetCurrentTimestamp(); + /* * Let postmaster know we've started redo now, so that it can * launch bgwriter to perform restartpoints. We don't bother *************** *** 7647,7653 **** XLogRecPtr endptr; /* Get the current (or recent) end of xlog */ ! endptr = GetWalRcvWriteRecPtr(); PrevLogSeg(_logId, _logSeg); RemoveOldXlogFiles(_logId, _logSeg, endptr); --- 7683,7689 ---- XLogRecPtr endptr; /* Get the current (or recent) end of xlog */ ! endptr = GetWalRcvWriteRecPtr(NULL); PrevLogSeg(_logId, _logSeg); RemoveOldXlogFiles(_logId, _logSeg, endptr); *************** *** 8757,8763 **** XLogRecPtr recptr; char location[MAXFNAMELEN]; ! recptr = GetWalRcvWriteRecPtr(); if (recptr.xlogid == 0 && recptr.xrecoff == 0) PG_RETURN_NULL(); --- 8793,8799 ---- XLogRecPtr recptr; char location[MAXFNAMELEN]; ! recptr = GetWalRcvWriteRecPtr(NULL); if (recptr.xlogid == 0 && recptr.xrecoff == 0) PG_RETURN_NULL(); *************** *** 9272,9277 **** --- 9308,9315 ---- { if (WalRcvInProgress()) { + bool havedata; + /* * If we find an invalid record in the WAL streamed from * master, something is seriously wrong. There's little *************** *** 9289,9316 **** } /* ! * While walreceiver is active, wait for new WAL to arrive ! * from primary. */ - receivedUpto = GetWalRcvWriteRecPtr(); if (XLByteLT(*RecPtr, receivedUpto)) { /* * Great, streamed far enough. Open the file if it's ! * not open already. */ if (readFile < 0) { readFile = XLogFileRead(readId, readSeg, PANIC, recoveryTargetTLI, ! XLOG_FROM_PG_XLOG, false); switched_segment = true; readSource = XLOG_FROM_STREAM; } break; } if (CheckForStandbyTrigger()) goto triggered; --- 9327,9388 ---- } /* ! * Walreceiver is active, so see if new data has arrived. ! * ! * We only advance XLogReceiptTime when we obtain fresh ! * WAL from walreceiver and observe that we had already ! * processed everything before the most recent "chunk" ! * that it flushed to disk. In steady state where we are ! * keeping up with the incoming data, XLogReceiptTime ! * will be updated on each cycle. When we are behind, ! * XLogReceiptTime will not advance, so the grace time ! * alloted to conflicting queries will decrease. */ if (XLByteLT(*RecPtr, receivedUpto)) + havedata = true; + else + { + XLogRecPtr latestChunkStart; + + receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart); + if (XLByteLT(*RecPtr, receivedUpto)) + { + havedata = true; + if (!XLByteLT(*RecPtr, latestChunkStart)) + XLogReceiptTime = GetCurrentTimestamp(); + } + else + havedata = false; + } + if (havedata) { /* * Great, streamed far enough. Open the file if it's ! * not open already. Use XLOG_FROM_STREAM so that ! * source info is set correctly and XLogReceiptTime ! * isn't changed. */ if (readFile < 0) { readFile = XLogFileRead(readId, readSeg, PANIC, recoveryTargetTLI, ! XLOG_FROM_STREAM, false); ! Assert(readFile >= 0); switched_segment = true; + } + else + { + /* just make sure source info is correct... */ readSource = XLOG_FROM_STREAM; + XLogReceiptSource = XLOG_FROM_STREAM; } break; } + /* + * Data not here yet, so check for trigger then sleep. + */ if (CheckForStandbyTrigger()) goto triggered; Index: src/backend/replication/walreceiver.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/replication/walreceiver.c,v retrieving revision 1.14 diff -c -r1.14 walreceiver.c *** src/backend/replication/walreceiver.c 9 Jun 2010 15:04:07 -0000 1.14 --- src/backend/replication/walreceiver.c 2 Jul 2010 20:01:24 -0000 *************** *** 524,529 **** --- 524,530 ---- /* Update shared-memory status */ SpinLockAcquire(&walrcv->mutex); + walrcv->latestChunkStart = walrcv->receivedUpto; walrcv->receivedUpto = LogstreamResult.Flush; SpinLockRelease(&walrcv->mutex); Index: src/backend/replication/walreceiverfuncs.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/replication/walreceiverfuncs.c,v retrieving revision 1.5 diff -c -r1.5 walreceiverfuncs.c *** src/backend/replication/walreceiverfuncs.c 28 Apr 2010 16:54:15 -0000 1.5 --- src/backend/replication/walreceiverfuncs.c 2 Jul 2010 20:01:24 -0000 *************** *** 199,214 **** walrcv->startTime = now; walrcv->receivedUpto = recptr; SpinLockRelease(&walrcv->mutex); SendPostmasterSignal(PMSIGNAL_START_WALRECEIVER); } /* ! * Returns the byte position that walreceiver has written */ XLogRecPtr ! GetWalRcvWriteRecPtr(void) { /* use volatile pointer to prevent code rearrangement */ volatile WalRcvData *walrcv = WalRcv; --- 200,221 ---- walrcv->startTime = now; walrcv->receivedUpto = recptr; + walrcv->latestChunkStart = recptr; + SpinLockRelease(&walrcv->mutex); SendPostmasterSignal(PMSIGNAL_START_WALRECEIVER); } /* ! * Returns the last+1 byte position that walreceiver has written. ! * ! * Optionally, returns the previous chunk start, that is the first byte ! * written in the most recent walreceiver flush cycle. Callers not ! * interested in that value may pass NULL for latestChunkStart. */ XLogRecPtr ! GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart) { /* use volatile pointer to prevent code rearrangement */ volatile WalRcvData *walrcv = WalRcv; *************** *** 216,221 **** --- 223,230 ---- SpinLockAcquire(&walrcv->mutex); recptr = walrcv->receivedUpto; + if (latestChunkStart) + *latestChunkStart = walrcv->latestChunkStart; SpinLockRelease(&walrcv->mutex); return recptr; Index: src/backend/storage/ipc/standby.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/storage/ipc/standby.c,v retrieving revision 1.25 diff -c -r1.25 standby.c *** src/backend/storage/ipc/standby.c 14 Jun 2010 00:49:24 -0000 1.25 --- src/backend/storage/ipc/standby.c 2 Jul 2010 20:01:24 -0000 *************** *** 30,36 **** --- 30,39 ---- #include "storage/standby.h" #include "utils/ps_status.h" + /* User-settable GUC parameters */ int vacuum_defer_cleanup_age; + int max_standby_archive_delay = 30 * 1000; + int max_standby_streaming_delay = 30 * 1000; static List *RecoveryLockList; *************** *** 113,118 **** --- 117,152 ---- * ----------------------------------------------------- */ + /* + * Determine the cutoff time at which we want to start canceling conflicting + * transactions. Returns zero (a time safely in the past) if we are willing + * to wait forever. + */ + static TimestampTz + GetStandbyLimitTime(void) + { + TimestampTz rtime; + bool fromStream; + + /* + * The cutoff time is the last WAL data receipt time plus the appropriate + * delay variable. Delay of -1 means wait forever. + */ + GetXLogReceiptTime(&rtime, &fromStream); + if (fromStream) + { + if (max_standby_streaming_delay < 0) + return 0; /* wait forever */ + return TimestampTzPlusMilliseconds(rtime, max_standby_streaming_delay); + } + else + { + if (max_standby_archive_delay < 0) + return 0; /* wait forever */ + return TimestampTzPlusMilliseconds(rtime, max_standby_archive_delay); + } + } + #define STANDBY_INITIAL_WAIT_US 1000 static int standbyWait_us = STANDBY_INITIAL_WAIT_US; *************** *** 124,133 **** static bool WaitExceedsMaxStandbyDelay(void) { ! /* Are we past max_standby_delay? */ ! if (MaxStandbyDelay >= 0 && ! TimestampDifferenceExceeds(GetLatestXLogTime(), GetCurrentTimestamp(), ! MaxStandbyDelay)) return true; /* --- 158,168 ---- static bool WaitExceedsMaxStandbyDelay(void) { ! TimestampTz ltime; ! ! /* Are we past the limit time? */ ! ltime = GetStandbyLimitTime(); ! if (ltime && GetCurrentTimestamp() >= ltime) return true; /* *************** *** 368,433 **** * Startup is sleeping and the query waits on a lock. We protect against * only the former sequence here, the latter sequence is checked prior to * the query sleeping, in CheckRecoveryConflictDeadlock(). */ void ResolveRecoveryConflictWithBufferPin(void) { bool sig_alarm_enabled = false; Assert(InHotStandby); ! if (MaxStandbyDelay == 0) ! { ! /* ! * We don't want to wait, so just tell everybody holding the pin to ! * get out of town. ! */ ! SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); ! } ! else if (MaxStandbyDelay < 0) ! { ! TimestampTz now = GetCurrentTimestamp(); /* ! * Set timeout for deadlock check (only) */ if (enable_standby_sig_alarm(now, now, true)) sig_alarm_enabled = true; else elog(FATAL, "could not set timer for process wakeup"); } else { ! TimestampTz then = GetLatestXLogTime(); ! TimestampTz now = GetCurrentTimestamp(); ! ! /* Are we past max_standby_delay? */ ! if (TimestampDifferenceExceeds(then, now, MaxStandbyDelay)) ! { ! /* ! * We're already behind, so clear a path as quickly as possible. ! */ ! SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); ! } else ! { ! TimestampTz max_standby_time; ! ! /* ! * At what point in the future do we hit MaxStandbyDelay? ! */ ! max_standby_time = TimestampTzPlusMilliseconds(then, MaxStandbyDelay); ! Assert(max_standby_time > now); ! ! /* ! * Wake up at MaxStandby delay, and check for deadlocks as well ! * if we will be waiting longer than deadlock_timeout ! */ ! if (enable_standby_sig_alarm(now, max_standby_time, false)) ! sig_alarm_enabled = true; ! else ! elog(FATAL, "could not set timer for process wakeup"); ! } } /* Wait to be signaled by UnpinBuffer() */ --- 402,452 ---- * Startup is sleeping and the query waits on a lock. We protect against * only the former sequence here, the latter sequence is checked prior to * the query sleeping, in CheckRecoveryConflictDeadlock(). + * + * Deadlocks are extremely rare, and relatively expensive to check for, + * so we don't do a deadlock check right away ... only if we have had to wait + * at least deadlock_timeout. Most of the logic about that is in proc.c. */ void ResolveRecoveryConflictWithBufferPin(void) { bool sig_alarm_enabled = false; + TimestampTz ltime; + TimestampTz now; Assert(InHotStandby); ! ltime = GetStandbyLimitTime(); ! now = GetCurrentTimestamp(); + if (!ltime) + { /* ! * We're willing to wait forever for conflicts, so set timeout for ! * deadlock check (only) */ if (enable_standby_sig_alarm(now, now, true)) sig_alarm_enabled = true; else elog(FATAL, "could not set timer for process wakeup"); } + else if (now >= ltime) + { + /* + * We're already behind, so clear a path as quickly as possible. + */ + SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); + } else { ! /* ! * Wake up at ltime, and check for deadlocks as well if we will be ! * waiting longer than deadlock_timeout ! */ ! if (enable_standby_sig_alarm(now, ltime, false)) ! sig_alarm_enabled = true; else ! elog(FATAL, "could not set timer for process wakeup"); } /* Wait to be signaled by UnpinBuffer() */ Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.557 diff -c -r1.557 guc.c *** src/backend/utils/misc/guc.c 25 Jun 2010 13:11:25 -0000 1.557 --- src/backend/utils/misc/guc.c 2 Jul 2010 20:01:25 -0000 *************** *** 57,62 **** --- 57,63 ---- #include "postmaster/walwriter.h" #include "replication/walsender.h" #include "storage/bufmgr.h" + #include "storage/standby.h" #include "storage/fd.h" #include "tcop/tcopprot.h" #include "tsearch/ts_cache.h" *************** *** 116,122 **** extern char *temp_tablespaces; extern bool synchronize_seqscans; extern bool fullPageWrites; - extern int vacuum_defer_cleanup_age; extern int ssl_renegotiation_limit; #ifdef TRACE_SORT --- 117,122 ---- *************** *** 1373,1378 **** --- 1373,1398 ---- 1000, 1, INT_MAX / 1000, NULL, NULL }, + { + {"max_standby_archive_delay", PGC_SIGHUP, WAL_STANDBY_SERVERS, + gettext_noop("Sets the maximum delay before canceling queries when a hot standby server is processing archivedWAL data."), + NULL, + GUC_UNIT_MS + }, + &max_standby_archive_delay, + 30 * 1000, -1, INT_MAX / 1000, NULL, NULL + }, + + { + {"max_standby_streaming_delay", PGC_SIGHUP, WAL_STANDBY_SERVERS, + gettext_noop("Sets the maximum delay before canceling queries when a hot standby server is processing streamedWAL data."), + NULL, + GUC_UNIT_MS + }, + &max_standby_streaming_delay, + 30 * 1000, -1, INT_MAX / 1000, NULL, NULL + }, + /* * Note: MaxBackends is limited to INT_MAX/4 because some places compute * 4*MaxBackends without any overflow check. This check is made in *************** *** 1393,1408 **** }, { - {"max_standby_delay", PGC_SIGHUP, WAL_STANDBY_SERVERS, - gettext_noop("Sets the maximum delay to avoid conflict processing on hot standby servers."), - NULL, - GUC_UNIT_MS - }, - &MaxStandbyDelay, - 30 * 1000, -1, INT_MAX / 1000, NULL, NULL - }, - - { {"superuser_reserved_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS, gettext_noop("Sets the number of connection slots reserved for superusers."), NULL --- 1413,1418 ---- Index: src/include/access/xlog.h =================================================================== RCS file: /cvsroot/pgsql/src/include/access/xlog.h,v retrieving revision 1.113 diff -c -r1.113 xlog.h *** src/include/access/xlog.h 17 Jun 2010 16:41:25 -0000 1.113 --- src/include/access/xlog.h 2 Jul 2010 20:01:25 -0000 *************** *** 193,199 **** extern bool XLogArchiveMode; extern char *XLogArchiveCommand; extern bool EnableHotStandby; - extern int MaxStandbyDelay; extern bool log_checkpoints; /* WAL levels */ --- 197,202 ---- *************** *** 279,285 **** extern bool RecoveryInProgress(void); extern bool XLogInsertAllowed(void); ! extern TimestampTz GetLatestXLogTime(void); extern void UpdateControlFile(void); extern uint64 GetSystemIdentifier(void); --- 282,288 ---- extern bool RecoveryInProgress(void); extern bool XLogInsertAllowed(void); ! extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream); extern void UpdateControlFile(void); extern uint64 GetSystemIdentifier(void); Index: src/include/replication/walreceiver.h =================================================================== RCS file: /cvsroot/pgsql/src/include/replication/walreceiver.h,v retrieving revision 1.9 diff -c -r1.9 walreceiver.h *** src/include/replication/walreceiver.h 3 Jun 2010 22:17:32 -0000 1.9 --- src/include/replication/walreceiver.h 2 Jul 2010 20:01:25 -0000 *************** *** 41,65 **** typedef struct { /* ! * connection string; is used for walreceiver to connect with the primary. ! */ ! char conninfo[MAXCONNINFO]; ! ! /* ! * PID of currently active walreceiver process, and the current state. */ pid_t pid; WalRcvState walRcvState; pg_time_t startTime; /* ! * receivedUpto-1 is the last byte position that has been already ! * received. When startup process starts the walreceiver, it sets this to ! * the point where it wants the streaming to begin. After that, ! * walreceiver updates this whenever it flushes the received WAL. */ XLogRecPtr receivedUpto; slock_t mutex; /* locks shared variables shown above */ } WalRcvData; --- 41,75 ---- typedef struct { /* ! * PID of currently active walreceiver process, its current state and ! * start time (actually, the time at which it was requested to be started). */ pid_t pid; WalRcvState walRcvState; pg_time_t startTime; /* ! * receivedUpto-1 is the last byte position that has already been ! * received. When startup process starts the walreceiver, it sets ! * receivedUpto to the point where it wants the streaming to begin. ! * After that, walreceiver updates this whenever it flushes the received ! * WAL to disk. */ XLogRecPtr receivedUpto; + /* + * latestChunkStart is the starting byte position of the current "batch" + * of received WAL. It's actually the same as the previous value of + * receivedUpto before the last flush to disk. Startup process can use + * this to detect whether it's keeping up or not. + */ + XLogRecPtr latestChunkStart; + + /* + * connection string; is used for walreceiver to connect with the primary. + */ + char conninfo[MAXCONNINFO]; + slock_t mutex; /* locks shared variables shown above */ } WalRcvData; *************** *** 83,88 **** extern bool WalRcvInProgress(void); extern XLogRecPtr WaitNextXLogAvailable(XLogRecPtr recptr, bool *finished); extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo); ! extern XLogRecPtr GetWalRcvWriteRecPtr(void); #endif /* _WALRECEIVER_H */ --- 93,98 ---- extern bool WalRcvInProgress(void); extern XLogRecPtr WaitNextXLogAvailable(XLogRecPtr recptr, bool *finished); extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo); ! extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart); #endif /* _WALRECEIVER_H */ Index: src/include/storage/standby.h =================================================================== RCS file: /cvsroot/pgsql/src/include/storage/standby.h,v retrieving revision 1.10 diff -c -r1.10 standby.h *** src/include/storage/standby.h 13 May 2010 11:15:38 -0000 1.10 --- src/include/storage/standby.h 2 Jul 2010 20:01:25 -0000 *************** *** 19,25 **** --- 19,28 ---- #include "storage/procsignal.h" #include "storage/relfilenode.h" + /* User-settable GUC parameters */ extern int vacuum_defer_cleanup_age; + extern int max_standby_archive_delay; + extern int max_standby_streaming_delay; extern void InitRecoveryTransactionEnvironment(void); extern void ShutdownRecoveryTransactionEnvironment(void);
On Fri, Jul 2, 2010 at 4:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > [ Apologies for the very slow turnaround on this --- I got hit with > another batch of non-postgres security issues this week. ] > > Attached is a draft patch for revising the max_standby_delay behavior into > something I think is a bit saner. There is some unfinished business: > > * I haven't touched the documentation yet > > * The code in xlog.c needs to be reverted to its former behavior so that > recoveryLastXTime means what it's supposed to mean, ie just the last > commit/abort timestamp. > > However neither of these affects the testability of the patch. > > Basically the way that it works is that the standby delay is computed with > reference to XLogReceiptTime rather than any timestamp obtained from WAL. > XLogReceiptTime is set to current time whenever we obtain a WAL segment > from the archives or when we begin processing a fresh batch of WAL from > walreceiver. There's a subtlety in the streaming case: we don't advance > XLogReceiptTime if we are not caught up, that is if the startup process > is more than one flush cycle behind walreceiver. In the normal case > we'll advance XLogReceiptTime on each flush cycle, but once we start > falling behind, it doesn't move so the grace time alloted to conflicting > queries begins to decrease. > > I split max_standby_delay into two GUC variables, as previously > threatened: max_standby_archive_delay and max_standby_streaming_delay. > The former applies when processing data read from archive, the latter > when processing data received from walreceiver. I think this is really > quite important given the new behavior, because max_standby_archive_delay > ought to be set with reference to the expected time to process one WAL > segment, whereas max_standby_streaming_delay doesn't depend on that value > at all. I'm not sure what good defaults are for these values, so I left > them at 30 seconds for the moment. I'm inclined to think > max_standby_archive_delay ought to be quite a bit less though. > > It might be worth adding a minimum-grace-time limit as we previously > discussed, but this patch doesn't attempt to do that. > > Comments? I haven't been able to wrap my head around why the delay should be LESS in the archive case than in the streaming case. Can you attempt to hit me with the clue-by-four? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > I haven't been able to wrap my head around why the delay should be > LESS in the archive case than in the streaming case. Can you attempt > to hit me with the clue-by-four? In the archive case, you're presumably trying to catch up, and so it makes sense to kill queries faster so you can catch up. The existing code essentially forces instant kill when reading from archive, for any reasonable value of max_standby_delay (because the archived timestamps will probably be older than that). That's overenthusiastic in my view, but you can get that behavior if you want it with this patch by setting max_standby_archive_delay to zero. If you don't want it, you can use something larger. If you don't think that max_standby_archive_delay should be less than max_standby_streaming_delay, you can set them the same, too. regards, tom lane
On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I haven't been able to wrap my head around why the delay should be >> LESS in the archive case than in the streaming case. Can you attempt >> to hit me with the clue-by-four? > > In the archive case, you're presumably trying to catch up, and so it > makes sense to kill queries faster so you can catch up. On the flip side, the timeout for the WAL segment is for 16MB of WAL, whereas the timeout for SR is normally going to be for a much smaller chunk (right?). So even with the same value for both, it seems like queries will be killed more aggressively during archive recovery. Even so, it seems useful to have both. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In the archive case, you're presumably trying to catch up, and so it >> makes sense to kill queries faster so you can catch up. > On the flip side, the timeout for the WAL segment is for 16MB of WAL, > whereas the timeout for SR is normally going to be for a much smaller > chunk (right?). So even with the same value for both, it seems like > queries will be killed more aggressively during archive recovery. True, but I suspect useful settings will be significantly larger than those times anyway, so it kind of comes out in the wash. For walreceiver the expected time to process each new chunk is less than wal_sender_delay (if it's not, you better get a faster slave machine). For the archive case, it probably takes more than 200ms to process a 16MB segment, but still only a few seconds. So if you have both the max-delay parameters set to 30 seconds, the minimum "normal" grace periods are 29.8 sec vs maybe 25 sec, not all that different. regards, tom lane
On Fri, Jul 2, 2010 at 4:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> In the archive case, you're presumably trying to catch up, and so it >>> makes sense to kill queries faster so you can catch up. > >> On the flip side, the timeout for the WAL segment is for 16MB of WAL, >> whereas the timeout for SR is normally going to be for a much smaller >> chunk (right?). So even with the same value for both, it seems like >> queries will be killed more aggressively during archive recovery. > > True, but I suspect useful settings will be significantly larger than > those times anyway, so it kind of comes out in the wash. For > walreceiver the expected time to process each new chunk is less than > wal_sender_delay (if it's not, you better get a faster slave machine). > For the archive case, it probably takes more than 200ms to process a > 16MB segment, but still only a few seconds. So if you have both the > max-delay parameters set to 30 seconds, the minimum "normal" grace > periods are 29.8 sec vs maybe 25 sec, not all that different. I guess I was thinking more in terms of conflicts-per-WAL-record than actual replay time. If we assume that ever 100th WAL record produces a conflict, SR will pause every once in a while, and then catch up (either because it canceled some queries or because they finished within the allowable grace period), whereas archive recovery will be patient for a bit and then start nuking 'em from orbit. Maybe my mental model is all wrong though. Anyway, I don't object to having two separate settings, and maybe we'll know more about how to tune them after this gets out in the field. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 02/07/10 23:36, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: >> I haven't been able to wrap my head around why the delay should be >> LESS in the archive case than in the streaming case. Can you attempt >> to hit me with the clue-by-four? > > In the archive case, you're presumably trying to catch up, and so it > makes sense to kill queries faster so you can catch up. The existing > code essentially forces instant kill when reading from archive, for any > reasonable value of max_standby_delay (because the archived timestamps > will probably be older than that). That's overenthusiastic in my view, > but you can get that behavior if you want it with this patch by setting > max_standby_archive_delay to zero. If you don't want it, you can use > something larger. If you don't think that max_standby_archive_delay > should be less than max_standby_streaming_delay, you can set them the > same, too. It would seem logical to use the same logic for archive recovery as we do for streaming replication, and only set XLogReceiptTime when you have to wait for a WAL segment to arrive into the archive, ie. when restore_command fails. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > It would seem logical to use the same logic for archive recovery as we > do for streaming replication, and only set XLogReceiptTime when you have > to wait for a WAL segment to arrive into the archive, ie. when > restore_command fails. That would not do what you want at all in the case where you're recovering from archive --- XLogReceiptTime would never advance at all for the duration of the recovery. It might be useful if you knew that it was a standby-with-log-shipping situation, but we have no way to tell the difference. The restore_command might know the difference, though. Is it worth modifying its API somehow so that the command could tell us whether to advance XLogReceiptTime? How would we do that? regards, tom lane
On 03/07/10 18:32, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> It would seem logical to use the same logic for archive recovery as we >> do for streaming replication, and only set XLogReceiptTime when you have >> to wait for a WAL segment to arrive into the archive, ie. when >> restore_command fails. > > That would not do what you want at all in the case where you're > recovering from archive --- XLogReceiptTime would never advance > at all for the duration of the recovery. Do you mean when using something like pg_standby, which does the waiting itself? > It might be useful if you knew that it was a standby-with-log-shipping > situation, but we have no way to tell the difference. With pg_standby etc. you use standby_mode=off. Same with traditional archive recovery. In standby mode, it's on. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 03/07/10 18:32, Tom Lane wrote: >> That would not do what you want at all in the case where you're >> recovering from archive --- XLogReceiptTime would never advance >> at all for the duration of the recovery. > Do you mean when using something like pg_standby, which does the waiting > itself? No, I'm thinking about recovery starting from an old base backup, or any situation where you're trying to process a significant number of archived WAL segments as quickly as possible. >> It might be useful if you knew that it was a standby-with-log-shipping >> situation, but we have no way to tell the difference. > With pg_standby etc. you use standby_mode=off. Same with traditional > archive recovery. In standby mode, it's on. Uh, no, that parameter is not what I'm talking about. What I tried to say is that if you're using log shipping for replication instead of walsender/walreceiver, you might want to treat data as live even though the database thinks it is being pulled "from archive". But we'd need a way for the database to tell the cases apart ... right now it cannot. regards, tom lane