Thread: Change pg_last_xlog_receive_location not to move backwards
Hi, In the case where there are multiple standbys, when a failover happens, we usually calculate most advanced standby by using pg_last_xlog_receive_location or pg_last_xlog_replay_location, and promote it to new master. The problem is that neither function might return the right last location of WAL available in the standby. So we cannot use them to check which standby is most ahead. Since pg_last_xlog_receive_location moves backwards when the standby attempts to reconnect to the primary, it might fall behind the last location of WAL available. OTOH, pg_last_xlog_replay_location might also fall behind because of Hot Standby query conflict. So I'm thinking to change pg_last_xlog_receive_location not to move backwards. There is no need to move it backwards when the standby reconnects to the primary. So we can do that. BTW, the related discussion was done before: http://archives.postgresql.org/pgsql-hackers/2010-06/msg00576.php Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > So I'm thinking to change pg_last_xlog_receive_location not to > move backwards. The attached patch does that. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Thu, Jan 13, 2011 at 2:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> So I'm thinking to change pg_last_xlog_receive_location not to >> move backwards. > > The attached patch does that. It looks to me like this is changing more than just the return value of pg_last_xlog_receive_location. receivedUpto isn't only used for that one function, and there's no explanation in your email or in the patch of why the new behavior is correct and/or better for the other places where it's used. This email from Heikki seems to indicate that there's a reason for the current behavior: http://archives.postgresql.org/pgsql-hackers/2010-06/msg00586.php -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 25, 2011 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 13, 2011 at 2:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> So I'm thinking to change pg_last_xlog_receive_location not to >>> move backwards. >> >> The attached patch does that. > > It looks to me like this is changing more than just the return value > of pg_last_xlog_receive_location. receivedUpto isn't only used for > that one function, and there's no explanation in your email or in the > patch of why the new behavior is correct and/or better for the other > places where it's used. > > This email from Heikki seems to indicate that there's a reason for the > current behavior: > > http://archives.postgresql.org/pgsql-hackers/2010-06/msg00586.php Yes, so I didn't change that behavior, i.e., even with the patch, SR still always starts from the head of the WAL segment, not the middle of that. What I changed is only the return value of pg_last_xlog_receive_location. Am I missing something? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Jan 24, 2011 at 11:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 13, 2011 at 2:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> So I'm thinking to change pg_last_xlog_receive_location not to >>> move backwards. >> >> The attached patch does that. > > It looks to me like this is changing more than just the return value > of pg_last_xlog_receive_location. receivedUpto isn't only used for > that one function, and there's no explanation in your email or in the > patch of why the new behavior is correct and/or better for the other > places where it's used. I believe the new walrcv->receiveStart was introduced to divide up those behaviors that should go backwards from those that should not. The non-retreating value is used in 3 places (via GetWalRcvWriteRecPtr) in xlog.c. (And some other places I haven't examined yet.) One place is to decide whether to remove/recycle XLog files, and I think the non-retreating value is at least as suitable for this usage. One is to feed the pg_last_xlog_receive_location() function. The third seems more problematic. In the XLogPageRead, it checks to see if more records have been received beyond what has been applied. By using the non-retreating value here, it seems like the xlog replay could start replaying records that the wal receiver is in the process of overwriting. Now, I've argued to myself that this is not a problem, because the receiver is overwriting them with identical data to that which is already there. But by that logic, why does any part of it (walrcv->receiveStart in the patch, walrcv->receivedUpto unpatched) need to retreat? From the previous discussion, I understand that the concern is that we don't want to retrieve and write out partial xlog files. What I don't understand is how we could get our selves into the position in which we are doing that, other than by someone going in and doing impermissible things to the PGDATA directory behind our backs. I've been spinning my wheels on that part for a while. Since I don't understand what we are defending against in the original code, I can't evaluate if the patch maintains that defense. Cheers, Jeff
Thanks for the review! On Wed, Jan 26, 2011 at 3:40 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > I believe the new walrcv->receiveStart was introduced to divide up those > behaviors that should go backwards from those that should not. Yes. > The non-retreating value is used in 3 places (via GetWalRcvWriteRecPtr) > in xlog.c. (And some other places I haven't examined yet.) Yes. > The third seems more problematic. In the XLogPageRead, > it checks to see if more records have been received beyond what > has been applied. By using the non-retreating value here, it seems > like the xlog replay could start replaying records that the wal > receiver is in the process of overwriting. Now, I've argued to myself > that this is not a problem, because the receiver is overwriting them > with identical data to that which is already there. Yes. I don't think that it's a problem, too. > But by that logic, why does any part of it (walrcv->receiveStart in > the patch, walrcv->receivedUpto unpatched) need to retreat? From > the previous discussion, I understand that the concern is that we don't > want to retrieve and write out partial xlog files. What I don't understand > is how we could get our selves into the position in which we are doing > that, other than by someone going in and doing impermissible things to > the PGDATA directory behind our backs. That logic exists because we'd like to check that newly-received WAL data is consistent with previous one by validating the header of new WAL file. So since we need the header of new WAL file, we retreat the replication starting location to the beginning of the WAL file when reconnecting to the primary. The following code (in XLogPageRead) validates the header of new WAL file. ----------------------if (switched_segment && targetPageOff != 0){ /* * Whenever switching to a new WAL segment, weread the first page of * the file and validate its header, even if that's not where the * target record is. Thisis so that we can check the additional * identification info that is present in the first page's "long" * header. */ readOff = 0; if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ) { ereport(emode_for_corrupt_record(emode,*RecPtr), (errcode_for_file_access(), errmsg("couldnot read from log file %u, segment %u, offset %u: %m", readId, readSeg, readOff))); goto next_record_is_invalid; } if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode)) goto next_record_is_invalid;} ---------------------- Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jan 25, 2011 at 6:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> The third seems more problematic. In the XLogPageRead, >> it checks to see if more records have been received beyond what >> has been applied. By using the non-retreating value here, it seems >> like the xlog replay could start replaying records that the wal >> receiver is in the process of overwriting. Now, I've argued to myself >> that this is not a problem, because the receiver is overwriting them >> with identical data to that which is already there. > > Yes. I don't think that it's a problem, too. > >> But by that logic, why does any part of it (walrcv->receiveStart in >> the patch, walrcv->receivedUpto unpatched) need to retreat? From >> the previous discussion, I understand that the concern is that we don't >> want to retrieve and write out partial xlog files. What I don't understand >> is how we could get our selves into the position in which we are doing >> that, other than by someone going in and doing impermissible things to >> the PGDATA directory behind our backs. > > That logic exists because we'd like to check that newly-received WAL > data is consistent with previous one by validating the header of new > WAL file. I do not understand what doing so gets us. Say we previously received 2/3 of a WAL file, and replayed most of it. So now the shared buffers have data that has been synced to that WAL file already, and some of those dirty shared buffers have been written to disk and some have not. At this point, we need the data in the first 2/3 of the WAL file in order to reach a consistent state. But now we lose the connection to the master, and then we restore it. Now we request the entire file from the start rather than from where it left off. Either of two things happens. Most likely, the newly received WAL file matches the file it is overwriting, in which case there was no point in asking for it. Less likely, the master is feeding us gibberish. By requesting the full WAL file, we check the header and detect that the master is feeding us gibberish. Unfortunately, we only detect that fact *after* we have replaced a critical part of our own (previously good) copy of the WAL file with said gibberish. The standby is now in an unrecoverable state. With a bit of malicious engineering, I have created this situation. I don't know how likely it is that something like that could happen accidentally, say with a corrupted file system. I have been unable to engineer a situation where checking the header actually does any good. It has either done nothing, or done harm. > So since we need the header of new WAL file, we retreat the > replication starting location to the beginning of the WAL file when > reconnecting to the primary. > > The following code (in XLogPageRead) validates the header of new > WAL file. > > ---------------------- > if (switched_segment && targetPageOff != 0) > { > /* > * Whenever switching to a new WAL segment, we read the first page of > * the file and validate its header, even if that's not where the > * target record is. This is so that we can check the additional > * identification info that is present in the first page's "long" > * header. > */ > readOff = 0; > if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ) > { > ereport(emode_for_corrupt_record(emode, *RecPtr), > (errcode_for_file_access(), > errmsg("could not read from log file %u, segment %u, offset %u: %m", > readId, readSeg, readOff))); > goto next_record_is_invalid; > } > if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode)) > goto next_record_is_invalid; > } OK, thanks for the explanation. Is there a race condition here? That is, it seems that with your patch this part of the code could get executed after streaming is restarted, but before the streaming ever actually received and wrote anything. So it would be checking the old header, not the newly about-to-be received header. Cheers, Jeff
On Sun, Jan 30, 2011 at 10:44 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > I do not understand what doing so gets us. > > Say we previously received 2/3 of a WAL file, and replayed most of it. > So now the shared buffers have data that has been synced to that WAL > file already, and some of those dirty shared buffers have been written > to disk and some have not. At this point, we need the data in the first > 2/3 of the WAL file in order to reach a consistent state. But now we > lose the connection to the master, and then we restore it. Now we > request the entire file from the start rather than from where it > left off. > > Either of two things happens. Most likely, the newly received WAL file > matches the file it is overwriting, in which case there was no > point in asking for it. > > Less likely, the master is feeding us gibberish. By requesting the > full WAL file, we check the header and detect that the master is feeding > us gibberish. Unfortunately, we only detect that fact *after* we have > replaced a critical part of our own (previously good) copy of the WAL > file with said gibberish. The standby is now in an unrecoverable state. Right. To avoid this problem completely, IMO, walreceiver should validate the received WAL data before writing it. Or, walreceiver should write the WAL to the transient file, and the startup process should rename it to the correct name after replaying it. We should do something like the above? > With a bit of malicious engineering, I have created this situation. > I don't know how likely it is that something like that could happen > accidentally, say with a corrupted file system. I have been unable > to engineer a situation where checking the header actually does > any good. It has either done nothing, or done harm. OK, I seem to have to consider again why the code which retreats the replication starting location exists. At first, I added the code to prevent a half-baked WAL file. For example, please imagine the case where you start the standby server with no WAL files in pg_xlog. In this case, if replication starts from the middle of WAL file, the received WAL file is obviously broken (i.e., with no WAL data in the first half of file). This broken WAL file might cause trouble when we restart the standby and even when we just promote it (because the last applied WAL file is re-fetched at the end of recovery). OTOH, we can start replication from the middle of WAL file if we can *ensure* that the first half of WAL file already exists. At least, when the standby reconnects to the master, we might be able to ensure that and start from the middle. > OK, thanks for the explanation. Is there a race condition here? That is, > it seems that with your patch this part of the code could get executed > after streaming is restarted, but before the streaming ever actually received > and wrote anything. So it would be checking the old header, not the newly > about-to-be received header. As far as I read the code again, there is no such a race condition. When the received WAL is read just after streaming is restarted, that part of the code seems to always be executed. Maybe I'm missing something. Could you explain about the problematic scenario? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii, all, * Fujii Masao (masao.fujii@gmail.com) wrote: > That logic exists because we'd like to check that newly-received WAL > data is consistent with previous one by validating the header of new > WAL file. So since we need the header of new WAL file, we retreat the > replication starting location to the beginning of the WAL file when > reconnecting to the primary. Thanks for that explanation, but I can't help but wonder why it doesn't make more sense to introduce a new variable to track the value you want rather than reusing an existing one and then adding a variable to represent what the old variable was already doing. In other words, why not invent XLogRecPtr newestReceviedUpto; /* or something */ and update that as necessary and have the function you want changed return that, and leave receivedUpto alone..? It seems like it'd be a lot easier to prove to ourselves that your patch didn't break anything, presuming the function we're talking about changing the return value of isn't called anywhere (seems rather unlikely to me..). Also, you really should reformat the docs properly when you change them, rather than leaving the lines ragged.. Thanks, Stephen
On Fri, Feb 11, 2011 at 11:52 AM, Stephen Frost <sfrost@snowman.net> wrote: > Fujii, all, > > * Fujii Masao (masao.fujii@gmail.com) wrote: >> That logic exists because we'd like to check that newly-received WAL >> data is consistent with previous one by validating the header of new >> WAL file. So since we need the header of new WAL file, we retreat the >> replication starting location to the beginning of the WAL file when >> reconnecting to the primary. > > Thanks for that explanation, but I can't help but wonder why it doesn't > make more sense to introduce a new variable to track the value you want > rather than reusing an existing one and then adding a variable to > represent what the old variable was already doing. +1. It seems like there may be some more significant changes in this area needed; however, for now I think the best fix is the one with the least chance of breaking anything. > Also, you really should reformat the docs properly when you change them, > rather than leaving the lines ragged.. It's OK to leave them a little ragged, I think. It eases back-patching. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 11, 2011 at 12:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 11, 2011 at 11:52 AM, Stephen Frost <sfrost@snowman.net> wrote: >> Fujii, all, >> >> * Fujii Masao (masao.fujii@gmail.com) wrote: >>> That logic exists because we'd like to check that newly-received WAL >>> data is consistent with previous one by validating the header of new >>> WAL file. So since we need the header of new WAL file, we retreat the >>> replication starting location to the beginning of the WAL file when >>> reconnecting to the primary. >> >> Thanks for that explanation, but I can't help but wonder why it doesn't >> make more sense to introduce a new variable to track the value you want >> rather than reusing an existing one and then adding a variable to >> represent what the old variable was already doing. > > +1. > > It seems like there may be some more significant changes in this area > needed; however, for now I think the best fix is the one with the > least chance of breaking anything. Actually... wait a minute. Now that I'm thinking about this a little more, I'm not so convinced. If we leave this the way is, and just paper over the problem using the method Stephen and I both had in mind, then we could be streaming from a position that precedes the so-called "current" position. And worse, the newly introduced replies to the master will still show the position going backward, so the master will reflect a position that is earlier that the position the slave reports for itself, not because of any real difference but because of a reporting difference. That sure doesn't seem good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Robert Haas (robertmhaas@gmail.com) wrote: > Actually... wait a minute. Now that I'm thinking about this a little > more, I'm not so convinced. If we leave this the way is, and just > paper over the problem using the method Stephen and I both had in > mind, then we could be streaming from a position that precedes the > so-called "current" position. And worse, the newly introduced replies > to the master will still show the position going backward, so the > master will reflect a position that is earlier that the position the > slave reports for itself, not because of any real difference but > because of a reporting difference. That sure doesn't seem good. I'm really not sure it's as bad as all that... The slave and the master are only going to be "out of sync" wrt position until the slave sends the request for update to the master, gets back the result, and checks the XLOG header, right? Here's my question- we're talking about if the master dies here, as I understand it. If the slave comes up and can't get to the master, is he going to be reporting the older position, or his current one? The answer to that should be, I believe, the *current* one. He'd only "go backwards" *if* the master is up and he gets his request over to the master to get the old log. In fact, if he comes up and gets his request off to the master and the master never replies, in my view, he should be able to be restarted into 'master' mode and come all the way up to 'current' (which would be later than what he was trying to ask the master for). *That* is the value, I think, that Fujii is looking for- if this slave started up as a master, where would he be? That should always be increasing. The fact that we "back up" a little, as a double-check to make sure we didn't get wildly out of sync, and because we want the master to only be producing full XLOGs, is an implementation detail, really, that probably doesn't really need to be exposed.. That may mean that we want to change what the slave is reporting for itself though, if it's currently allowed to be seen "going backwards", but I don't think that would be terribly difficult to change... I havn't really dug into the SR/HS stuff, so I could be way off too.. Thanks, Stephen
On Fri, Feb 11, 2011 at 12:52 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> Actually... wait a minute. Now that I'm thinking about this a little >> more, I'm not so convinced. If we leave this the way is, and just >> paper over the problem using the method Stephen and I both had in >> mind, then we could be streaming from a position that precedes the >> so-called "current" position. And worse, the newly introduced replies >> to the master will still show the position going backward, so the >> master will reflect a position that is earlier that the position the >> slave reports for itself, not because of any real difference but >> because of a reporting difference. That sure doesn't seem good. > > I'm really not sure it's as bad as all that... The slave and the master > are only going to be "out of sync" wrt position until the slave sends > the request for update to the master, gets back the result, and checks > the XLOG header, right? It'll restream the whole segment up to wherever it was, but basically, yes. I think a big part of the problem here is that we have wildly inconsistent terminology for no especially good reason. The standby reports three XLOG positions to the master, currently called write, flush, and apply. The walreceiver reports positions called recievedUpTo and lastChunkStart. receivedUpTo is actually the FLUSH position, and lastChunkStart is the previous flush position, except when the walreceiver first starts, when it's the position at which streaming is to begin. So, what if we did some renaming? I'd be inclined to start by renaming "receivedUpTo" to Flush, and add a new position called Stream. When walreciever is started, we set Stream to the position at which streaming is going to begin (which can rewind) and leave Flush alone (so it never rewinds). We then change the walreceiver feedback mechanism to use the term stream_location rather than write_location; and we could consider replacing pg_last_xlog_receive_location() with pg_last_xlog_stream_location() and pg_last_xlog_flush_location(). That's a backward compatibility break, but maybe it's worth it for the sake of terminological consistency, not to mention accuracy. I'd also be inclined to go to the walreceiver code and and rename the apply_location to replay_location, so that it matches pg_last_xlog_replay_location(). The latter is in 9.0, but the former is new to 9.1, so we can still fix it to be consistent without a backward compatibility break. Thoughts, comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Robert Haas (robertmhaas@gmail.com) wrote: > I think a big part of the problem here is that we have wildly > inconsistent terminology for no especially good reason. Agreed. > Thoughts, comments? I thought about this for a bit and agree w/ your suggestions. So, +1 from me. Thanks, Stephen
On Sat, 2011-02-12 at 09:32 -0500, Robert Haas wrote: > On Fri, Feb 11, 2011 at 12:52 PM, Stephen Frost <sfrost@snowman.net> wrote: > > * Robert Haas (robertmhaas@gmail.com) wrote: > >> Actually... wait a minute. Now that I'm thinking about this a little > >> more, I'm not so convinced. If we leave this the way is, and just > >> paper over the problem using the method Stephen and I both had in > >> mind, then we could be streaming from a position that precedes the > >> so-called "current" position. And worse, the newly introduced replies > >> to the master will still show the position going backward, so the > >> master will reflect a position that is earlier that the position the > >> slave reports for itself, not because of any real difference but > >> because of a reporting difference. That sure doesn't seem good. > > > > I'm really not sure it's as bad as all that... The slave and the master > > are only going to be "out of sync" wrt position until the slave sends > > the request for update to the master, gets back the result, and checks > > the XLOG header, right? > > It'll restream the whole segment up to wherever it was, but basically, yes. > > I think a big part of the problem here is that we have wildly > inconsistent terminology for no especially good reason. The standby > reports three XLOG positions to the master, currently called write, > flush, and apply. The walreceiver reports positions called > recievedUpTo and lastChunkStart. receivedUpTo is actually the FLUSH > position, and lastChunkStart is the previous flush position, except > when the walreceiver first starts, when it's the position at which > streaming is to begin. > > So, what if we did some renaming? I'd be inclined to start by > renaming "receivedUpTo" to Flush, and add a new position called > Stream. When walreciever is started, we set Stream to the position at > which streaming is going to begin (which can rewind) and leave Flush > alone (so it never rewinds). OK > We then change the walreceiver feedback > mechanism to use the term stream_location rather than write_location; OK > and we could consider replacing pg_last_xlog_receive_location() with > pg_last_xlog_stream_location() and pg_last_xlog_flush_location(). > That's a backward compatibility break, but maybe it's worth it for the > sake of terminological consistency, not to mention accuracy. Don't see a need to break anything. OK with two new function names. > I'd also be inclined to go to the walreceiver code and and rename the > apply_location to replay_location, so that it matches > pg_last_xlog_replay_location(). The latter is in 9.0, but the former > is new to 9.1, so we can still fix it to be consistent without a > backward compatibility break. Any renaming of things should be done as cosmetic fixes after important patches are in. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Mon, 2011-01-31 at 16:12 +0900, Fujii Masao wrote: > On Sun, Jan 30, 2011 at 10:44 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > I do not understand what doing so gets us. > > > > Say we previously received 2/3 of a WAL file, and replayed most of it. > > So now the shared buffers have data that has been synced to that WAL > > file already, and some of those dirty shared buffers have been written > > to disk and some have not. At this point, we need the data in the first > > 2/3 of the WAL file in order to reach a consistent state. But now we > > lose the connection to the master, and then we restore it. Now we > > request the entire file from the start rather than from where it > > left off. > > > > Either of two things happens. Most likely, the newly received WAL file > > matches the file it is overwriting, in which case there was no > > point in asking for it. > > > > Less likely, the master is feeding us gibberish. By requesting the > > full WAL file, we check the header and detect that the master is feeding > > us gibberish. Unfortunately, we only detect that fact *after* we have > > replaced a critical part of our own (previously good) copy of the WAL > > file with said gibberish. The standby is now in an unrecoverable state. > > Right. To avoid this problem completely, IMO, walreceiver should validate > the received WAL data before writing it. Or, walreceiver should write the > WAL to the transient file, and the startup process should rename it to the > correct name after replaying it. > > We should do something like the above? > > > With a bit of malicious engineering, I have created this situation. > > I don't know how likely it is that something like that could happen > > accidentally, say with a corrupted file system. I have been unable > > to engineer a situation where checking the header actually does > > any good. It has either done nothing, or done harm. > > OK, I seem to have to consider again why the code which retreats the > replication starting location exists. > > At first, I added the code to prevent a half-baked WAL file. For example, > please imagine the case where you start the standby server with no WAL > files in pg_xlog. In this case, if replication starts from the middle of WAL > file, the received WAL file is obviously broken (i.e., with no WAL data in > the first half of file). This broken WAL file might cause trouble when we > restart the standby and even when we just promote it (because the last > applied WAL file is re-fetched at the end of recovery). > > OTOH, we can start replication from the middle of WAL file if we can > *ensure* that the first half of WAL file already exists. At least, when the > standby reconnects to the master, we might be able to ensure that and > start from the middle. Some important points here, but it seems complex. AFAICS we need to do two things * check header of WAL file matches when we start streaming * start streaming from last received location So why not do them separately, rather than rewinding the received location and kludging things? Seems easier to send all required info in a separate data packet, then validate the existing WAL file against that. Then start streaming from last received location. That way we don't have any "going backwards" logic at all. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sun, Jan 30, 2011 at 11:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sun, Jan 30, 2011 at 10:44 AM, Jeff Janes <jeff.janes@gmail.com> wrote: >> I do not understand what doing so gets us. >> >> Say we previously received 2/3 of a WAL file, and replayed most of it. >> So now the shared buffers have data that has been synced to that WAL >> file already, and some of those dirty shared buffers have been written >> to disk and some have not. At this point, we need the data in the first >> 2/3 of the WAL file in order to reach a consistent state. But now we >> lose the connection to the master, and then we restore it. Now we >> request the entire file from the start rather than from where it >> left off. >> >> Either of two things happens. Most likely, the newly received WAL file >> matches the file it is overwriting, in which case there was no >> point in asking for it. >> >> Less likely, the master is feeding us gibberish. By requesting the >> full WAL file, we check the header and detect that the master is feeding >> us gibberish. Unfortunately, we only detect that fact *after* we have >> replaced a critical part of our own (previously good) copy of the WAL >> file with said gibberish. The standby is now in an unrecoverable state. > > Right. To avoid this problem completely, IMO, walreceiver should validate > the received WAL data before writing it. Or, walreceiver should write the > WAL to the transient file, and the startup process should rename it to the > correct name after replaying it. > > We should do something like the above? I don't think we should overwrite local WAL that has already been applied, unless we already know for sure that it is bad (and suspect re-streaming might make it right.) I think the better approach would be to check the existence, and the segment header, of the WAL segment we already (think we) have in pg_xlog. And then request re-streaming of that file from the beginning only if it is either missing or has the wrong header in the local copy. That might make the code a lot more complex, though. At least I don't see a simple way to implement it. > >> With a bit of malicious engineering, I have created this situation. >> I don't know how likely it is that something like that could happen >> accidentally, say with a corrupted file system. I have been unable >> to engineer a situation where checking the header actually does >> any good. It has either done nothing, or done harm. > > OK, I seem to have to consider again why the code which retreats the > replication starting location exists. > > At first, I added the code to prevent a half-baked WAL file. For example, > please imagine the case where you start the standby server with no WAL > files in pg_xlog. In this case, if replication starts from the middle of WAL > file, the received WAL file is obviously broken (i.e., with no WAL data in > the first half of file). This broken WAL file might cause trouble when we > restart the standby and even when we just promote it (because the last > applied WAL file is re-fetched at the end of recovery). > > OTOH, we can start replication from the middle of WAL file if we can > *ensure* that the first half of WAL file already exists. At least, when the > standby reconnects to the master, we might be able to ensure that and > start from the middle. > >> OK, thanks for the explanation. Is there a race condition here? That is, >> it seems that with your patch this part of the code could get executed >> after streaming is restarted, but before the streaming ever actually received >> and wrote anything. So it would be checking the old header, not the newly >> about-to-be received header. > > As far as I read the code again, there is no such a race condition. When > the received WAL is read just after streaming is restarted, that part of the > code seems to always be executed. > > Maybe I'm missing something. Could you explain about the problematic > scenario? OK, I think you are right. I was thinking that if apply is well behind receive when the connection is lost, that a new connection could immediately pass the "havedata" test an so proceed to the header check perhaps before anything was received. But now I see that we never try to re-connect until after apply has caught up with receive, so that race condition cannot happen. Sorry for the false alarm. By the way, the patch no longer applies cleanly to HEAD, as of commit d309acf201ab2c5 (Feb 11th). But it looks like a trivial conflict (Two different patches both correct the same doc typo). Cheers, Jeff
On Sat, Feb 12, 2011 at 11:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: > So, what if we did some renaming? I'd be inclined to start by > renaming "receivedUpTo" to Flush, and add a new position called > Stream. When walreciever is started, we set Stream to the position at > which streaming is going to begin (which can rewind) and leave Flush > alone (so it never rewinds). We then change the walreceiver feedback > mechanism to use the term stream_location rather than write_location; > and we could consider replacing pg_last_xlog_receive_location() with > pg_last_xlog_stream_location() and pg_last_xlog_flush_location(). You suggest that the shared variable Stream tracks the WAL write location, after it's set to the replication starting position? I don't think that the write location needs to be tracked in the shmem because other processes than walreceiver don't use it. What I proposed is: Walreceiver-local variables ================== 1. LogstreamResult.Write - Indicates the location of recently written WAL record - Can rewind - pg_stat_replication.write_locationreturns this 2. LogstreamResult.Flush - Indicates the location of recently flushed WAL record - Can rewind - pg_stat_replication.flush_locationreturns this Shmem variables =========== 3. WalRcv->receiveStart - Indicates the replication starting location - Updated only when walreceiver is started - Doesn't exist at the moment, so I propose to add this 4. WalRcv->receivedUpto - Indicates the latest location of all the flushed WAL records - Never rewinds (Can rewindat the moment, so I propose to prevent the rewind) - pg_last_xlog_receive_location returns this You propose to rename LogstreamResult.Write to .Stream, and merge it and receiveStart? > I'd also be inclined to go to the walreceiver code and and rename the > apply_location to replay_location, so that it matches > pg_last_xlog_replay_location(). The latter is in 9.0, but the former > is new to 9.1, so we can still fix it to be consistent without a > backward compatibility break. +1 Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > You suggest that the shared variable Stream tracks the WAL write location, > after it's set to the replication starting position? I don't think > that the write > location needs to be tracked in the shmem because other processes than > walreceiver don't use it. Well, my proposal was to expose it, on the theory that it's useful. As we stream the WAL, we write it, so I think for all intents and purposes write == stream. But using it to convey the starting position makes more sense if you call it stream than it does if you call it write. > You propose to rename LogstreamResult.Write to .Stream, and > merge it and receiveStart? Yeah, or probably change recieveStart to be called Stream. It's representing the same thing, just in shmem instead of backend-local, so why name it differently? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 15, 2011 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> You suggest that the shared variable Stream tracks the WAL write location, >> after it's set to the replication starting position? I don't think >> that the write >> location needs to be tracked in the shmem because other processes than >> walreceiver don't use it. > > Well, my proposal was to expose it, on the theory that it's useful. > As we stream the WAL, we write it, so I think for all intents and > purposes write == stream. But using it to convey the starting > position makes more sense if you call it stream than it does if you > call it write. Umm.. I could not find any use case to expose the WAL write location besides flush one. So I'm not sure if it's really useful to track the write location in the shmem besides the walreceiver-local memory. What use case do you think of? Personally the term "stream" sounds more ambiguous than "write". I cannot imagine what location the pg_last_xlog_stream_location or stream_location actually returns, from the function name; WAL location that has been received? written? flushed? replayed? Since the "write" sounds cleaner, I like it. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Feb 16, 2011 at 12:59 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Feb 15, 2011 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> You suggest that the shared variable Stream tracks the WAL write location, >>> after it's set to the replication starting position? I don't think >>> that the write >>> location needs to be tracked in the shmem because other processes than >>> walreceiver don't use it. >> >> Well, my proposal was to expose it, on the theory that it's useful. >> As we stream the WAL, we write it, so I think for all intents and >> purposes write == stream. But using it to convey the starting >> position makes more sense if you call it stream than it does if you >> call it write. > > Umm.. I could not find any use case to expose the WAL write location > besides flush one. So I'm not sure if it's really useful to track the > write location in the shmem besides the walreceiver-local memory. > What use case do you think of? Well, we're currently exposing that on the master via pg_stat_replication. I guess we could rip that out, but I think that if nothing else we're imagining eventually supporting a sync rep mode where the standby acknowledges WAL upon receipt rather than upon write. And the lag between the write and flush positions can be up to 16MB, so it doesn't seem entirely academic. Basically, the write position is the most WAL that could be on disk on standby and the flush is the most WAL that we're SURE is on disk on the standby. > Personally the term "stream" sounds more ambiguous than "write". > I cannot imagine what location the pg_last_xlog_stream_location or > stream_location actually returns, from the function name; WAL > location that has been received? written? flushed? replayed? > Since the "write" sounds cleaner, I like it. Well, the problem with receivedUpto is that it's really being used for two different things, neither of which is how much WAL has been received. One is where streaming is to start (hence, stream) and the other is how much we've flushed to disk (hence, flush). So you might think there were four positions: streaming start, write, flush, apply.But I think the first two are really the same: onceyou've received at least one byte, the position that you're streaming from and the write position are the same, so I think the name stream can span both concepts. OTOH, stream-start and flush are clearly NOT the same - there is a small but potentially significant delay between stream/write and flush. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 16, 2011 at 7:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 16, 2011 at 12:59 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Tue, Feb 15, 2011 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> You suggest that the shared variable Stream tracks the WAL write location, >>>> after it's set to the replication starting position? I don't think >>>> that the write >>>> location needs to be tracked in the shmem because other processes than >>>> walreceiver don't use it. >>> >>> Well, my proposal was to expose it, on the theory that it's useful. >>> As we stream the WAL, we write it, so I think for all intents and >>> purposes write == stream. But using it to convey the starting >>> position makes more sense if you call it stream than it does if you >>> call it write. >> >> Umm.. I could not find any use case to expose the WAL write location >> besides flush one. So I'm not sure if it's really useful to track the >> write location in the shmem besides the walreceiver-local memory. >> What use case do you think of? > > Well, we're currently exposing that on the master via > pg_stat_replication. I guess we could rip that out, but I think that > if nothing else we're imagining eventually supporting a sync rep mode > where the standby acknowledges WAL upon receipt rather than upon > write. And the lag between the write and flush positions can be up to > 16MB, so it doesn't seem entirely academic. Basically, the write > position is the most WAL that could be on disk on standby and the > flush is the most WAL that we're SURE is on disk on the standby. > >> Personally the term "stream" sounds more ambiguous than "write". >> I cannot imagine what location the pg_last_xlog_stream_location or >> stream_location actually returns, from the function name; WAL >> location that has been received? written? flushed? replayed? >> Since the "write" sounds cleaner, I like it. > > Well, the problem with receivedUpto is that it's really being used for > two different things, neither of which is how much WAL has been > received. One is where streaming is to start (hence, stream) and the > other is how much we've flushed to disk (hence, flush). So you might > think there were four positions: streaming start, write, flush, apply. > But I think the first two are really the same: once you've received > at least one byte, the position that you're streaming from and the > write position are the same, so I think the name stream can span both > concepts. OTOH, stream-start and flush are clearly NOT the same - > there is a small but potentially significant delay between > stream/write and flush. It sounds like the only thing we have definite agreement about from all this is that apply_location should be renamed to replay_location in pg_stat_replication, a point fairly incidental to the what this patch is about. It seems fairly unsatisfying to just change that and punt the rest of this, but I'm not sure what the alternative is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 26.02.2011 16:58, Robert Haas wrote: > It sounds like the only thing we have definite agreement about from > all this is that apply_location should be renamed to replay_location > in pg_stat_replication, a point fairly incidental to the what this > patch is about. It seems fairly unsatisfying to just change that and > punt the rest of this, but I'm not sure what the alternative is. After reading the discussions, I don't see any actual opposition to Fujii-san's patch. And I think it makes sense, the new definition makes sense for the purpose Fujii mentioned in the mail that started this thread: determining which standby is most up-to-date. There has been a lot of good suggestions, like verifying the received WAL before overwriting existing WAL. But we're not going to start doing bigger code changes this late in the release cycle. And even if we did, this patch would still make sense - we still wouldn't want pg_last_xlog_receive_location() to move backwards. So, committed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com