Thread: page corruption on 8.3+ that makes it to standby
I reported a problem here: http://archives.postgresql.org/pgsql-bugs/2010-07/msg00173.php Perhaps I used a poor subject line, but I believe it's a serious issue. That reproducible sequence seems like an obvious bug to me on 8.3+, and what's worse, the corruption propagates to the standby as I found out today (through a test, fortunately). The only mitigating factor is that it doesn't actually lose data, and you can fix it (I believe) with zero_damaged_pages (or careful use of dd). There are two fixes that I can see: 1. Have log_newpage() and heap_xlog_newpage() only call PageSetLSN() and PageSetTLI() if the page is not new. This seems slightly awkward because most WAL replay stuff doesn't have to worry about zero pages, but in this case I think it does. 2. Have copy_relation_data() initialize new pages. I don't like this because (a) it's not really the job of SET TABLESPACE to clean up zero pages; and (b) it could be an index with different special size, etc., and it doesn't seem like a good place to figure that out. Comments? Regards,Jeff Davis
On Tue, Jul 27, 2010 at 2:06 PM, Jeff Davis <pgsql@j-davis.com> wrote: > I reported a problem here: > > http://archives.postgresql.org/pgsql-bugs/2010-07/msg00173.php > > Perhaps I used a poor subject line, but I believe it's a serious issue. > That reproducible sequence seems like an obvious bug to me on 8.3+, and > what's worse, the corruption propagates to the standby as I found out > today (through a test, fortunately). I think that the problem is not so much your choice of subject line as your misfortune to discover this bug when Tom and Heikki were both on vacation. > The only mitigating factor is that it doesn't actually lose data, and > you can fix it (I believe) with zero_damaged_pages (or careful use of > dd). > > There are two fixes that I can see: > > 1. Have log_newpage() and heap_xlog_newpage() only call PageSetLSN() and > PageSetTLI() if the page is not new. This seems slightly awkward because > most WAL replay stuff doesn't have to worry about zero pages, but in > this case I think it does. > > 2. Have copy_relation_data() initialize new pages. I don't like this > because (a) it's not really the job of SET TABLESPACE to clean up zero > pages; and (b) it could be an index with different special size, etc., > and it doesn't seem like a good place to figure that out. It appears to me that all of the callers of log_newpage() other than copy_relation_data() do so with pages that they've just constructed, and which therefore can't be new. So maybe we could just modify copy_relation_data to check PageIsNew(buf), or something like that, and only call log_newpage() if that returns true. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, 2010-07-27 at 15:50 -0400, Robert Haas wrote: > > 1. Have log_newpage() and heap_xlog_newpage() only call PageSetLSN() and > > PageSetTLI() if the page is not new. This seems slightly awkward because > > most WAL replay stuff doesn't have to worry about zero pages, but in > > this case I think it does. > > > > 2. Have copy_relation_data() initialize new pages. I don't like this > > because (a) it's not really the job of SET TABLESPACE to clean up zero > > pages; and (b) it could be an index with different special size, etc., > > and it doesn't seem like a good place to figure that out. > > It appears to me that all of the callers of log_newpage() other than > copy_relation_data() do so with pages that they've just constructed, > and which therefore can't be new. So maybe we could just modify > copy_relation_data to check PageIsNew(buf), or something like that, > and only call log_newpage() if that returns true. My first concern with that idea was that it may create an inconsistency between the primary and the standby. The primary could have a bunch of zero pages that never make it to the standby. However, it looks like all WAL recovery stuff passes true for "init" when calling XLogReadBuffer(), so I think it's safe. I'll test it and submit a patch. Regards,Jeff Davis
On Tue, Jul 27, 2010 at 5:08 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Tue, 2010-07-27 at 15:50 -0400, Robert Haas wrote: >> > 1. Have log_newpage() and heap_xlog_newpage() only call PageSetLSN() and >> > PageSetTLI() if the page is not new. This seems slightly awkward because >> > most WAL replay stuff doesn't have to worry about zero pages, but in >> > this case I think it does. >> > >> > 2. Have copy_relation_data() initialize new pages. I don't like this >> > because (a) it's not really the job of SET TABLESPACE to clean up zero >> > pages; and (b) it could be an index with different special size, etc., >> > and it doesn't seem like a good place to figure that out. >> >> It appears to me that all of the callers of log_newpage() other than >> copy_relation_data() do so with pages that they've just constructed, >> and which therefore can't be new. So maybe we could just modify >> copy_relation_data to check PageIsNew(buf), or something like that, >> and only call log_newpage() if that returns true. > > My first concern with that idea was that it may create an inconsistency > between the primary and the standby. The primary could have a bunch of > zero pages that never make it to the standby. Maybe I'm slow on the uptake here, but don't the pages start out all-zeroes on the standby just as they do on the primary? The only way it seems like this would be a problem is if a page that previously contained data on the primary was subsequently zeroed without writing a WAL record - or am I confused? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, 2010-07-27 at 17:18 -0400, Robert Haas wrote: > > My first concern with that idea was that it may create an inconsistency > > between the primary and the standby. The primary could have a bunch of > > zero pages that never make it to the standby. > > Maybe I'm slow on the uptake here, but don't the pages start out > all-zeroes on the standby just as they do on the primary? The only way > it seems like this would be a problem is if a page that previously > contained data on the primary was subsequently zeroed without writing > a WAL record - or am I confused? The case I was concerned about is when you have a table on the primary with a bunch of zero pages at the end. Then you SET TABLESPACE, and none of the copied pages (or even the fact that they exist) would be sent to the standby, but they would exist on the primary. And later pages may have data, so the standby may see page N but not N-1. Generally, most of the code is not expecting to read or write past the end of the file, unless it's doing an extension. However, I think everything is fine during recovery, because it looks like it's designed to create zero pages as needed. So your idea seems safe to me, although I do still have some doubts because of my lack of knowledge in this area; particularly hot standby conflict detection/resolution. My idea was different: still log the zero page, just don't set LSN or TLI for a zero page in log_newpage() or heap_xlog_newpage(). This isn't as clean as your idea, but I'm a little more confident that it is correct. Regards,Jeff Davis
On Tue, 2010-07-27 at 15:23 -0700, Jeff Davis wrote: > On Tue, 2010-07-27 at 17:18 -0400, Robert Haas wrote: > > > My first concern with that idea was that it may create an inconsistency > > > between the primary and the standby. The primary could have a bunch of > > > zero pages that never make it to the standby. > > > > Maybe I'm slow on the uptake here, but don't the pages start out > > all-zeroes on the standby just as they do on the primary? The only way > > it seems like this would be a problem is if a page that previously > > contained data on the primary was subsequently zeroed without writing > > a WAL record - or am I confused? > > The case I was concerned about is when you have a table on the primary > with a bunch of zero pages at the end. Then you SET TABLESPACE, and none > of the copied pages (or even the fact that they exist) would be sent to > the standby, but they would exist on the primary. And later pages may > have data, so the standby may see page N but not N-1. > > Generally, most of the code is not expecting to read or write past the > end of the file, unless it's doing an extension. > > However, I think everything is fine during recovery, because it looks > like it's designed to create zero pages as needed. So your idea seems > safe to me, although I do still have some doubts because of my lack of > knowledge in this area; particularly hot standby conflict > detection/resolution. > > My idea was different: still log the zero page, just don't set LSN or > TLI for a zero page in log_newpage() or heap_xlog_newpage(). This isn't > as clean as your idea, but I'm a little more confident that it is > correct. > Both potential fixes attached and both appear to work. fix1 -- Only call PageSetLSN/TLI inside log_newpage() and heap_xlog_newpage() if the page is not zeroed. fix2 -- Don't call log_newpage() at all if the page is not zeroed. Please review. I don't have a strong opinion about which one should be applied. Regards, Jeff Davis
Attachment
On Wed, Jul 28, 2010 at 12:23 AM, Jeff Davis <pgsql@j-davis.com> wrote: > On Tue, 2010-07-27 at 15:23 -0700, Jeff Davis wrote: >> On Tue, 2010-07-27 at 17:18 -0400, Robert Haas wrote: >> > > My first concern with that idea was that it may create an inconsistency >> > > between the primary and the standby. The primary could have a bunch of >> > > zero pages that never make it to the standby. >> > >> > Maybe I'm slow on the uptake here, but don't the pages start out >> > all-zeroes on the standby just as they do on the primary? The only way >> > it seems like this would be a problem is if a page that previously >> > contained data on the primary was subsequently zeroed without writing >> > a WAL record - or am I confused? >> >> The case I was concerned about is when you have a table on the primary >> with a bunch of zero pages at the end. Then you SET TABLESPACE, and none >> of the copied pages (or even the fact that they exist) would be sent to >> the standby, but they would exist on the primary. And later pages may >> have data, so the standby may see page N but not N-1. >> >> Generally, most of the code is not expecting to read or write past the >> end of the file, unless it's doing an extension. >> >> However, I think everything is fine during recovery, because it looks >> like it's designed to create zero pages as needed. So your idea seems >> safe to me, although I do still have some doubts because of my lack of >> knowledge in this area; particularly hot standby conflict >> detection/resolution. >> >> My idea was different: still log the zero page, just don't set LSN or >> TLI for a zero page in log_newpage() or heap_xlog_newpage(). This isn't >> as clean as your idea, but I'm a little more confident that it is >> correct. >> > > Both potential fixes attached and both appear to work. > > fix1 -- Only call PageSetLSN/TLI inside log_newpage() and > heap_xlog_newpage() if the page is not zeroed. > > fix2 -- Don't call log_newpage() at all if the page is not zeroed. > > Please review. I don't have a strong opinion about which one should be > applied. Looks good. I still prefer fix2; it seems cleaner to me. It has another advantage, too, which is that copy_relation_data() is used ONLY by ALTER TABLE SET TABLESPACE. So if I stick to patching that function, that's the only thing I can possibly break, whereas log_newpage() is used in a bunch of other places. I don't think either fix is going to break anything at all, but considering that this is going to need back-patching, I'd rather be conservative. Speaking of back-patching, the subject line describes this as an 8.3+ problem, but it looks to me like this goes all the way back to 8.0. The code is a little different in 8.2 and prior, but ISTM it's vulnerable to the same issue. Don't we need a modified version of this patch for the 8.0 - 8.2 branches? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, 2010-07-27 at 21:23 -0700, Jeff Davis wrote: > Both potential fixes attached and both appear to work. > > fix1 -- Only call PageSetLSN/TLI inside log_newpage() and > heap_xlog_newpage() if the page is not zeroed. > > fix2 -- Don't call log_newpage() at all if the page is not zeroed. > > Please review. I don't have a strong opinion about which one should be > applied. ISTM we should just fix an uninitialized page first, using code from VACUUM similar to if (PageIsNew(page)) { ereport(WARNING,(errmsg("relation \"%s\" page %u is uninitialized --- fixing", relname, blkno))); PageInit(page, BufferGetPageSize(buf), 0); } then continue as before. We definitely shouldn't do anything that leaves standby different to primary. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Wed, Jul 28, 2010 at 7:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, 2010-07-27 at 21:23 -0700, Jeff Davis wrote: > >> Both potential fixes attached and both appear to work. >> >> fix1 -- Only call PageSetLSN/TLI inside log_newpage() and >> heap_xlog_newpage() if the page is not zeroed. >> >> fix2 -- Don't call log_newpage() at all if the page is not zeroed. >> >> Please review. I don't have a strong opinion about which one should be >> applied. > > ISTM we should just fix an uninitialized page first, using code from > VACUUM similar to > > if (PageIsNew(page)) > { > ereport(WARNING, > (errmsg("relation \"%s\" page %u is uninitialized --- fixing", > relname, blkno))); > PageInit(page, BufferGetPageSize(buf), 0); > } > > then continue as before. As Jeff Davis pointed out upthread, you don't know that 0 is the correct special size for the relation being copied. In the VACUUM path, that code is only applied to heaps, but that's not necessarily the case here. > We definitely shouldn't do anything that leaves standby different to > primary. Obviously. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, 2010-07-28 at 06:40 -0400, Robert Haas wrote: > > fix1 -- Only call PageSetLSN/TLI inside log_newpage() and > > heap_xlog_newpage() if the page is not zeroed. > > > > fix2 -- Don't call log_newpage() at all if the page is not zeroed. > > > > Please review. I don't have a strong opinion about which one should be > > applied. > > Looks good. I still prefer fix2; it seems cleaner to me. It has > another advantage, too, which is that copy_relation_data() is used > ONLY by ALTER TABLE SET TABLESPACE. So if I stick to patching that > function, that's the only thing I can possibly break, whereas > log_newpage() is used in a bunch of other places. I don't think > either fix is going to break anything at all, but considering that > this is going to need back-patching, I'd rather be conservative. > Sounds good to me. However, when Simon said "We definitely shouldn't do anything that leaves standby different to primary." you said "obviously". Fix2 can leave a difference between the two, because zeroed pages at the end of the heap file on the primary will not be sent to the standby (the standby will only create the zeroed pages if a higher block number is sent; which won't be the case if the zeroed pages are at the end). As we discussed before, that looks inconsequential, but I just want to make sure that it's understood. > Speaking of back-patching, the subject line describes this as an 8.3+ > problem, but it looks to me like this goes all the way back to 8.0. > The code is a little different in 8.2 and prior, but ISTM it's > vulnerable to the same issue. Don't we need a modified version of > this patch for the 8.0 - 8.2 branches? That sounds right. I just saw that the code in question was introduced in 8.3. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > However, when Simon said "We definitely shouldn't do anything that > leaves standby different to primary." you said "obviously". Fix2 can > leave a difference between the two, because zeroed pages at the end of > the heap file on the primary will not be sent to the standby (the > standby will only create the zeroed pages if a higher block number is > sent; which won't be the case if the zeroed pages are at the end). > As we discussed before, that looks inconsequential, but I just want to > make sure that it's understood. I understand it, and I don't like it one bit. I haven't caught up on this thread yet, but I think the only acceptable solution is one that leaves the slave in the *same* state as the master. Not a state that we hope will behave equivalently. I can think of too many corner cases where it might not. (In fact, having a zeroed page in a relation is already a corner case in itself, so the amount of testing you'd get for such behaviors is epsilon squared. You don't want to take that bet.) regards, tom lane
On Wed, 2010-07-28 at 12:36 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > However, when Simon said "We definitely shouldn't do anything that > > leaves standby different to primary." you said "obviously". Fix2 can > > leave a difference between the two, because zeroed pages at the end of > > the heap file on the primary will not be sent to the standby (the > > standby will only create the zeroed pages if a higher block number is > > sent; which won't be the case if the zeroed pages are at the end). > > > As we discussed before, that looks inconsequential, but I just want to > > make sure that it's understood. > > I understand it, and I don't like it one bit. I haven't caught up on > this thread yet, but I think the only acceptable solution is one that > leaves the slave in the *same* state as the master. Not a state that > we hope will behave equivalently. I can think of too many corner cases > where it might not. (In fact, having a zeroed page in a relation is > already a corner case in itself, so the amount of testing you'd get for > such behaviors is epsilon squared. You don't want to take that bet.) > Ok, sounds like my original fix (fix1) is the way to go then. Log zero pages, but don't set LSN/TLI if it's a zero page (in log_newpage or heap_xlog_newpage). Regards,Jeff Davis
On Wed, Jul 28, 2010 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> However, when Simon said "We definitely shouldn't do anything that >> leaves standby different to primary." you said "obviously". Fix2 can >> leave a difference between the two, because zeroed pages at the end of >> the heap file on the primary will not be sent to the standby (the >> standby will only create the zeroed pages if a higher block number is >> sent; which won't be the case if the zeroed pages are at the end). > >> As we discussed before, that looks inconsequential, but I just want to >> make sure that it's understood. > > I understand it, and I don't like it one bit. I haven't caught up on > this thread yet, but I think the only acceptable solution is one that > leaves the slave in the *same* state as the master. Not a state that > we hope will behave equivalently. I can think of too many corner cases > where it might not. (In fact, having a zeroed page in a relation is > already a corner case in itself, so the amount of testing you'd get for > such behaviors is epsilon squared. You don't want to take that bet.) I might be missing something here, but I don't see how you're going to manage that. In Jeff's original example, he crashes the database after extending the relation but before initializing and writing the new page. I believe that at that point no XLOG has been written yet, so the relation has been extended but there is no WAL to be sent to the standby. So now you have the exact situation you're concerned about - the relation has been extended on the master but not on the standby. As far as I can see, this is an unavoidable consequence of the fact that we don't XLOG the act of extending the relation. Worrying about it only in the specific context of ALTER TABLE .. SET TABLESPACE seems backwards; if there are any bugs there, we're in for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, 2010-07-28 at 13:18 -0400, Robert Haas wrote: > In Jeff's original example, he crashes the database > after extending the relation but before initializing and writing the > new page. I believe that at that point no XLOG has been written yet, > so the relation has been extended but there is no WAL to be sent to > the standby. So now you have the exact situation you're concerned > about - the relation has been extended on the master but not on the > standby. As far as I can see, this is an unavoidable consequence of > the fact that we don't XLOG the act of extending the relation. > Worrying about it only in the specific context of ALTER TABLE .. SET > TABLESPACE seems backwards; if there are any bugs there, we're in for > it. That's a very good point. Now I'm leaning more toward your fix. Regards,Jeff Davis
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jul 28, 2010 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I understand it, and I don't like it one bit. �I haven't caught up on >> this thread yet, but I think the only acceptable solution is one that >> leaves the slave in the *same* state as the master. > I might be missing something here, but I don't see how you're going to > manage that. In Jeff's original example, he crashes the database > after extending the relation but before initializing and writing the > new page. I believe that at that point no XLOG has been written yet, > so the relation has been extended but there is no WAL to be sent to > the standby. So now you have the exact situation you're concerned > about - the relation has been extended on the master but not on the > standby. You're right that we cannot prevent that situation --- or at least, the cure would be worse than the disease. (The cure would be to XLOG the extension action, obviously, but then out-of-disk-space has to be a PANIC condition.) However, it doesn't follow that it's a good idea to make copy_relation_data *intentionally* make the slave and master different. I've caught up on the thread now, and I think that fix2 (skip logging the page) is extremely dangerous and has little if anything in its favor. fix1 seems reasonable given the structure of the page validity checks. However, what about Jeff's original comment : On second thought, why are PageSetLSN and PageSetTLI being called from : log_newpage(), anyway? I think it is appropriate to be setting the LSN/TLI in the case of a page that's been constructed by the caller as part of the WAL-logged action, but doing so in copy_relation_data seems rather questionable. We certainly didn't change the source page so changing its LSN seems rather wrong --- wouldn't it be better to just copy the source pages with their original LSNs? So perhaps the best fix is to add a bool parameter to log_newpage telling it whether to update LSN/TLI, and have copy_relation_data pass false while the other callers pass true. (Although I guess we'd need to propagate that flag in the WAL record, so maybe this is more trouble than its worth.) regards, tom lane
On Wed, Jul 28, 2010 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Jul 28, 2010 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I understand it, and I don't like it one bit. I haven't caught up on >>> this thread yet, but I think the only acceptable solution is one that >>> leaves the slave in the *same* state as the master. > >> I might be missing something here, but I don't see how you're going to >> manage that. In Jeff's original example, he crashes the database >> after extending the relation but before initializing and writing the >> new page. I believe that at that point no XLOG has been written yet, >> so the relation has been extended but there is no WAL to be sent to >> the standby. So now you have the exact situation you're concerned >> about - the relation has been extended on the master but not on the >> standby. > > You're right that we cannot prevent that situation --- or at least, > the cure would be worse than the disease. (The cure would be to > XLOG the extension action, obviously, but then out-of-disk-space > has to be a PANIC condition.) Not to mention that performance would probably be atrocious. > However, it doesn't follow that it's > a good idea to make copy_relation_data *intentionally* make the slave > and master different. > > I've caught up on the thread now, and I think that fix2 (skip logging > the page) is extremely dangerous and has little if anything in its > favor. Why do you think that? They will be different only in terms of whether the uninitialized bytes are before or after the nominal EOF, and we know we have to be indifferent to that case anyway. > fix1 seems reasonable given the structure of the page validity > checks. > > However, what about Jeff's original comment > > : On second thought, why are PageSetLSN and PageSetTLI being called from > : log_newpage(), anyway? > > I think it is appropriate to be setting the LSN/TLI in the case of a > page that's been constructed by the caller as part of the WAL-logged > action, but doing so in copy_relation_data seems rather questionable. > We certainly didn't change the source page so changing its LSN seems > rather wrong --- wouldn't it be better to just copy the source pages > with their original LSNs? So perhaps the best fix is to add a bool > parameter to log_newpage telling it whether to update LSN/TLI, and > have copy_relation_data pass false while the other callers pass true. > (Although I guess we'd need to propagate that flag in the WAL record, > so maybe this is more trouble than its worth.) It seems like if log_newpage() were to set the LSN/TLI before calling XLogInsert() - or optionally not - then it wouldn't be necessary to set them also in heap_xlog_newpage(); the memcpy operation would by definition have copied the right information onto the page. That seems like it would be a cleaner design, but back-patching a change to the interpretation of WAL records that might already be on someone's disk seems dicey at best. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, 2010-07-28 at 14:50 -0400, Robert Haas wrote: > It seems like if log_newpage() were to set the LSN/TLI before calling > XLogInsert() - or optionally not - then it wouldn't be necessary to > set them also in heap_xlog_newpage(); the memcpy operation would by > definition have copied the right information onto the page. That > seems like it would be a cleaner design, but back-patching a change to > the interpretation of WAL records that might already be on someone's > disk seems dicey at best. How do you set the LSN before XLogInsert()? Regards,Jeff Davis
On Wed, Jul 28, 2010 at 3:08 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2010-07-28 at 14:50 -0400, Robert Haas wrote: >> It seems like if log_newpage() were to set the LSN/TLI before calling >> XLogInsert() - or optionally not - then it wouldn't be necessary to >> set them also in heap_xlog_newpage(); the memcpy operation would by >> definition have copied the right information onto the page. That >> seems like it would be a cleaner design, but back-patching a change to >> the interpretation of WAL records that might already be on someone's >> disk seems dicey at best. > > How do you set the LSN before XLogInsert()? Details, details... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jul 28, 2010 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I've caught up on the thread now, and I think that fix2 (skip logging >> the page) is extremely dangerous and has little if anything in its >> favor. > Why do you think that? They will be different only in terms of > whether the uninitialized bytes are before or after the nominal EOF, > and we know we have to be indifferent to that case anyway. (1) You're assuming that the page will be zeroes on the slave without having forced it to be so. A really obvious case where this fails is where we're doing crash-and-restart on the master: a later action could have modified the page away from the all-zero state. (In principle that's OK but I think this might break torn-page protection.) (2) On filesystems that support holes, the page will not have storage, whereas it (probably) does on the master. This could lead to a divergence in behavior later, ie slave runs out of disk space at a different point than the master. (3) The position of the nominal EOF can drive choices about which page to put new tuples in, specifically thats where RelationGetBufferForTuple will go if FSM has no information. This could result in unexpected divergence in behavior after the slave goes live compared to what the master would have done. Maybe that's OK but it seems better to avoid it if we can, especially when you think about crash-and-restart on the master as opposed to a separate slave. Now as I said earlier, these are all tiny corners of a corner case, and they *probably* shouldn't matter. But I see no good reason to expose ourselves to the possibility that there's some cases where they do matter. Especially when your argument for fix2 is a purely aesthetic judgment that I don't agree with anyway. >> I think it is appropriate to be setting the LSN/TLI in the case of a >> page that's been constructed by the caller as part of the WAL-logged >> action, but doing so in copy_relation_data seems rather questionable. >> We certainly didn't change the source page so changing its LSN seems >> rather wrong --- wouldn't it be better to just copy the source pages >> with their original LSNs? > It seems like if log_newpage() were to set the LSN/TLI before calling > XLogInsert() - or optionally not - then it wouldn't be necessary to > set them also in heap_xlog_newpage(); the memcpy operation would by > definition have copied the right information onto the page. Not possible because it is only after you've done XLogInsert that you know what LSN was assigned to the WAL record. regards, tom lane
I wrote: >>> I think it is appropriate to be setting the LSN/TLI in the case of a >>> page that's been constructed by the caller as part of the WAL-logged >>> action, but doing so in copy_relation_data seems rather questionable. BTW, I thought of an argument that explains why that's sane: it marks the copied page as having been recently WAL-logged. If we do some action on the copied relation shortly after completing the copy_relation_data transaction, we will see that its LSN is later than the last checkpoint and know that we don't need to emit a full-page WAL image for it, which is correct because in case of crash+restart the HEAP_NEWPAGE record will provide the full-page image. If we left the source relation's page's LSN in there, we would frequently make the wrong decision and emit an unnecessary extra full-page image. So nevermind that distraction. I'm back to thinking that fix1 is the way to go. regards, tom lane
On Wed, Jul 28, 2010 at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > (1) You're assuming that the page will be zeroes on the slave without > having forced it to be so. A really obvious case where this fails > is where we're doing crash-and-restart on the master: a later action > could have modified the page away from the all-zero state. (In > principle that's OK but I think this might break torn-page protection.) Hmm, yeah, that does seem like it has the potential to be bad. I think this is sufficient reason to go with fix #1. > (2) On filesystems that support holes, the page will not have storage, > whereas it (probably) does on the master. This could lead to a > divergence in behavior later, ie slave runs out of disk space at a > different point than the master. I can't get excited about this one. > (3) The position of the nominal EOF can drive choices about which page > to put new tuples in, specifically thats where RelationGetBufferForTuple > will go if FSM has no information. This could result in unexpected > divergence in behavior after the slave goes live compared to what the > master would have done. Maybe that's OK but it seems better to avoid > it if we can, especially when you think about crash-and-restart on the > master as opposed to a separate slave. You're still going to have that in the "normal" (not altering the tablespace) extension case, which is presumably far more common. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, 2010-07-28 at 15:37 -0400, Tom Lane wrote: > So nevermind that distraction. I'm back to thinking that fix1 is > the way to go. Agreed. It's uncontroversial to have a simple guard against corrupting an uninitialized page, and uncontroversial is good for things that will be back-patched. Regards,Jeff Davis
On Wed, 2010-07-28 at 14:22 -0700, Jeff Davis wrote: > On Wed, 2010-07-28 at 15:37 -0400, Tom Lane wrote: > > So nevermind that distraction. I'm back to thinking that fix1 is > > the way to go. > > Agreed. > > It's uncontroversial to have a simple guard against corrupting an > uninitialized page, and uncontroversial is good for things that will be > back-patched. Still don't understand why we would not initialize such pages. If we're copying a relation we must know enough about it to init a page. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Thu, Jul 29, 2010 at 4:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-07-28 at 14:22 -0700, Jeff Davis wrote: >> On Wed, 2010-07-28 at 15:37 -0400, Tom Lane wrote: >> > So nevermind that distraction. I'm back to thinking that fix1 is >> > the way to go. >> >> Agreed. >> >> It's uncontroversial to have a simple guard against corrupting an >> uninitialized page, and uncontroversial is good for things that will be >> back-patched. > > Still don't understand why we would not initialize such pages. If we're > copying a relation we must know enough about it to init a page. Well, I don't see why we'd want to do that. As Jeff Davis pointed out, if someone asks to move a table to a different tablespace, changing the contents as we go along seems a bit off-topic. But the bigger problem is you haven't explained how you think we could determine what initialization ought to be performed. There's no index-AM API that says "initialize this page". I suppose we could invent one if there were some benefit, but we couldn't very well back-patch such a thing to 8.0. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jul 29, 2010 at 4:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Still don't understand why we would not initialize such pages. If we're >> copying a relation we must know enough about it to init a page. > Well, I don't see why we'd want to do that. As Jeff Davis pointed > out, if someone asks to move a table to a different tablespace, > changing the contents as we go along seems a bit off-topic. But the > bigger problem is you haven't explained how you think we could > determine what initialization ought to be performed. There's no > index-AM API that says "initialize this page". I suppose we could > invent one if there were some benefit, but we couldn't very well > back-patch such a thing to 8.0. Yeah. And you really would have to get the AM involved. Even if you were willing to assume that you knew the special-space size for a particular index type, it would not fly to assume that the special space doesn't require initialization to some nonzero content. regards, tom lane
On Wed, Jul 28, 2010 at 5:22 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2010-07-28 at 15:37 -0400, Tom Lane wrote: >> So nevermind that distraction. I'm back to thinking that fix1 is >> the way to go. > > Agreed. > > It's uncontroversial to have a simple guard against corrupting an > uninitialized page, and uncontroversial is good for things that will be > back-patched. Here's a version of Jeff's fix1 patch (with a trivial change to the comment) that applies to HEAD, REL9_0_STABLE, REL8_4_STABLE, and REL8_3_STABLE; a slightly modified version that applies to REL8_2_STABLE; and another slightly modified version that applies to REL8_1_STABLE and REL8_0_STABLE. REL7_4_STABLE doesn't have tablespaces, so the problem can't manifest there, I think. I'm currently compiling and testing all of these. When that's done, should I go ahead and check this in, or wait until after beta4 wraps? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > Here's a version of Jeff's fix1 patch (with a trivial change to the > comment) that applies to HEAD, REL9_0_STABLE, REL8_4_STABLE, and > REL8_3_STABLE; a slightly modified version that applies to > REL8_2_STABLE; and another slightly modified version that applies to > REL8_1_STABLE and REL8_0_STABLE. REL7_4_STABLE doesn't have > tablespaces, so the problem can't manifest there, I think. Looks sane to the eyeball. I'm not sure if the oldest versions have the same page-read-time header sanity checks that we have now, so it may be that there is not a need for this patch all the way back; but it can't hurt anything. > I'm currently compiling and testing all of these. When that's done, > should I go ahead and check this in, or wait until after beta4 wraps? Go ahead and commit, please. regards, tom lane
On Thu, Jul 29, 2010 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Here's a version of Jeff's fix1 patch (with a trivial change to the >> comment) that applies to HEAD, REL9_0_STABLE, REL8_4_STABLE, and >> REL8_3_STABLE; a slightly modified version that applies to >> REL8_2_STABLE; and another slightly modified version that applies to >> REL8_1_STABLE and REL8_0_STABLE. REL7_4_STABLE doesn't have >> tablespaces, so the problem can't manifest there, I think. > > Looks sane to the eyeball. I'm not sure if the oldest versions have the > same page-read-time header sanity checks that we have now, so it may be > that there is not a need for this patch all the way back; but it can't > hurt anything. It looks like they do. I am able to reproduce the problem even on 8.0, and the patch does fix it. >> I'm currently compiling and testing all of these. When that's done, >> should I go ahead and check this in, or wait until after beta4 wraps? > > Go ahead and commit, please. Done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company