Thread: All-zero page in GIN index causes assertion failure
This is a continuation of the discussion at http://www.postgresql.org/message-id/CAMkU=1zUc=h0oCZntaJaqqW7gxxVxCWsYq8DD2t7oHgsgVEsgA@mail.gmail.com, I'm starting a new thread as this is a separate issue than the original LWLock bug. > On Thu, Jul 16, 2015 at 12:03 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > >> On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas <hlinnaka@iki.fi> >> wrote: >> >> I don't see how this is related to the LWLock issue, but I didn't see it >> without your patch. Perhaps the system just didn't survive long enough to >> uncover it without the patch (although it shows up pretty quickly). It >> could just be an overzealous Assert, since the casserts off didn't show >> problems. > >> bt and bt full are shown below. >> >> Cheers, >> >> Jeff >> >> #0 0x0000003dcb632625 in raise () from /lib64/libc.so.6 >> #1 0x0000003dcb633e05 in abort () from /lib64/libc.so.6 >> #2 0x0000000000930b7a in ExceptionalCondition ( >> conditionName=0x9a1440 "!(((PageHeader) (page))->pd_special >= >> (__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=0x9a12bc >> "FailedAssertion", >> fileName=0x9a12b0 "ginvacuum.c", lineNumber=713) at assert.c:54 >> #3 0x00000000004947cf in ginvacuumcleanup (fcinfo=0x7fffee073a90) at >> ginvacuum.c:713 >> > > It now looks like this *is* unrelated to the LWLock issue. The assert that > it is tripping over was added just recently (302ac7f27197855afa8c) and so I > had not been testing under its presence until now. It looks like it is > finding all-zero pages (index extended but then a crash before initializing > the page?) and it doesn't like them. > > (gdb) f 3 > (gdb) p *(char[8192]*)(page) > $11 = '\000' <repeats 8191 times> > > Presumably before this assert, such pages would just be permanently > orphaned. Yeah, so it seems. It's normal to have all-zero pages in the index, if you crash immediately after the relation has been extended, but before the new page has been WAL-logged. What is your test case like; did you do crash-testing? ISTM ginvacuumcleanup should check for PageIsNew, and put the page to the FSM. That's what btvacuumpage() gistvacuumcleanup() do. spgvacuumpage() seems to also check for PageIsNew(), but it seems broken in a different way: it initializes the page and marks the page as dirty, but it is not WAL-logged. That is a problem at least if checksums are enabled: if you crash you might have a torn page on disk, with invalid checksum. - Heikki
On 07/20/2015 11:14 AM, Heikki Linnakangas wrote: > ISTM ginvacuumcleanup should check for PageIsNew, and put the page to > the FSM. That's what btvacuumpage() gistvacuumcleanup() do. > spgvacuumpage() seems to also check for PageIsNew(), but it seems broken > in a different way: it initializes the page and marks the page as dirty, > but it is not WAL-logged. That is a problem at least if checksums are > enabled: if you crash you might have a torn page on disk, with invalid > checksum. Looking closer, heap vacuum does a similar thing, but there are two mitigating factors that make it safe in practice, I think: 1. An empty heap page is all-zeroes except for the small page header in the beginning of the page. For a torn page to matter, the page would need to be torn in the header, but we rely elsewhere (control file) anyway that a 512-byte sector update is atomic, so that shouldn't happen. Note that this hinges on the fact that there is no special area on heap pages, so you cannot rely on this for index pages. 2. The redo of the first insert/update on a heap page will always re-initialize the page, even when full-page-writes are disabled. This is the XLOG_HEAP_INIT_PAGE optimization. So it's not just an optimization, it's required for correctness. Heap update can also leave behind a page in the buffer cache that's been initialized by RelationGetBufferForTuple but not yet WAL-logged. However, it doesn't mark the buffer dirty, so the torn-page problem cannot happen because the page will not be flushed to disk if nothing else touches it. The XLOG_HEAP_INIT_PAGE optimization is needed in that case too, however. B-tree, GiST, and SP-GiST's relation extension work similarly, but they have other mitigating factors. If a newly-initialized B-tree page is left behind in the relation, it won't be reused for anything, and vacuum will ignore it (by accident, I think; there is no explicit comment on what will happen to such pages, but it will be treated like an internal page and ignored). Eventually the buffer will be evicted from cache, and because it's not marked as dirty, it will not be flushed to disk, and will later be read back as all-zeros and vacuum will recycle it. BRIN update is not quite right, however. brin_getinsertbuffer() can initialize a page, but the caller might bail out without using the page and WAL-logging the change. If that happens, the next update that uses the same page will WAL-log the change but it will not use the XLOG_BRIN_INIT_PAGE option, and redo will not initialize the page. Redo will fail. BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN pages? If an empty page is missing from the FSM for any reason, there's nothing to add it there. This is all very subtle. The whole business of leaving behind an already-initialized page in the buffer cache, without marking the buffer as dirty, is pretty ugly. I wish we had a more robust pattern to handle all-zero pages and relation extension. Thoughts? As a minimal backpatchable fix, I think we should add the check in ginvacuumpage() to initialize any all-zeros pages it encounters. That needs to be WAL-logged, and WAL-logging needs to be added to the page initialization in spgvacuumpage too. - Heikki
Heikki Linnakangas wrote: > Looking closer, heap vacuum does a similar thing, but there are two > mitigating factors that make it safe in practice, I think: > > 1. An empty heap page is all-zeroes except for the small page header in the > beginning of the page. For a torn page to matter, the page would need to be > torn in the header, but we rely elsewhere (control file) anyway that a > 512-byte sector update is atomic, so that shouldn't happen. Note that this > hinges on the fact that there is no special area on heap pages, so you > cannot rely on this for index pages. > > 2. The redo of the first insert/update on a heap page will always > re-initialize the page, even when full-page-writes are disabled. This is the > XLOG_HEAP_INIT_PAGE optimization. So it's not just an optimization, it's > required for correctness. Hmm, I guess this is a case of an optimization hiding a bug :-( I mean, if we didn't have that, we would have noticed this a lot sooner, I imagine. > BRIN update is not quite right, however. brin_getinsertbuffer() can > initialize a page, but the caller might bail out without using the page and > WAL-logging the change. If that happens, the next update that uses the same > page will WAL-log the change but it will not use the XLOG_BRIN_INIT_PAGE > option, and redo will not initialize the page. Redo will fail. Hmm, interesting. > BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN pages? > If an empty page is missing from the FSM for any reason, there's nothing to > add it there. Probably, yeah, makes sense. If you recall, the original design I proposed had a scan of the whole index (though for other reasons I am thankful we got rid of that). > This is all very subtle. No kidding ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/20/2015 07:11 PM, Alvaro Herrera wrote: > Heikki Linnakangas wrote: > >> Looking closer, heap vacuum does a similar thing, but there are two >> mitigating factors that make it safe in practice, I think: >> >> 1. An empty heap page is all-zeroes except for the small page header in the >> beginning of the page. For a torn page to matter, the page would need to be >> torn in the header, but we rely elsewhere (control file) anyway that a >> 512-byte sector update is atomic, so that shouldn't happen. Note that this >> hinges on the fact that there is no special area on heap pages, so you >> cannot rely on this for index pages. >> >> 2. The redo of the first insert/update on a heap page will always >> re-initialize the page, even when full-page-writes are disabled. This is the >> XLOG_HEAP_INIT_PAGE optimization. So it's not just an optimization, it's >> required for correctness. > > Hmm, I guess this is a case of an optimization hiding a bug :-( I mean, > if we didn't have that, we would have noticed this a lot sooner, I > imagine. Yeah, probably. I came up with the attached, for GIN and SP-GiST. Instead of WAL-logging the page initialization in SP-GiST vacuum, I changed it so that it simply leaves the page as all-zeroes, and adds it to the FSM. The code that grabs a target page from the FSM handles that, and initializes the page anyway, so that was simpler. This made the SP-GiST is-deleted flag obsolete, it's no longer set, although the code still checks for it for backwards-compatibility. (even that may actually be unnecessary, as a page that's marked as deleted must also be empty, and wherever we check for the deleted-flag, we also check for PageIsEmpty())) - Heikki
Attachment
On Mon, Jul 20, 2015 at 1:23 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I came up with the attached, for GIN and SP-GiST. Instead of WAL-logging the page initialization in SP-GiST vacuum, I changed it so that it simply leaves the page as all-zeroes, and adds it to the FSM. The code that grabs a target page from the FSM handles that, and initializes the page anyway, so that was simpler. This made the SP-GiST is-deleted flag obsolete, it's no longer set, although the code still checks for it for backwards-compatibility. (even that may actually be unnecessary, as a page that's marked as deleted must also be empty, and wherever we check for the deleted-flag, we also check for PageIsEmpty()))
This patch, in conjunction with the LWLock deadlock patch, fixes all the issues I was having with GIN indexes in 9.5.
I haven't tried SP-GiST.
Cheers,
Jeff
Heikki Linnakangas wrote: > BRIN update is not quite right, however. brin_getinsertbuffer() can > initialize a page, but the caller might bail out without using the page and > WAL-logging the change. If that happens, the next update that uses the same > page will WAL-log the change but it will not use the XLOG_BRIN_INIT_PAGE > option, and redo will not initialize the page. Redo will fail. There's even a worse case. In brin_getinsertbuffer, if an old buffer is passed, and the function extends the index, and then brin_getinsertbuffer finds that the old buffer has been used to extend the revmap, then we don't even return the newly extended page. We do enter it into the FSM, though, and the FSM is vacuumed. So next time around somebody requests a page from FSM, it might return that page; but since it wasn't initialized, the insertion will fail. Everybody expects the page to have been initialized previously, but this will not be the case here. I guess that function needs some restructuring, but it's pretty hairy already ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/23/2015 07:43 PM, Jeff Janes wrote: > On Mon, Jul 20, 2015 at 1:23 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> I came up with the attached, for GIN and SP-GiST. Instead of WAL-logging >> the page initialization in SP-GiST vacuum, I changed it so that it simply >> leaves the page as all-zeroes, and adds it to the FSM. The code that grabs >> a target page from the FSM handles that, and initializes the page anyway, >> so that was simpler. This made the SP-GiST is-deleted flag obsolete, it's >> no longer set, although the code still checks for it for >> backwards-compatibility. (even that may actually be unnecessary, as a page >> that's marked as deleted must also be empty, and wherever we check for the >> deleted-flag, we also check for PageIsEmpty())) > > This patch, in conjunction with the LWLock deadlock patch, fixes all the > issues I was having with GIN indexes in 9.5. Thanks, I've pushed this, as well a fix to similar failure from B-tree vacuum that Amit Langote reported in the other thread. - Heikki
On Tue, Jul 28, 2015 at 12:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Thanks, I've pushed this, as well a fix to similar failure from B-tree > vacuum that Amit Langote reported in the other thread. > Thanks Heikki and sorry I didn't notice this new thread. Regards, Amit
Heikki Linnakangas wrote: > BRIN update is not quite right, however. brin_getinsertbuffer() can > initialize a page, but the caller might bail out without using the page and > WAL-logging the change. If that happens, the next update that uses the same > page will WAL-log the change but it will not use the XLOG_BRIN_INIT_PAGE > option, and redo will not initialize the page. Redo will fail. Here's a fix for BRIN: when brin_getinsertbuffer extends the relation and doesn't return the page, it initializes and logs the page as an empty regular page before returning, and also records it into the FSM. That way, some future process that gets a page from FSM will use it, preventing this type of problem from bloating the index forever. Also, this function no longer initializes the page when it returns it to the caller: instead the caller (brin_doinsert or brin_doupdate) must do that. (Previously, the code was initializing the page outside the critical section that WAL-logs this action). > BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN pages? > If an empty page is missing from the FSM for any reason, there's nothing to > add it there. Probably. I didn't change this part yet. There are two things to fix: 1. since we use log_newpage_buffer(), we log the initialization but not the recording into FSM, so the page would be forgotten about. This can be tested with PageIsEmpty(). An alternative to the vacuum scan is to use our own WAL record that not only logs the initialization itself but also the FSM update. Not sure this is worth the trouble. 2. additionally, if brin_getinsertbuffer extends the relation but we crash before the caller initializes it, the page would be detected by PageIsNew instead and would also need initialization. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
-- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera wrote: > > BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN pages? > > If an empty page is missing from the FSM for any reason, there's nothing to > > add it there. > > Probably. I didn't change this part yet. There are two things to fix: > 1. since we use log_newpage_buffer(), we log the initialization but not > the recording into FSM, so the page would be forgotten about. This can > be tested with PageIsEmpty(). An alternative to the vacuum scan is to > use our own WAL record that not only logs the initialization itself but > also the FSM update. Not sure this is worth the trouble. > > 2. additionally, if brin_getinsertbuffer extends the relation but we > crash before the caller initializes it, the page would be detected by > PageIsNew instead and would also need initialization. Added this part. It's using log_newpage_buffer too. The vacuum scan fixes the whole FSM, though, so after vacuum the FSM is up to date. I think we could shave off a few bytes by using a separate WAL record, but I'm not sure it's worth the trouble. I intend to push this tomorrow. I now think the free space calculations are broken, but I'll leave that for later. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services