Thread: GiST insert algorithm rewrite
Following the discussions here (http://archives.postgresql.org/message-id/12375.1289429390@sss.pgh.pa.us), here's a draft version of a rewrite of the GiST insertion algorithm and WAL-logging. The goal is to get rid of the tracking of incomplete splits and inserts in WAL replay, and the fixup activities at the end of recovery. There are two key changes to the insert algorithm: 1. When we walk down the tree searching for a suitable leaf page to insert the new tuple to, we update the nodes on the way down so that they are consistent with the new key we're inserting. The old GiST algorithm adjusted all the parent pages only after inserting the tuple on the leaf. Updating them on the way down ensures that the tree is self-consistent at all times, even if we crash just after inserting the new tuple on the leaf page. 2. When a page is split, we mark the new left page with a flag to indicate that the downlink for the page to the right hasn't been inserted yet. When the downlink is inserted, the flag is cleared. Again the purpose is to ensure that the tree is self-consistent at all times. If we crash just after a page split, before the downlink is inserted, scans will find the tuples on the right page by following the rightlink. It's slightly less performant, but correct. The WAL replay code has been simplified accordingly, we no longer need any fixup code to finish incomplete splits or inserts, and we no longer need the "invalid" pointers in the tree. There's no on-disk format changes, except for the additional flag in the page headers, so this does not affect pg_upgrade. However, if there's any "invalid" keys in the old index because of an incomplete insertion, the new code will not understand that. So you should run vacuum to ensure that there's no such invalid keys in the index before upgrading. Vacuum will print a message in the log if it finds any, and you will have to reindex. But that's what it suggests you to do anyway. I haven't done much testing yet, although it passes the regression tests. I'll do more testing, and crash testing, and I'll have to re-review the patch myself before committing. Comments? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > There are two key changes to the insert algorithm: > 1. When we walk down the tree searching for a suitable leaf page to > insert the new tuple to, we update the nodes on the way down so that > they are consistent with the new key we're inserting. The old GiST > algorithm adjusted all the parent pages only after inserting the tuple > on the leaf. Updating them on the way down ensures that the tree is > self-consistent at all times, even if we crash just after inserting the > new tuple on the leaf page. > 2. When a page is split, we mark the new left page with a flag to > indicate that the downlink for the page to the right hasn't been > inserted yet. When the downlink is inserted, the flag is cleared. Again > the purpose is to ensure that the tree is self-consistent at all times. > If we crash just after a page split, before the downlink is inserted, > scans will find the tuples on the right page by following the rightlink. > It's slightly less performant, but correct. The one thought that comes to mind is how does the flag business work after multiple splittings? That is, assume we have a page that has the flag set because of a previous crash. If we have to split either that page or its right sibling, what do we do with the flags? I think that it can be made to work, so long as searches keep moving right as long as the flag is set. But this needs to be thought through, and documented in the README file. I'm particularly worried whether the after-the-fact fixup that you mention in README might fail in such scenarios. regards, tom lane
On 16.11.2010 20:01, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> 2. When a page is split, we mark the new left page with a flag to >> indicate that the downlink for the page to the right hasn't been >> inserted yet. When the downlink is inserted, the flag is cleared. Again >> the purpose is to ensure that the tree is self-consistent at all times. >> If we crash just after a page split, before the downlink is inserted, >> scans will find the tuples on the right page by following the rightlink. >> It's slightly less performant, but correct. > > The one thought that comes to mind is how does the flag business work > after multiple splittings? That is, assume we have a page that has the > flag set because of a previous crash. If we have to split either that > page or its right sibling, what do we do with the flags? As I mentioned in the README, the insertion algorithm finishes any incomplete splits it sees before proceeding. AFAICS that should ensure that the situation never arises where you try to split a page that already has the flag set. Or its right sibling; such a page can only be reached via the rightlink so you would see the page with the flag set first. Hmm, there is one corner-case that I didin't think of before: One backend splits a leaf page, and another backend concurrently splits the parent of that leaf page. If for some reason the backend operating on the parent page dies, releasing the locks, the other backend will see the incomplete split when it walks up to insert the downlink. Although it should be possible to handle that, I think we can simply give up on inserting the downlink in that case, and leave that split incomplete as well. It's still correct, and next insert that comes along will complete the splits, from top to bottom. BTW, I don't try to fix incomplete splits during vacuum in the patch. That's perhaps a bit surprising, and probably would be easy to add, but I left it out for now as it's not strictly necessary. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 16, 2010 at 1:22 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > BTW, I don't try to fix incomplete splits during vacuum in the patch. That's > perhaps a bit surprising, and probably would be easy to add, but I left it > out for now as it's not strictly necessary. Seems like it would be good to have this; otherwise, the split might stay incompletely indefinitely? Would that be bad? If we start to enlarge the bounding boxes on the higher levels of the tree and then crash before inserting the key, is there any mechanism for getting them back down to the minimal size? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16.11.2010 20:46, Robert Haas wrote: > On Tue, Nov 16, 2010 at 1:22 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> BTW, I don't try to fix incomplete splits during vacuum in the patch. That's >> perhaps a bit surprising, and probably would be easy to add, but I left it >> out for now as it's not strictly necessary. > > Seems like it would be good to have this; otherwise, the split might > stay incompletely indefinitely? Would that be bad? Nothing bad should happen. Scans that need to traverse the incompletely split page would just be marginally slower. > If we start to enlarge the bounding boxes on the higher levels of the > tree and then crash before inserting the key, is there any mechanism > for getting them back down to the minimal size? No. There's also no mechanism for trimming the bounding boxes if a tuple is deleted. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 16, 2010 at 1:50 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> If we start to enlarge the bounding boxes on the higher levels of the >> tree and then crash before inserting the key, is there any mechanism >> for getting them back down to the minimal size? > > No. There's also no mechanism for trimming the bounding boxes if a tuple is > deleted. Oh. So do the indexes just degrade over time until they eventually need to be REINDEX'd? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> they are consistent with the new key we're inserting. The old GiST > algorithm adjusted all the parent pages only after inserting the tuple > on the leaf. Updating them on the way down ensures that the tree is Hm, performance? while you traverse to leaf page, on each inner page you will need to call unionFn/equalFn methods to decide update or not key on inner page. Now we stops do that after first positive result of equalFn while walking up. Next, when child page splits then you need to update parent twice - first time while walk down, and second while walk up. As I see, you try to implement algorithm very close to original, but it was rather slow. http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=0ad7db4be4b1f0208271c49fc1c8348f11ebc5b3 > self-consistent at all times, even if we crash just after inserting the > new tuple on the leaf page. > > 2. When a page is split, we mark the new left page with a flag to > indicate that the downlink for the page to the right hasn't been > inserted yet. When the downlink is inserted, the flag is cleared. Again Again, twice write of new children (it could be several because of implementation of user-defined picksplit method). -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Robert Haas <robertmhaas@gmail.com> writes: > Oh. So do the indexes just degrade over time until they eventually > need to be REINDEX'd? At some point you might reach a state where a reindex would be helpful. But the same is true of btrees. I don't think this is a serious objection, at least not unless backed by evidence that the tree often degrades rapidly. Anyway fixing it would be material for a different patch. regards, tom lane
On 16.11.2010 21:20, Teodor Sigaev wrote: >> they are consistent with the new key we're inserting. The old GiST >> algorithm adjusted all the parent pages only after inserting the tuple >> on the leaf. Updating them on the way down ensures that the tree is > Hm, performance? while you traverse to leaf page, on each inner page you > will need to call unionFn/equalFn methods to decide update or not key on > inner page. Now we stops do that after first positive result of equalFn > while walking up. Next, when child page splits then you need to update > parent twice - first time while walk down, and second while walk up. Hmm, will have to do some benchmarking on that. I'm using the Consistent function when walking down to check if the downlink needs to be updated, and assumed that it would be insignificant compared to the cost of calling Penalty on all the keys on the page. > As I see, you try to implement algorithm very close to original, but it > was rather slow. > http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=0ad7db4be4b1f0208271c49fc1c8348f11ebc5b3 Thanks for that, I'll have to read that through as well. >> self-consistent at all times, even if we crash just after inserting the >> new tuple on the leaf page. >> >> 2. When a page is split, we mark the new left page with a flag to >> indicate that the downlink for the page to the right hasn't been >> inserted yet. When the downlink is inserted, the flag is cleared. Again > Again, twice write of new children (it could be several because of > implementation of user-defined picksplit method). There should be no difference in performance here AFAICS. The children need to be updated a second time to clear the flag, but we don't release the locks on them in the middle, and we're only talking about setting a single flag, so it should make no difference. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 16, 2010 at 3:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Oh. So do the indexes just degrade over time until they eventually >> need to be REINDEX'd? > > At some point you might reach a state where a reindex would be helpful. > But the same is true of btrees. I don't think this is a serious > objection, at least not unless backed by evidence that the tree often > degrades rapidly. Anyway fixing it would be material for a different > patch. Oh, I agree it's not for this patch to fix it, if it's already that way. I was just curious. I think index maintenance is going to be a problem we have to devote some cycles to down the road, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Hmm, will have to do some benchmarking on that. I'm using the Consistent > function when walking down to check if the downlink needs to be updated, > and assumed that it would be insignificant compared to the cost of > calling Penalty on all the keys on the page. Why consistent?! It's impossible - you don't know right strategy number, index with storage type/over type could do not accept the same type as query. Index over tsvector is an example. > There should be no difference in performance here AFAICS. The children > need to be updated a second time to clear the flag, but we don't release > the locks on them in the middle, and we're only talking about setting a > single flag, so it should make no difference. Agree with that -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Sorry, I missed beginning of discussion on GiST, so I read it on the web mail archive. You wrote: http://archives.postgresql.org/pgsql-hackers/2010-11/msg00939.php [skip] 0. (the child page is locked) 1. The parent page is locked. 2. The child page is split. The original page becomes the left half, and new buffers are allocated for the right halves. 3. The downlink is inserted on the parent page (and the original downlink is updated to reflect only the keys that stayed on the left page). While keeping the child pages locked, the NSN field on the children are updated with the new LSN of the parent page. ... The scan checks that by comparing the LSN it saw on the parent page with the NSN on the child page. If parent LSN < NSN, we saw the parent before the downlink was inserted. Now, the problem with crash recovery is that the above algorithm depends on the split to keep the parent and child locked until the downlink is inserted in the parent. If you crash between steps 2 and 3, the locks are gone. If a later insert then updates the parent page, because of a split on some unrelated child page, that will bump the LSN of the parent above the NSN on the child. Scans will see that the parent LSN > child NSN, and will no longer follow the > rightlink. [skip] I disagree with that opinion: if we crash between 2 and 3 then why will somebody update parent before WAL replay? WAL replay process in this case should complete child split by inserting "invalid" pointer and tree become correct again, although it needs to repair "invalid" pointers. The same situation with b-tree: WAL replay repairs incomplete split before any other processing. Or do I miss something important? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On 17.11.2010 19:46, Teodor Sigaev wrote: > I disagree with that opinion: if we crash between 2 and 3 then why will > somebody update parent before WAL replay? WAL replay process in this > case should complete child split by inserting "invalid" pointer and tree > become correct again, although it needs to repair "invalid" pointers. > The same situation with b-tree: WAL replay repairs incomplete split > before any other processing. > > Or do I miss something important? Yeah, see the thread that started this: http://archives.postgresql.org/pgsql-hackers/2010-11/msg00052.php http://archives.postgresql.org/message-id/12375.1289429390@sss.pgh.pa.us The code currently relies on the end-of-recovery processing to finish the incomplete, but I'm trying to get rid of that end-of-recovery processing altogether. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 17.11.2010 19:36, Teodor Sigaev wrote: >> Hmm, will have to do some benchmarking on that. I'm using the Consistent >> function when walking down to check if the downlink needs to be updated, >> and assumed that it would be insignificant compared to the cost of >> calling Penalty on all the keys on the page. > Why consistent?! It's impossible - you don't know right strategy number, > index with storage type/over type could do not accept the same type as > query. Index over tsvector is an example. Sorry, I was confused. I'm calling the gistgetadjusted() function, which uses the Union function. Ie. I'm doing the same we did before when propagating the changes up the tree. I'm just doing it on the way down instead. I ran some quick performance tests on my laptop, and couldn't see any measurable difference with the patch. So I think we're good on performance. I used the attached scripts, with \timing. Have you had a chance to look at the patch yet? I'm hesitant to commit before you take a look at it, though I still have to proofread it myself as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On 18.11.2010 14:58, Heikki Linnakangas wrote: > On 17.11.2010 19:36, Teodor Sigaev wrote: >>> Hmm, will have to do some benchmarking on that. I'm using the Consistent >>> function when walking down to check if the downlink needs to be updated, >>> and assumed that it would be insignificant compared to the cost of >>> calling Penalty on all the keys on the page. >> Why consistent?! It's impossible - you don't know right strategy number, >> index with storage type/over type could do not accept the same type as >> query. Index over tsvector is an example. > > Sorry, I was confused. I'm calling the gistgetadjusted() function, which > uses the Union function. Ie. I'm doing the same we did before when > propagating the changes up the tree. I'm just doing it on the way down > instead. > > I ran some quick performance tests on my laptop, and couldn't see any > measurable difference with the patch. So I think we're good on > performance. I used the attached scripts, with \timing. > > Have you had a chance to look at the patch yet? I'm hesitant to commit > before you take a look at it, though I still have to proofread it myself > as well. Here's an updated version with some minor fixes. I'd appreciate review, as well as pointers to good test cases for this. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Heikki Linnakangas wrote: > There's no on-disk format changes, except for the additional flag in the > page headers, so this does not affect pg_upgrade. However, if there's > any "invalid" keys in the old index because of an incomplete insertion, > the new code will not understand that. So you should run vacuum to > ensure that there's no such invalid keys in the index before upgrading. > Vacuum will print a message in the log if it finds any, and you will > have to reindex. But that's what it suggests you to do anyway. OK, pg_upgrade has code to report invalid gin and hash indexes because of changes between PG 8.3 and 8.4. Is this something we would do for 9.0 to 9.1? You are saying it would have to be run before the upgrade. Can it not be run after? I can output a script to VACUUM all such indexes, and tell users to manually REINDEX any index that generates a warning messasge. I don't have any way to automate an optional REINDEX step. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 27.11.2010 21:31, Bruce Momjian wrote: > Heikki Linnakangas wrote: >> There's no on-disk format changes, except for the additional flag in the >> page headers, so this does not affect pg_upgrade. However, if there's >> any "invalid" keys in the old index because of an incomplete insertion, >> the new code will not understand that. So you should run vacuum to >> ensure that there's no such invalid keys in the index before upgrading. >> Vacuum will print a message in the log if it finds any, and you will >> have to reindex. But that's what it suggests you to do anyway. > > OK, pg_upgrade has code to report invalid gin and hash indexes because > of changes between PG 8.3 and 8.4. Is this something we would do for > 9.0 to 9.1? 9.1. The problem that started this whole thing is there in older versions, but given the lack of real-life reports and the scale of the changes required it doesn't seem wise to backport. > You are saying it would have to be run before the upgrade. Can it not > be run after? After would work too. > I can output a script to VACUUM all such indexes, and tell users to > manually REINDEX any index that generates a warning messasge. I don't > have any way to automate an optional REINDEX step. That seems good enough. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 30.11.2010 11:55, Heikki Linnakangas wrote: > On 27.11.2010 21:31, Bruce Momjian wrote: >> Heikki Linnakangas wrote: >>> There's no on-disk format changes, except for the additional flag in the >>> page headers, so this does not affect pg_upgrade. However, if there's >>> any "invalid" keys in the old index because of an incomplete insertion, >>> the new code will not understand that. So you should run vacuum to >>> ensure that there's no such invalid keys in the index before upgrading. >>> Vacuum will print a message in the log if it finds any, and you will >>> have to reindex. But that's what it suggests you to do anyway. >> >> OK, pg_upgrade has code to report invalid gin and hash indexes because >> of changes between PG 8.3 and 8.4. Is this something we would do for >> 9.0 to 9.1? > > 9.1. The problem that started this whole thing is there in older > versions, but given the lack of real-life reports and the scale of the > changes required it doesn't seem wise to backport. Oh sorry, I read your question as "9.0 *or* 9.1". Only GiST indexes that have any "invalid" tuples in them n, as a result of a crash, need to be reindexed. That's very rare in practice, so we shouldn't invalidate all GiST indexes. I don't think there's any simple way to check whether reindex is required, so I think we have to just document this. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 30.11.2010 11:55, Heikki Linnakangas wrote: >> >> On 27.11.2010 21:31, Bruce Momjian wrote: >>> >>> Heikki Linnakangas wrote: >>>> >>>> There's no on-disk format changes, except for the additional flag in the >>>> page headers, so this does not affect pg_upgrade. However, if there's >>>> any "invalid" keys in the old index because of an incomplete insertion, >>>> the new code will not understand that. So you should run vacuum to >>>> ensure that there's no such invalid keys in the index before upgrading. >>>> Vacuum will print a message in the log if it finds any, and you will >>>> have to reindex. But that's what it suggests you to do anyway. >>> >>> OK, pg_upgrade has code to report invalid gin and hash indexes because >>> of changes between PG 8.3 and 8.4. Is this something we would do for >>> 9.0 to 9.1? >> >> 9.1. The problem that started this whole thing is there in older >> versions, but given the lack of real-life reports and the scale of the >> changes required it doesn't seem wise to backport. > > Oh sorry, I read your question as "9.0 *or* 9.1". > > Only GiST indexes that have any "invalid" tuples in them n, as a result of a > crash, need to be reindexed. That's very rare in practice, so we shouldn't > invalidate all GiST indexes. I don't think there's any simple way to check > whether reindex is required, so I think we have to just document this. It seems odd to say, the indexes are corrupted, but they're probably not, so let's not worry about it. I assume there's no way to make the new code cope with any pre-existing corruption? Does the current code cope with the corruption? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 30.11.2010 16:23, Robert Haas wrote: > On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 30.11.2010 11:55, Heikki Linnakangas wrote: >>> >>> On 27.11.2010 21:31, Bruce Momjian wrote: >>>> >>>> Heikki Linnakangas wrote: >>>>> >>>>> There's no on-disk format changes, except for the additional flag in the >>>>> page headers, so this does not affect pg_upgrade. However, if there's >>>>> any "invalid" keys in the old index because of an incomplete insertion, >>>>> the new code will not understand that. So you should run vacuum to >>>>> ensure that there's no such invalid keys in the index before upgrading. >>>>> Vacuum will print a message in the log if it finds any, and you will >>>>> have to reindex. But that's what it suggests you to do anyway. >>>> >>>> OK, pg_upgrade has code to report invalid gin and hash indexes because >>>> of changes between PG 8.3 and 8.4. Is this something we would do for >>>> 9.0 to 9.1? >>> >>> 9.1. The problem that started this whole thing is there in older >>> versions, but given the lack of real-life reports and the scale of the >>> changes required it doesn't seem wise to backport. >> >> Oh sorry, I read your question as "9.0 *or* 9.1". >> >> Only GiST indexes that have any "invalid" tuples in them n, as a result of a >> crash, need to be reindexed. That's very rare in practice, so we shouldn't >> invalidate all GiST indexes. I don't think there's any simple way to check >> whether reindex is required, so I think we have to just document this. > > It seems odd to say, the indexes are corrupted, but they're probably > not, so let's not worry about it. > > I assume there's no way to make the new code cope with any > pre-existing corruption? > > Does the current code cope with the corruption? It's not corruption, but "intended degradation". Yes, the current code copes with it, that's how GiST survives a crash. However, even with the current code, VACUUM will nag if it finds any invalid tuples with this message: ereport(NOTICE,(errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash recovery", That's harmless, in the sense that all scans and inserts work fine, but scans might need to do more work than if the invalid tuple wasn't there. I don't think we need to go out of our way to support such degraded indexes in 9.1. If you see such notices in your logs, you should REINDEX anyway, before of after pg_upgrade. Let's just make sure that you get a reasonable error message in 9.1 if a scan or insert encounters such a tuple. There is a section on this in the docs, BTW: http://www.postgresql.org/docs/9.0/static/gist-recovery.html -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > > Does the current code cope with the corruption? > > It's not corruption, but "intended degradation". Yes, the current code > copes with it, that's how GiST survives a crash. However, even with the > current code, VACUUM will nag if it finds any invalid tuples with this > message: > > ereport(NOTICE, > (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash > recovery", > > That's harmless, in the sense that all scans and inserts work fine, but > scans might need to do more work than if the invalid tuple wasn't there. > > I don't think we need to go out of our way to support such degraded > indexes in 9.1. If you see such notices in your logs, you should REINDEX > anyway, before of after pg_upgrade. Let's just make sure that you get a > reasonable error message in 9.1 if a scan or insert encounters such a tuple. > > There is a section on this in the docs, BTW: > http://www.postgresql.org/docs/9.0/static/gist-recovery.html OK, administrators will be prompted during normal operation --- seems there is nothing extra for pg_upgrade to do here. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> Does the current code cope with the corruption? > > It's not corruption, but "intended degradation". Yes, the current code copes > with it, that's how GiST survives a crash. However, even with the current > code, VACUUM will nag if it finds any invalid tuples with this message: > > ereport(NOTICE, > (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash > recovery", > > That's harmless, in the sense that all scans and inserts work fine, but > scans might need to do more work than if the invalid tuple wasn't there. > > I don't think we need to go out of our way to support such degraded indexes > in 9.1. If you see such notices in your logs, you should REINDEX anyway, > before of after pg_upgrade. Let's just make sure that you get a reasonable > error message in 9.1 if a scan or insert encounters such a tuple. I just don't want to take a risk of giving people unexpected wrong answers. It's not clear to me whether that's a risk here or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01.12.2010 04:10, Robert Haas wrote: > On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >>> Does the current code cope with the corruption? >> >> It's not corruption, but "intended degradation". Yes, the current code copes >> with it, that's how GiST survives a crash. However, even with the current >> code, VACUUM will nag if it finds any invalid tuples with this message: >> >> ereport(NOTICE, >> (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash >> recovery", >> >> That's harmless, in the sense that all scans and inserts work fine, but >> scans might need to do more work than if the invalid tuple wasn't there. >> >> I don't think we need to go out of our way to support such degraded indexes >> in 9.1. If you see such notices in your logs, you should REINDEX anyway, >> before of after pg_upgrade. Let's just make sure that you get a reasonable >> error message in 9.1 if a scan or insert encounters such a tuple. > > I just don't want to take a risk of giving people unexpected wrong > answers. It's not clear to me whether that's a risk here or not. You'll get an error if a scan encounters an invalid tuple. In the patch I posted, I just ripped out everything related to invalid tuples altogether. But we should add a check and ereport for that before commit. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 1, 2010 at 4:00 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 01.12.2010 04:10, Robert Haas wrote: >> >> On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>>> >>>> Does the current code cope with the corruption? >>> >>> It's not corruption, but "intended degradation". Yes, the current code >>> copes >>> with it, that's how GiST survives a crash. However, even with the current >>> code, VACUUM will nag if it finds any invalid tuples with this message: >>> >>> ereport(NOTICE, >>> (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash >>> recovery", >>> >>> That's harmless, in the sense that all scans and inserts work fine, but >>> scans might need to do more work than if the invalid tuple wasn't there. >>> >>> I don't think we need to go out of our way to support such degraded >>> indexes >>> in 9.1. If you see such notices in your logs, you should REINDEX anyway, >>> before of after pg_upgrade. Let's just make sure that you get a >>> reasonable >>> error message in 9.1 if a scan or insert encounters such a tuple. >> >> I just don't want to take a risk of giving people unexpected wrong >> answers. It's not clear to me whether that's a risk here or not. > > You'll get an error if a scan encounters an invalid tuple. > > In the patch I posted, I just ripped out everything related to invalid > tuples altogether. But we should add a check and ereport for that before > commit. All right, that seems like a reasonable backstop, if we're fairly sure this won't be a common scenario. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01.12.2010 11:00, Heikki Linnakangas wrote: > On 01.12.2010 04:10, Robert Haas wrote: >> On Tue, Nov 30, 2010 at 10:26 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>>> Does the current code cope with the corruption? >>> >>> It's not corruption, but "intended degradation". Yes, the current >>> code copes >>> with it, that's how GiST survives a crash. However, even with the >>> current >>> code, VACUUM will nag if it finds any invalid tuples with this message: >>> >>> ereport(NOTICE, >>> (errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash >>> recovery", >>> >>> That's harmless, in the sense that all scans and inserts work fine, but >>> scans might need to do more work than if the invalid tuple wasn't there. >>> >>> I don't think we need to go out of our way to support such degraded >>> indexes >>> in 9.1. If you see such notices in your logs, you should REINDEX anyway, >>> before of after pg_upgrade. Let's just make sure that you get a >>> reasonable >>> error message in 9.1 if a scan or insert encounters such a tuple. >> >> I just don't want to take a risk of giving people unexpected wrong >> answers. It's not clear to me whether that's a risk here or not. > > You'll get an error if a scan encounters an invalid tuple. > > In the patch I posted, I just ripped out everything related to invalid > tuples altogether. But we should add a check and ereport for that before > commit. Here's an updated patch. On closer look, supporting the invalid tuples in scans was trivial, so I kept that after all. So you can still query an index with invalid tuples. If an insert encounters one, you get an error, and VACUUM emits a LOG message on any such tuples. There's one bug remaining that I found during testing. If you crash, leaving an incomplete split behind, and then vacuum the table removing all the aborted tuples from the pages, it's possible that you end up with a completely empty page that has no downlink yet. The code to complete incomplete splits doesn't cope with that at the moment - it doesn't know how to construct a parent key for a child that has no tuples. The nicest way to handle that would be to recycle the empty page instead of trying to finish the page split, but I think there might be a race condition there if the page gets quickly reused while a scan is just about to visit it through the rightlink. GiST doesn't seem to ever reuse pages in normal operation, which conveniently avoids that problem. Simply abandoning the page forever is certainly one way to handle it, it shouldn't happen that often. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On Dec 3, 2010, at 4:54 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Here's an updated patch. How carefully have you perf-tested this? > On closer look, supporting the invalid tuples in scans was trivial, so I kept that after all. So you can still query anindex with invalid tuples. If an insert encounters one, you get an error, and VACUUM emits a LOG message on any such tuples. Cool. > There's one bug remaining that I found during testing. If you crash, leaving an incomplete split behind, and then vacuumthe table removing all the aborted tuples from the pages, it's possible that you end up with a completely empty pagethat has no downlink yet. The code to complete incomplete splits doesn't cope with that at the moment - it doesn't knowhow to construct a parent key for a child that has no tuples. I think we can live with this. > ...Robert
On 03.12.2010 23:54, Heikki Linnakangas wrote: > There's one bug remaining that I found during testing. If you crash, > leaving an incomplete split behind, and then vacuum the table removing > all the aborted tuples from the pages, it's possible that you end up > with a completely empty page that has no downlink yet. The code to > complete incomplete splits doesn't cope with that at the moment - it > doesn't know how to construct a parent key for a child that has no tuples. > > The nicest way to handle that would be to recycle the empty page instead > of trying to finish the page split, but I think there might be a race > condition there if the page gets quickly reused while a scan is just > about to visit it through the rightlink. GiST doesn't seem to ever reuse > pages in normal operation, which conveniently avoids that problem. > Simply abandoning the page forever is certainly one way to handle it, it > shouldn't happen that often. I fixed that by simply aobandoning pages. That seems acceptable, given that you shouldn't crash with incomplete splits that often. GiST never tries to reuse empty pages anyway, so some leakage at a crash seems like the least of our worries on that front. I also added check for the F_FOLLOW_RIGHT flag in the gist scan code. Tom Lane pointed out offlist that it was missing earlier. I realized that the way I was setting the NSN and clearing the flag on child pages, when the downlink is inserted to the parent, was not safe. We need to update the LSN and take a full-page image per the usual rules. But that creates a new problem: There's a maximum of three backup blocks per WAL record, but a GiST page can be split into any number of child pages as one operation. You might run out of backup block slots. Attached is an updated patch, but that issue with limited number of backup blocks needs to be resolved. The straightforward way would be to change the WAL format to increase the limit. Another option is to refactor the GiST insertion code some more, to insert the downlink pointers to the parent one-by-one, instead of as one big operation, when a page is split into more than two halves. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On Mon, Dec 13, 2010 at 7:09 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Attached is an updated patch, but that issue with limited number of backup > blocks needs to be resolved. The straightforward way would be to change the > WAL format to increase the limit. Eh, is that going to bloat WAL? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 13.12.2010 15:04, Robert Haas wrote: > On Mon, Dec 13, 2010 at 7:09 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Attached is an updated patch, but that issue with limited number of backup >> blocks needs to be resolved. The straightforward way would be to change the >> WAL format to increase the limit. > > Eh, is that going to bloat WAL? Nah. The way split now works is that: 1. Split the page. Write a WAL record with the contents of the page halves. 2. Insert the downlink pointers in the parent, and set the NSN and clear F_FOLLOW_RIGHT flags on the child pages. A 2nd WAL record is written for this. In this new patch version, at step 2, the 2nd WAL record updates the LSNs and takes full-page-images of the child pages if necessary. Previous patches overlooked that. Usually a full page image won't be necessary, because we just wrote the page-split WAL record at step 1 for those pages. It's only if a checkpoint started between in the small window between steps 1 and 2. So this should have no effect on performance, but it is necessary for correctness. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Increasing max # of backup blocks (was Re: GiST insert algorithm rewrite)
From
Heikki Linnakangas
Date:
On 13.12.2010 15:20, Heikki Linnakangas wrote: > On 13.12.2010 15:04, Robert Haas wrote: >> On Mon, Dec 13, 2010 at 7:09 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> Attached is an updated patch, but that issue with limited number of >>> backup >>> blocks needs to be resolved. The straightforward way would be to >>> change the >>> WAL format to increase the limit. >> >> Eh, is that going to bloat WAL? > > Nah. Or did you mean, if changing the WAL format to raise the limit would bloat WAL? Depends on how it's done, of course. Assuming MAXALIGN=4, there's 2 wasted bytes in XLogRecord at the moment that we could take into use without making the header bigger. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > But that creates a new problem: There's a maximum of three backup blocks > per WAL record, but a GiST page can be split into any number of child > pages as one operation. You might run out of backup block slots. > Attached is an updated patch, but that issue with limited number of > backup blocks needs to be resolved. The straightforward way would be to > change the WAL format to increase the limit. I don't think you can fix it that way. If a page can be split into any number of child pages, then no fixed number of pages in a WAL record will be enough. Even if you were willing to suppose that ~16 would be enough, it would be a bad idea because of the extra overhead added into XLogInsert, which'd be paid by *every* WAL insert operation. I think you need to refactor the operation so that there's one WAL record per child page, or something along that line. I concede this might be diffcult :-( regards, tom lane
On Mon, Dec 13, 2010 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think you need to refactor the operation so that there's one WAL > record per child page, or something along that line. I concede this > might be diffcult :-( If it's only the backup blocks that matter couldn't you generate noop WAL records with just the full page image in them. Once all those are generated then generate the actual split operation and since all the pages have been written to wal since the last checkpoint they won't need any backup block slots. This would require surpressing any checkpoints between writing the first backup block and the final operation record. That might be pretty hard to do cleanly. -- greg
On 13.12.2010 19:19, Greg Stark wrote: > On Mon, Dec 13, 2010 at 3:14 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> I think you need to refactor the operation so that there's one WAL >> record per child page, or something along that line. I concede this >> might be diffcult :-( > > If it's only the backup blocks that matter couldn't you generate noop > WAL records with just the full page image in them. Once all those are > generated then generate the actual split operation and since all the > pages have been written to wal since the last checkpoint they won't > need any backup block slots. > > This would require surpressing any checkpoints between writing the > first backup block and the final operation record. That might be > pretty hard to do cleanly. That would work, but it brings us back to square one (http://archives.postgresql.org/message-id/4CCFEE61.2090702@enterprisedb.com). It's not necessarily a bad idea, A capability to hold off checkpoints might be the easiest way to do this, and other things in the future. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 13.12.2010 19:19, Greg Stark wrote: >> If it's only the backup blocks that matter couldn't you generate noop >> WAL records with just the full page image in them. Once all those are >> generated then generate the actual split operation and since all the >> pages have been written to wal since the last checkpoint they won't >> need any backup block slots. >> >> This would require surpressing any checkpoints between writing the >> first backup block and the final operation record. That might be >> pretty hard to do cleanly. > That would work, but it brings us back to square one Yeah. Wouldn't the original page-split record have been carrying full page images already? (And if so, why didn't we have this problem in the previous implementation?) regards, tom lane
On 13.12.2010 19:48, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> On 13.12.2010 19:19, Greg Stark wrote: >>> If it's only the backup blocks that matter couldn't you generate noop >>> WAL records with just the full page image in them. Once all those are >>> generated then generate the actual split operation and since all the >>> pages have been written to wal since the last checkpoint they won't >>> need any backup block slots. >>> >>> This would require surpressing any checkpoints between writing the >>> first backup block and the final operation record. That might be >>> pretty hard to do cleanly. > >> That would work, but it brings us back to square one > > Yeah. Wouldn't the original page-split record have been carrying full > page images already? Yes. BTW, the original split record doesn't run into the limit because it doesn't use the backup-block mechanism, it contains all the tuples for all the pages in the main payload. > (And if so, why didn't we have this problem in the > previous implementation?) In the previous implementation, the NSN was updated immediately in the page split record, and there was no follow-right flag to clear. So the child pages didn't need to be updated when the downlinks are inserted. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 13.12.2010 19:48, Tom Lane wrote: >> Yeah. Wouldn't the original page-split record have been carrying full >> page images already? > Yes. > BTW, the original split record doesn't run into the limit because it > doesn't use the backup-block mechanism, it contains all the tuples for > all the pages in the main payload. I see. >> (And if so, why didn't we have this problem in the >> previous implementation?) > In the previous implementation, the NSN was updated immediately in the > page split record, and there was no follow-right flag to clear. So the > child pages didn't need to be updated when the downlinks are inserted. Can we fix it so that each child page is updated, and its downlink inserted, as a separate atomic action? That'd require each intermediate state to be consistent and crash-safe, but I think you really need the intermediate states to be consistent anyway because of concurrent scans. regards, tom lane
On 13.12.2010 20:30, Tom Lane wrote: > Can we fix it so that each child page is updated, and its downlink > inserted, as a separate atomic action? That'd require each intermediate > state to be consistent and crash-safe, but I think you really need the > intermediate states to be consistent anyway because of concurrent scans. Yes, all the intermediate states are consistent. I'm looking at that approach as we speak. The logic to track what we've done and what needs to be done as the changes are propagated gets quite hairy, but in principle it should work. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 13.12.2010 20:30, Tom Lane wrote: > Can we fix it so that each child page is updated, and its downlink > inserted, as a separate atomic action? That'd require each intermediate > state to be consistent and crash-safe, but I think you really need the > intermediate states to be consistent anyway because of concurrent scans. Here's an updated patch, using that idea.If a page split into more than two pages, the downlinks for the pages are inserted to the parent one-by-one, right-to-left, until there's only two remaining. Finally the downlink for the last remaining right page is inserted and the downlink for the original page is updated as one atomic operation. It was a pretty big rewrite again, but seems to work now. I tested splits that span more than two pages by rigging the btree_gist picksplit function to choose very bad split points, and the fix-split logic by adding elog(ERROR) in strategic places to sometimes leave splits incomplete. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On 16.12.2010 15:52, Heikki Linnakangas wrote: > On 13.12.2010 20:30, Tom Lane wrote: >> Can we fix it so that each child page is updated, and its downlink >> inserted, as a separate atomic action? That'd require each intermediate >> state to be consistent and crash-safe, but I think you really need the >> intermediate states to be consistent anyway because of concurrent scans. > > Here's an updated patch, using that idea.If a page split into more than > two pages, the downlinks for the pages are inserted to the parent > one-by-one, right-to-left, until there's only two remaining. Finally the > downlink for the last remaining right page is inserted and the downlink > for the original page is updated as one atomic operation. > > It was a pretty big rewrite again, but seems to work now. I tested > splits that span more than two pages by rigging the btree_gist picksplit > function to choose very bad split points, and the fix-split logic by > adding elog(ERROR) in strategic places to sometimes leave splits > incomplete. One final version, with a bug fix wrt. root page split and some cleanup. I'm planning to commit this before Christmas. It's a big patch, so review would be much appreciated. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On 21.12.2010 20:00, Heikki Linnakangas wrote: > One final version, with a bug fix wrt. root page split and some cleanup. > I'm planning to commit this before Christmas. It's a big patch, so > review would be much appreciated. Committed. Phew! Review & testing is of course still appreciated, given how big and complicated this was. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 21.12.2010 20:00, Heikki Linnakangas wrote: >> One final version, with a bug fix wrt. root page split and some cleanup. >> I'm planning to commit this before Christmas. It's a big patch, so >> review would be much appreciated. > Committed. Phew! Review & testing is of course still appreciated, given > how big and complicated this was. I just found out that the "benchmark" test script in contrib/intarray/bench/ crashes HEAD in gistdoinsert() --- it looks like it's trying to pop to a stack entry that isn't there. Run it per the instructions in the intarray documentation: $ createdb TEST $ psql TEST < ../_int.sql ... $ ./create_test.pl | psql TEST CREATE TABLE CREATE TABLE CREATE INDEX CREATE INDEX server closed the connection unexpectedly This probably means the server terminated abnormally before or whileprocessing the request. connection to server was lost The script generates randomized data, so possibly it won't fail every time, but it failed three out of three times for me. The changes I'm about to commit in intarray don't seem to make any difference. regards, tom lane
On 09.01.2011 07:05, Tom Lane wrote: > I just found out that the "benchmark" test script in > contrib/intarray/bench/ crashes HEAD in gistdoinsert() --- it looks like > it's trying to pop to a stack entry that isn't there. > > Run it per the instructions in the intarray documentation: > > $ createdb TEST > $ psql TEST< ../_int.sql > ... > $ ./create_test.pl | psql TEST > CREATE TABLE > CREATE TABLE > CREATE INDEX > CREATE INDEX > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > connection to server was lost > > The script generates randomized data, so possibly it won't fail every > time, but it failed three out of three times for me. The changes I'm > about to commit in intarray don't seem to make any difference. Thanks, fixed. Apparently my testing never hit the case where an update, rather than an insert, into the root page causes it to split. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com