Thread: Bug in new buffering GiST build code

Bug in new buffering GiST build code

From
Heikki Linnakangas
Date:
I bumped into a bug in the new buffering GiST build code. I did this:

create table gisttest (t text);
insert into gisttest select
a||'fooooooooooooooooooooooooooooooooooooooooooooooo' from
generate_series(1,10000000) a;

create index i_gisttest on gisttest using gist (t collate "C") WITH
(fillfactor=10);

After a while, this segfaults.

I debugged this, and traced this into a bug in the
gistRelocateBuildBuffersOnSplit() function. It splits a node buffer into
two (or more) node buffers, when the corresponding index page is split.
It first makes a copy of the old GISTNodeBuffer struct, and then
repurposes the struct to hold the new buffer for the new leftmost page
of the split. The temporary copy of the old page is only needed while
the function moves all the tuples from the old buffer into the new
buffers, after that it can be discarded. The temporary copy of the
struct is kept in the stack. However, the temporary copy can find its
way into the list of "loaded buffers". After the function exits, and
it's time to unload all the currently loaded buffers, you get a segfault
because the pointer now points to garbage. I think the reason this
doesn't always crash is that the copy in the stack usually still happens
to be valid enough that gistUnloadNodeBuffer() just skips it.

I'll commit the attached fix for that, marking the temporary copy
explicitly, so that we can avoid leaking it into any long-lived data
structures.

After fixing that, however, I'm now getting another error, much later in
the build process:

ERROR:  failed to re-find parent for block 123002
STATEMENT:  create index i_gisttest on gisttest using gist (t collate
"C") WITH (fillfactor=10);

I'll continue debugging that, but it seems to be another, unrelated, bug.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Bug in new buffering GiST build code

From
Alexander Korotkov
Date:
On Fri, May 18, 2012 at 8:27 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: 
After fixing that, however, I'm now getting another error, much later in the build process:

ERROR:  failed to re-find parent for block 123002
STATEMENT:  create index i_gisttest on gisttest using gist (t collate "C") WITH (fillfactor=10);

I'll continue debugging that, but it seems to be another, unrelated, bug.
Thanks for debugging and fixing that. I'm going to take a look on the remaining bug.

------
With best regards,
Alexander Korotkov.

Re: Bug in new buffering GiST build code

From
Heikki Linnakangas
Date:
On 18.05.2012 20:34, Alexander Korotkov wrote:
> On Fri, May 18, 2012 at 8:27 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>>
>> After fixing that, however, I'm now getting another error, much later in
>> the build process:
>>
>> ERROR:  failed to re-find parent for block 123002
>> STATEMENT:  create index i_gisttest on gisttest using gist (t collate "C")
>> WITH (fillfactor=10);
>>
>> I'll continue debugging that, but it seems to be another, unrelated, bug.
>
> Thanks for debugging and fixing that. I'm going to take a look on the
> remaining bug.

After staring at graphs built from gist trees for the whole day, I think 
I finally understand what's wrong:

There's a thinko in the way we maintain the parent paths during 
insertions. It boils down to the fact that in a GiST index, the 
left-to-right ordering as determined by the right-links on the upper 
level does not necessarily match the left-to-right ordering at a lower 
level. I'm afraid we've inadvertently made that assumption in the code.

This can happen:

1. Let's imagine that we have a tree that looks like this:
   root    |   ...    |    A   (internal node at upper level)    |    |    B    |    |    C   (internal node at a lower
level)   |   ...
 

2. While we descend down the tree to insert a tuple, we memorize the 
path A..B..C. This is stored in the node buffer associated with node C.

3. More tuples are inserted to another subtree below B (not shown), 
until node B needs to be split. This produces tree:
    A    |\    | \    B->B2       |       |       C

We still have the path A..B..C memorized in C's node buffer. The 
downlink for C is now actually in B2, but that's ok, because we have the 
code to follow the right links if we can't find the downlink for a node 
in the memorized parent.

4. More tuples are added to another subtree of A, until A has to be 
split. Picksplit decides to keep the downlink for B2 on the original 
page, and moves the downlink for B on the new page, A2:
    A->A2     \ /      X     / \    B->B2       |       |       C

Remember that we still have the path A..B..C memorized in C's node buffer.

5. More tuples are buffered, and we traverse down the tree along the 
path A2->B->... When we look up the node buffer for page B, we update 
the path stored there. It's now A2..B. This fragment of the path is 
shared by the path in C's node buffer.

6. At this point, the path memorized in C's node buffer is A2..B..C. 
This is where things go wrong. While it's true that A2 is the parent of 
B, and it's true that the parent of C can be found by following the 
rightlink from B, A2 is *not* a grandparent of C.

7. More tuples are added below C, and C has to be split. To insert the 
downlink for the new sibling, we re-find the parent for C. The memorized 
path is A2..B..C. We begin by searching for the downlink for C in page 
B. It's not there, so we move right, and find it in B2. The path we're 
working with is now A2..B2..C. When we insert the new downlink into B2, 
it also fills up and has to be split, so recurse up and have to refind 
the parent of B2. We begin looking in the memorized parent, A2. The 
downlink is not there, so we move right. But the downlink for B2 is to 
the left from A2, so we never find it. We walk right until we fall off 
the cliff, and you get the "failed to re-find parent" error.


I think the minimal fix is that after moving right, look up the node 
buffer of the page we moved onto, and use the path memorized for that if 
we have to recurse further up the tree. So in the above example, at step 
7 after we've moved right to node B2, we should look up the memorized 
path for B2 in B2's node buffer. That would give us the correct path, 
A..B2..C.

The management of the path stacks is a bit complicated, anyway. I'll 
think about this some more tomorrow, maybe we can make it simpler, 
knowing that we have to do those extra lookups.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Bug in new buffering GiST build code

From
Heikki Linnakangas
Date:
On 18.05.2012 20:34, Alexander Korotkov wrote:
> On Fri, May 18, 2012 at 8:27 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>>
>> After fixing that, however, I'm now getting another error, much later in
>> the build process:
>>
>> ERROR:  failed to re-find parent for block 123002
>> STATEMENT:  create index i_gisttest on gisttest using gist (t collate "C")
>> WITH (fillfactor=10);
>>
>> I'll continue debugging that, but it seems to be another, unrelated, bug.
>
> Thanks for debugging and fixing that. I'm going to take a look on the
> remaining bug.

After staring at graphs built from gist trees for the whole day, I think 
I finally understand what's wrong:

There's a thinko in the way we maintain the parent paths during 
insertions. It boils down to the fact that in a GiST index, the 
left-to-right ordering as determined by the right-links on the upper 
level does not necessarily match the left-to-right ordering at a lower 
level. I'm afraid we've inadvertently made that assumption in the code.

This can happen:

1. Let's imagine that we have a tree that looks like this:
   root    |   ...    |    A   (internal node at upper level)    |    |    B    |    |    C   (internal node at a lower
level)   |   ...
 

2. While we descend down the tree to insert a tuple, we memorize the 
path A..B..C. This is stored in the node buffer associated with node C.

3. More tuples are inserted to another subtree below B (not shown), 
until node B needs to be split. This produces tree:
    A    |\    | \    B->B2       |       |       C

We still have the path A..B..C memorized in C's node buffer. The 
downlink for C is now actually in B2, but that's ok, because we have the 
code to follow the right links if we can't find the downlink for a node 
in the memorized parent.

4. More tuples are added to another subtree of A, until A has to be 
split. Picksplit decides to keep the downlink for B2 on the original 
page, and moves the downlink for B on the new page, A2:
    A->A2     \ /      X     / \    B->B2       |       |       C

Remember that we still have the path A..B..C memorized in C's node buffer.

5. More tuples are buffered, and we traverse down the tree along the 
path A2->B->... When we look up the node buffer for page B, we update 
the path stored there. It's now A2..B. This fragment of the path is 
shared by the path in C's node buffer.

6. At this point, the path memorized in C's node buffer is A2..B..C. 
This is where things go wrong. While it's true that A2 is the parent of 
B, and it's true that the parent of C can be found by following the 
rightlink from B, A2 is *not* a grandparent of C.

7. More tuples are added below C, and C has to be split. To insert the 
downlink for the new sibling, we re-find the parent for C. The memorized 
path is A2..B..C. We begin by searching for the downlink for C in page 
B. It's not there, so we move right, and find it in B2. The path we're 
working with is now A2..B2..C. When we insert the new downlink into B2, 
it also fills up and has to be split, so recurse up and have to refind 
the parent of B2. We begin looking in the memorized parent, A2. The 
downlink is not there, so we move right. But the downlink for B2 is to 
the left from A2, so we never find it. We walk right until we fall off 
the cliff, and you get the "failed to re-find parent" error.


I think the minimal fix is that after moving right, look up the node 
buffer of the page we moved onto, and use the path memorized for that if 
we have to recurse further up the tree. So in the above example, at step 
7 after we've moved right to node B2, we should look up the memorized 
path for B2 in B2's node buffer. That would give us the correct path, 
A..B2..C.

The management of the path stacks is a bit complicated, anyway. I'll 
think about this some more tomorrow, maybe we can make it simpler, 
knowing that we have to do those extra lookups.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Bug in new buffering GiST build code

From
Alexander Korotkov
Date:
<div class="gmail_quote">Hi!</div><div class="gmail_quote"><br /></div><div class="gmail_quote">On Tue, May 22, 2012 at
12:56AM, Heikki Linnakangas <span dir="ltr"><<a href="mailto:heikki.linnakangas@enterprisedb.com"
target="_blank">heikki.linnakangas@enterprisedb.com</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">After staring
atgraphs built from gist trees for the whole day, I think I finally understand what's wrong:</div></div><br /> There's
athinko in the way we maintain the parent paths during insertions. It boils down to the fact that in a GiST index, the
left-to-rightordering as determined by the right-links on the upper level does not necessarily match the left-to-right
orderingat a lower level. I'm afraid we've inadvertently made that assumption in the code.<br /><br /> This can
happen:<br/><br /> 1. Let's imagine that we have a tree that looks like this:<br /><br />   root<br />    |<br />  
...<br/>    |<br />    A   (internal node at upper level)<br />    |<br />    |<br />    B<br />    |<br />    |<br />
  C   (internal node at a lower level)<br />    |<br />   ...<br /><br /> 2. While we descend down the tree to insert a
tuple,we memorize the path A..B..C. This is stored in the node buffer associated with node C.<br /><br /> 3. More
tuplesare inserted to another subtree below B (not shown), until node B needs to be split. This produces tree:<br /><br
/>   A<br />    |\<br />    | \<br />    B->B2<br />       |<br />       |<br />       C<br /><br /> We still have
thepath A..B..C memorized in C's node buffer. The downlink for C is now actually in B2, but that's ok, because we have
thecode to follow the right links if we can't find the downlink for a node in the memorized parent.<br /><br /> 4. More
tuplesare added to another subtree of A, until A has to be split. Picksplit decides to keep the downlink for B2 on the
originalpage, and moves the downlink for B on the new page, A2:<br /><br />    A->A2<br />     \ /<br />      X<br
/>    / \<br />    B->B2<br />       |<br />       |<br />       C<br /><br /> Remember that we still have the path
A..B..Cmemorized in C's node buffer.<br /><br /> 5. More tuples are buffered, and we traverse down the tree along the
pathA2->B->... When we look up the node buffer for page B, we update the path stored there. It's now A2..B. This
fragmentof the path is shared by the path in C's node buffer.<br /><br /> 6. At this point, the path memorized in C's
nodebuffer is A2..B..C. This is where things go wrong. While it's true that A2 is the parent of B, and it's true that
theparent of C can be found by following the rightlink from B, A2 is *not* a grandparent of C.<br /><br /> 7. More
tuplesare added below C, and C has to be split. To insert the downlink for the new sibling, we re-find the parent for
C.The memorized path is A2..B..C. We begin by searching for the downlink for C in page B. It's not there, so we move
right,and find it in B2. The path we're working with is now A2..B2..C. When we insert the new downlink into B2, it also
fillsup and has to be split, so recurse up and have to refind the parent of B2. We begin looking in the memorized
parent,A2. The downlink is not there, so we move right. But the downlink for B2 is to the left from A2, so we never
findit. We walk right until we fall off the cliff, and you get the "failed to re-find parent" error.<br /><br /><br />
Ithink the minimal fix is that after moving right, look up the node buffer of the page we moved onto, and use the path
memorizedfor that if we have to recurse further up the tree. So in the above example, at step 7 after we've moved right
tonode B2, we should look up the memorized path for B2 in B2's node buffer. That would give us the correct path,
A..B2..C.<br/><br /> The management of the path stacks is a bit complicated, anyway. I'll think about this some more
tomorrow,maybe we can make it simpler, knowing that we have to do those extra lookups.</blockquote><div
class="gmail_quote"><br/></div>WOW! You did enormous work on exploring that!</div><div class="gmail_quote">I just
arrivedfrom PGCon and start looking at it when find you've already done comprehensive research of this
problem.</div><divclass="gmail_quote"><div class="gmail_quote">On the step 5 if we've NSN in GISTBufferingInsertStack
structure,we could detect situation of changing parent of splitted page. Using this we could save copy
of GISTBufferingInsertStackfor B2 with original parent A, because we know split of B to occur after
creating GISTBufferingInsertStackbut before split of A. The question is how to find this copy from C, hash?</div><br
/>------<br/>With best regards,<br />Alexander Korotkov.<br /></div> 

Re: Bug in new buffering GiST build code

From
Heikki Linnakangas
Date:
On 22.05.2012 01:09, Alexander Korotkov wrote:
> Hi!
>
> On Tue, May 22, 2012 at 12:56 AM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> The management of the path stacks is a bit complicated, anyway. I'll think
>> about this some more tomorrow, maybe we can make it simpler, knowing that
>> we have to do those extra lookups.
>
> WOW! You did enormous work on exploring that!
> I just arrived from PGCon and start looking at it when find you've already
> done comprehensive research of this problem.
> On the step 5 if we've NSN in GISTBufferingInsertStack structure, we could
> detect situation of changing parent of splitted page. Using this we could
> save copy of GISTBufferingInsertStack for B2 with original parent A,
> because we know split of B to occur after creating GISTBufferingInsertStack
> but before split of A. The question is how to find this copy from C, hash?

I tested a patch that adds the extra getNodeBuffer() call after
refinding the parent, as discussed. However, I'm still getting a "failed
to-refind parent" error later in the build, so I think we're still
missing some corner case.

I think we should rewrite the way we track the parents completely.
Rather than keep a path stack attached to every node buffer, let's just
maintain a second hash table that contains the parent of every internal
node. Whenever a downlink is moved to another page, update the hash
table with the new information. That way we always have up-to-date
information about the parent of every internal node.

That's much easier to understand than the path stack structures we have
now. I think the overall memory consumption will be about the same too.
Although we need the extra hash table with one entry for every internal
node, we get rid of the path stack structs, which are also one per every
internal node at the moment.

I believe it is faster too. I added some instrumentation to the existing
gist code (with the additional getNodeBuffer() call added to fix this
bug), to measure the time spent moving right, when refinding the parent
of a page. I added gettimeofday() calls before and after moving right,
and summed the total. In my test case, the final index size was about
19GB, and the index build took 3545 seconds (59 minutes). Of that time,
580 seconds (~ 10 minutes) was spent moving right to refind parents.
That's a lot. I also printed a line whenever a refind operation had to
move right 20 pages or more. That happened 2482 times during the build,
in the worst case we moved right over 40000 pages.

Attached is a patch to replace the path stacks with a hash table. With
this patch, the index build time in my test case dropped from 59 minutes
to about 55 minutes. I don'ẗ know how representative or repeatable this
test case is, but this definitely seems very worthwhile, not only
because it fixes the bug and makes the code simpler, but also on
performance grounds.

Alexander, do you still have the test environments and data lying around
that you used for GiST buffering testing last summer? Could you rerun
some of those tests with this patch?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Bug in new buffering GiST build code

From
Alexander Korotkov
Date:
On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
I think we should rewrite the way we track the parents completely. Rather than keep a path stack attached to every node buffer, let's just maintain a second hash table that contains the parent of every internal node. Whenever a downlink is moved to another page, update the hash table with the new information. That way we always have up-to-date information about the parent of every internal node.

That's much easier to understand than the path stack structures we have now. I think the overall memory consumption will be about the same too. Although we need the extra hash table with one entry for every internal node, we get rid of the path stack structs, which are also one per every internal node at the moment.

I believe it is faster too. I added some instrumentation to the existing gist code (with the additional getNodeBuffer() call added to fix this bug), to measure the time spent moving right, when refinding the parent of a page. I added gettimeofday() calls before and after moving right, and summed the total. In my test case, the final index size was about 19GB, and the index build took 3545 seconds (59 minutes). Of that time, 580 seconds (~ 10 minutes) was spent moving right to refind parents. That's a lot. I also printed a line whenever a refind operation had to move right 20 pages or more. That happened 2482 times during the build, in the worst case we moved right over 40000 pages.

Attached is a patch to replace the path stacks with a hash table. With this patch, the index build time in my test case dropped from 59 minutes to about 55 minutes. I don'ẗ know how representative or repeatable this test case is, but this definitely seems very worthwhile, not only because it fixes the bug and makes the code simpler, but also on performance grounds.

Cool, seems that we've both simplier and faster implementation of finding parent. Thanks!
 
Alexander, do you still have the test environments and data lying around that you used for GiST buffering testing last summer? Could you rerun some of those tests with this patch?

I think I can restore test environment and data. Will rerun tests soon.

------
With best regards,
Alexander Korotkov.

Re: Bug in new buffering GiST build code

From
Heikki Linnakangas
Date:
On 28.05.2012 00:46, Alexander Korotkov wrote:
> On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> Attached is a patch to replace the path stacks with a hash table. With
>> this patch, the index build time in my test case dropped from 59 minutes to
>> about 55 minutes. I don'ẗ know how representative or repeatable this test
>> case is, but this definitely seems very worthwhile, not only because it
>> fixes the bug and makes the code simpler, but also on performance grounds.
>
> Cool, seems that we've both simplier and faster implementation of finding
> parent. Thanks!

Ok, I committed this now.

I also spotted and fixed another little oversight: the temporary file 
didn't get deleted after the index build.

>> Alexander, do you still have the test environments and data lying around
>> that you used for GiST buffering testing last summer? Could you rerun some
>> of those tests with this patch?
>
> I think I can restore test environment and data. Will rerun tests soon.

Thanks!

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Bug in new buffering GiST build code

From
Alexander Korotkov
Date:
On Wed, May 30, 2012 at 1:01 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
I also spotted and fixed another little oversight: the temporary file didn't get deleted after the index build.

I've one note not directly related to buffering build. While I debugging buffering GiST index build, backend was frequently crashed. After recovery partially built index file was remain. Do we have some tool to detect such "dead" files? If not, probably we need some?

------
With best regards,
Alexander Korotkov.

Re: Bug in new buffering GiST build code

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> I've one note not directly related to buffering build. While I debugging
> buffering GiST index build, backend was frequently crashed. After recovery
> partially built index file was remain. Do we have some tool to detect such
> "dead" files? If not, probably we need some?

Well, it's not that hard to check for orphan files (I think
contrib/oid2name can do that, or perhaps could be extended to).  I don't
like the idea of the postmaster automatically removing such files, if
that's where you were headed.  Too much risk of deleting important data.
        regards, tom lane


Re: Bug in new buffering GiST build code

From
Alexander Korotkov
Date:
On Mon, May 28, 2012 at 1:46 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Alexander, do you still have the test environments and data lying around that you used for GiST buffering testing last summer? Could you rerun some of those tests with this patch?

I think I can restore test environment and data. Will rerun tests soon.

I rerun some of tests. There are index build times in seconds for old way of parent refind and new way of it.

                  old      new
usnoa2           2385     2452
usnoa2_shuffled  8131     8055
uniform          8327     8359

I thinks difference can be described by round error.
Indexes seem to be exactly same. It's predictable because changing algorithm of parent refind shouldn't change the result.

------
With best regards,
Alexander Korotkov.

Re: Bug in new buffering GiST build code

From
Alexander Korotkov
Date:
On Tue, Jun 5, 2012 at 2:00 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, May 28, 2012 at 1:46 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Alexander, do you still have the test environments and data lying around that you used for GiST buffering testing last summer? Could you rerun some of those tests with this patch?

I think I can restore test environment and data. Will rerun tests soon.

I rerun some of tests. There are index build times in seconds for old way of parent refind and new way of it.

                  old      new
usnoa2           2385     2452
usnoa2_shuffled  8131     8055
uniform          8327     8359

I thinks difference can be described by round error.

Oh, I mean not "round" error, but "random". I.e. not exactly same state of shared buffers at index build start and so on.
 
------
With best regards,
Alexander Korotkov.

Re: Bug in new buffering GiST build code

From
Heikki Linnakangas
Date:
On 05.06.2012 09:45, Alexander Korotkov wrote:
> On Tue, Jun 5, 2012 at 2:00 AM, Alexander Korotkov<aekorotkov@gmail.com>wrote:
>
>> On Mon, May 28, 2012 at 1:46 AM, Alexander Korotkov<aekorotkov@gmail.com>wrote:
>>
>>> On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas<
>>> heikki.linnakangas@enterprisedb.com>  wrote:
>>>
>>>> Alexander, do you still have the test environments and data lying around
>>>> that you used for GiST buffering testing last summer? Could you rerun some
>>>> of those tests with this patch?
>>>
>>> I think I can restore test environment and data. Will rerun tests soon.
>>
>> I rerun some of tests. There are index build times in seconds for old way
>> of parent refind and new way of it.
>>
>>                    old      new
>> usnoa2           2385     2452
>> usnoa2_shuffled  8131     8055
>> uniform          8327     8359
>>
>> I thinks difference can be described by round error.
>
> Oh, I mean not "round" error, but "random". I.e. not exactly same state of
> shared buffers at index build start and so on.

Thanks. I was expecting a small performance gain from the new approach, 
but I guess not. Oh well, that would've just been a bonus, the important 
thing is that it now works.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com