Re: Bug in new buffering GiST build code - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Bug in new buffering GiST build code
Date
Msg-id CAPpHfdtVeLkeT0Je1ihbOphSjf0b5mg8A8Um2zUVvaomSqM_Lw@mail.gmail.com
Whole thread Raw
In response to Re: Bug in new buffering GiST build code  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Bug in new buffering GiST build code
List pgsql-hackers
<div class="gmail_quote">Hi!</div><div class="gmail_quote"><br /></div><div class="gmail_quote">On Tue, May 22, 2012 at
12:56AM, Heikki Linnakangas <span dir="ltr"><<a href="mailto:heikki.linnakangas@enterprisedb.com"
target="_blank">heikki.linnakangas@enterprisedb.com</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">After staring
atgraphs built from gist trees for the whole day, I think I finally understand what's wrong:</div></div><br /> There's
athinko in the way we maintain the parent paths during insertions. It boils down to the fact that in a GiST index, the
left-to-rightordering as determined by the right-links on the upper level does not necessarily match the left-to-right
orderingat a lower level. I'm afraid we've inadvertently made that assumption in the code.<br /><br /> This can
happen:<br/><br /> 1. Let's imagine that we have a tree that looks like this:<br /><br />   root<br />    |<br />  
...<br/>    |<br />    A   (internal node at upper level)<br />    |<br />    |<br />    B<br />    |<br />    |<br />
  C   (internal node at a lower level)<br />    |<br />   ...<br /><br /> 2. While we descend down the tree to insert a
tuple,we memorize the path A..B..C. This is stored in the node buffer associated with node C.<br /><br /> 3. More
tuplesare inserted to another subtree below B (not shown), until node B needs to be split. This produces tree:<br /><br
/>   A<br />    |\<br />    | \<br />    B->B2<br />       |<br />       |<br />       C<br /><br /> We still have
thepath A..B..C memorized in C's node buffer. The downlink for C is now actually in B2, but that's ok, because we have
thecode to follow the right links if we can't find the downlink for a node in the memorized parent.<br /><br /> 4. More
tuplesare added to another subtree of A, until A has to be split. Picksplit decides to keep the downlink for B2 on the
originalpage, and moves the downlink for B on the new page, A2:<br /><br />    A->A2<br />     \ /<br />      X<br
/>    / \<br />    B->B2<br />       |<br />       |<br />       C<br /><br /> Remember that we still have the path
A..B..Cmemorized in C's node buffer.<br /><br /> 5. More tuples are buffered, and we traverse down the tree along the
pathA2->B->... When we look up the node buffer for page B, we update the path stored there. It's now A2..B. This
fragmentof the path is shared by the path in C's node buffer.<br /><br /> 6. At this point, the path memorized in C's
nodebuffer is A2..B..C. This is where things go wrong. While it's true that A2 is the parent of B, and it's true that
theparent of C can be found by following the rightlink from B, A2 is *not* a grandparent of C.<br /><br /> 7. More
tuplesare added below C, and C has to be split. To insert the downlink for the new sibling, we re-find the parent for
C.The memorized path is A2..B..C. We begin by searching for the downlink for C in page B. It's not there, so we move
right,and find it in B2. The path we're working with is now A2..B2..C. When we insert the new downlink into B2, it also
fillsup and has to be split, so recurse up and have to refind the parent of B2. We begin looking in the memorized
parent,A2. The downlink is not there, so we move right. But the downlink for B2 is to the left from A2, so we never
findit. We walk right until we fall off the cliff, and you get the "failed to re-find parent" error.<br /><br /><br />
Ithink the minimal fix is that after moving right, look up the node buffer of the page we moved onto, and use the path
memorizedfor that if we have to recurse further up the tree. So in the above example, at step 7 after we've moved right
tonode B2, we should look up the memorized path for B2 in B2's node buffer. That would give us the correct path,
A..B2..C.<br/><br /> The management of the path stacks is a bit complicated, anyway. I'll think about this some more
tomorrow,maybe we can make it simpler, knowing that we have to do those extra lookups.</blockquote><div
class="gmail_quote"><br/></div>WOW! You did enormous work on exploring that!</div><div class="gmail_quote">I just
arrivedfrom PGCon and start looking at it when find you've already done comprehensive research of this
problem.</div><divclass="gmail_quote"><div class="gmail_quote">On the step 5 if we've NSN in GISTBufferingInsertStack
structure,we could detect situation of changing parent of splitted page. Using this we could save copy
of GISTBufferingInsertStackfor B2 with original parent A, because we know split of B to occur after
creating GISTBufferingInsertStackbut before split of A. The question is how to find this copy from C, hash?</div><br
/>------<br/>With best regards,<br />Alexander Korotkov.<br /></div> 

pgsql-hackers by date:

Previous
From: Florian Pflug
Date:
Subject: Re: read() returns ERANGE in Mac OS X
Next
From: Alexander Korotkov
Date:
Subject: Re: Patch: add conversion from pg_wchar to multibyte