Thread: GiST range-contained-by searches versus empty ranges
I started to wonder why the test in range_gist_consistent_int() for RANGESTRAT_CONTAINED_BY was "return true" (ie, search the entire index) rather than range_overlaps, which is what is tested in the comparable case in rtree_internal_consistent(). The regression tests showed me how come: an empty-range index entry should be reported as being contained by anything, but range_overlaps won't necessarily find such entries. (The rtree case is all right because there is no equivalent of an empty range in boxes, circles, or polygons.) I am not satisfied with this state of affairs though: range_contained_by really ought to be efficiently indexable, but with the current coding an index search is nearly useless. Also, so far as I can see, the current penalty function allows empty-range entries to be scattered basically anywhere in the index, making a search for "= empty" pretty inefficient too. The first solution that comes to mind is to make the penalty and picksplit functions forcibly segregate empty ranges from others, that is a split will never put empty ranges together with non-empty ones. Then, we can assume that a non-empty internal node doesn't represent any empty leaf entries, and avoid descending to it when it doesn't overlap the target range. Then the equality-search case could be improved too. Thoughts, better ideas? regards, tom lane
I wrote: > I started to wonder why the test in range_gist_consistent_int() for > RANGESTRAT_CONTAINED_BY was "return true" (ie, search the entire index) > rather than range_overlaps, which is what is tested in the comparable > case in rtree_internal_consistent(). The regression tests showed me > how come: an empty-range index entry should be reported as being > contained by anything, but range_overlaps won't necessarily find such > entries. (The rtree case is all right because there is no equivalent of > an empty range in boxes, circles, or polygons.) > I am not satisfied with this state of affairs though: range_contained_by > really ought to be efficiently indexable, but with the current coding > an index search is nearly useless. Also, so far as I can see, the > current penalty function allows empty-range entries to be scattered > basically anywhere in the index, making a search for "= empty" pretty > inefficient too. > The first solution that comes to mind is to make the penalty and > picksplit functions forcibly segregate empty ranges from others, that is > a split will never put empty ranges together with non-empty ones. Then, > we can assume that a non-empty internal node doesn't represent any empty > leaf entries, and avoid descending to it when it doesn't overlap the > target range. Then the equality-search case could be improved too. After looking through the code a bit, it seems that this isn't possible to solve with a localized fix after all. Yes, the opclass-specific picksplit function can positively guarantee to separate empty ranges from others when it splits a page ... but it doesn't have control over what happens when an item is added to an index without a page split occurring. Consider a scenario where we have a multi-page GiST index containing no empty ranges, and an empty range item needs to be added. As the code is currently written, the central GiST code will choose one of the existing root-page items to add the new item to; there is no way to persuade it that a new top-level item would be a better idea. What's more, if we try to do so by having the penalty function return infinity for adding an empty range to a nonempty item, we'll end up with empty ranges being randomly added to any/all of the items, since we'll get the exact same penalty for all the nonempty items. So, far from segregating the empties, we'll just end up with them cluttered all over the index, just like now. This is not really a new problem. There is code in there that is trying to segregate NULLs from non-NULL entries, and it is equally broken/ineffective. I'm inclined to propose that we should add some logic to say that merging a new item into an existing one is forbidden if the penalty function returns plus-infinity for the case. If all existing items on a page return infinity, a new item must be added to the page (possibly causing a page split) instead of inserting into any existing one. (Of course, gistpenalty() should be fixed to return infinity, not just a randomly chosen large value, for the null-and-not-null case.) Comments? regards, tom lane
The first solution that comes to mind is to make the penalty and
picksplit functions forcibly segregate empty ranges from others, that is
a split will never put empty ranges together with non-empty ones. Then,
we can assume that a non-empty internal node doesn't represent any empty
leaf entries, and avoid descending to it when it doesn't overlap the
target range. Then the equality-search case could be improved too.
Thoughts, better ideas?
Have you seen my patch about GiST for range types?
There mentioned problem is solved by introduction of RANGE_CONTAIN_EMPTY bit in range flags. This bit is only used in GiST index and means that there are underlying empty ranges.
------
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes: >> The first solution that comes to mind is to make the penalty and >> picksplit functions forcibly segregate empty ranges from others, that is >> a split will never put empty ranges together with non-empty ones. > Have you seen my patch about GiST for range types? > https://commitfest.postgresql.org/action/patch_view?id=682 > There mentioned problem is solved by introduction of RANGE_CONTAIN_EMPTY > bit in range flags. This bit is only used in GiST index and means that > there are underlying empty ranges. Yeah, I had been coming around to the idea that we'd need some kluge like that. The forcible-segregation idea falls apart as soon as you start thinking about multiple-column indexes --- if you have several columns each demanding to segregate a certain kind of data, you can easily overrun the space available in the root page, whereupon it's no longer possible to guarantee anything about the contents of child pages. I think this is a "must fix" for 9.2, because once we release it will be too late to do anything without invalidating existing indexes. regards, tom lane
On Sat, 2011-11-26 at 19:26 -0500, Tom Lane wrote: > I'm inclined to propose that we should add some logic to say that > merging a new item into an existing one is forbidden if the penalty > function returns plus-infinity for the case. If all existing items on a > page return infinity, a new item must be added to the page (possibly > causing a page split) instead of inserting into any existing one. > (Of course, gistpenalty() should be fixed to return infinity, not just a > randomly chosen large value, for the null-and-not-null case.) This seems less important now that you've committed the flag for "contains empty ranges". However, it still sounds like a useful improvement to me. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Sat, 2011-11-26 at 19:26 -0500, Tom Lane wrote: >> I'm inclined to propose that we should add some logic to say that >> merging a new item into an existing one is forbidden if the penalty >> function returns plus-infinity for the case. If all existing items on a >> page return infinity, a new item must be added to the page (possibly >> causing a page split) instead of inserting into any existing one. >> (Of course, gistpenalty() should be fixed to return infinity, not just a >> randomly chosen large value, for the null-and-not-null case.) > This seems less important now that you've committed the flag for > "contains empty ranges". > However, it still sounds like a useful improvement to me. Actually, I'd pretty much decided that it was unworkable because of the fact that GiST supports multi-column indexes. You can't have N columns all trying to segregate their own special values without running out of room on the root page. It might well be that the GiST code should consider "add a new node to the current internal page" in addition to "push the new value into one of the existing nodes". But we can't let opclasses assume that the division they'd like to have is guaranteed. regards, tom lane