Thread: GiST range-contained-by searches versus empty ranges

GiST range-contained-by searches versus empty ranges

From
Tom Lane
Date:
I started to wonder why the test in range_gist_consistent_int() for
RANGESTRAT_CONTAINED_BY was "return true" (ie, search the entire index)
rather than range_overlaps, which is what is tested in the comparable
case in rtree_internal_consistent().  The regression tests showed me
how come: an empty-range index entry should be reported as being
contained by anything, but range_overlaps won't necessarily find such
entries.  (The rtree case is all right because there is no equivalent of
an empty range in boxes, circles, or polygons.)

I am not satisfied with this state of affairs though: range_contained_by
really ought to be efficiently indexable, but with the current coding
an index search is nearly useless.  Also, so far as I can see, the
current penalty function allows empty-range entries to be scattered
basically anywhere in the index, making a search for "= empty" pretty
inefficient too.

The first solution that comes to mind is to make the penalty and
picksplit functions forcibly segregate empty ranges from others, that is
a split will never put empty ranges together with non-empty ones.  Then,
we can assume that a non-empty internal node doesn't represent any empty
leaf entries, and avoid descending to it when it doesn't overlap the
target range.  Then the equality-search case could be improved too.

Thoughts, better ideas?
        regards, tom lane


Re: GiST range-contained-by searches versus empty ranges

From
Tom Lane
Date:
I wrote:
> I started to wonder why the test in range_gist_consistent_int() for
> RANGESTRAT_CONTAINED_BY was "return true" (ie, search the entire index)
> rather than range_overlaps, which is what is tested in the comparable
> case in rtree_internal_consistent().  The regression tests showed me
> how come: an empty-range index entry should be reported as being
> contained by anything, but range_overlaps won't necessarily find such
> entries.  (The rtree case is all right because there is no equivalent of
> an empty range in boxes, circles, or polygons.)

> I am not satisfied with this state of affairs though: range_contained_by
> really ought to be efficiently indexable, but with the current coding
> an index search is nearly useless.  Also, so far as I can see, the
> current penalty function allows empty-range entries to be scattered
> basically anywhere in the index, making a search for "= empty" pretty
> inefficient too.

> The first solution that comes to mind is to make the penalty and
> picksplit functions forcibly segregate empty ranges from others, that is
> a split will never put empty ranges together with non-empty ones.  Then,
> we can assume that a non-empty internal node doesn't represent any empty
> leaf entries, and avoid descending to it when it doesn't overlap the
> target range.  Then the equality-search case could be improved too.

After looking through the code a bit, it seems that this isn't possible
to solve with a localized fix after all.  Yes, the opclass-specific
picksplit function can positively guarantee to separate empty ranges
from others when it splits a page ... but it doesn't have control over
what happens when an item is added to an index without a page split
occurring.  Consider a scenario where we have a multi-page GiST index
containing no empty ranges, and an empty range item needs to be added.
As the code is currently written, the central GiST code will choose one
of the existing root-page items to add the new item to; there is no way
to persuade it that a new top-level item would be a better idea.  What's
more, if we try to do so by having the penalty function return infinity
for adding an empty range to a nonempty item, we'll end up with empty
ranges being randomly added to any/all of the items, since we'll get the
exact same penalty for all the nonempty items.  So, far from segregating
the empties, we'll just end up with them cluttered all over the index,
just like now.

This is not really a new problem.  There is code in there that is trying
to segregate NULLs from non-NULL entries, and it is equally
broken/ineffective.

I'm inclined to propose that we should add some logic to say that
merging a new item into an existing one is forbidden if the penalty
function returns plus-infinity for the case.  If all existing items on a
page return infinity, a new item must be added to the page (possibly
causing a page split) instead of inserting into any existing one.
(Of course, gistpenalty() should be fixed to return infinity, not just a
randomly chosen large value, for the null-and-not-null case.)

Comments?
        regards, tom lane


Re: GiST range-contained-by searches versus empty ranges

From
Alexander Korotkov
Date:
The first solution that comes to mind is to make the penalty and
picksplit functions forcibly segregate empty ranges from others, that is
a split will never put empty ranges together with non-empty ones.  Then,
we can assume that a non-empty internal node doesn't represent any empty
leaf entries, and avoid descending to it when it doesn't overlap the
target range.  Then the equality-search case could be improved too.

Thoughts, better ideas?

Have you seen my patch about GiST for range types?
There mentioned problem is solved by introduction of RANGE_CONTAIN_EMPTY bit in range flags. This bit is only used in GiST index and means that there are underlying empty ranges.

------
With best regards, 
Alexander Korotkov. 

Re: GiST range-contained-by searches versus empty ranges

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
>> The first solution that comes to mind is to make the penalty and
>> picksplit functions forcibly segregate empty ranges from others, that is
>> a split will never put empty ranges together with non-empty ones.

> Have you seen my patch about GiST for range types?
> https://commitfest.postgresql.org/action/patch_view?id=682
> There mentioned problem is solved by introduction of RANGE_CONTAIN_EMPTY
> bit in range flags. This bit is only used in GiST index and means that
> there are underlying empty ranges.

Yeah, I had been coming around to the idea that we'd need some kluge
like that.  The forcible-segregation idea falls apart as soon as you
start thinking about multiple-column indexes --- if you have several
columns each demanding to segregate a certain kind of data, you can
easily overrun the space available in the root page, whereupon it's no
longer possible to guarantee anything about the contents of child
pages.

I think this is a "must fix" for 9.2, because once we release it will
be too late to do anything without invalidating existing indexes.
        regards, tom lane


Re: GiST range-contained-by searches versus empty ranges

From
Jeff Davis
Date:
On Sat, 2011-11-26 at 19:26 -0500, Tom Lane wrote:
> I'm inclined to propose that we should add some logic to say that
> merging a new item into an existing one is forbidden if the penalty
> function returns plus-infinity for the case.  If all existing items on a
> page return infinity, a new item must be added to the page (possibly
> causing a page split) instead of inserting into any existing one.
> (Of course, gistpenalty() should be fixed to return infinity, not just a
> randomly chosen large value, for the null-and-not-null case.)

This seems less important now that you've committed the flag for
"contains empty ranges".

However, it still sounds like a useful improvement to me.

Regards,Jeff Davis



Re: GiST range-contained-by searches versus empty ranges

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2011-11-26 at 19:26 -0500, Tom Lane wrote:
>> I'm inclined to propose that we should add some logic to say that
>> merging a new item into an existing one is forbidden if the penalty
>> function returns plus-infinity for the case.  If all existing items on a
>> page return infinity, a new item must be added to the page (possibly
>> causing a page split) instead of inserting into any existing one.
>> (Of course, gistpenalty() should be fixed to return infinity, not just a
>> randomly chosen large value, for the null-and-not-null case.)

> This seems less important now that you've committed the flag for
> "contains empty ranges".

> However, it still sounds like a useful improvement to me.

Actually, I'd pretty much decided that it was unworkable because of the
fact that GiST supports multi-column indexes.  You can't have N columns
all trying to segregate their own special values without running out of
room on the root page.

It might well be that the GiST code should consider "add a new node to
the current internal page" in addition to "push the new value into one
of the existing nodes".  But we can't let opclasses assume that the
division they'd like to have is guaranteed.
        regards, tom lane