Re: GiST for range types (was Re: Range Types - typo + NULL string constructor) - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: GiST for range types (was Re: Range Types - typo + NULL string constructor)
Date
Msg-id 1324540006.7608.79.camel@jdavis
Whole thread Raw
In response to Re: GiST for range types (was Re: Range Types - typo + NULL string constructor)  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: GiST for range types (was Re: Range Types - typo + NULL string constructor)  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
On Wed, 2011-12-14 at 01:04 +0400, Alexander Korotkov wrote:
> Hi!
>
Thank you! Attached a few changes:

* Change "ordinal" to "normal" for clarity (at least to me).
* Some comment cleanup
* Change classes_groups to be an enum of SPLIT_LEFT and SPLIT_RIGHT,
rather than using 1 and 2.
* Changed the "bounds_lower" and "bounds_upper" variables into
"by_lower" and "by_upper" to indicate that the arrays are distinguished
by sort order. It was confusing me to read it otherwise.

A few comments:

* In range_gist_picksplit, it would be nice to have a little bit more
intuitive description of what's going on with the nonEmptyCount and
nonInfCount numbers. For instance, it appears to depend on the fact that
a range must either be in nonEmptyCount or in nonInfCount. Also, can you
describe the reason you're multiplying by two and taking the absolute
value? It seems to work, but I am missing the intuition behind those
operations.

* The penalty function is fairly hard to read still. At a high level, I
think we're trying to accomplish a few things (in order from most to
least important):
  - Keep normal ranges separate.
  - Avoid broadening the class of the original predicate (e.g. turning
single-infinite into double-infinite).
  - Avoid broadening (as determined by subtype_diff) the original
predicate.
  - Favor adding ranges to narrower original predicates.

Do you agree? If so, perhaps we can attack those more directly and it
might be a little more readable.

Additionally, the arbitrary numbers might become a problem. Can we
choose better constants there? They would still be arbitrary when
compared with real numbers derived from subtype_diff, but maybe we can
still do better than what's there.

* Regarding the leftover "common" entries that can go to either side:
what is the "delta" measure trying to accomplish? When I work through
some examples, it seems to favor putting larger common ranges on the
left (small delta) and smaller common ranges on the right (smaller
delta). Why is that good? Or did I misread the code?

Intuitively, I would think that we'd want to push the ranges with lower
upper bounds to the left and higher lower bounds to the right -- in
other words, recurse. Obviously, we'd need to make sure it terminated at
some point, but splitting the common entries does seem like a smaller
version of the original problem. Thoughts?

Thank you for the helpful comments! It took me a while to work through
the logic, but I would have been lost completely without the comments
around the double sorting split.

Regards,
    Jeff Davis

Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Page Checksums + Double Writes
Next
From: Jeff Davis
Date:
Subject: Re: GiST for range types (was Re: Range Types - typo + NULL string constructor)