Thread: GiST: PickSplit and multi-attr indexes

GiST: PickSplit and multi-attr indexes

From

Neil Conway

Date:

14 November 2004, 09:22:03

Oleg & Teodor,

If I understand the code correctly, GiST will only pass the first 
attribute of each index tuple to the user-defined PickSplit method when 
it wants to split a node. (see circa line 1269 of gist.c)

Is this a wise design decision? Granted, in many situations the first 
attribute in the index is sufficient to make a reasonable decision about 
how to divide the node into two halves, but I don't think that is 
universally true. For example, consider a two column index whose first 
attribute has a small number of distinct values. It could well be that 
all the first attribute values in a node-to-be-split would be the same. 
Only passing the first attribute to PickSplit would result in an 
essentially random distribution of tuples among the split nodes, rather 
than allowing the GiST extension to make use of the second attribution 
to partition the nodes. That's an extreme example, but it is easy to 
construct more realistic scenarios (basically, any situation in which 
the cardinality of the first index attribute is low -- a reasonably 
common occurrence with a multi-column index, I believe).

I'm not sure whether this would be a problem in practice. Speculation: 
repeated invocations of PickSplit are one of the main factors in 
deriving the ultimate shape of the GiST tree. Poor distribution of keys 
resulting from PickSplit would eventually result in unnecessarily loose 
bounding predicates in internal nodes, which would hurt performance.

Comments welcome,

Neil

Re: GiST: PickSplit and multi-attr indexes

From

Tom Lane

Date:

14 November 2004, 23:55:20

Neil Conway <neilc@samurai.com> writes:
> If I understand the code correctly, GiST will only pass the first 
> attribute of each index tuple to the user-defined PickSplit method when 
> it wants to split a node. (see circa line 1269 of gist.c)

> Is this a wise design decision?

It's probably just a hangover from the days when GiST didn't support
multi-column indexes at all.  I agree it should be changed.  But note
you will then have to cope with NULL values.
        regards, tom lane

Re: GiST: PickSplit and multi-attr indexes

From

Neil Conway

Date:

15 November 2004, 04:15:44

On Sun, 2004-11-14 at 18:54 -0500, Tom Lane wrote:
> It's probably just a hangover from the days when GiST didn't support
> multi-column indexes at all.  I agree it should be changed.

I'm not sure the right way to fix it (at least without significant
changes to the GiST API). At present, the PickSplit() method is passed a
vector of GISTENTRYs and fills in a GIST_SPLITVEC. The GISTENTRYs
correspond to the first attributes of all the tuples in the node to be
split. There is no provision for the GiST extension to be informed about
any additional attributes in the index tuples. Even if we changed the
API to allow that, the GiST extension would have a hard time making a
reasonable decision in the multi-attribute case: the additional index
attributes could well be implemented using a different GiST extension.

> But note you will then have to cope with NULL values.

Yes -- I'm not sure offhand why GiST does not allow leading NULL values
in index attributes, but that ought to be fixed anyway.

-Neil

Re: GiST: PickSplit and multi-attr indexes

From

Oleg Bartunov

Date:

15 November 2004, 12:43:29

Neil,

I think we should not touch this right now unless you certainly know
algorithm proven to be good in general case. Did you already got some idea ?
    Oleg

On Mon, 15 Nov 2004, Neil Conway wrote:

> On Sun, 2004-11-14 at 18:54 -0500, Tom Lane wrote:
>> It's probably just a hangover from the days when GiST didn't support
>> multi-column indexes at all.  I agree it should be changed.
>
> I'm not sure the right way to fix it (at least without significant
> changes to the GiST API). At present, the PickSplit() method is passed a
> vector of GISTENTRYs and fills in a GIST_SPLITVEC. The GISTENTRYs
> correspond to the first attributes of all the tuples in the node to be
> split. There is no provision for the GiST extension to be informed about
> any additional attributes in the index tuples. Even if we changed the
> API to allow that, the GiST extension would have a hard time making a
> reasonable decision in the multi-attribute case: the additional index
> attributes could well be implemented using a different GiST extension.
>
>> But note you will then have to cope with NULL values.
>
> Yes -- I'm not sure offhand why GiST does not allow leading NULL values
> in index attributes, but that ought to be fixed anyway.
>
> -Neil
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>               http://archives.postgresql.org
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: GiST: PickSplit and multi-attr indexes

From

Tom Lane

Date:

15 November 2004, 15:19:56

Neil Conway <neilc@samurai.com> writes:
> I'm not sure the right way to fix it (at least without significant
> changes to the GiST API). At present, the PickSplit() method is passed a
> vector of GISTENTRYs and fills in a GIST_SPLITVEC. The GISTENTRYs
> correspond to the first attributes of all the tuples in the node to be
> split. There is no provision for the GiST extension to be informed about
> any additional attributes in the index tuples. Even if we changed the
> API to allow that, the GiST extension would have a hard time making a
> reasonable decision in the multi-attribute case: the additional index
> attributes could well be implemented using a different GiST extension.

Right, when you consider attributes of different types it gets tough.

I'm not familiar with the details of the GiST code, but would it work to
generalize PickSplit to return a three-way classification?  That is,
instead of actually splitting the node, have it identify each item as
"definitely left", "definitely right", or "don't care" (think "less",
"greater", or "equal to" the desired split value).  If there is another
index column, you then invoke its PickSplit on just the "don't care" set
to try to classify them; and so on.  Once you are out of columns, any
remaining "don't care" set is split using some standardized method like
trying to equalize the physical space.

Of course this requires API changes for PickSplit (not only a more
complex result, but it will have to deal with null inputs).  But it
seems like it would be localized otherwise.  Also it might allow removal
of duplicated logic to deal with splitting equal values sanely.
        regards, tom lane

Re: GiST: PickSplit and multi-attr indexes

From

Neil Conway

Date:

16 November 2004, 04:13:39

On Mon, 2004-11-15 at 10:19 -0500, Tom Lane wrote:
> I'm not familiar with the details of the GiST code, but would it work to
> generalize PickSplit to return a three-way classification?  That is,
> instead of actually splitting the node, have it identify each item as
> "definitely left", "definitely right", or "don't care" (think "less",
> "greater", or "equal to" the desired split value).

I had thought about this solution, but I was worried that it will
actually make GiST a less general framework, because it requires that
the PickSplit() method effectively order the leaves of the tree. For
btree that is a reasonable assumption to make, but I'm not sure if it
can or should be made in general.

That said, I haven't actually implemented any indexes using GiST;
perhaps Oleg or Teodor can comment?

-Neil

Re: GiST: PickSplit and multi-attr indexes

From

Tom Lane

Date:

16 November 2004, 14:40:25

Neil Conway <neilc@samurai.com> writes:
> On Mon, 2004-11-15 at 10:19 -0500, Tom Lane wrote:
>> I'm not familiar with the details of the GiST code, but would it work to
>> generalize PickSplit to return a three-way classification?  That is,
>> instead of actually splitting the node, have it identify each item as
>> "definitely left", "definitely right", or "don't care" (think "less",
>> "greater", or "equal to" the desired split value).

> I had thought about this solution, but I was worried that it will
> actually make GiST a less general framework, because it requires that
> the PickSplit() method effectively order the leaves of the tree. For
> btree that is a reasonable assumption to make, but I'm not sure if it
> can or should be made in general.

If there are no don't-care cases, then you're effectively saying that
the first column's PickSplit has sole control over the tree shape,
which is where we're at now.  ISTM the entire point of a multi-column
index is that the first column has duplicates, or at least values that
are similar enough to qualify as don't-cares.
        regards, tom lane

Re: GiST: PickSplit and multi-attr indexes

From

Greg Stark

Date:

16 November 2004, 20:58:11

Tom Lane <tgl@sss.pgh.pa.us> writes:

> If there are no don't-care cases, then you're effectively saying that
> the first column's PickSplit has sole control over the tree shape,
> which is where we're at now.  ISTM the entire point of a multi-column
> index is that the first column has duplicates, or at least values that
> are similar enough to qualify as don't-cares.

I'm not sure that GiST indexes behave the same way as btree indexes for the
multi-column case. 

In a btree index the second column is entirely subordinate to the first
column. In a GiST index the data is multi-dimensional, and all dimensions are
equally important.

For a concrete example, say you wanted to index a 2d cartesian coordinate. If
you stored it in a single column using the point data type (or a 2 element
array) then the PickSplit function can use both coordinates to determine its
split. A GiST index on two separate integer columns really shouldn't be
prevented from operating in the same way.

-- 
greg

Re: GiST: PickSplit and multi-attr indexes

From

Greg Stark

Date:

16 November 2004, 21:13:14

Greg Stark <gsstark@MIT.EDU> writes:

> I'm not sure that GiST indexes behave the same way as btree indexes for the
> multi-column case. 
> 
> In a btree index the second column is entirely subordinate to the first
> column. In a GiST index the data is multi-dimensional, and all dimensions are
> equally important.

In fact on further consideration I do have a proposal.

If you look at the GiST implementations for various index types you'll see
that many (all?) take the same approach for PickSplit. In fact they pretty
much all have the same code copy/pasted to handle it.

The approach they take is to have a function which calculates an abstract
"distance" between any two entries. There's an algorithm that they use to pick
the split based on this distance function.

If you abandoned "PickSplit" and instead exposed this distance function as the
external API then the behaviour for multi-column indexes is clear. You
calculate the distance along all the axes and calculate the diagonal distance.

I think abandoning PickSplit for the distance function might also mean you
don't need a separate function for Penalty either, but I'm not sure on that.

-- 
greg

Re: GiST: PickSplit and multi-attr indexes

From

Tom Lane

Date:

16 November 2004, 21:37:42

Greg Stark <gsstark@MIT.EDU> writes:
> The approach they take is to have a function which calculates an
> abstract "distance" between any two entries. There's an algorithm that
> they use to pick the split based on this distance function.

> If you abandoned "PickSplit" and instead exposed this distance
> function as the external API then the behaviour for multi-column
> indexes is clear. You calculate the distance along all the axes and
> calculate the diagonal distance.

Hmm ... the problem with that is the assumption that different opclasses
will compute similarly-scaled distances.  If opclass A generates
distances in the range (0,1e6) while B generates in the range (0,1),
combining them with Euclidean distance won't work well at all.  OTOH you
can't blindly normalize, because in some cases maybe the data is such
that a massive difference in distances is truly appropriate.

I'm also a bit leery of the assumption that every GiST application can
reduce its PickSplit logic to Euclidean distances.
        regards, tom lane