Re: Double sorting split patch - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Double sorting split patch
Date
Msg-id 4E8AD603.80506@enterprisedb.com
Whole thread Raw
In response to Re: Double sorting split patch  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: Double sorting split patch
List pgsql-hackers
On 04.10.2011 11:51, Alexander Korotkov wrote:
> On Tue, Oct 4, 2011 at 12:12 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> Can you elaborate the consider-split algorithm? The criteria to select the
>> new split over the previously selected one is this:
>>
>>> !               /*
>>> !                * If ratio is acceptable, we should compare current split
>>> with
>>> !                * previously selected one. If no split was selected then
>>> we select
>>> !                * current anyway. Between splits of one dimension we
>>> search for
>>> !                * minimal overlap (allowing negative values) and minimal
>>> ration
>>> !                * (between same overlaps. We switch dimension if find
>>> less overlap
>>> !                * (non-negative) or less range with same overlap.
>>> !                */
>>> !               range = diminfo->upper - diminfo->lower;
>>> !               overlap = ((leftUpper) - (rightLower)) / range;
>>> !               if (context->first ||
>>> !                       (context->dim == dimNum&&
>>> !                        (overlap<  context->overlap ||
>>> !                         (overlap == context->overlap&&  ratio>
>>> context->ratio))) ||
>>> !                       (context->dim != dimNum&&
>>> !                        ((range>  context->range&&
>>> !                          non_negative(overlap)<=
>>> non_negative(context->overlap)**) ||
>>> !                         non_negative(overlap)<
>>> non_negative(context->overlap)**))
>>> !                       )
>>> !               {
>>>
>>
>> Why are negative overlaps handled differently across dimensions and within
>> the same dimension? Your considerSplit algorithm in the SYRCoSE 2011 paper
>> doesn't seem to make that distinction.
>
> Yes, I've changed this behaviour after experiments on real-life datasets. On
> the datasets where data don't overlap themselfs (points are always such
> datasets), non-overlapping splits are frequently possible. In this case
> splits tends to be along one dimension, because most distant non-overlapping
> splits (i.e. having lowest negative overlap) appears to be in the same
> dimension as previous split. Therefore MBRs appear to be very prolonged
> along another dimension, that's bad for search. In order to prevent this
> behavour I've made transition to another dimension by non-nagative part of
> overlap and range. Using range as the split criteria makes MBRs more
> quadratic, and using non-negative part of overlap as priority criteria give
> to range chance to matter.

Ok. Could you phrase that as a code comment?

Here's a version of the patch I've been working on. There's no
functional changes, just a lot of moving things around, comment changes,
etc. to hopefully make it more readable.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Separating bgwriter and checkpointer
Next
From: Florian Pflug
Date:
Subject: Re: restoring an object to a different name