On 04.10.2011 11:51, Alexander Korotkov wrote:
> On Tue, Oct 4, 2011 at 12:12 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com> wrote:
>
>> Can you elaborate the consider-split algorithm? The criteria to select the
>> new split over the previously selected one is this:
>>
>>> ! /*
>>> ! * If ratio is acceptable, we should compare current split
>>> with
>>> ! * previously selected one. If no split was selected then
>>> we select
>>> ! * current anyway. Between splits of one dimension we
>>> search for
>>> ! * minimal overlap (allowing negative values) and minimal
>>> ration
>>> ! * (between same overlaps. We switch dimension if find
>>> less overlap
>>> ! * (non-negative) or less range with same overlap.
>>> ! */
>>> ! range = diminfo->upper - diminfo->lower;
>>> ! overlap = ((leftUpper) - (rightLower)) / range;
>>> ! if (context->first ||
>>> ! (context->dim == dimNum&&
>>> ! (overlap< context->overlap ||
>>> ! (overlap == context->overlap&& ratio>
>>> context->ratio))) ||
>>> ! (context->dim != dimNum&&
>>> ! ((range> context->range&&
>>> ! non_negative(overlap)<=
>>> non_negative(context->overlap)**) ||
>>> ! non_negative(overlap)<
>>> non_negative(context->overlap)**))
>>> ! )
>>> ! {
>>>
>>
>> Why are negative overlaps handled differently across dimensions and within
>> the same dimension? Your considerSplit algorithm in the SYRCoSE 2011 paper
>> doesn't seem to make that distinction.
>
> Yes, I've changed this behaviour after experiments on real-life datasets. On
> the datasets where data don't overlap themselfs (points are always such
> datasets), non-overlapping splits are frequently possible. In this case
> splits tends to be along one dimension, because most distant non-overlapping
> splits (i.e. having lowest negative overlap) appears to be in the same
> dimension as previous split. Therefore MBRs appear to be very prolonged
> along another dimension, that's bad for search. In order to prevent this
> behavour I've made transition to another dimension by non-nagative part of
> overlap and range. Using range as the split criteria makes MBRs more
> quadratic, and using non-negative part of overlap as priority criteria give
> to range chance to matter.
Ok. Could you phrase that as a code comment?
Here's a version of the patch I've been working on. There's no
functional changes, just a lot of moving things around, comment changes,
etc. to hopefully make it more readable.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com