On Tue, Oct 4, 2011 at 12:12 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Can you elaborate the consider-split algorithm? The criteria to select the new split over the previously selected one is this:
! /*
! * If ratio is acceptable, we should compare current split with
! * previously selected one. If no split was selected then we select
! * current anyway. Between splits of one dimension we search for
! * minimal overlap (allowing negative values) and minimal ration
! * (between same overlaps. We switch dimension if find less overlap
! * (non-negative) or less range with same overlap.
! */
! range = diminfo->upper - diminfo->lower;
! overlap = ((leftUpper) - (rightLower)) / range;
! if (context->first ||
! (context->dim == dimNum &&
! (overlap < context->overlap ||
! (overlap == context->overlap && ratio > context->ratio))) ||
! (context->dim != dimNum &&
! ((range > context->range &&
! non_negative(overlap) <= non_negative(context->overlap)) ||
! non_negative(overlap) < non_negative(context->overlap)))
! )
! {
Why are negative overlaps handled differently across dimensions and within the same dimension? Your considerSplit algorithm in the SYRCoSE 2011 paper doesn't seem to make that distinction.
Yes, I've changed this behaviour after experiments on real-life datasets. On the datasets where data don't overlap themselfs (points are always such datasets), non-overlapping splits are frequently possible. In this case splits tends to be along one dimension, because most distant non-overlapping splits (i.e. having lowest negative overlap) appears to be in the same dimension as previous split. Therefore MBRs appear to be very prolonged along another dimension, that's bad for search. In order to prevent this behavour I've made transition to another dimension by non-nagative part of overlap and range. Using range as the split criteria makes MBRs more quadratic, and using non-negative part of overlap as priority criteria give to range chance to matter.
------
With best regards,
Alexander Korotkov.