Re: Double sorting split patch - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Double sorting split patch
Date
Msg-id CAPpHfds333E_V28nRkDY2CBSDC+TDHykZ1HBdPpYkGAjb9w6dg@mail.gmail.com
Whole thread Raw
In response to Re: Double sorting split patch  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Double sorting split patch
List pgsql-hackers
On Tue, Oct 4, 2011 at 12:12 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Can you elaborate the consider-split algorithm? The criteria to select the new split over the previously selected one is this:
!               /*
!                * If ratio is acceptable, we should compare current split with
!                * previously selected one. If no split was selected then we select
!                * current anyway. Between splits of one dimension we search for
!                * minimal overlap (allowing negative values) and minimal ration
!                * (between same overlaps. We switch dimension if find less overlap
!                * (non-negative) or less range with same overlap.
!                */
!               range = diminfo->upper - diminfo->lower;
!               overlap = ((leftUpper) - (rightLower)) / range;
!               if (context->first ||
!                       (context->dim == dimNum &&
!                        (overlap < context->overlap ||
!                         (overlap == context->overlap && ratio > context->ratio))) ||
!                       (context->dim != dimNum &&
!                        ((range > context->range &&
!                          non_negative(overlap) <= non_negative(context->overlap)) ||
!                         non_negative(overlap) < non_negative(context->overlap)))
!                       )
!               {

Why are negative overlaps handled differently across dimensions and within the same dimension? Your considerSplit algorithm in the SYRCoSE 2011 paper doesn't seem to make that distinction.

Yes, I've changed this behaviour after experiments on real-life datasets. On the datasets where data don't overlap themselfs (points are always such datasets), non-overlapping splits are frequently possible. In this case splits tends to be along one dimension, because most distant non-overlapping splits (i.e. having lowest negative overlap) appears to be in the same dimension as previous split. Therefore MBRs appear to be very prolonged along another dimension, that's bad for search. In order to prevent this behavour I've made transition to another dimension by non-nagative part of overlap and range. Using range as the split criteria makes MBRs more quadratic, and using non-negative part of overlap as priority criteria give to range chance to matter.

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Alex Hunsaker
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.
Next
From: Amit Khandekar
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Force strings passed to and from plperl to be in UTF8 encoding.