Thread: Cube extension split algorithm fix

Cube extension split algorithm fix

From

Stas Kelvich

Date:

25 September 2013, 11:14:19

Hello, hackers.

I've fixed split algorithm that was implemented in cube extension. I've changed it according to the original Guttman
paper(old version was more simple algorithm) and also ported Alexander Korotkov's algorithm from box datatype indexing
thatwork faster and better on low dimensions. 

On my test dataset (1M records, 7 dimensions, real world database of goods) numbers was following:

Building index over table (on expression):
old: 67.296058 seconds
new: 48.842391 seconds

Cube point search, mean, 100 queries
old: 0.001025 seconds
new: 0.000427 seconds

Index on field size:
old: 562 MB
new: 283 MB

Stas.

Attachment

split.diff

Re: Cube extension split algorithm fix

From

Peter Eisentraut

Date:

25 September 2013, 16:58:36

On 9/25/13 7:14 AM, Stas Kelvich wrote:
> I've fixed split algorithm that was implemented in cube extension.

This patch creates a bunch of new compiler warnings.  Please fix those.

Re: Cube extension split algorithm fix

From

Alexander Korotkov

Date:

25 November 2013, 21:30:26

Hi!

On Wed, Sep 25, 2013 at 3:14 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:

I've fixed split algorithm that was implemented in cube extension. I've changed it according to the original Guttman paper (old version was more simple algorithm) and also ported Alexander Korotkov's algorithm from box datatype indexing that work faster and better on low dimensions.

On my test dataset (1M records, 7 dimensions, real world database of goods) numbers was following:

Building index over table (on expression):
old: 67.296058 seconds
new: 48.842391 seconds

Cube point search, mean, 100 queries
old: 0.001025 seconds
new: 0.000427 seconds

Index on field size:
old: 562 MB
new: 283 MB

Thanks for patch! Here is my review.

1) After points for cubes were committed, this patch doesn't applies anymore to head.

patching file contrib/cube/cube.c

Hunk #1 FAILED at 131.

Hunk #2 succeeded at 482 (offset 17 lines).

Hunk #3 succeeded at 494 (offset 17 lines).

Hunk #4 succeeded at 501 (offset 17 lines).

Hunk #5 succeeded at 611 (offset 17 lines).

Hunk #6 FAILED at 681.

Hunk #7 succeeded at 790 with fuzz 1 (offset 30 lines).

Hunk #8 succeeded at 808 (offset 29 lines).

Hunk #9 succeeded at 888 (offset 29 lines).

Hunk #10 succeeded at 1090 (offset 29 lines).

Hunk #11 succeeded at 1160 (offset 29 lines).

Hunk #12 succeeded at 1272 (offset 27 lines).

Hunk #13 succeeded at 1560 with fuzz 1 (offset 95 lines).

2 out of 13 hunks FAILED -- saving rejects to file contrib/cube/cube.c.rej

(Stripping trailing CRs from patch.)

patching file contrib/cube/cubedata.h

Hunk #1 succeeded at 47 (offset 35 lines).

2) Split algorithms are choosing by number of dimensions in first cube. This is generally weak idea :) At least, cubes could have different number of dimensions in the same index. This rule would give different answers for different orders of cubes. Also, as corner case n+1 dimensional cubes with confluent (n+1)th dimension are generally same as n dimensional cubes. Choosing split algorithm shouldn't depend on it. We should try to extrapolate this principle little more and detect useless for split dimensions to do a better choice. Also, as responsible part as choosing split algorithm needs to be documented (with references to the papers if needed).

3) We need more comprehensive testing. One dataset is not enough. We need much more to prove the rule of choosing split algorithm. Description of set of queries 'Cube point search' is not detail. Can you share dataset you use? We need the tests that anyone can reproduce.

------
With best regards,
Alexander Korotkov.