Re: Cube extension split algorithm fix - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Cube extension split algorithm fix
Date
Msg-id CAPpHfdvvmmPtdNhH40sG0-mnjgZaFDKg98CeBVdxPiO63SjbzA@mail.gmail.com
Whole thread Raw
In response to Cube extension split algorithm fix  (Stas Kelvich <stas.kelvich@gmail.com>)
List pgsql-hackers
Hi!

On Wed, Sep 25, 2013 at 3:14 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
I've fixed split algorithm that was implemented in cube extension. I've changed it according to the original Guttman paper (old version was more simple algorithm) and also ported Alexander Korotkov's algorithm from box datatype indexing that work faster and better on low dimensions.

On my test dataset (1M records, 7 dimensions, real world database of goods) numbers was following:

Building index over table (on expression):
old: 67.296058 seconds
new: 48.842391 seconds

Cube point search, mean, 100 queries
old: 0.001025 seconds
new: 0.000427 seconds

Index on field size:
old: 562 MB
new: 283 MB

Thanks for patch! Here is my review.

1) After points for cubes were committed, this patch doesn't applies anymore to head.
 
patching file contrib/cube/cube.c
Hunk #1 FAILED at 131.
Hunk #2 succeeded at 482 (offset 17 lines).
Hunk #3 succeeded at 494 (offset 17 lines).
Hunk #4 succeeded at 501 (offset 17 lines).
Hunk #5 succeeded at 611 (offset 17 lines).
Hunk #6 FAILED at 681.
Hunk #7 succeeded at 790 with fuzz 1 (offset 30 lines).
Hunk #8 succeeded at 808 (offset 29 lines).
Hunk #9 succeeded at 888 (offset 29 lines).
Hunk #10 succeeded at 1090 (offset 29 lines).
Hunk #11 succeeded at 1160 (offset 29 lines).
Hunk #12 succeeded at 1272 (offset 27 lines).
Hunk #13 succeeded at 1560 with fuzz 1 (offset 95 lines).
2 out of 13 hunks FAILED -- saving rejects to file contrib/cube/cube.c.rej
(Stripping trailing CRs from patch.)
patching file contrib/cube/cubedata.h
Hunk #1 succeeded at 47 (offset 35 lines). 

2) Split algorithms are choosing by number of dimensions in first cube. This is generally weak idea :) At least, cubes could have different number of dimensions in the same index. This rule would give different answers for different orders of cubes. Also, as corner case n+1 dimensional cubes with confluent (n+1)th dimension are generally same as n dimensional cubes. Choosing split algorithm shouldn't depend on it. We should try to extrapolate this principle little more and detect useless for split dimensions to do a better choice. Also, as responsible part as choosing split algorithm needs to be documented (with references to the papers if needed).

3) We need more comprehensive testing. One dataset is not enough. We need much more to prove the rule of choosing split algorithm. Description of set of queries 'Cube point search' is not detail. Can you share dataset you use? We need the tests that anyone can reproduce.

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: ToDo: fast update of arrays with fixed length fields for PL/pgSQL
Next
From: Tom Lane
Date:
Subject: Re: ToDo: fast update of arrays with fixed length fields for PL/pgSQL