Cube extension improvement, GSoC - Mailing list pgsql-hackers

From Stas Kelvich
Subject Cube extension improvement, GSoC
Date
Msg-id 6A7E75B1-64DD-4F5F-A991-435E3E5A24FB@gmail.com
Whole thread Raw
Responses Re: Cube extension improvement, GSoC  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
Hello,

some time ago I started working on the data search system (about 100-200M of records) with queries consisted of several
diapasonand equality conditions, e.g.: 
 WHERE dim1 BETWEEN 128 AND 137 AND WHERE dim2 BETWEEN 4815 AND 162342 AND WHERE dim3 = 42 ORDER BY dim1 ASC

There are 6 or 7 search criteria in my task. In order to avoid full table scan I started using R-Tree over cube data
type:
 CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3]))

For fast sorting I used Alexander Korotkov's patch for knngist
(http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html),which changes metric for nearest neighbors
searchand allows to obtain cubes ordered by a specific coordinate. Having changed some data types and function/operator
definitionsI ported this to 9.2 (https://github.com/kelvich/postgres/commit/bb372). Then, after this I could select
orderedrecords right from the index: 
 SELECT * FROM my_table WHERE cube(array[dim1, dim2, dim3]) <@ cube '(128,4815,42),(137,162342,42)' ORDER BY
cube(array[dim1,dim2, dim3]) <*> 5; 

The main issue of such approach is the index size. In my case it was about 100GB for 100M of records. Therefore index
buildingbecomes very expensive disk-related operation. For the same reason reading a large number of records from the
indexis slow too. 

I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander Korotkov for any help. (I'm very thankful to
themand glad that they agreed to meet, listen to me and give useful advices). Having discussed it we decided that there
wasseveral possible improvements for the cube extension: 
 * Adding point data type support to the cube extension in order to avoid storing of coincident upper left and lower
rightvertices, which may reduce the volume that leaf nodes occupy almost twice. * Checking the split algorithm with big
datasetsand, if possible, improving it. * Learning cube extension to store dimensions with different data types. Such
indexwould be good alternative to compound key B-Tree multi-index (suitable for diapason queries and data ordering). *
Providingsupport for KNN with metrics induced by the different norms: euclidean, taxicab norm, p-norm. This can be
usefulin fields where we can extract signature: similar images search, similar audio search. 

I'd like to participate in GSoC with this improvements, and I'm very interested in any comments or suggestions about
thisfeature list. 




pgsql-hackers by date:

Previous
From: Daniel Farina
Date:
Subject: Re: postgres_fdw vs data formatting GUCs (was Re: [v9.3] writable foreign tables)
Next
From: Alvaro Herrera
Date:
Subject: Re: DROP OWNED BY fails to drop privileges granted by non-owners (was Re: [GENERAL] Bug, Feature, or what else?)