Thread: Cube extension improvement, GSoC
Hello, some time ago I started working on the data search system (about 100-200M of records) with queries consisted of several diapasonand equality conditions, e.g.: WHERE dim1 BETWEEN 128 AND 137 AND WHERE dim2 BETWEEN 4815 AND 162342 AND WHERE dim3 = 42 ORDER BY dim1 ASC There are 6 or 7 search criteria in my task. In order to avoid full table scan I started using R-Tree over cube data type: CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3])) For fast sorting I used Alexander Korotkov's patch for knngist (http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html),which changes metric for nearest neighbors searchand allows to obtain cubes ordered by a specific coordinate. Having changed some data types and function/operator definitionsI ported this to 9.2 (https://github.com/kelvich/postgres/commit/bb372). Then, after this I could select orderedrecords right from the index: SELECT * FROM my_table WHERE cube(array[dim1, dim2, dim3]) <@ cube '(128,4815,42),(137,162342,42)' ORDER BY cube(array[dim1,dim2, dim3]) <*> 5; The main issue of such approach is the index size. In my case it was about 100GB for 100M of records. Therefore index buildingbecomes very expensive disk-related operation. For the same reason reading a large number of records from the indexis slow too. I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander Korotkov for any help. (I'm very thankful to themand glad that they agreed to meet, listen to me and give useful advices). Having discussed it we decided that there wasseveral possible improvements for the cube extension: * Adding point data type support to the cube extension in order to avoid storing of coincident upper left and lower rightvertices, which may reduce the volume that leaf nodes occupy almost twice. * Checking the split algorithm with big datasetsand, if possible, improving it. * Learning cube extension to store dimensions with different data types. Such indexwould be good alternative to compound key B-Tree multi-index (suitable for diapason queries and data ordering). * Providingsupport for KNN with metrics induced by the different norms: euclidean, taxicab norm, p-norm. This can be usefulin fields where we can extract signature: similar images search, similar audio search. I'd like to participate in GSoC with this improvements, and I'm very interested in any comments or suggestions about thisfeature list.
Hi, Jey!
I remember I've started couple of threads related to cube extension:
Could you provide some feedback to GSoC proposal of Stas?
Hello,
some time ago I started working on the data search system (about 100-200M of records) with queries consisted of several diapason and equality conditions, e.g.:
WHERE dim1 BETWEEN 128 AND 137 AND
WHERE dim2 BETWEEN 4815 AND 162342 AND
WHERE dim3 = 42
ORDER BY dim1 ASC
There are 6 or 7 search criteria in my task. In order to avoid full table scan I started using R-Tree over cube data type:
CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3]))
For fast sorting I used Alexander Korotkov's patch for knngist (http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html), which changes metric for nearest neighbors search and allows to obtain cubes ordered by a specific coordinate. Having changed some data types and function/operator definitions I ported this to 9.2 (https://github.com/kelvich/postgres/commit/bb372). Then, after this I could select ordered records right from the index:
SELECT * FROM my_table
WHERE cube(array[dim1, dim2, dim3]) <@ cube '(128,4815,42),(137,162342,42)'
ORDER BY cube(array[dim1, dim2, dim3]) <*> 5;
The main issue of such approach is the index size. In my case it was about 100GB for 100M of records. Therefore index building becomes very expensive disk-related operation. For the same reason reading a large number of records from the index is slow too.
I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander Korotkov for any help. (I'm very thankful to them and glad that they agreed to meet, listen to me and give useful advices). Having discussed it we decided that there was several possible improvements for the cube extension:
* Adding point data type support to the cube extension in order to avoid storing of coincident upper left and lower right vertices, which may reduce the volume that leaf nodes occupy almost twice.
* Checking the split algorithm with big datasets and, if possible, improving it.
* Learning cube extension to store dimensions with different data types. Such index would be good alternative to compound key B-Tree multi-index (suitable for diapason queries and data ordering).
* Providing support for KNN with metrics induced by the different norms: euclidean, taxicab norm, p-norm. This can be useful in fields where we can extract signature: similar images search, similar audio search.
I'd like to participate in GSoC with this improvements, and I'm very interested in any comments or suggestions about this feature list.
With best regards,
Alexander Korotkov.
On Sat, Mar 30, 2013 at 3:55 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
Hi, Jey!I remember I've started couple of threads related to cube extension:
Oh, it's a typo. I remember you've started those threads.
Could you provide some feedback to GSoC proposal of Stas?
With best regards,
Alexander Korotkov.
Hi Stas, and sorry for the late response. On 23.03.2013 01:10, Stas Kelvich wrote: > * Adding point data type support to the cube extension in order to avoid storing of coincident upper left and lowerright vertices, which may reduce the volume that leaf nodes occupy almost twice. Makes sense. I think it would be possible to implement this as an optimization to existing cube data type, so that a cube that is actually a point is simply stored more efficiently. Ie. set a flag in the NDBOX struct indicating that the cube is a point, and omit the second set of coordinates if that's set. > * Learning cube extension to store dimensions with different data types. Such index would be good alternative to compoundkey B-Tree multi-index (suitable for diapason queries and data ordering). You mean, a cube containing something else than floats? I don't think we want to explode the number of datatypes by doing that, casting between them could be problematic. But I wonder if you could add cube-like operators for arrays. We already have support for arrays of any datatype, and any number of dimensions. That seems awfully lot like what the cube extension does. - Heikki
On Thu, May 2, 2013 at 4:19 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
Alexander Korotkov.
* Learning cube extension to store dimensions with different data types. Such index would be good alternative to compound key B-Tree multi-index (suitable for diapason queries and data ordering).You mean, a cube containing something else than floats? I don't think we want to explode the number of datatypes by doing that, casting between them could be problematic.
At least option for having float4 cube instead of foat8 cube seems reasonable for me, because of space economy payed by less accuracy.
But I wonder if you could add cube-like operators for arrays. We already have support for arrays of any datatype, and any number of dimensions. That seems awfully lot like what the cube extension does.
I think we have at least 3 data types more or less similar to cube.
1) array of ranges
2) range of arrays
3) 2d arrays
Semantically cube is most close to array or ranges. However array of ranges have huge storage overhead.
Also we can declare cube as domain on 2d arrays and declare operations of that domain.
------
With best regards,Alexander Korotkov.
Hello. > * Learning cube extension to store dimensions with different data types. Such index would be good alternative to compoundkey B-Tree multi-index (suitable for diapason queries and data ordering). > > You mean, a cube containing something else than floats? I don't think we want to explode the number of datatypes by doingthat, casting between them could be problematic. > > At least option for having float4 cube instead of foat8 cube seems reasonable for me, because of space economy payed byless accuracy. The main idea was to reduce the size of the index when it is possible. This can be very important when we have many dimensionswith low number of different elements. > But I wonder if you could add cube-like operators for arrays. We already have support for arrays of any datatype, and anynumber of dimensions. That seems awfully lot like what the cube extension does. All cube stuff assumes fixed number of dimensions and it is not very useful in arrays where we easily can change number ofelements. One can define index on expression over array, i.e. CREATE INDEX ON table USING GIST(cube(array[a,b,c])) andcube-like operations will work. But it is only when we have fixed-size array. And again if array elements is smallintsthen such cube uses 8 times more space then it's actually needed — four times when converting to float8 and twotimes when storing coincident cube bounds. > I think we have at least 3 data types more or less similar to cube. > 1) array of ranges > 2) range of arrays > 3) 2d arrays > Semantically cube is most close to array or ranges. However array of ranges have huge storage overhead. > Also we can declare cube as domain on 2d arrays and declare operations of that domain. But what we should do when arrays in different records have different numbers of element? Stas Kelvich.
On Sat, May 4, 2013 at 11:19 PM, Stas Kelvich <stanconn@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
> I think we have at least 3 data types more or less similar to cube.> 1) array of rangesBut what we should do when arrays in different records have different numbers of element?
> 2) range of arrays
> 3) 2d arrays
> Semantically cube is most close to array or ranges. However array of ranges have huge storage overhead.
> Also we can declare cube as domain on 2d arrays and declare operations of that domain.
We can be faced with absolutely same situation with cube.
test=# create table cube_test (v cube);
CREATE TABLE
test=# insert into cube_test values (cube(array[1,2])), (cube(array[1,2,3]));
INSERT 0 2
In order to force all cubes to have same number of dimensions excplicit CHECK on table is required.
As I remember cube treats absent dimensions as zeros.
With best regards,
Alexander Korotkov.
HI. Thanks, Heikki, for the answer on google-melange. For some reason i didn't receive email notification, so I saw this answeronly today. > Do you have access to a server you can use to perform those tests? (...) Yes, i have. I am maintaining MPI cluster in my university, so it is not a problem. Actually tests for this proposal wasmade on servers from this cluster. But, anyway, thanks for offering help. There is an open question about supporting different datatypes in cube extension. As I understand we have following ideas: * add cube-like operators for arrays. We already have support for arrays of any datatype, and any number of dimensions. Ifwe want to use tree-like data structures for this operators we will run into the same problems with trees and types. Andwe always can cast array to cube and use this operators. Or I wrongly understand this. * Create support for storing cube coordinates with different data types. (2,4,8-bytes integers, 4,8-bytes floats)Main goalof this is reducing the index size, so in order not to break big amount of code we can store data according to data typesize (i.e. | smallint | real | real | double | double | ) and when we load data from the disk or cache we can cast itto float8 and existent code will work. To achieve this behavior two steps should be performed:1) Store information aboutcoordinates types when the index is created. Good question is where to store this data structure, but I believe it canbe done.2) Change functions that read and write data to disk, so they can cast type to/from float8 using data from previousclause. * Don't do cube with type supportEventually, there is different ways of reducing R-Tree size. For example we can store relativecoordinates with dynamic size of MBR (VRMBR), instead of absolute coordinates with fixed sized MBR. There is someevidences, that this can sufficiently reduce size. http://link.springer.com/chapter/10.1007/11427865_13 On May 8, 2013, at 2:35 PM, Alexander Korotkov wrote: > On Sat, May 4, 2013 at 11:19 PM, Stas Kelvich <stanconn@gmail.com> wrote: > > I think we have at least 3 data types more or less similar to cube. > > 1) array of ranges > > 2) range of arrays > > 3) 2d arrays > > Semantically cube is most close to array or ranges. However array of ranges have huge storage overhead. > > Also we can declare cube as domain on 2d arrays and declare operations of that domain. > > But what we should do when arrays in different records have different numbers of element? > > We can be faced with absolutely same situation with cube. > > test=# create table cube_test (v cube); > CREATE TABLE > > test=# insert into cube_test values (cube(array[1,2])), (cube(array[1,2,3])); > INSERT 0 2 > > In order to force all cubes to have same number of dimensions excplicit CHECK on table is required. > As I remember cube treats absent dimensions as zeros. > > ------ > With best regards, > Alexander Korotkov. >
On Tue, May 14, 2013 at 4:30 PM, Stas Kelvich <stanconn@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
* Don't do cube with type support
Eventually, there is different ways of reducing R-Tree size. For example we can store relative coordinates with dynamic size of MBR (VRMBR), instead of absolute coordinates with fixed sized MBR. There is some evidences, that this can sufficiently reduce size. http://link.springer.com/chapter/10.1007/11427865_13
Sounds promising. Did you examine how this technique can fit into GiST? In current GiST interface methods don't have access to parent entries.
With best regards,
Alexander Korotkov.
On May 14, 2013, at 4:53 PM, Alexander Korotkov wrote: > Sounds promising. Did you examine how this technique can fit into GiST? In current GiST interface methods don't have accessto parent entries. No, i didn't examine it yet. Anyway in this technique lots of changes should be performed to both cube GiST and storing methods.We can start from hacking with datatype support which only affects storing methods and after that begin investigationof VRMBRs support.