Thread: Cube extension improvement, GSoC

Cube extension improvement, GSoC

From
Stas Kelvich
Date:
Hello,

some time ago I started working on the data search system (about 100-200M of records) with queries consisted of several
diapasonand equality conditions, e.g.: 
 WHERE dim1 BETWEEN 128 AND 137 AND WHERE dim2 BETWEEN 4815 AND 162342 AND WHERE dim3 = 42 ORDER BY dim1 ASC

There are 6 or 7 search criteria in my task. In order to avoid full table scan I started using R-Tree over cube data
type:
 CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3]))

For fast sorting I used Alexander Korotkov's patch for knngist
(http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html),which changes metric for nearest neighbors
searchand allows to obtain cubes ordered by a specific coordinate. Having changed some data types and function/operator
definitionsI ported this to 9.2 (https://github.com/kelvich/postgres/commit/bb372). Then, after this I could select
orderedrecords right from the index: 
 SELECT * FROM my_table WHERE cube(array[dim1, dim2, dim3]) <@ cube '(128,4815,42),(137,162342,42)' ORDER BY
cube(array[dim1,dim2, dim3]) <*> 5; 

The main issue of such approach is the index size. In my case it was about 100GB for 100M of records. Therefore index
buildingbecomes very expensive disk-related operation. For the same reason reading a large number of records from the
indexis slow too. 

I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander Korotkov for any help. (I'm very thankful to
themand glad that they agreed to meet, listen to me and give useful advices). Having discussed it we decided that there
wasseveral possible improvements for the cube extension: 
 * Adding point data type support to the cube extension in order to avoid storing of coincident upper left and lower
rightvertices, which may reduce the volume that leaf nodes occupy almost twice. * Checking the split algorithm with big
datasetsand, if possible, improving it. * Learning cube extension to store dimensions with different data types. Such
indexwould be good alternative to compound key B-Tree multi-index (suitable for diapason queries and data ordering). *
Providingsupport for KNN with metrics induced by the different norms: euclidean, taxicab norm, p-norm. This can be
usefulin fields where we can extract signature: similar images search, similar audio search. 

I'd like to participate in GSoC with this improvements, and I'm very interested in any comments or suggestions about
thisfeature list. 




Re: Cube extension improvement, GSoC

From
Alexander Korotkov
Date:
Hi, Jey!

Could you provide some feedback to GSoC proposal of Stas?

On Sat, Mar 23, 2013 at 3:10 AM, Stas Kelvich <stanconn@gmail.com> wrote:
Hello,

some time ago I started working on the data search system (about 100-200M of records) with queries consisted of several diapason and equality conditions, e.g.:

  WHERE dim1 BETWEEN 128 AND 137 AND
  WHERE dim2 BETWEEN 4815 AND 162342 AND
  WHERE dim3 = 42
  ORDER BY dim1 ASC

There are 6 or 7 search criteria in my task. In order to avoid full table scan I started using R-Tree over cube data type:

  CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3]))

For fast sorting I used Alexander Korotkov's patch for knngist (http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html), which changes metric for nearest neighbors search and allows to obtain cubes ordered by a specific coordinate. Having changed some data types and function/operator definitions I ported this to 9.2 (https://github.com/kelvich/postgres/commit/bb372). Then, after this I could select ordered records right from the index:

  SELECT * FROM my_table
  WHERE cube(array[dim1, dim2, dim3]) <@ cube '(128,4815,42),(137,162342,42)'
  ORDER BY cube(array[dim1, dim2, dim3]) <*> 5;

The main issue of such approach is the index size. In my case it was about 100GB for 100M of records. Therefore index building becomes very expensive disk-related operation. For the same reason reading a large number of records from the index is slow too.

I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander Korotkov for any help. (I'm very thankful to them and glad that they agreed to meet, listen to me and give useful advices). Having discussed it we decided that there was several possible improvements for the cube extension:

  * Adding point data type support to the cube extension in order to avoid storing of coincident upper left and lower right vertices, which may reduce the volume that leaf nodes occupy almost twice.
  * Checking the split algorithm with big datasets and, if possible, improving it.
  * Learning cube extension to store dimensions with different data types. Such index would be good alternative to compound key B-Tree multi-index (suitable for diapason queries and data ordering).
  * Providing support for KNN with metrics induced by the different norms: euclidean, taxicab norm, p-norm. This can be useful in fields where we can extract signature: similar images search, similar audio search.

I'd like to participate in GSoC with this improvements, and I'm very interested in any comments or suggestions about this feature list.

------
With best regards,
Alexander Korotkov. 

Re: Cube extension improvement, GSoC

From
Alexander Korotkov
Date:
On Sat, Mar 30, 2013 at 3:55 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi, Jey!

I remember I've started couple of threads related to cube extension:

 Oh, it's a typo. I remember you've started those threads.
 

------
With best regards,
Alexander Korotkov.

Re: Cube extension improvement, GSoC

From
Heikki Linnakangas
Date:
Hi Stas, and sorry for the late response.

On 23.03.2013 01:10, Stas Kelvich wrote:
>    * Adding point data type support to the cube extension in order to avoid storing of coincident upper left and
lowerright vertices, which may reduce the volume that leaf nodes occupy almost twice.
 

Makes sense. I think it would be possible to implement this as an 
optimization to existing cube data type, so that a cube that is actually 
a point is simply stored more efficiently. Ie. set a flag in the NDBOX 
struct indicating that the cube is a point, and omit the second set of 
coordinates if that's set.

>    * Learning cube extension to store dimensions with different data types. Such index would be good alternative to
compoundkey B-Tree multi-index (suitable for diapason queries and data ordering).
 

You mean, a cube containing something else than floats? I don't think we 
want to explode the number of datatypes by doing that, casting between 
them could be problematic. But I wonder if you could add cube-like 
operators for arrays. We already have support for arrays of any 
datatype, and any number of dimensions. That seems awfully lot like what 
the cube extension does.

- Heikki



Re: Cube extension improvement, GSoC

From
Alexander Korotkov
Date:
On Thu, May 2, 2013 at 4:19 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
   * Learning cube extension to store dimensions with different data types. Such index would be good alternative to compound key B-Tree multi-index (suitable for diapason queries and data ordering).

You mean, a cube containing something else than floats? I don't think we want to explode the number of datatypes by doing that, casting between them could be problematic.
 
At least option for having float4 cube instead of foat8 cube seems reasonable for me, because of space economy payed by less accuracy.

But I wonder if you could add cube-like operators for arrays. We already have support for arrays of any datatype, and any number of dimensions. That seems awfully lot like what the cube extension does.

I think we have at least 3 data types more or less similar to cube.
1) array of ranges
2) range of arrays
3) 2d arrays
Semantically cube is most close to array or ranges. However array of ranges have huge storage overhead. 
Also we can declare cube as domain on 2d arrays and declare operations of that domain.

------
With best regards,
Alexander Korotkov.

Re: Cube extension improvement, GSoC

From
Stas Kelvich
Date:
Hello.

>    * Learning cube extension to store dimensions with different data types. Such index would be good alternative to
compoundkey B-Tree multi-index (suitable for diapason queries and data ordering). 
>
> You mean, a cube containing something else than floats? I don't think we want to explode the number of datatypes by
doingthat, casting between them could be problematic. 
>
> At least option for having float4 cube instead of foat8 cube seems reasonable for me, because of space economy payed
byless accuracy. 

The main idea was to reduce the size of the index when it is possible. This can be very important when we have many
dimensionswith low number of different elements. 

> But I wonder if you could add cube-like operators for arrays. We already have support for arrays of any datatype, and
anynumber of dimensions. That seems awfully lot like what the cube extension does. 

All cube stuff assumes fixed number of dimensions and it is not very useful in arrays where we easily can change number
ofelements. One can define index on expression over array, i.e. CREATE INDEX ON table USING GIST(cube(array[a,b,c]))
andcube-like operations will work. But it is only when we have fixed-size array. And again if array elements is
smallintsthen such cube uses 8 times more space then it's actually needed — four times when converting to float8 and
twotimes when storing coincident cube bounds. 

> I think we have at least 3 data types more or less similar to cube.
> 1) array of ranges
> 2) range of arrays
> 3) 2d arrays
> Semantically cube is most close to array or ranges. However array of ranges have huge storage overhead.
> Also we can declare cube as domain on 2d arrays and declare operations of that domain.

But what we should do when arrays in different records have different numbers of element?

Stas Kelvich.


Re: Cube extension improvement, GSoC

From
Alexander Korotkov
Date:
On Sat, May 4, 2013 at 11:19 PM, Stas Kelvich <stanconn@gmail.com> wrote:
> I think we have at least 3 data types more or less similar to cube.
> 1) array of ranges
> 2) range of arrays
> 3) 2d arrays
> Semantically cube is most close to array or ranges. However array of ranges have huge storage overhead.
> Also we can declare cube as domain on 2d arrays and declare operations of that domain.

But what we should do when arrays in different records have different numbers of element?

We can be faced with absolutely same situation with cube.

test=# create table cube_test (v cube);
CREATE TABLE

test=# insert into cube_test values (cube(array[1,2])), (cube(array[1,2,3]));
INSERT 0 2

In order to force all cubes to have same number of dimensions excplicit CHECK on table is required.
As I remember cube treats absent dimensions as zeros.

------
With best regards,
Alexander Korotkov.

Re: Cube extension improvement, GSoC

From
Stas Kelvich
Date:
HI.

Thanks, Heikki, for the answer on google-melange. For some reason i didn't receive email notification, so I saw this
answeronly today. 

> Do you have access to a server you can use to perform those tests? (...)

Yes, i have. I am maintaining MPI cluster in my university, so it is not a problem. Actually tests for this proposal
wasmade on servers from this cluster. But, anyway, thanks for offering help. 

There is an open question about supporting different datatypes in cube extension. As I understand we have following
ideas:

* add cube-like operators for arrays. We already have support for arrays of any datatype, and any number of dimensions.
Ifwe want to use tree-like data structures for this operators we will run into the same problems with trees and types.
Andwe always can cast array to cube and use this operators. Or I wrongly understand this. 

* Create support for storing cube coordinates with different data types. (2,4,8-bytes integers, 4,8-bytes floats)Main
goalof this is reducing the index size, so in order not to break big amount of code we can store data according to data
typesize (i.e. | smallint | real | real | double | double | ) and when we load data from the disk or cache we can cast
itto float8 and existent code will work. To achieve this behavior two steps should be performed:1) Store information
aboutcoordinates types when the index is created. Good question is where to store this data structure, but I believe it
canbe done.2) Change functions that read and write data to disk, so they can cast type to/from float8 using data from
previousclause. 

* Don't do cube with type supportEventually, there is different ways of reducing R-Tree size. For example we can store
relativecoordinates with dynamic size of MBR (VRMBR), instead of absolute coordinates with fixed sized MBR. There is
someevidences, that this can sufficiently reduce size. http://link.springer.com/chapter/10.1007/11427865_13 

On May 8, 2013, at 2:35 PM, Alexander Korotkov wrote:

> On Sat, May 4, 2013 at 11:19 PM, Stas Kelvich <stanconn@gmail.com> wrote:
> > I think we have at least 3 data types more or less similar to cube.
> > 1) array of ranges
> > 2) range of arrays
> > 3) 2d arrays
> > Semantically cube is most close to array or ranges. However array of ranges have huge storage overhead.
> > Also we can declare cube as domain on 2d arrays and declare operations of that domain.
>
> But what we should do when arrays in different records have different numbers of element?
>
> We can be faced with absolutely same situation with cube.
>
> test=# create table cube_test (v cube);
> CREATE TABLE
>
> test=# insert into cube_test values (cube(array[1,2])), (cube(array[1,2,3]));
> INSERT 0 2
>
> In order to force all cubes to have same number of dimensions excplicit CHECK on table is required.
> As I remember cube treats absent dimensions as zeros.
>
> ------
> With best regards,
> Alexander Korotkov.
>




Re: Cube extension improvement, GSoC

From
Alexander Korotkov
Date:
On Tue, May 14, 2013 at 4:30 PM, Stas Kelvich <stanconn@gmail.com> wrote:
* Don't do cube with type support
        Eventually, there is different ways of reducing R-Tree size. For example we can store relative coordinates with dynamic size of MBR (VRMBR), instead of absolute coordinates with fixed sized MBR. There is some evidences, that this can sufficiently reduce size. http://link.springer.com/chapter/10.1007/11427865_13

Sounds promising. Did you examine how this technique can fit into GiST? In current GiST interface methods don't have access to parent entries.

------
With best regards,
Alexander Korotkov.

Re: Cube extension improvement, GSoC

From
Stas Kelvich
Date:
On May 14, 2013, at 4:53 PM, Alexander Korotkov wrote:

> Sounds promising. Did you examine how this technique can fit into GiST? In current GiST interface methods don't have
accessto parent entries. 

No, i didn't examine it yet. Anyway in this technique lots of changes should be performed to both cube GiST and storing
methods.We can start from hacking with datatype support which only affects storing methods and after that begin
investigationof VRMBRs support.