Thread: Cube extension kNN support

Cube extension kNN support

From

Stas Kelvich

Date:

22 September 2013, 23:39:01

Hello, hackers.

Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.

Following distance operators introduced:

<#> taxicab distance
<->  euclidean distance
<=> chebyshev distance

For example:
SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10;

Also there is operator "->" for selecting ordered rows directly from index.
This request selects rows ordered ascending by 3rd coordinate:

SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10;

For descendent ordering suggested syntax with minus before coordinate.
This request selects rows ordered descending by 4th coordinate:

SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10;

Stas Kelvich.

Attachment

distances.patch

Re: Cube extension kNN support

From

Oleg Bartunov

Date:

23 September 2013, 06:27:24

<div dir="ltr">Do you have any benchmarks ?<br /></div><div class="gmail_extra"><br /><br /><div class="gmail_quote">On
Mon,Sep 23, 2013 at 3:38 AM, Stas Kelvich <span dir="ltr"><<a href="mailto:stas.kelvich@gmail.com"
target="_blank">stas.kelvich@gmail.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex">Hello, hackers.<br /><br /> Here is the patch that introduces kNN
searchfor cubes with euclidean, taxicab and chebyshev distances.<br /><br /> Following distance operators
introduced:<br/><br /> <#> taxicab distance<br /> <->  euclidean distance<br /> <=> chebyshev
distance<br/><br /> For example:<br /> SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube
LIMIT10;<br /><br /> Also there is operator "->" for selecting ordered rows directly from index.<br /> This request
selectsrows ordered ascending by 3rd coordinate:<br /><br /> SELECT * FROM objects ORDER BY objects.coord->3 LIMIT
10;<br/><br /> For descendent ordering suggested syntax with minus before coordinate.<br /> This request selects rows
ordereddescending by 4th coordinate:<br /><br /> SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10;<br /><br
/>Stas Kelvich.<br /><br /><br /><br /><br /> --<br /> Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your
subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div><br /></div>

Re: Cube extension kNN support

From

Peter Eisentraut

Date:

25 September 2013, 13:26:02

On 9/22/13 7:38 PM, Stas Kelvich wrote:
> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.

cube and earthdistance regression tests fail.

high-dimensional knn-GIST tests (was Re: Cube extension kNN support)

From

Gordon Mohr

Date:

24 October 2013, 01:50:44

On 9/22/13 4:38 PM, Stas Kelvich wrote:
> Hello, hackers.
>
> Here is the patch that introduces kNN search for cubes with
> euclidean, taxicab and chebyshev distances.

Thanks for this! I decided to give the patch a try at the bleeding edge 
with some high-dimensional vectors, specifically the 1.4 million 
1000-dimensional Freebase entity vectors from the Google 'word2vec' project:

https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming

Unfortunately, here's what I found:

* with 1000-dimensional vectors, the index size on disk grows to many 
times (1000x or more) the size of the indexed data, making other tests 
of the index's helpfulness impractical. (Perhaps, other parameter-tuning 
can help?)

* with 500-dimensional or 100-dimensional vectors, the index size is 
more manageable -- 2x to 6x the data size -- but using the index 
significantly hurts performance on top-10 nearest-neighbor searches, 
making it much slower than a full table scan. (The planner still tries 
to use the index, even though it's hurting.)

Much more detail for the curious is below...

Regarding the dataset & hoped-for results:

The word2vec project's example scripts do their synonym/analogy 
demonstrations by loading the entire 5GB+ dataset into main memory 
(~3min), do a full scan of all vectors (~40sec) to find those nearest a 
target vector.

My motivating question was: could the data be loaded to Postgresql using 
the cube type, and kNN-GIST indexed using this patch, then do speedy 
index-assisted ranked-nearest-neighbor queries from the database?

(While the patch's distance_euclid is not the same cosine-distance the 
word2vec examples use, the freebase entity vectors are all unit vectors, 
and even additions of vectors can be scaled to unit length. My intuition 
is that euclidean-distances on the unit sphere will be in the same rank 
order as cosine-distance, so the cube distance_euclid/<->  should enable 
the same sort of synonym/analogy demos.)

Regarding the 1000-dimensional case:

It was necessary to change three compiled-in implementation limits. In 
the four steps that the need for change became evident:

(1) in contrib/cube/cubedata.h, increase CUBE_MAX_DIM (to accept vectors 
with more than 100 dimensions)

(2) in src/include/pg_config.h, increase BLCKSZ to 16384 (otherwise 1000 
64-bit floats in a single column gave an <ERROR: row is too big>, 
mentioning the 8160 limit - large cube values not TOASTable?)

(3) in src/include/access/itup.h, increase INDEX_SIZE_MASK to 0x3fff 
(otherwise encountering an <ERROR:  index row requires 16016 bytes, 
maximum size is 8191> when attempting to create the index>

(4) in src/include/pg_config.h, again increase BLCKSZ now to 32768 
(otherwise encountering an <ERROR:  index row size 16416 exceeds maximum 
5440 for index "pg_class_relname_nsp_index"> when attempting to create 
the index>

With the cube-kNN patch applied and these other changes, I was able to 
import the 1.4M freebase vectors and do a full-scan nearest-neighbors 
query. (My starting postgresql codebase was the github mirror of 9.4dev 
as of about a week ago.)

The psql transcript:

word2vec=# \timing
Timing is on.
word2vec=# CREATE EXTENSION cube;
CREATE EXTENSION
Time: 42.141 ms
word2vec=# CREATE TABLE word2vec ( word varchar(200), vec cube );
CREATE TABLE
Time: 8.532 ms
word2vec=# COPY word2vec FROM PROGRAM 'zcat 
/tmp/pgvectors/freebase-vectors-skipgram1000.pgtxt.gz';
COPY 1422903
Time: 12399065.498 ms
word2vec=# SELECT word, dist FROM (SELECT word, 
distance_euclid(vec,(SELECT vec FROM word2vec WHERE word='geoffrey 
hinton')) AS dist FROM word2vec) AS subquery ORDER BY dist LIMIT 11;          word           |       dist
-------------------------+------------------ geoffrey hinton         |                0 marvin minsky           |
1.03892498287268paul corkum             | 1.05221701690288 william richard peltier | 1.06244397334495 brenda milner
     | 1.06387762685894 john charles polanyi    | 1.07444446452295 leslie valiant          | 1.07735786596934 hava
siegelmann        | 1.08148623006629 hans moravec            |  1.0862034591185 david rumelhart         |
1.08748431130477godel prize             | 1.08774264379264

(11 rows)

Time: 310753.976 ms

That's 3.5 hours to do the import and 5 minutes to do the query; this is 
on a 2009 MacBook Pro with 8GB RAM and SSD.

Confirming the intuition above, these 10-nearest are the same entities 
in the same order as on the word2vec project page's example output, 
though the euclidean distances are of course different than the cosine 
distances.

The DATA directory is 23GB after the import of 1.4 million rows. In the 
word2vec uncompressed binary format, this dataset is about 5.4GB, so 
this word2vec-table cube-column representation involves about 4X expansion.

So, to the main question: can that query be sped up by building a 
kNN-GIST index? Here the problems start.

word2vec=# CREATE INDEX word2vec_index ON word2vec USING gist (vec);

This attempt ran for hours, consuming another 80GB+ before failing due 
to disk full.

In fact, the largest table with 1000-dimensional vectors for which I was 
able to build a gist index was a mere 100 rows. That index-build took 
about 14 minutes and grew the DATA directory like so:
 73MB pgdata # empty word2vec table 75MB pgdata # after COPY FROM of 100 vectors
8.4GB pgdata # after CREATE INDEX

Of course with just 100 rows the index isn't practically needed or 
helpful for query speed-up.

Even trying just 500 rows, the CREATE INDEX command ran for hours before 
failing by consuming all available disk space (about 90GB more). It 
seems the knn-GIST index overhead for 1000-dimensional cubes grows 
faster than linearly in the number of rows.

I was able to complete index builds with fewer dimensions.

100-dimensions, most-frequent 850K entities:

The dataset was trimmed to the first (most-frequent) 850K entities and 
each vector truncated to its first 100-dimensions. (Even though this 
would fit in the usual CUBE_MAX_DIM/BLCKSZ/INDEX_SIZE_MASK limits, I 
still used the customized build with limits expanded for the 1000d-case.)

Observed data sizes and operation times were:
 73MB pgdata # empty word2vec table
2.6GB pgdata # after COPY FROM of 850K vectors, taking 216s

before indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~2s

5.0GB pgdata # after CREATE INDEX, taking 1344s

after indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~57s  # "Index Scan using word2vec_index…"

So the availability of the index causes a significant slowdown... and 
the planner does not learn to choose the faster full sequential-scan.

500-dimensions, most-frequent 100K entities:

The dataset was trimmed to the first (most-frequent) 100K entities and 
each vector truncated to its first 500-dimensions. (Again, still using 
the custom build with upped CUBE_MAX_DIM/BLCKSZ/INDEX_SIZE_MASK.)

Observed data sizes and operation times were:
 73MB pgdata # empty word2vec table
1.6GB pgdata # after COPY FROM of 100K vectors, taking 266s

before indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~2s

4.8GB pgdata # after CREATE INDEX, taking 977s

after indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~46s  # "Index Scan using word2vec_index…"

Dropping the index makes the <-> query fast again.

Open questions and tentative conclusions:

The massive knn-GIST index overhead for 1000-dimensional vectors makes 
it hard to evaluate whether a fully-built index could be useful on large 
datasets.

Perhaps, some aspects of the knn-GIST support implicitly assume a 
low-dimensionality (2-4) in the data, and large numbers of dimensions 
cause pathological index sizes?

Or, something specific about this dataset (all vectors on the unit 
sphere) is a challenging case?

In the truncated 100d or 500d cases, the index can be built, but slows 
rather than speeds nearest-neighbor queries that use the index.

The evaluated patch's definitions of distance_euclid and other support 
function seem straightforward, and give proper results on simple test 
cases... so the problem, if any, would appear to be in the general 
distance-driven knn-GIST indexing and query-planning.

If there is a type of cube-type represented dimensional data where this 
indexing helps, it may only be with far fewer dimensions than 100, 
and/or far more rows than 100K/850K.

If a knn-GIST implementor/expert has suggestions for tuning the index 
overhead and behavior -- perhaps a different penalty or picksplit 
function? -- I'd be happy to try those out and report back.

Otherwise, I hope this write-up will help evaluate the patch or save 
time for others tempted to try similar knn-GIST indexing of 
higher-dimensional data.

- Gordon Mohr

Re: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)

From

Alvaro Herrera

Date:

24 October 2013, 04:05:37

Gordon Mohr wrote:

> Thanks for this! I decided to give the patch a try at the bleeding
> edge with some high-dimensional vectors, specifically the 1.4
> million 1000-dimensional Freebase entity vectors from the Google
> 'word2vec' project:
> 
> https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming
> 
> Unfortunately, here's what I found:

I wonder if these results would improve with this patch:
http://www.postgresql.org/message-id/EFEDC2BF-AB35-4E2C-911F-FC88DA6473D7@gmail.com

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)

From

Gordon Mohr

Date:

26 October 2013, 22:58:17

On 10/23/13 9:05 PM, Alvaro Herrera wrote:
> Gordon Mohr wrote:
>
>> Thanks for this! I decided to give the patch a try at the bleeding
>> edge with some high-dimensional vectors, specifically the 1.4
>> million 1000-dimensional Freebase entity vectors from the Google
>> 'word2vec' project:
>>
>> https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming
>>
>> Unfortunately, here's what I found:
>
> I wonder if these results would improve with this patch:
> http://www.postgresql.org/message-id/EFEDC2BF-AB35-4E2C-911F-FC88DA6473D7@gmail.com

Thanks for the pointer; I'd missed that relevant update from Stas 
Kelvich. I applied that patch, and reindexed.

On the 100-dimension, 850K vector set:

indexing:  1137s (vs. 1344s)
DATA size: 4.7G (vs 5.0G)
top-11-nearest-neighbor query: 32s (vs ~57s)

On the 500-dimension, 100K vector set:

indexing: 756s (vs. 977s)
DATA size: 4.5G (vs. 4.8G)
top-11-nearest-neighbor query: 18s (vs ~46s)

So, moderate (5-20%) improvements in indexing time and size, and larger 
(40-60%) speedups in index-assisted (<->) queries... but those 
index-assisted queries are still ~10X+ slower than the sequence-scan 
(distance_euclid()) queries, so the existence of the knn-GIST index is 
still harming rather than hurting performance.

Will update if my understanding changes; still interested to hear if 
I've missed a key factor/switch needed for these indexes to work well.

- Gordon Mohr

Re: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)

From

Gordon Mohr

Date:

27 October 2013, 03:34:20

On 10/23/13 9:05 PM, Alvaro Herrera wrote:
> Gordon Mohr wrote:
>
>> Thanks for this! I decided to give the patch a try at the bleeding
>> edge with some high-dimensional vectors, specifically the 1.4
>> million 1000-dimensional Freebase entity vectors from the Google
>> 'word2vec' project:
>>
>> https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming
>>
>> Unfortunately, here's what I found:
>
> I wonder if these results would improve with this patch:
> http://www.postgresql.org/message-id/EFEDC2BF-AB35-4E2C-911F-FC88DA6473D7@gmail.com

Thanks for the pointer; I'd missed that relevant update from Stas 
Kelvich. I applied that patch, and reindexed.

On the 100-dimension, 850K vector set:

indexing:  1137s (vs. 1344s)
DATA size: 4.7G (vs 5.0G)
top-11-nearest-neighbor query: 32s (vs ~57s)

On the 500-dimension, 100K vector set:

indexing: 756s (vs. 977s)
DATA size: 4.5G (vs. 4.8G)
top-11-nearest-neighbor query: 18s (vs ~46s)

So, moderate (5-20%) improvements in indexing time and size, and larger 
(40-60%) speedups in index-assisted (<->) queries... but those 
index-assisted queries are still ~10X+ slower than the sequence-scan 
(distance_euclid()) queries, so the existence of the knn-GIST index is 
still harming rather than hurting performance.

Will update if my understanding changes; still interested to hear if 
I've missed a key factor/switch needed for these indexes to work well.

- Gordon Mohr

Re: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)

From

Marcin Mańk

Date:

27 October 2013, 20:44:40

On Thu, Oct 24, 2013 at 3:50 AM, Gordon Mohr <gojomo-pgsql@xavvy.com> wrote:

On 9/22/13 4:38 PM, Stas Kelvich wrote:
Hello, hackers.

Here is the patch that introduces kNN search for cubes with
euclidean, taxicab and chebyshev distances.

Thanks for this! I decided to give the patch a try at the bleeding edge with some high-dimensional vectors, specifically the 1.4 million 1000-dimensional Freebase entity vectors from the Google 'word2vec' project:

I believe the curse of dimensionality is affecting you here. I think it is impossible to get an improvement over sequential scan for 1000 dimensional vectors. Read here:

http://en.wikipedia.org/wiki/Curse_of_dimensionality#k-nearest_neighbor_classification

Regards

Marcin Mańk

Re: Cube extension kNN support

From

Stas Kelvich

Date:

03 December 2013, 23:10:45

Hi.

> cube and earthdistance regression tests fail.

Code updated to work with current HEAD. Also added tests to cover new functionality.

> Do you have any benchmarks ?

  This patch just introduces functionality of calculating distances between cubes, so this code don't interfere much
withkNN search speed. I think it's better to publish such benchmarks in neighbor patch about split algorithm. 
  Anyway, we can compare kNN with b-tree and full scan:

create table test(a1 float, a2 float, a3 float);
insert into test (select 100*random(), 100*random(), 100*random() from generate_series(1,1000000) as s(a));
create index on test using gist(cube(array[a1,a2,a3]));
select * from test order by a1 limit 15; -- 227.658 ms
select * from test order by cube(array[a1,a2,a3])->1 limit 15; -- 1.275 ms
create index on test(a1);
select * from test order by a1 limit 15; -- 0.103 ms

As we can see, kNN ordering 10 times slower than B-tree (on silly request for R-Tree, just as example), but still 100+
timesfaster than full scan on this table. 

Stas.

On Sep 25, 2013, at 5:25 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

> On 9/22/13 7:38 PM, Stas Kelvich wrote:
>> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.
>
> cube and earthdistance regression tests fail.

Attachment

distances.patch

Re: Cube extension kNN support

From

Peter Eisentraut

Date:

11 December 2013, 22:08:15

cube.c: In function ‘g_cube_distance’:
cube.c:1453:2: warning: ‘retval’ may be used uninitialized in this function [-Wmaybe-uninitialized]

Re: Cube extension kNN support

From

Sergey Konoplev

Date:

27 March 2014, 22:27:18

Hi everyone,

On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.

What is the status of this patch?

-- 
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

Re: Cube extension kNN support

From

Sergey Konoplev

Date:

31 March 2014, 18:02:22

On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:
> On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
>> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.
>
> What is the status of this patch?

Referring to our private conversation with Alexander Korotkov, the
patch is in WIP state currently, and, hopefully, will be ready by 9.5.
I'm ready to actively participate in its testing on a real world
production set of data.

I'm not sure if it is doable at all, but are there any possibility to
implement here, or, what would be just great, any ready/half ready
solutions of a Hamming distance based kNN search?

-- 
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

Re: Cube extension kNN support

From

Alexander Korotkov

Date:

31 March 2014, 19:09:46

On Mon, Mar 31, 2014 at 10:01 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:

On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:
> On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
>> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.
>
> What is the status of this patch?

Referring to our private conversation with Alexander Korotkov, the
patch is in WIP state currently, and, hopefully, will be ready by 9.5.
I'm ready to actively participate in its testing on a real world
production set of data.

I'm not sure if it is doable at all, but are there any possibility to
implement here, or, what would be just great, any ready/half ready
solutions of a Hamming distance based kNN search?

Cube dealing with float8 numbers. There is another patch making it work with other number types. But Hamming distance is for bit vectors, isn't it?

----
With best regards,
Alexander Korotkov.

Re: Cube extension kNN support

From

Sergey Konoplev

Date:

31 March 2014, 19:46:03

On Mon, Mar 31, 2014 at 12:09 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
>> I'm not sure if it is doable at all, but are there any possibility to
>> implement here, or, what would be just great, any ready/half ready
>> solutions of a Hamming distance based kNN search?
>
> Cube dealing with float8 numbers. There is another patch making it work with
> other number types. But Hamming distance is for bit vectors, isn't it?

It can be generalized as for character vectors. Though, I agree, that
was an off topic question in some extent. I was wondering if there
were any postgres related indexable Hamming/Manhattan distance
experiments/thoughts/discussions, if kNN can be used here or not,
because from my understanding it can be represented as spatial (I
might be very wrong here).

-- 
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

Re: Cube extension kNN support

From

Stas Kelvich

Date:

07 February 2015, 09:46:00

Hi!

I had updated old patch with kNN operators for cube data structures. Copying description from old message:

Following distance operators introduced:

<#> taxicab distance
<->  euclidean distance
<=> chebyshev distance

For example:
SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10;

Also there is operator "->" for selecting ordered rows directly from index.
This request selects rows ordered ascending by 3rd coordinate:

SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10;

For descendent ordering suggested syntax with minus before coordinate.
This request selects rows ordered descending by 4th coordinate:

SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10;

Stas Kelvich.

Attachment

distances2.patch

Re: Cube extension kNN support

From

Alexander Korotkov

Date:

07 February 2015, 21:33:29

Hi!

On Sat, Feb 7, 2015 at 12:45 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:

I had updated old patch with kNN operators for cube data structures. Copying description from old message:

Following distance operators introduced:

<#> taxicab distance
<-> euclidean distance
<=> chebyshev distance

For example:
SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10;

Also there is operator "->" for selecting ordered rows directly from index.
This request selects rows ordered ascending by 3rd coordinate:

SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10;

For descendent ordering suggested syntax with minus before coordinate.
This request selects rows ordered descending by 4th coordinate:

SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10;

I've checked the patch. The first notes are so:

1) Check coding style, in particular braces. Postgres coding style require using it for multiline statements.

2) Update documentation according to new features.

------
With best regards,
Alexander Korotkov.

Re: Cube extension kNN support

From

Stas Kelvich

Date:

12 March 2015, 17:43:16

Documentation along with style fix.



> On 08 Feb 2015, at 00:32, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> Hi!
>
> On Sat, Feb 7, 2015 at 12:45 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
> I had updated old patch with kNN operators for cube data structures. Copying description from old message:
>
> Following distance operators introduced:
>
> <#> taxicab distance
> <->  euclidean distance
> <=> chebyshev distance
>
> For example:
> SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10;
>
> Also there is operator "->" for selecting ordered rows directly from index.
> This request selects rows ordered ascending by 3rd coordinate:
>
> SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10;
>
> For descendent ordering suggested syntax with minus before coordinate.
> This request selects rows ordered descending by 4th coordinate:
>
> SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10;
>
> I've checked the patch. The first notes are so:
> 1) Check coding style, in particular braces. Postgres coding style require using it for multiline statements.
> 2) Update documentation according to new features.
>
> ------
> With best regards,
> Alexander Korotkov.

Attachment

distances2r3.patch

Re: Cube extension kNN support

From

Alexander Korotkov

Date:

03 April 2015, 12:24:17

On Thu, Mar 12, 2015 at 8:43 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:

Documentation along with style fix.

Since we change the interface of extension we have to change it version and create a migration script.

E.g. you have to rename cube--1.0.sql to cube--1.1.sql and create cube--1.0--1.1.sql to migrate the old version.

+ -- Alias for backword compatibility

CREATE FUNCTION cube_distance(cube, cube)

RETURNS float8

+ AS 'MODULE_PATHNAME', 'distance_euclid'

+ LANGUAGE C IMMUTABLE STRICT;

For backward compatibility it would be better to keep the old name of cube_distance so that extension with old definition could work with new binary.

------
With best regards,

Alexander Korotkov.

Re: Cube extension kNN support

From

Alvaro Herrera

Date:

09 May 2015, 02:20:21

Sergey Konoplev wrote:
> On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:
> > On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
> >> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.
> >
> > What is the status of this patch?
> 
> Referring to our private conversation with Alexander Korotkov, the
> patch is in WIP state currently, and, hopefully, will be ready by 9.5.
> I'm ready to actively participate in its testing on a real world
> production set of data.

This patch doesn't seem to have received an updated version.  Should we
just punt on it?  The assumption would be that Stas or Alexander will be
re-submitting this for 9.6.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cube extension kNN support

From

Stas Kelvich

Date:

09 May 2015, 03:53:39

Hi!

Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and
versionbump. 
Attaching a version with all migration stuff.



Stas.


> On 09 May 2015, at 05:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Sergey Konoplev wrote:
>> On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:
>>> On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
>>>> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.
>>>
>>> What is the status of this patch?
>>
>> Referring to our private conversation with Alexander Korotkov, the
>> patch is in WIP state currently, and, hopefully, will be ready by 9.5.
>> I'm ready to actively participate in its testing on a real world
>> production set of data.
>
> This patch doesn't seem to have received an updated version.  Should we
> just punt on it?  The assumption would be that Stas or Alexander will be
> re-submitting this for 9.6.
>
> --
> Álvaro Herrera                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

distances2r4.patch

Re: Cube extension kNN support

From

Alexander Korotkov

Date:

09 July 2015, 13:40:50

Hi!

On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas.kelvich@gmail.com> wrote:

Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and version bump.
Attaching a version with all migration stuff.

I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: Cube extension kNN support

From

Stas Kelvich

Date:

30 October 2015, 01:51:14

Hello.

That is updated version of the patch with proper update scripts.

Also i’ve noted that documentation states the wrong thing:

“It does not matter which order the opposite corners of a cube are entered in. The cube functions automatically swap
valuesif needed to create a uniform "lower left — upper right" internal representation." 

But in practice cubes stored "as is" and that leads to problems with getting cubes sorted along specific dimension
directlyfrom index. 
As a simplest workaround i’ve deleted that sentence from docs and implemented two coordinate getters -> and ~>. First
onereturns 
coordinate of cube as it stored, and second returns coordinate of cube normalised to (LL,UR)-form.

Other way to fix thing is to force ’normalization’ while creating cube. But that can produce wrong sorts with already
existingdata. 

> On 09 Jul 2015, at 16:40, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> Hi!
>
> On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
> Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and
versionbump. 
> Attaching a version with all migration stuff.
>
> I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

distances.patch

Re: Cube extension kNN support

From

Teodor Sigaev

Date:

01 December 2015, 14:52:44

Patch looks good, but there ara some review notices:
1 gmake installcheck fails:
*** /.../pgsql/contrib/cube/expected/cube_1.out   2015-12-01 17:49:01.768764000 
+0300
--- /.../pgsql/contrib/cube/results/cube.out      2015-12-01 17:49:12.190818000 
+0300
***************
*** 1382,1388 ****  (1 row)
  -- Test of distances
! --  SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube);   cube_distance  ---------------
--- 1382,1388 ----  (1 row)
  -- Test of distances
! --  SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube);   cube_distance

Seems, there a extra space at the end of string

2 Pls, don use in C-code magick constants like 'case 16:'. Use macros to define 
some human-readable name (g_cube_distance())

3 Switch in g_cube_distance(): default switch path should generate a error. It 
just simplifies a degbug process, may be in future.

4 Docs: pls, don't use a strings with unlimited length.


Stas Kelvich wrote:
> Hello.
>
> That is updated version of the patch with proper update scripts.
>
> Also i’ve noted that documentation states the wrong thing:
>
> “It does not matter which order the opposite corners of a cube are entered in. The cube functions automatically swap
valuesif needed to create a uniform "lower left — upper right" internal representation."
 
>
> But in practice cubes stored "as is" and that leads to problems with getting cubes sorted along specific dimension
directlyfrom index.
 
> As a simplest workaround i’ve deleted that sentence from docs and implemented two coordinate getters -> and ~>. First
onereturns
 
> coordinate of cube as it stored, and second returns coordinate of cube normalised to (LL,UR)-form.
>
> Other way to fix thing is to force ’normalization’ while creating cube. But that can produce wrong sorts with already
existingdata.
 
>
>> On 09 Jul 2015, at 16:40, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>>
>> Hi!
>>
>> On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
>> Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and
versionbump.
 
>> Attaching a version with all migration stuff.
>>
>> I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?
>>
>> ------
>> Alexander Korotkov
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>
> Stas Kelvich
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
>

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Cube extension kNN support

From

Stas Kelvich

Date:

07 December 2015, 14:48:02

Hello, fixed.



---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

> On 01 Dec 2015, at 17:52, Teodor Sigaev <teodor@sigaev.ru> wrote:
>
> Patch looks good, but there ara some review notices:
> 1 gmake installcheck fails:
> *** /.../pgsql/contrib/cube/expected/cube_1.out   2015-12-01 17:49:01.768764000 +0300
> --- /.../pgsql/contrib/cube/results/cube.out      2015-12-01 17:49:12.190818000 +0300
> ***************
> *** 1382,1388 ****
>  (1 row)
>
>  -- Test of distances
> ! --
>  SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube);
>   cube_distance
>  ---------------
> --- 1382,1388 ----
>  (1 row)
>
>  -- Test of distances
> ! --
>  SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube);
>   cube_distance
>
> Seems, there a extra space at the end of string
>
> 2 Pls, don use in C-code magick constants like 'case 16:'. Use macros to define some human-readable name
(g_cube_distance())
>
> 3 Switch in g_cube_distance(): default switch path should generate a error. It just simplifies a degbug process, may
bein future. 
>
> 4 Docs: pls, don't use a strings with unlimited length.
>
>
> Stas Kelvich wrote:
>> Hello.
>>
>> That is updated version of the patch with proper update scripts.
>>
>> Also i’ve noted that documentation states the wrong thing:
>>
>> “It does not matter which order the opposite corners of a cube are entered in. The cube functions automatically swap
valuesif needed to create a uniform "lower left — upper right" internal representation." 
>>
>> But in practice cubes stored "as is" and that leads to problems with getting cubes sorted along specific dimension
directlyfrom index. 
>> As a simplest workaround i’ve deleted that sentence from docs and implemented two coordinate getters -> and ~>.
Firstone returns 
>> coordinate of cube as it stored, and second returns coordinate of cube normalised to (LL,UR)-form.
>>
>> Other way to fix thing is to force ’normalization’ while creating cube. But that can produce wrong sorts with
alreadyexisting data. 
>>
>>> On 09 Jul 2015, at 16:40, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>>>
>>> Hi!
>>>
>>> On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
>>> Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and
versionbump. 
>>> Attaching a version with all migration stuff.
>>>
>>> I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?
>>>
>>> ------
>>> Alexander Korotkov
>>> Postgres Professional: http://www.postgrespro.com
>>> The Russian Postgres Company
>>
>> Stas Kelvich
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>>
>>
>>
>>
>
> --
> Teodor Sigaev                                   E-mail: teodor@sigaev.ru
>                                                   WWW: http://www.sigaev.ru/
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

cube_distances.diff

Re: Cube extension kNN support

From

Tomas Vondra

Date:

15 December 2015, 18:46:15

Hi,

On 12/07/2015 03:47 PM, Stas Kelvich wrote:
> Hello, fixed.

I've looked at the patch today, seems mostly fine to me.

Three comments though:

1) (nitpicking) There seem to be some minor whitespace issues, i.e.   trailing spaces, empty lines being added/removed,
etc.

2) one of the regression tests started to fail
   SELECT '-1e-700'::cube AS cube;
   This used to return (0) but now I get (-0). As this query existed in   1.0, it's probably due to change in the C
code.Now sure where.
 

3) I wonder why the docs were changed like this:
   <para>
-   It does not matter which order the opposite corners of a cube are
-   entered in.  The <type>cube</> functions
-   automatically swap values if needed to create a uniform
-   <quote>lower left — upper right</> internal representation.
+   When corners coincide cube stores only one corner along with a    special flag in order to reduce size wasted.
</para>
   Was the old behavior removed? I don't think so - it seems to behave   as before, so why to remove this information?
Maybeit's not useful?   But then why add the bit about optimizing storage of points?
 

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cube extension kNN support

From

Stas Kelvich

Date:

16 December 2015, 12:26:46

Hi, thanks for the review.

> 1) (nitpicking) There seem to be some minor whitespace issues, i.e.
>   trailing spaces, empty lines being added/removed, etc.


Fixed, I think

> 2) one of the regression tests started to fail
>
>   SELECT '-1e-700'::cube AS cube;
>
>   This used to return (0) but now I get (-0).

Actually that problem emerged because of the first problem. I had extra whitespace in sql file and removed that
whitespacefrom one of the answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was one line length and
yousaw diff with cube.sql.  
In all systems that available to me (osx/linux/freebsd) I saw that right answers file is cube_1.sql. But in other OS’es
youcan get +/- 0 or e27/e027. I edited that answers files manually, so there probably can be some other typos. 

> 3) I wonder why the docs were changed like this:
>
>   <para>
> -   It does not matter which order the opposite corners of a cube are
> -   entered in.  The <type>cube</> functions
> -   automatically swap values if needed to create a uniform
> -   <quote>lower left — upper right</> internal representation.
> +   When corners coincide cube stores only one corner along with a
>    special flag in order to reduce size wasted.
>   </para>
>
>   Was the old behavior removed? I don't think so - it seems to behave
>   as before, so why to remove this information? Maybe it's not useful?
>   But then why add the bit about optimizing storage of points?

I’ve edited it because the statement was mislead (or at least ambiguous) — cube_in function doesn’t swap coordinates.
Simple way to see it:
> select '(1,3),(3,1)'::cube;
     cube
---------------
 (1, 3),(3, 1)

But LowerLeft-UpperRight representation should be (1,1),(3,3)


Updated patch attached.





> On 15 Dec 2015, at 21:46, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> On 12/07/2015 03:47 PM, Stas Kelvich wrote:
>> Hello, fixed.
>
> I've looked at the patch today, seems mostly fine to me.
>
> Three comments though:
>
> 1) (nitpicking) There seem to be some minor whitespace issues, i.e.
>   trailing spaces, empty lines being added/removed, etc.
>
> 2) one of the regression tests started to fail
>
>   SELECT '-1e-700'::cube AS cube;
>
>   This used to return (0) but now I get (-0). As this query existed in
>   1.0, it's probably due to change in the C code. Now sure where.
>
> 3) I wonder why the docs were changed like this:
>
>   <para>
> -   It does not matter which order the opposite corners of a cube are
> -   entered in.  The <type>cube</> functions
> -   automatically swap values if needed to create a uniform
> -   <quote>lower left — upper right</> internal representation.
> +   When corners coincide cube stores only one corner along with a
>    special flag in order to reduce size wasted.
>   </para>
>
>   Was the old behavior removed? I don't think so - it seems to behave
>   as before, so why to remove this information? Maybe it's not useful?
>   But then why add the bit about optimizing storage of points?
>
> regards
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

cube_distances.patch

Re: Cube extension kNN support

From

Tomas Vondra

Date:

16 December 2015, 13:47:05

Hi,

On 12/16/2015 01:26 PM, Stas Kelvich wrote:
> Hi, thanks for the review.
>
>> 1) (nitpicking) There seem to be some minor whitespace issues, i.e.
>>    trailing spaces, empty lines being added/removed, etc.
>
>
> Fixed, I think
>
>> 2) one of the regression tests started to fail
>>
>>    SELECT '-1e-700'::cube AS cube;
>>
>>    This used to return (0) but now I get (-0).
>
> Actually that problem emerged because of the first problem. I had
extra whitespace in sql file and removed that whitespace from one of the
answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was
one line length and you saw diff with cube.sql.
> In all systems that available to me (osx/linux/freebsd) I saw that
right answers file is cube_1.sql. But in other OS’es you can get +/- 0
or e27/e027. I edited that answers files manually, so there probably can
be some other typos.

Ah! So that's why I couldn't quickly find the issue in the C code ...

>
>> 3) I wonder why the docs were changed like this:
>>
>>    <para>
>> -   It does not matter which order the opposite corners of a cube are
>> -   entered in.  The <type>cube</> functions
>> -   automatically swap values if needed to create a uniform
>> -   <quote>lower left — upper right</> internal representation.
>> +   When corners coincide cube stores only one corner along with a
>>     special flag in order to reduce size wasted.
>>    </para>
>>
>>    Was the old behavior removed? I don't think so - it seems to behave
>>    as before, so why to remove this information? Maybe it's not useful?
>>    But then why add the bit about optimizing storage of points?
>
> I’ve edited it because the statement was mislead (or at least ambiguous) — cube_in function doesn’t swap
coordinates.
> Simple way to see it:
>> select '(1,3),(3,1)'::cube;
>       cube
> ---------------
>   (1, 3),(3, 1)
>
> But LowerLeft-UpperRight representation should be (1,1),(3,3)

I don't think that's what the comment says, actually. It rather refers 
to code like this:
    result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1));

i.e. if you specifically ask for a particular corner (ll, in this case), 
you'll get the proper value.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cube extension kNN support

From

Stas Kelvich

Date:

16 December 2015, 14:31:50

> I don't think that's what the comment says, actually. It rather refers to code like this:
>
>    result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1));
>
> i.e. if you specifically ask for a particular corner (ll, in this case), you'll get the proper value.

Hmm, I was confused by phrase “create a uniform _internal_ representation” and actually internally cube stored “as is”.
Butprobably i just misinterpret that. 
So here is the updated version with old documentation restored.




> On 16 Dec 2015, at 16:46, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> On 12/16/2015 01:26 PM, Stas Kelvich wrote:
>> Hi, thanks for the review.
>>
>>> 1) (nitpicking) There seem to be some minor whitespace issues, i.e.
>>>   trailing spaces, empty lines being added/removed, etc.
>>
>>
>> Fixed, I think
>>
>>> 2) one of the regression tests started to fail
>>>
>>>   SELECT '-1e-700'::cube AS cube;
>>>
>>>   This used to return (0) but now I get (-0).
>>
>> Actually that problem emerged because of the first problem. I had
> extra whitespace in sql file and removed that whitespace from one of the
> answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was
> one line length and you saw diff with cube.sql.
>> In all systems that available to me (osx/linux/freebsd) I saw that
> right answers file is cube_1.sql. But in other OS’es you can get +/- 0
> or e27/e027. I edited that answers files manually, so there probably can
> be some other typos.
>
> Ah! So that's why I couldn't quickly find the issue in the C code ...
>
>>
>>> 3) I wonder why the docs were changed like this:
>>>
>>>   <para>
>>> -   It does not matter which order the opposite corners of a cube are
>>> -   entered in.  The <type>cube</> functions
>>> -   automatically swap values if needed to create a uniform
>>> -   <quote>lower left — upper right</> internal representation.
>>> +   When corners coincide cube stores only one corner along with a
>>>    special flag in order to reduce size wasted.
>>>   </para>
>>>
>>>   Was the old behavior removed? I don't think so - it seems to behave
>>>   as before, so why to remove this information? Maybe it's not useful?
>>>   But then why add the bit about optimizing storage of points?
>>
>> I’ve edited it because the statement was mislead (or at least ambiguous) — cube_in function doesn’t swap
coordinates.
>> Simple way to see it:
>>> select '(1,3),(3,1)'::cube;
>>      cube
>> ---------------
>>  (1, 3),(3, 1)
>>
>> But LowerLeft-UpperRight representation should be (1,1),(3,3)
>
> I don't think that's what the comment says, actually. It rather refers to code like this:
>
>    result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1));
>
> i.e. if you specifically ask for a particular corner (ll, in this case), you'll get the proper value.
>
> regards
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

cube_distances.patch