Thread: Cube extension kNN support
Hello, hackers. Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances. Following distance operators introduced: <#> taxicab distance <-> euclidean distance <=> chebyshev distance For example: SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10; Also there is operator "->" for selecting ordered rows directly from index. This request selects rows ordered ascending by 3rd coordinate: SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10; For descendent ordering suggested syntax with minus before coordinate. This request selects rows ordered descending by 4th coordinate: SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10; Stas Kelvich.
Attachment
<div dir="ltr">Do you have any benchmarks ?<br /></div><div class="gmail_extra"><br /><br /><div class="gmail_quote">On Mon,Sep 23, 2013 at 3:38 AM, Stas Kelvich <span dir="ltr"><<a href="mailto:stas.kelvich@gmail.com" target="_blank">stas.kelvich@gmail.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px#ccc solid;padding-left:1ex">Hello, hackers.<br /><br /> Here is the patch that introduces kNN searchfor cubes with euclidean, taxicab and chebyshev distances.<br /><br /> Following distance operators introduced:<br/><br /> <#> taxicab distance<br /> <-> euclidean distance<br /> <=> chebyshev distance<br/><br /> For example:<br /> SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT10;<br /><br /> Also there is operator "->" for selecting ordered rows directly from index.<br /> This request selectsrows ordered ascending by 3rd coordinate:<br /><br /> SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10;<br/><br /> For descendent ordering suggested syntax with minus before coordinate.<br /> This request selects rows ordereddescending by 4th coordinate:<br /><br /> SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10;<br /><br />Stas Kelvich.<br /><br /><br /><br /><br /> --<br /> Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div><br /></div>
On 9/22/13 7:38 PM, Stas Kelvich wrote: > Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances. cube and earthdistance regression tests fail.
On 9/22/13 4:38 PM, Stas Kelvich wrote: > Hello, hackers. > > Here is the patch that introduces kNN search for cubes with > euclidean, taxicab and chebyshev distances. Thanks for this! I decided to give the patch a try at the bleeding edge with some high-dimensional vectors, specifically the 1.4 million 1000-dimensional Freebase entity vectors from the Google 'word2vec' project: https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming Unfortunately, here's what I found: * with 1000-dimensional vectors, the index size on disk grows to many times (1000x or more) the size of the indexed data, making other tests of the index's helpfulness impractical. (Perhaps, other parameter-tuning can help?) * with 500-dimensional or 100-dimensional vectors, the index size is more manageable -- 2x to 6x the data size -- but using the index significantly hurts performance on top-10 nearest-neighbor searches, making it much slower than a full table scan. (The planner still tries to use the index, even though it's hurting.) Much more detail for the curious is below... Regarding the dataset & hoped-for results: The word2vec project's example scripts do their synonym/analogy demonstrations by loading the entire 5GB+ dataset into main memory (~3min), do a full scan of all vectors (~40sec) to find those nearest a target vector. My motivating question was: could the data be loaded to Postgresql using the cube type, and kNN-GIST indexed using this patch, then do speedy index-assisted ranked-nearest-neighbor queries from the database? (While the patch's distance_euclid is not the same cosine-distance the word2vec examples use, the freebase entity vectors are all unit vectors, and even additions of vectors can be scaled to unit length. My intuition is that euclidean-distances on the unit sphere will be in the same rank order as cosine-distance, so the cube distance_euclid/<-> should enable the same sort of synonym/analogy demos.) Regarding the 1000-dimensional case: It was necessary to change three compiled-in implementation limits. In the four steps that the need for change became evident: (1) in contrib/cube/cubedata.h, increase CUBE_MAX_DIM (to accept vectors with more than 100 dimensions) (2) in src/include/pg_config.h, increase BLCKSZ to 16384 (otherwise 1000 64-bit floats in a single column gave an <ERROR: row is too big>, mentioning the 8160 limit - large cube values not TOASTable?) (3) in src/include/access/itup.h, increase INDEX_SIZE_MASK to 0x3fff (otherwise encountering an <ERROR: index row requires 16016 bytes, maximum size is 8191> when attempting to create the index> (4) in src/include/pg_config.h, again increase BLCKSZ now to 32768 (otherwise encountering an <ERROR: index row size 16416 exceeds maximum 5440 for index "pg_class_relname_nsp_index"> when attempting to create the index> With the cube-kNN patch applied and these other changes, I was able to import the 1.4M freebase vectors and do a full-scan nearest-neighbors query. (My starting postgresql codebase was the github mirror of 9.4dev as of about a week ago.) The psql transcript: word2vec=# \timing Timing is on. word2vec=# CREATE EXTENSION cube; CREATE EXTENSION Time: 42.141 ms word2vec=# CREATE TABLE word2vec ( word varchar(200), vec cube ); CREATE TABLE Time: 8.532 ms word2vec=# COPY word2vec FROM PROGRAM 'zcat /tmp/pgvectors/freebase-vectors-skipgram1000.pgtxt.gz'; COPY 1422903 Time: 12399065.498 ms word2vec=# SELECT word, dist FROM (SELECT word, distance_euclid(vec,(SELECT vec FROM word2vec WHERE word='geoffrey hinton')) AS dist FROM word2vec) AS subquery ORDER BY dist LIMIT 11; word | dist -------------------------+------------------ geoffrey hinton | 0 marvin minsky | 1.03892498287268paul corkum | 1.05221701690288 william richard peltier | 1.06244397334495 brenda milner | 1.06387762685894 john charles polanyi | 1.07444446452295 leslie valiant | 1.07735786596934 hava siegelmann | 1.08148623006629 hans moravec | 1.0862034591185 david rumelhart | 1.08748431130477godel prize | 1.08774264379264 (11 rows) Time: 310753.976 ms That's 3.5 hours to do the import and 5 minutes to do the query; this is on a 2009 MacBook Pro with 8GB RAM and SSD. Confirming the intuition above, these 10-nearest are the same entities in the same order as on the word2vec project page's example output, though the euclidean distances are of course different than the cosine distances. The DATA directory is 23GB after the import of 1.4 million rows. In the word2vec uncompressed binary format, this dataset is about 5.4GB, so this word2vec-table cube-column representation involves about 4X expansion. So, to the main question: can that query be sped up by building a kNN-GIST index? Here the problems start. word2vec=# CREATE INDEX word2vec_index ON word2vec USING gist (vec); This attempt ran for hours, consuming another 80GB+ before failing due to disk full. In fact, the largest table with 1000-dimensional vectors for which I was able to build a gist index was a mere 100 rows. That index-build took about 14 minutes and grew the DATA directory like so: 73MB pgdata # empty word2vec table 75MB pgdata # after COPY FROM of 100 vectors 8.4GB pgdata # after CREATE INDEX Of course with just 100 rows the index isn't practically needed or helpful for query speed-up. Even trying just 500 rows, the CREATE INDEX command ran for hours before failing by consuming all available disk space (about 90GB more). It seems the knn-GIST index overhead for 1000-dimensional cubes grows faster than linearly in the number of rows. I was able to complete index builds with fewer dimensions. 100-dimensions, most-frequent 850K entities: The dataset was trimmed to the first (most-frequent) 850K entities and each vector truncated to its first 100-dimensions. (Even though this would fit in the usual CUBE_MAX_DIM/BLCKSZ/INDEX_SIZE_MASK limits, I still used the customized build with limits expanded for the 1000d-case.) Observed data sizes and operation times were: 73MB pgdata # empty word2vec table 2.6GB pgdata # after COPY FROM of 850K vectors, taking 216s before indexing: nearest-11 by distance_euclid(): ~2s nearest-11 by <-> operator: ~2s 5.0GB pgdata # after CREATE INDEX, taking 1344s after indexing: nearest-11 by distance_euclid(): ~2s nearest-11 by <-> operator: ~57s # "Index Scan using word2vec_index…" So the availability of the index causes a significant slowdown... and the planner does not learn to choose the faster full sequential-scan. 500-dimensions, most-frequent 100K entities: The dataset was trimmed to the first (most-frequent) 100K entities and each vector truncated to its first 500-dimensions. (Again, still using the custom build with upped CUBE_MAX_DIM/BLCKSZ/INDEX_SIZE_MASK.) Observed data sizes and operation times were: 73MB pgdata # empty word2vec table 1.6GB pgdata # after COPY FROM of 100K vectors, taking 266s before indexing: nearest-11 by distance_euclid(): ~2s nearest-11 by <-> operator: ~2s 4.8GB pgdata # after CREATE INDEX, taking 977s after indexing: nearest-11 by distance_euclid(): ~2s nearest-11 by <-> operator: ~46s # "Index Scan using word2vec_index…" Dropping the index makes the <-> query fast again. Open questions and tentative conclusions: The massive knn-GIST index overhead for 1000-dimensional vectors makes it hard to evaluate whether a fully-built index could be useful on large datasets. Perhaps, some aspects of the knn-GIST support implicitly assume a low-dimensionality (2-4) in the data, and large numbers of dimensions cause pathological index sizes? Or, something specific about this dataset (all vectors on the unit sphere) is a challenging case? In the truncated 100d or 500d cases, the index can be built, but slows rather than speeds nearest-neighbor queries that use the index. The evaluated patch's definitions of distance_euclid and other support function seem straightforward, and give proper results on simple test cases... so the problem, if any, would appear to be in the general distance-driven knn-GIST indexing and query-planning. If there is a type of cube-type represented dimensional data where this indexing helps, it may only be with far fewer dimensions than 100, and/or far more rows than 100K/850K. If a knn-GIST implementor/expert has suggestions for tuning the index overhead and behavior -- perhaps a different penalty or picksplit function? -- I'd be happy to try those out and report back. Otherwise, I hope this write-up will help evaluate the patch or save time for others tempted to try similar knn-GIST indexing of higher-dimensional data. - Gordon Mohr
Gordon Mohr wrote: > Thanks for this! I decided to give the patch a try at the bleeding > edge with some high-dimensional vectors, specifically the 1.4 > million 1000-dimensional Freebase entity vectors from the Google > 'word2vec' project: > > https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming > > Unfortunately, here's what I found: I wonder if these results would improve with this patch: http://www.postgresql.org/message-id/EFEDC2BF-AB35-4E2C-911F-FC88DA6473D7@gmail.com -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 10/23/13 9:05 PM, Alvaro Herrera wrote: > Gordon Mohr wrote: > >> Thanks for this! I decided to give the patch a try at the bleeding >> edge with some high-dimensional vectors, specifically the 1.4 >> million 1000-dimensional Freebase entity vectors from the Google >> 'word2vec' project: >> >> https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming >> >> Unfortunately, here's what I found: > > I wonder if these results would improve with this patch: > http://www.postgresql.org/message-id/EFEDC2BF-AB35-4E2C-911F-FC88DA6473D7@gmail.com Thanks for the pointer; I'd missed that relevant update from Stas Kelvich. I applied that patch, and reindexed. On the 100-dimension, 850K vector set: indexing: 1137s (vs. 1344s) DATA size: 4.7G (vs 5.0G) top-11-nearest-neighbor query: 32s (vs ~57s) On the 500-dimension, 100K vector set: indexing: 756s (vs. 977s) DATA size: 4.5G (vs. 4.8G) top-11-nearest-neighbor query: 18s (vs ~46s) So, moderate (5-20%) improvements in indexing time and size, and larger (40-60%) speedups in index-assisted (<->) queries... but those index-assisted queries are still ~10X+ slower than the sequence-scan (distance_euclid()) queries, so the existence of the knn-GIST index is still harming rather than hurting performance. Will update if my understanding changes; still interested to hear if I've missed a key factor/switch needed for these indexes to work well. - Gordon Mohr
On 10/23/13 9:05 PM, Alvaro Herrera wrote: > Gordon Mohr wrote: > >> Thanks for this! I decided to give the patch a try at the bleeding >> edge with some high-dimensional vectors, specifically the 1.4 >> million 1000-dimensional Freebase entity vectors from the Google >> 'word2vec' project: >> >> https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming >> >> Unfortunately, here's what I found: > > I wonder if these results would improve with this patch: > http://www.postgresql.org/message-id/EFEDC2BF-AB35-4E2C-911F-FC88DA6473D7@gmail.com Thanks for the pointer; I'd missed that relevant update from Stas Kelvich. I applied that patch, and reindexed. On the 100-dimension, 850K vector set: indexing: 1137s (vs. 1344s) DATA size: 4.7G (vs 5.0G) top-11-nearest-neighbor query: 32s (vs ~57s) On the 500-dimension, 100K vector set: indexing: 756s (vs. 977s) DATA size: 4.5G (vs. 4.8G) top-11-nearest-neighbor query: 18s (vs ~46s) So, moderate (5-20%) improvements in indexing time and size, and larger (40-60%) speedups in index-assisted (<->) queries... but those index-assisted queries are still ~10X+ slower than the sequence-scan (distance_euclid()) queries, so the existence of the knn-GIST index is still harming rather than hurting performance. Will update if my understanding changes; still interested to hear if I've missed a key factor/switch needed for these indexes to work well. - Gordon Mohr
On Thu, Oct 24, 2013 at 3:50 AM, Gordon Mohr <gojomo-pgsql@xavvy.com> wrote:
On 9/22/13 4:38 PM, Stas Kelvich wrote:Hello, hackers.
Here is the patch that introduces kNN search for cubes with
euclidean, taxicab and chebyshev distances.
Thanks for this! I decided to give the patch a try at the bleeding edge with some high-dimensional vectors, specifically the 1.4 million 1000-dimensional Freebase entity vectors from the Google 'word2vec' project:
I believe the curse of dimensionality is affecting you here. I think it is impossible to get an improvement over sequential scan for 1000 dimensional vectors. Read here:
Regards
Marcin Mańk
Hi. > cube and earthdistance regression tests fail. Code updated to work with current HEAD. Also added tests to cover new functionality. > Do you have any benchmarks ? This patch just introduces functionality of calculating distances between cubes, so this code don't interfere much withkNN search speed. I think it's better to publish such benchmarks in neighbor patch about split algorithm. Anyway, we can compare kNN with b-tree and full scan: create table test(a1 float, a2 float, a3 float); insert into test (select 100*random(), 100*random(), 100*random() from generate_series(1,1000000) as s(a)); create index on test using gist(cube(array[a1,a2,a3])); select * from test order by a1 limit 15; -- 227.658 ms select * from test order by cube(array[a1,a2,a3])->1 limit 15; -- 1.275 ms create index on test(a1); select * from test order by a1 limit 15; -- 0.103 ms As we can see, kNN ordering 10 times slower than B-tree (on silly request for R-Tree, just as example), but still 100+ timesfaster than full scan on this table. Stas. On Sep 25, 2013, at 5:25 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 9/22/13 7:38 PM, Stas Kelvich wrote: >> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances. > > cube and earthdistance regression tests fail.
Attachment
cube.c: In function ‘g_cube_distance’: cube.c:1453:2: warning: ‘retval’ may be used uninitialized in this function [-Wmaybe-uninitialized]
Hi everyone, On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote: > Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances. What is the status of this patch? -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray.ru@gmail.com> wrote: > On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote: >> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances. > > What is the status of this patch? Referring to our private conversation with Alexander Korotkov, the patch is in WIP state currently, and, hopefully, will be ready by 9.5. I'm ready to actively participate in its testing on a real world production set of data. I'm not sure if it is doable at all, but are there any possibility to implement here, or, what would be just great, any ready/half ready solutions of a Hamming distance based kNN search? -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
On Mon, Mar 31, 2014 at 10:01 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:
----
With best regards,
Alexander Korotkov.
On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray.ru@gmail.com> wrote:Referring to our private conversation with Alexander Korotkov, the
> On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
>> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances.
>
> What is the status of this patch?
patch is in WIP state currently, and, hopefully, will be ready by 9.5.
I'm ready to actively participate in its testing on a real world
production set of data.
I'm not sure if it is doable at all, but are there any possibility to
implement here, or, what would be just great, any ready/half ready
solutions of a Hamming distance based kNN search?
Cube dealing with float8 numbers. There is another patch making it work with other number types. But Hamming distance is for bit vectors, isn't it?
----
With best regards,
Alexander Korotkov.
On Mon, Mar 31, 2014 at 12:09 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote: >> I'm not sure if it is doable at all, but are there any possibility to >> implement here, or, what would be just great, any ready/half ready >> solutions of a Hamming distance based kNN search? > > Cube dealing with float8 numbers. There is another patch making it work with > other number types. But Hamming distance is for bit vectors, isn't it? It can be generalized as for character vectors. Though, I agree, that was an off topic question in some extent. I was wondering if there were any postgres related indexable Hamming/Manhattan distance experiments/thoughts/discussions, if kNN can be used here or not, because from my understanding it can be represented as spatial (I might be very wrong here). -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979 gray.ru@gmail.com
Hi! I had updated old patch with kNN operators for cube data structures. Copying description from old message: Following distance operators introduced: <#> taxicab distance <-> euclidean distance <=> chebyshev distance For example: SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10; Also there is operator "->" for selecting ordered rows directly from index. This request selects rows ordered ascending by 3rd coordinate: SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10; For descendent ordering suggested syntax with minus before coordinate. This request selects rows ordered descending by 4th coordinate: SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10; Stas Kelvich.
Attachment
Hi!
On Sat, Feb 7, 2015 at 12:45 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
I had updated old patch with kNN operators for cube data structures. Copying description from old message:
Following distance operators introduced:
<#> taxicab distance
<-> euclidean distance
<=> chebyshev distance
For example:
SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10;
Also there is operator "->" for selecting ordered rows directly from index.
This request selects rows ordered ascending by 3rd coordinate:
SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10;
For descendent ordering suggested syntax with minus before coordinate.
This request selects rows ordered descending by 4th coordinate:
SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10;
I've checked the patch. The first notes are so:
1) Check coding style, in particular braces. Postgres coding style require using it for multiline statements.
2) Update documentation according to new features.
With best regards,
Alexander Korotkov.
Documentation along with style fix. > On 08 Feb 2015, at 00:32, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Hi! > > On Sat, Feb 7, 2015 at 12:45 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote: > I had updated old patch with kNN operators for cube data structures. Copying description from old message: > > Following distance operators introduced: > > <#> taxicab distance > <-> euclidean distance > <=> chebyshev distance > > For example: > SELECT * FROM objects ORDER BY objects.coord <-> '(137,42,314)'::cube LIMIT 10; > > Also there is operator "->" for selecting ordered rows directly from index. > This request selects rows ordered ascending by 3rd coordinate: > > SELECT * FROM objects ORDER BY objects.coord->3 LIMIT 10; > > For descendent ordering suggested syntax with minus before coordinate. > This request selects rows ordered descending by 4th coordinate: > > SELECT * FROM objects ORDER BY objects.coord->-4 LIMIT 10; > > I've checked the patch. The first notes are so: > 1) Check coding style, in particular braces. Postgres coding style require using it for multiline statements. > 2) Update documentation according to new features. > > ------ > With best regards, > Alexander Korotkov.
Attachment
On Thu, Mar 12, 2015 at 8:43 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
Documentation along with style fix.
Since we change the interface of extension we have to change it version and create a migration script.
E.g. you have to rename cube--1.0.sql to cube--1.1.sql and create cube--1.0--1.1.sql to migrate the old version.
+ -- Alias for backword compatibility
CREATE FUNCTION cube_distance(cube, cube)
RETURNS float8
+ AS 'MODULE_PATHNAME', 'distance_euclid'
+ LANGUAGE C IMMUTABLE STRICT;
For backward compatibility it would be better to keep the old name of cube_distance so that extension with old definition could work with new binary.
With best regards,
Alexander Korotkov.
Sergey Konoplev wrote: > On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray.ru@gmail.com> wrote: > > On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote: > >> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances. > > > > What is the status of this patch? > > Referring to our private conversation with Alexander Korotkov, the > patch is in WIP state currently, and, hopefully, will be ready by 9.5. > I'm ready to actively participate in its testing on a real world > production set of data. This patch doesn't seem to have received an updated version. Should we just punt on it? The assumption would be that Stas or Alexander will be re-submitting this for 9.6. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi! Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and versionbump. Attaching a version with all migration stuff. Stas. > On 09 May 2015, at 05:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Sergey Konoplev wrote: >> On Thu, Mar 27, 2014 at 3:26 PM, Sergey Konoplev <gray.ru@gmail.com> wrote: >>> On Sun, Sep 22, 2013 at 4:38 PM, Stas Kelvich <stas.kelvich@gmail.com> wrote: >>>> Here is the patch that introduces kNN search for cubes with euclidean, taxicab and chebyshev distances. >>> >>> What is the status of this patch? >> >> Referring to our private conversation with Alexander Korotkov, the >> patch is in WIP state currently, and, hopefully, will be ready by 9.5. >> I'm ready to actively participate in its testing on a real world >> production set of data. > > This patch doesn't seem to have received an updated version. Should we > just punt on it? The assumption would be that Stas or Alexander will be > re-submitting this for 9.6. > > -- > Álvaro Herrera http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi!
On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas.kelvich@gmail.com> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and version bump.
Attaching a version with all migration stuff.
I can't see cube--1.0--1.1.sql in the patch. Did forget to include it?
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hello. That is updated version of the patch with proper update scripts. Also i’ve noted that documentation states the wrong thing: “It does not matter which order the opposite corners of a cube are entered in. The cube functions automatically swap valuesif needed to create a uniform "lower left — upper right" internal representation." But in practice cubes stored "as is" and that leads to problems with getting cubes sorted along specific dimension directlyfrom index. As a simplest workaround i’ve deleted that sentence from docs and implemented two coordinate getters -> and ~>. First onereturns coordinate of cube as it stored, and second returns coordinate of cube normalised to (LL,UR)-form. Other way to fix thing is to force ’normalization’ while creating cube. But that can produce wrong sorts with already existingdata. > On 09 Jul 2015, at 16:40, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Hi! > > On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas.kelvich@gmail.com> wrote: > Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and versionbump. > Attaching a version with all migration stuff. > > I can't see cube--1.0--1.1.sql in the patch. Did forget to include it? > > ------ > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Patch looks good, but there ara some review notices: 1 gmake installcheck fails: *** /.../pgsql/contrib/cube/expected/cube_1.out 2015-12-01 17:49:01.768764000 +0300 --- /.../pgsql/contrib/cube/results/cube.out 2015-12-01 17:49:12.190818000 +0300 *************** *** 1382,1388 **** (1 row) -- Test of distances ! -- SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube); cube_distance --------------- --- 1382,1388 ---- (1 row) -- Test of distances ! -- SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube); cube_distance Seems, there a extra space at the end of string 2 Pls, don use in C-code magick constants like 'case 16:'. Use macros to define some human-readable name (g_cube_distance()) 3 Switch in g_cube_distance(): default switch path should generate a error. It just simplifies a degbug process, may be in future. 4 Docs: pls, don't use a strings with unlimited length. Stas Kelvich wrote: > Hello. > > That is updated version of the patch with proper update scripts. > > Also i’ve noted that documentation states the wrong thing: > > “It does not matter which order the opposite corners of a cube are entered in. The cube functions automatically swap valuesif needed to create a uniform "lower left — upper right" internal representation." > > But in practice cubes stored "as is" and that leads to problems with getting cubes sorted along specific dimension directlyfrom index. > As a simplest workaround i’ve deleted that sentence from docs and implemented two coordinate getters -> and ~>. First onereturns > coordinate of cube as it stored, and second returns coordinate of cube normalised to (LL,UR)-form. > > Other way to fix thing is to force ’normalization’ while creating cube. But that can produce wrong sorts with already existingdata. > >> On 09 Jul 2015, at 16:40, Alexander Korotkov <aekorotkov@gmail.com> wrote: >> >> Hi! >> >> On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas.kelvich@gmail.com> wrote: >> Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and versionbump. >> Attaching a version with all migration stuff. >> >> I can't see cube--1.0--1.1.sql in the patch. Did forget to include it? >> >> ------ >> Alexander Korotkov >> Postgres Professional: http://www.postgrespro.com >> The Russian Postgres Company > > Stas Kelvich > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > > > > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hello, fixed. --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company > On 01 Dec 2015, at 17:52, Teodor Sigaev <teodor@sigaev.ru> wrote: > > Patch looks good, but there ara some review notices: > 1 gmake installcheck fails: > *** /.../pgsql/contrib/cube/expected/cube_1.out 2015-12-01 17:49:01.768764000 +0300 > --- /.../pgsql/contrib/cube/results/cube.out 2015-12-01 17:49:12.190818000 +0300 > *************** > *** 1382,1388 **** > (1 row) > > -- Test of distances > ! -- > SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube); > cube_distance > --------------- > --- 1382,1388 ---- > (1 row) > > -- Test of distances > ! -- > SELECT cube_distance('(1,1)'::cube, '(4,5)'::cube); > cube_distance > > Seems, there a extra space at the end of string > > 2 Pls, don use in C-code magick constants like 'case 16:'. Use macros to define some human-readable name (g_cube_distance()) > > 3 Switch in g_cube_distance(): default switch path should generate a error. It just simplifies a degbug process, may bein future. > > 4 Docs: pls, don't use a strings with unlimited length. > > > Stas Kelvich wrote: >> Hello. >> >> That is updated version of the patch with proper update scripts. >> >> Also i’ve noted that documentation states the wrong thing: >> >> “It does not matter which order the opposite corners of a cube are entered in. The cube functions automatically swap valuesif needed to create a uniform "lower left — upper right" internal representation." >> >> But in practice cubes stored "as is" and that leads to problems with getting cubes sorted along specific dimension directlyfrom index. >> As a simplest workaround i’ve deleted that sentence from docs and implemented two coordinate getters -> and ~>. Firstone returns >> coordinate of cube as it stored, and second returns coordinate of cube normalised to (LL,UR)-form. >> >> Other way to fix thing is to force ’normalization’ while creating cube. But that can produce wrong sorts with alreadyexisting data. >> >>> On 09 Jul 2015, at 16:40, Alexander Korotkov <aekorotkov@gmail.com> wrote: >>> >>> Hi! >>> >>> On Sat, May 9, 2015 at 6:53 AM, Stas Kelvich <stas.kelvich@gmail.com> wrote: >>> Patch is pretty ready, last issue was about changed extension interface, so there should be migration script and versionbump. >>> Attaching a version with all migration stuff. >>> >>> I can't see cube--1.0--1.1.sql in the patch. Did forget to include it? >>> >>> ------ >>> Alexander Korotkov >>> Postgres Professional: http://www.postgrespro.com >>> The Russian Postgres Company >> >> Stas Kelvich >> Postgres Professional: http://www.postgrespro.com >> The Russian Postgres Company >> >> >> >> > > -- > Teodor Sigaev E-mail: teodor@sigaev.ru > WWW: http://www.sigaev.ru/ > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 12/07/2015 03:47 PM, Stas Kelvich wrote: > Hello, fixed. I've looked at the patch today, seems mostly fine to me. Three comments though: 1) (nitpicking) There seem to be some minor whitespace issues, i.e. trailing spaces, empty lines being added/removed, etc. 2) one of the regression tests started to fail SELECT '-1e-700'::cube AS cube; This used to return (0) but now I get (-0). As this query existed in 1.0, it's probably due to change in the C code.Now sure where. 3) I wonder why the docs were changed like this: <para> - It does not matter which order the opposite corners of a cube are - entered in. The <type>cube</> functions - automatically swap values if needed to create a uniform - <quote>lower left — upper right</> internal representation. + When corners coincide cube stores only one corner along with a special flag in order to reduce size wasted. </para> Was the old behavior removed? I don't think so - it seems to behave as before, so why to remove this information? Maybeit's not useful? But then why add the bit about optimizing storage of points? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, thanks for the review. > 1) (nitpicking) There seem to be some minor whitespace issues, i.e. > trailing spaces, empty lines being added/removed, etc. Fixed, I think > 2) one of the regression tests started to fail > > SELECT '-1e-700'::cube AS cube; > > This used to return (0) but now I get (-0). Actually that problem emerged because of the first problem. I had extra whitespace in sql file and removed that whitespacefrom one of the answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was one line length and yousaw diff with cube.sql. In all systems that available to me (osx/linux/freebsd) I saw that right answers file is cube_1.sql. But in other OS’es youcan get +/- 0 or e27/e027. I edited that answers files manually, so there probably can be some other typos. > 3) I wonder why the docs were changed like this: > > <para> > - It does not matter which order the opposite corners of a cube are > - entered in. The <type>cube</> functions > - automatically swap values if needed to create a uniform > - <quote>lower left — upper right</> internal representation. > + When corners coincide cube stores only one corner along with a > special flag in order to reduce size wasted. > </para> > > Was the old behavior removed? I don't think so - it seems to behave > as before, so why to remove this information? Maybe it's not useful? > But then why add the bit about optimizing storage of points? I’ve edited it because the statement was mislead (or at least ambiguous) — cube_in function doesn’t swap coordinates. Simple way to see it: > select '(1,3),(3,1)'::cube; cube --------------- (1, 3),(3, 1) But LowerLeft-UpperRight representation should be (1,1),(3,3) Updated patch attached. > On 15 Dec 2015, at 21:46, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hi, > > On 12/07/2015 03:47 PM, Stas Kelvich wrote: >> Hello, fixed. > > I've looked at the patch today, seems mostly fine to me. > > Three comments though: > > 1) (nitpicking) There seem to be some minor whitespace issues, i.e. > trailing spaces, empty lines being added/removed, etc. > > 2) one of the regression tests started to fail > > SELECT '-1e-700'::cube AS cube; > > This used to return (0) but now I get (-0). As this query existed in > 1.0, it's probably due to change in the C code. Now sure where. > > 3) I wonder why the docs were changed like this: > > <para> > - It does not matter which order the opposite corners of a cube are > - entered in. The <type>cube</> functions > - automatically swap values if needed to create a uniform > - <quote>lower left — upper right</> internal representation. > + When corners coincide cube stores only one corner along with a > special flag in order to reduce size wasted. > </para> > > Was the old behavior removed? I don't think so - it seems to behave > as before, so why to remove this information? Maybe it's not useful? > But then why add the bit about optimizing storage of points? > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 12/16/2015 01:26 PM, Stas Kelvich wrote: > Hi, thanks for the review. > >> 1) (nitpicking) There seem to be some minor whitespace issues, i.e. >> trailing spaces, empty lines being added/removed, etc. > > > Fixed, I think > >> 2) one of the regression tests started to fail >> >> SELECT '-1e-700'::cube AS cube; >> >> This used to return (0) but now I get (-0). > > Actually that problem emerged because of the first problem. I had extra whitespace in sql file and removed that whitespace from one of the answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was one line length and you saw diff with cube.sql. > In all systems that available to me (osx/linux/freebsd) I saw that right answers file is cube_1.sql. But in other OS’es you can get +/- 0 or e27/e027. I edited that answers files manually, so there probably can be some other typos. Ah! So that's why I couldn't quickly find the issue in the C code ... > >> 3) I wonder why the docs were changed like this: >> >> <para> >> - It does not matter which order the opposite corners of a cube are >> - entered in. The <type>cube</> functions >> - automatically swap values if needed to create a uniform >> - <quote>lower left — upper right</> internal representation. >> + When corners coincide cube stores only one corner along with a >> special flag in order to reduce size wasted. >> </para> >> >> Was the old behavior removed? I don't think so - it seems to behave >> as before, so why to remove this information? Maybe it's not useful? >> But then why add the bit about optimizing storage of points? > > I’ve edited it because the statement was mislead (or at least ambiguous) — cube_in function doesn’t swap coordinates. > Simple way to see it: >> select '(1,3),(3,1)'::cube; > cube > --------------- > (1, 3),(3, 1) > > But LowerLeft-UpperRight representation should be (1,1),(3,3) I don't think that's what the comment says, actually. It rather refers to code like this: result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1)); i.e. if you specifically ask for a particular corner (ll, in this case), you'll get the proper value. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> I don't think that's what the comment says, actually. It rather refers to code like this: > > result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1)); > > i.e. if you specifically ask for a particular corner (ll, in this case), you'll get the proper value. Hmm, I was confused by phrase “create a uniform _internal_ representation” and actually internally cube stored “as is”. Butprobably i just misinterpret that. So here is the updated version with old documentation restored. > On 16 Dec 2015, at 16:46, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hi, > > On 12/16/2015 01:26 PM, Stas Kelvich wrote: >> Hi, thanks for the review. >> >>> 1) (nitpicking) There seem to be some minor whitespace issues, i.e. >>> trailing spaces, empty lines being added/removed, etc. >> >> >> Fixed, I think >> >>> 2) one of the regression tests started to fail >>> >>> SELECT '-1e-700'::cube AS cube; >>> >>> This used to return (0) but now I get (-0). >> >> Actually that problem emerged because of the first problem. I had > extra whitespace in sql file and removed that whitespace from one of the > answers file (cube_1.sql), so diff with both cube.sql and cube_1.sql was > one line length and you saw diff with cube.sql. >> In all systems that available to me (osx/linux/freebsd) I saw that > right answers file is cube_1.sql. But in other OS’es you can get +/- 0 > or e27/e027. I edited that answers files manually, so there probably can > be some other typos. > > Ah! So that's why I couldn't quickly find the issue in the C code ... > >> >>> 3) I wonder why the docs were changed like this: >>> >>> <para> >>> - It does not matter which order the opposite corners of a cube are >>> - entered in. The <type>cube</> functions >>> - automatically swap values if needed to create a uniform >>> - <quote>lower left — upper right</> internal representation. >>> + When corners coincide cube stores only one corner along with a >>> special flag in order to reduce size wasted. >>> </para> >>> >>> Was the old behavior removed? I don't think so - it seems to behave >>> as before, so why to remove this information? Maybe it's not useful? >>> But then why add the bit about optimizing storage of points? >> >> I’ve edited it because the statement was mislead (or at least ambiguous) — cube_in function doesn’t swap coordinates. >> Simple way to see it: >>> select '(1,3),(3,1)'::cube; >> cube >> --------------- >> (1, 3),(3, 1) >> >> But LowerLeft-UpperRight representation should be (1,1),(3,3) > > I don't think that's what the comment says, actually. It rather refers to code like this: > > result = Min(LL_COORD(c, n - 1), UR_COORD(c, n - 1)); > > i.e. if you specifically ask for a particular corner (ll, in this case), you'll get the proper value. > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services