high-dimensional knn-GIST tests (was Re: Cube extension kNN support) - Mailing list pgsql-hackers

From Gordon Mohr
Subject high-dimensional knn-GIST tests (was Re: Cube extension kNN support)
Date
Msg-id 52687CEA.9060903@xavvy.com
Whole thread Raw
In response to Cube extension kNN support  (Stas Kelvich <stas.kelvich@gmail.com>)
Responses Re: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: high-dimensional knn-GIST tests (was Re: Cube extension kNN support)  (Marcin Mańk <marcin.mank@gmail.com>)
List pgsql-hackers
On 9/22/13 4:38 PM, Stas Kelvich wrote:
> Hello, hackers.
>
> Here is the patch that introduces kNN search for cubes with
> euclidean, taxicab and chebyshev distances.

Thanks for this! I decided to give the patch a try at the bleeding edge 
with some high-dimensional vectors, specifically the 1.4 million 
1000-dimensional Freebase entity vectors from the Google 'word2vec' project:

https://code.google.com/p/word2vec/#Pre-trained_entity_vectors_with_Freebase_naming

Unfortunately, here's what I found:

* with 1000-dimensional vectors, the index size on disk grows to many 
times (1000x or more) the size of the indexed data, making other tests 
of the index's helpfulness impractical. (Perhaps, other parameter-tuning 
can help?)

* with 500-dimensional or 100-dimensional vectors, the index size is 
more manageable -- 2x to 6x the data size -- but using the index 
significantly hurts performance on top-10 nearest-neighbor searches, 
making it much slower than a full table scan. (The planner still tries 
to use the index, even though it's hurting.)

Much more detail for the curious is below...

Regarding the dataset & hoped-for results:

The word2vec project's example scripts do their synonym/analogy 
demonstrations by loading the entire 5GB+ dataset into main memory 
(~3min), do a full scan of all vectors (~40sec) to find those nearest a 
target vector.

My motivating question was: could the data be loaded to Postgresql using 
the cube type, and kNN-GIST indexed using this patch, then do speedy 
index-assisted ranked-nearest-neighbor queries from the database?

(While the patch's distance_euclid is not the same cosine-distance the 
word2vec examples use, the freebase entity vectors are all unit vectors, 
and even additions of vectors can be scaled to unit length. My intuition 
is that euclidean-distances on the unit sphere will be in the same rank 
order as cosine-distance, so the cube distance_euclid/<->  should enable 
the same sort of synonym/analogy demos.)

Regarding the 1000-dimensional case:

It was necessary to change three compiled-in implementation limits. In 
the four steps that the need for change became evident:

(1) in contrib/cube/cubedata.h, increase CUBE_MAX_DIM (to accept vectors 
with more than 100 dimensions)

(2) in src/include/pg_config.h, increase BLCKSZ to 16384 (otherwise 1000 
64-bit floats in a single column gave an <ERROR: row is too big>, 
mentioning the 8160 limit - large cube values not TOASTable?)

(3) in src/include/access/itup.h, increase INDEX_SIZE_MASK to 0x3fff 
(otherwise encountering an <ERROR:  index row requires 16016 bytes, 
maximum size is 8191> when attempting to create the index>

(4) in src/include/pg_config.h, again increase BLCKSZ now to 32768 
(otherwise encountering an <ERROR:  index row size 16416 exceeds maximum 
5440 for index "pg_class_relname_nsp_index"> when attempting to create 
the index>

With the cube-kNN patch applied and these other changes, I was able to 
import the 1.4M freebase vectors and do a full-scan nearest-neighbors 
query. (My starting postgresql codebase was the github mirror of 9.4dev 
as of about a week ago.)

The psql transcript:

word2vec=# \timing
Timing is on.
word2vec=# CREATE EXTENSION cube;
CREATE EXTENSION
Time: 42.141 ms
word2vec=# CREATE TABLE word2vec ( word varchar(200), vec cube );
CREATE TABLE
Time: 8.532 ms
word2vec=# COPY word2vec FROM PROGRAM 'zcat 
/tmp/pgvectors/freebase-vectors-skipgram1000.pgtxt.gz';
COPY 1422903
Time: 12399065.498 ms
word2vec=# SELECT word, dist FROM (SELECT word, 
distance_euclid(vec,(SELECT vec FROM word2vec WHERE word='geoffrey 
hinton')) AS dist FROM word2vec) AS subquery ORDER BY dist LIMIT 11;          word           |       dist
-------------------------+------------------ geoffrey hinton         |                0 marvin minsky           |
1.03892498287268paul corkum             | 1.05221701690288 william richard peltier | 1.06244397334495 brenda milner
     | 1.06387762685894 john charles polanyi    | 1.07444446452295 leslie valiant          | 1.07735786596934 hava
siegelmann        | 1.08148623006629 hans moravec            |  1.0862034591185 david rumelhart         |
1.08748431130477godel prize             | 1.08774264379264
 
(11 rows)

Time: 310753.976 ms

That's 3.5 hours to do the import and 5 minutes to do the query; this is 
on a 2009 MacBook Pro with 8GB RAM and SSD.

Confirming the intuition above, these 10-nearest are the same entities 
in the same order as on the word2vec project page's example output, 
though the euclidean distances are of course different than the cosine 
distances.

The DATA directory is 23GB after the import of 1.4 million rows. In the 
word2vec uncompressed binary format, this dataset is about 5.4GB, so 
this word2vec-table cube-column representation involves about 4X expansion.

So, to the main question: can that query be sped up by building a 
kNN-GIST index? Here the problems start.

word2vec=# CREATE INDEX word2vec_index ON word2vec USING gist (vec);

This attempt ran for hours, consuming another 80GB+ before failing due 
to disk full.

In fact, the largest table with 1000-dimensional vectors for which I was 
able to build a gist index was a mere 100 rows. That index-build took 
about 14 minutes and grew the DATA directory like so:
 73MB pgdata # empty word2vec table 75MB pgdata # after COPY FROM of 100 vectors
8.4GB pgdata # after CREATE INDEX

Of course with just 100 rows the index isn't practically needed or 
helpful for query speed-up.

Even trying just 500 rows, the CREATE INDEX command ran for hours before 
failing by consuming all available disk space (about 90GB more). It 
seems the knn-GIST index overhead for 1000-dimensional cubes grows 
faster than linearly in the number of rows.

I was able to complete index builds with fewer dimensions.

100-dimensions, most-frequent 850K entities:

The dataset was trimmed to the first (most-frequent) 850K entities and 
each vector truncated to its first 100-dimensions. (Even though this 
would fit in the usual CUBE_MAX_DIM/BLCKSZ/INDEX_SIZE_MASK limits, I 
still used the customized build with limits expanded for the 1000d-case.)

Observed data sizes and operation times were:
 73MB pgdata # empty word2vec table
2.6GB pgdata # after COPY FROM of 850K vectors, taking 216s

before indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~2s

5.0GB pgdata # after CREATE INDEX, taking 1344s

after indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~57s  # "Index Scan using word2vec_index…"

So the availability of the index causes a significant slowdown... and 
the planner does not learn to choose the faster full sequential-scan.

500-dimensions, most-frequent 100K entities:

The dataset was trimmed to the first (most-frequent) 100K entities and 
each vector truncated to its first 500-dimensions. (Again, still using 
the custom build with upped CUBE_MAX_DIM/BLCKSZ/INDEX_SIZE_MASK.)

Observed data sizes and operation times were:
 73MB pgdata # empty word2vec table
1.6GB pgdata # after COPY FROM of 100K vectors, taking 266s

before indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~2s

4.8GB pgdata # after CREATE INDEX, taking 977s

after indexing:
nearest-11 by distance_euclid(): ~2s
nearest-11 by <-> operator: ~46s  # "Index Scan using word2vec_index…"

Dropping the index makes the <-> query fast again.

Open questions and tentative conclusions:

The massive knn-GIST index overhead for 1000-dimensional vectors makes 
it hard to evaluate whether a fully-built index could be useful on large 
datasets.

Perhaps, some aspects of the knn-GIST support implicitly assume a 
low-dimensionality (2-4) in the data, and large numbers of dimensions 
cause pathological index sizes?

Or, something specific about this dataset (all vectors on the unit 
sphere) is a challenging case?

In the truncated 100d or 500d cases, the index can be built, but slows 
rather than speeds nearest-neighbor queries that use the index.

The evaluated patch's definitions of distance_euclid and other support 
function seem straightforward, and give proper results on simple test 
cases... so the problem, if any, would appear to be in the general 
distance-driven knn-GIST indexing and query-planning.

If there is a type of cube-type represented dimensional data where this 
indexing helps, it may only be with far fewer dimensions than 100, 
and/or far more rows than 100K/850K.

If a knn-GIST implementor/expert has suggestions for tuning the index 
overhead and behavior -- perhaps a different penalty or picksplit 
function? -- I'd be happy to try those out and report back.

Otherwise, I hope this write-up will help evaluate the patch or save 
time for others tempted to try similar knn-GIST indexing of 
higher-dimensional data.

- Gordon Mohr





pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: RULE regression test fragility?
Next
From: David Johnston
Date:
Subject: Re: Using indexes for ORDER BY and PARTITION BY clause in windowing functions