Thread: GiST limits on contrib/cube with dimension > 100?
I've been using cube extension recompiled with
#define MAX_DIM 256.
But with a version 11.3 I'm getting the following error:
failed to add item to index page in <index_name>
There's a regression unit test in contrib/cube/expected/cube.out:
CREATE TABLE test_cube (c cube);
\copy test_cube from 'data/test_cube.data'
CREATE INDEX test_cube_ix ON test_cube USING gist (c);
\copy test_cube from 'data/test_cube.data'
CREATE INDEX test_cube_ix ON test_cube USING gist (c);
SELECT * FROM test_cube WHERE c && '(3000,1000),(0,0)' ORDER BY c;
I've created gist index in the same way, i.e. create index <index_name> on <table_name> using gist(<column_name>);
If MAX_DIM equals to 512, btree index complaints as:
index row size 4112 exceeds maximum 2712 for index <index_name>
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
That's why 256 has been set.
But gist doesn't provide explanation on its error.
These are the places where the message might have been generated:
src/backend/access/gist/gist.c:418: elog(ERROR, "failed to add item to index page in \"%s\"", RelationGetRelationName(rel));
src/backend/access/gist/gist.c:540: elog(ERROR, "failed to add item to index page in \"%s\"",
src/backend/access/gist/gist.c:540: elog(ERROR, "failed to add item to index page in \"%s\"",
Question is what restrains from setting MAX_DIM bigger than 100 in a custom recompiled cube extension version?
In practice the error messages are too cryptic.
contrib/cube/cube.c has the following methods regarding GIST:
/*
** GiST support methods
*/
PG_FUNCTION_INFO_V1(g_cube_consistent);
PG_FUNCTION_INFO_V1(g_cube_compress);
PG_FUNCTION_INFO_V1(g_cube_decompress);
PG_FUNCTION_INFO_V1(g_cube_penalty);
PG_FUNCTION_INFO_V1(g_cube_picksplit);
PG_FUNCTION_INFO_V1(g_cube_union);
PG_FUNCTION_INFO_V1(g_cube_same);
PG_FUNCTION_INFO_V1(g_cube_distance);
** GiST support methods
*/
PG_FUNCTION_INFO_V1(g_cube_consistent);
PG_FUNCTION_INFO_V1(g_cube_compress);
PG_FUNCTION_INFO_V1(g_cube_decompress);
PG_FUNCTION_INFO_V1(g_cube_penalty);
PG_FUNCTION_INFO_V1(g_cube_picksplit);
PG_FUNCTION_INFO_V1(g_cube_union);
PG_FUNCTION_INFO_V1(g_cube_same);
PG_FUNCTION_INFO_V1(g_cube_distance);
g_cube_compress has the following body:
PG_RETURN_DATUM(PG_GETARG_DATUM(0));
Does it just returns void pointer to the underlying x array?
cube data structure:
cube data structure:
typedef struct NDBOX
{
/* varlena header (do not touch directly!) */
int32 vl_len_;
/*----------
* Header contains info about NDBOX. For binary compatibility with old
* versions, it is defined as "unsigned int".
*
* Following information is stored:
*
* bits 0-7 : number of cube dimensions;
* bits 8-30 : unused, initialize to zero;
* bit 31 : point flag. If set, the upper right coordinates are not
* stored, and are implicitly the same as the lower left
* coordinates.
*----------
*/
unsigned int header;
/*
* The lower left coordinates for each dimension come first, followed by
* upper right coordinates unless the point flag is set.
*/
double x[FLEXIBLE_ARRAY_MEMBER];
} NDBOX;
{
/* varlena header (do not touch directly!) */
int32 vl_len_;
/*----------
* Header contains info about NDBOX. For binary compatibility with old
* versions, it is defined as "unsigned int".
*
* Following information is stored:
*
* bits 0-7 : number of cube dimensions;
* bits 8-30 : unused, initialize to zero;
* bit 31 : point flag. If set, the upper right coordinates are not
* stored, and are implicitly the same as the lower left
* coordinates.
*----------
*/
unsigned int header;
/*
* The lower left coordinates for each dimension come first, followed by
* upper right coordinates unless the point flag is set.
*/
double x[FLEXIBLE_ARRAY_MEMBER];
} NDBOX;
Can it be a problem of not fitting into some limits when building or updating gist index for cube with MAX_DIM > 100?
> On 9 Jun 2019, at 20:05, Siarhei Siniak <siarheisiniak@yahoo.com> wrote: > > I've been using cube extension recompiled with > #define MAX_DIM 256. > > But with a version 11.3 I'm getting the following error: > failed to add item to index page in <index_name> This sounds like a variant of the issue reported on -bugs in AM6PR06MB57318C9882C021879DD4101EA3100@AM6PR06MB5731.eurprd06.prod.outlook.com and is also reproducible on HEAD. cheers ./daniel
Can you point out a failling unit test in the codebase?
P.S sorry for a late reply, has got this message in the spam folder )
Le lundi 10 juin 2019 à 14:57:32 UTC+3, Daniel Gustafsson <daniel@yesql.se> a écrit :
> On 9 Jun 2019, at 20:05, Siarhei Siniak <siarheisiniak@yahoo.com> wrote:
>
> I've been using cube extension recompiled with
> #define MAX_DIM 256.
>
> But with a version 11.3 I'm getting the following error:
> failed to add item to index page in <index_name>
This sounds like a variant of the issue reported on -bugs in
AM6PR06MB57318C9882C021879DD4101EA3100@AM6PR06MB5731.eurprd06.prod.outlook.com
and is also reproducible on HEAD.
cheers ./daniel
>
> I've been using cube extension recompiled with
> #define MAX_DIM 256.
>
> But with a version 11.3 I'm getting the following error:
> failed to add item to index page in <index_name>
This sounds like a variant of the issue reported on -bugs in
AM6PR06MB57318C9882C021879DD4101EA3100@AM6PR06MB5731.eurprd06.prod.outlook.com
and is also reproducible on HEAD.
cheers ./daniel
I've added debug prints to cube extension.
g_custom_cube_a_f8
g_custom_cube_picksplit
are the only called methods
after that it prints
are the only called methods
after that it prints
import psycopg2
import logging
import numpy
import unittest
import python.utils.logging
import python.custom_db.backends
import python.custom_db.backends.postgresql
class TestPostgresql(unittest.TestCase):
def test_gist(self):
b = python.custom_db.backends.postgresql.Postgresql(
databases=dict(
test=dict(
minconn=1,
maxconn=1
)
)
)
b.connect()
try:
c = b.get_connection(use='test')
c2 = c[0]
with c2.cursor() as cur:
cur.execute(r'''
drop table if exists test;
create table test(image_id integer primary key, latent_code custom_cube);
create index lc_idx on test using gist(latent_code);
''')
c2.commit()
with self.assertRaises(psycopg2.errors.InternalError_):
for k in range(10):
logging.info('test_postgresql.test_gist, k = %d' % k)
cur.execute(
r'''
insert into test (image_id, latent_code)
values (%s, custom_cube(%s))
''',
[
k,
[float(x) for x in numpy.random.uniform(0, 1, 512)],
]
)
c2.commit()
finally:
b.put_connection(c2, 'test')
```import logging
import numpy
import unittest
import python.utils.logging
import python.custom_db.backends
import python.custom_db.backends.postgresql
class TestPostgresql(unittest.TestCase):
def test_gist(self):
b = python.custom_db.backends.postgresql.Postgresql(
databases=dict(
test=dict(
minconn=1,
maxconn=1
)
)
)
b.connect()
try:
c = b.get_connection(use='test')
c2 = c[0]
with c2.cursor() as cur:
cur.execute(r'''
drop table if exists test;
create table test(image_id integer primary key, latent_code custom_cube);
create index lc_idx on test using gist(latent_code);
''')
c2.commit()
with self.assertRaises(psycopg2.errors.InternalError_):
for k in range(10):
logging.info('test_postgresql.test_gist, k = %d' % k)
cur.execute(
r'''
insert into test (image_id, latent_code)
values (%s, custom_cube(%s))
''',
[
k,
[float(x) for x in numpy.random.uniform(0, 1, 512)],
]
)
c2.commit()
finally:
b.put_connection(c2, 'test')
I've added debug prints to cube extension. g_custom_cube_a_f8 and g_custom_cube_picksplit are the only called methods. After that it prints: ERROR: failed to add item to index page in "lc_idx" Cube extension modifications: #define MAX_DIM (512) Python test source code has been attached to the letter. P.S. sorry for the previous letter, didn't configure plain text composition
Attachment
Hi! > 9 июня 2019 г., в 23:05, Siarhei Siniak <siarheisiniak@yahoo.com> написал(а): > > I've been using cube extension recompiled with > #define MAX_DIM 256. > > But with a version 11.3 I'm getting the following error: > failed to add item to index page in <index_name> So, you have changed source code (removing dim constraint) and get cryptic error after that. In some way this is expected... Though, the reason why cube has this limit is not physical. R-tree's (cube+gist) just do not work effectively with more than10 non-correlated dimensions. If you have some correlated dimensions - you, probably, should invent something more clever that just cube - plain arrayof D*2*doubles for each tuple. 100 is upper bound for sane data that can be indexed in case of cube... Nevertheless, we can improve AddTuple messages. But there is not such strict guidelines as with B-tree. Probably, tuplesshould not be bigger than half of usable page space. Best regards, Andrey Borodin.
A uniform set of points with a dimension 128 and type cube. That has a size of 50 ** 3. Can be queried for a nearest neighbor at a speed of 10 queries per second with limit varying from 1 to 25.
It works better than when no index used at all. So gist gives here a speed up.
The documentation of postgresql doesn't mention complexity of such an index. I've got confused as to its speed.
Does postgresql allow for an approximate nearest neighbor search?
https://github.com/erikbern/ann-benchmarks has a lot of efficient implementations.
> 12 июня 2019 г., в 15:11, Siarhei Siniak <siarheisiniak@yahoo.com> написал(а): > > A uniform set of points with a dimension 128 and type cube. That has a size of 50 ** 3. Can be queried for a nearest neighborat a speed of 10 queries per second with limit varying from 1 to 25. > It works better than when no index used at all. So gist gives here a speed up. Then, I think, data is correlated. I believe it is possible to implement something better for high dimensional KNN in GiSTthan cube. > > The documentation of postgresql doesn't mention complexity of such an index. I've got confused as to its speed. > > Does postgresql allow for an approximate nearest neighbor search? > https://github.com/erikbern/ann-benchmarks has a lot of efficient implementations. ANN is beyond concepts of SQL standard: database with index must return same results as without index. I can add ANN to github.com/x4m/ags which is a fork of GiST. But PostgreSQL adds a lot of OLTP overhead: 1. it is crash-safe 2. it supports concurrent operations 2a. in a very unexpected way, for example in serializable mode it guaranties you that if you were looking for nearest neighborthere will not appear any new closer neighbor until the end of your transaction. 3. it allows extensibility and has abstraction layers 4. it has declarative querying language etcetc All this comes at a cost of database that can do anything and be anything. It its very hard to compete with specialized indexesthat only do ANN. Yet, as far as I know, no one really pursued the idea of fast high dimensional ANN in PG. Best regards, Andrey Borodin.
>ANN is beyond concepts of SQL standard: database with index must return same results as without index.
>I can add ANN to github.com/x4m/ags which is a fork of GiST.
How to recompile that extension and not to get a name conflict with a standard one?
I've renamed everything for cube extension. When I needed to fork it.
But it's impractical.
Andrey Borodin <x4mmm@yandex-team.ru> writes: >> 9 июня 2019 г., в 23:05, Siarhei Siniak <siarheisiniak@yahoo.com> написал(а): >> I've been using cube extension recompiled with >> #define MAX_DIM 256. >> But with a version 11.3 I'm getting the following error: >> failed to add item to index page in <index_name> > So, you have changed source code (removing dim constraint) and get cryptic error after that. In some way this is expected... Yeah. There is necessarily a limit on the size of index entries, and it's getting exceeded. > Nevertheless, we can improve AddTuple messages. But there is not such strict guidelines as with B-tree. Probably, tuplesshould not be bigger than half of usable page space. I do not think "improve AddTuple messages" is going to be enough to fix this. Daniel was correct that this is the same problem seen in https://www.postgresql.org/message-id/flat/AM6PR06MB57318C9882C021879DD4101EA3100%40AM6PR06MB5731.eurprd06.prod.outlook.com See my reply there. I think we should continue this discussion on that thread, since it has seniority. regards, tom lane