Thread: BUG #6399: knngist sometimes returns tuples in incorrect order

BUG #6399: knngist sometimes returns tuples in incorrect order

From
yamt@mwd.biglobe.ne.jp
Date:
The following bug has been logged on the website:

Bug reference:      6399
Logged by:          YAMAMOTO Takashi
Email address:      yamt@mwd.biglobe.ne.jp
PostgreSQL version: Unsupported/Unknown
Operating system:   NetBSD
Description:=20=20=20=20=20=20=20=20

9.2devel
(01d83ffdcae92f75dbfd41de0b4213d241edd394)

knngist seems to assume that any distances can be represented with float8.
at least distances between int8 values can not.

the following example uses btree_gist extension.
results should be the same regardless of the existance of the index.

create temp table t (a int8);
insert into t values (9223372036854775806),(9223372036854775807);
select *,0<->a as dist from t order by dist;
create index on t using gist (a);
set enable_seqscan=3Doff;
select *,0<->a as dist from t order by dist;


CREATE TABLE
INSERT 0 2
          a          |        dist=20=20=20=20=20=20=20=20=20
---------------------+---------------------
 9223372036854775806 | 9223372036854775806
 9223372036854775807 | 9223372036854775807
(2 rows)

CREATE INDEX
SET
          a          |        dist=20=20=20=20=20=20=20=20=20
---------------------+---------------------
 9223372036854775807 | 9223372036854775807
 9223372036854775806 | 9223372036854775806
(2 rows)

Re: BUG #6399: knngist sometimes returns tuples in incorrect order

From
Heikki Linnakangas
Date:
On 16.01.2012 11:32, yamt@mwd.biglobe.ne.jp wrote:
> The following bug has been logged on the website:
>
> Bug reference:      6399
> Logged by:          YAMAMOTO Takashi
> Email address:      yamt@mwd.biglobe.ne.jp
> PostgreSQL version: Unsupported/Unknown
> Operating system:   NetBSD
> Description:
>
> 9.2devel
> (01d83ffdcae92f75dbfd41de0b4213d241edd394)
>
> knngist seems to assume that any distances can be represented with float8.
> at least distances between int8 values can not.

Yeah. That seems like a bad assumption. It might theoretically be
possible to somehow map all int8s to float8s, but e.g numerics will not be.

> the following example uses btree_gist extension.
> results should be the same regardless of the existance of the index.
>
> create temp table t (a int8);
> insert into t values (9223372036854775806),(9223372036854775807);
> select *,0<->a as dist from t order by dist;
> create index on t using gist (a);
> set enable_seqscan=off;
> select *,0<->a as dist from t order by dist;
>
>
> CREATE TABLE
> INSERT 0 2
>            a          |        dist
> ---------------------+---------------------
>   9223372036854775806 | 9223372036854775806
>   9223372036854775807 | 9223372036854775807
> (2 rows)
>
> CREATE INDEX
> SET
>            a          |        dist
> ---------------------+---------------------
>   9223372036854775807 | 9223372036854775807
>   9223372036854775806 | 9223372036854775806
> (2 rows)

Yep, that's wrong. Both rows have the same float8 distance value, and
get outputted in wrong order. For 9.2, I think we should change gist so
that the user-defined distance function can return any scalar data type,
not just float8 (as long as it has b-tree comparison operators).

For 9.1, I'm afraid it's too late to do that. Or is it? The other option
is to hack gist to re-check tuples with equal distance values.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: BUG #6399: knngist sometimes returns tuples in incorrect order

From
Heikki Linnakangas
Date:
On 18.01.2012 14:07, Heikki Linnakangas wrote:
> For 9.2, I think we should change gist so
> that the user-defined distance function can return any scalar data type,
> not just float8 (as long as it has b-tree comparison operators).

I took a shot at doing that. Patch attached. It needs some cleanup; I
think we'll need to bump the version of the btree_gist extension, and I
only changed the btree_int8 distance function to do that, even though it
would now make a lot of sense to adjust the others, too. Also, I think
it'll leak memory (or crash..) if the distance data type is pass-by-ref.

If someone has a test suite at hand for performance testing this, that
would be good. Now that I'm calling the b-tree comparison function for
every comparison operation in the red-black tree, instead of a direct
float8 <= instruction, this could be quite a bit slower. That could be
mitigated somewhat by using the sort-support stuff Tom committed recently.

If we bend things a bit, this might be back-patchable to 9.1. We can't
add a new am support function easily, that would require a catalog
update to increment pg_am.amsupport entry, but we could hardwire the
support for int8 distances, like I hardwired the knowledge of float8's
comparsion function in the attached patch.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: BUG #6399: knngist sometimes returns tuples in incorrect order

From
Heikki Linnakangas
Date:
On 19.01.2012 06:35, YAMAMOTO Takashi wrote:
>> On 18.01.2012 14:07, Heikki Linnakangas wrote:
>>> For 9.2, I think we should change gist so
>>> that the user-defined distance function can return any scalar data type,
>>> not just float8 (as long as it has b-tree comparison operators).
>>
>> I took a shot at doing that. Patch attached. It needs some cleanup; I
>> think we'll need to bump the version of the btree_gist extension, and I
>> only changed the btree_int8 distance function to do that, even though it
>> would now make a lot of sense to adjust the others, too. Also, I think
>> it'll leak memory (or crash..) if the distance data type is pass-by-ref.
>>
>> If someone has a test suite at hand for performance testing this, that
>> would be good. Now that I'm calling the b-tree comparison function for
>> every comparison operation in the red-black tree, instead of a direct
>> float8<= instruction, this could be quite a bit slower. That could be
>> mitigated somewhat by using the sort-support stuff Tom committed recently.
>>
>> If we bend things a bit, this might be back-patchable to 9.1. We can't
>> add a new am support function easily, that would require a catalog
>> update to increment pg_am.amsupport entry, but we could hardwire the
>> support for int8 distances, like I hardwired the knowledge of float8's
>> comparsion function in the attached patch.
>
> thanks for taking a look quickly.
>
> i have some questions:
>
> doesn't gbt_int8_distance overflow?

Yes, good point.

> probably result += INT64_MIN so that it fits to int8 keeping the order?

It can still overflow. The maximum distance between two int64s is (2^63
- 1) - (-2^63) = 2^64 - 1. That's the distance between the smallest
possible int64 (-2^63) and the greatest one (2^63 - 1)

. The int8_dist function, which implements the <-> operator, can also
overflow, but it checks for it and throws an error:

postgres=# SELECT 9223372036854775807 <-> -9223372036854775808;
ERROR:  bigint out of range

We could add such an overflow check to gbt_int8_distance(), but I wonder
if it would have some surprising effects. GiST might calculate distance
to a value that's not in the final result set, so you'd get an error
that you wouldn't get without the index. Then again, the alternative
plan is to fetch the matching rows and sort, and the distance calculated
for the sort will also overflow.

Another solution for gbt_int8_distance() would be to treat the distance
as an unsigned value, changing the comparison operation accordingly.

I see that the distance functions for int4 and int2 can also overflow.
The gist support function doesn't have the same problem with float8
representation, but I think it would be good idea to change int4_dist to
return a bigint, so that it couldn't overflow (and int2_dist to return
an int4).

> isn't strategy number necessary to find out the sorting semantics for
> the operator's sortfamily?  (your patch doesn't change it, but...)

Sorry, I didn't understand that. Can you elaborate?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: BUG #6399: knngist sometimes returns tuples in incorrect order

From
yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
Date:
hi,

> On 18.01.2012 14:07, Heikki Linnakangas wrote:
>> For 9.2, I think we should change gist so
>> that the user-defined distance function can return any scalar data type,
>> not just float8 (as long as it has b-tree comparison operators).
>
> I took a shot at doing that. Patch attached. It needs some cleanup; I
> think we'll need to bump the version of the btree_gist extension, and I
> only changed the btree_int8 distance function to do that, even though it
> would now make a lot of sense to adjust the others, too. Also, I think
> it'll leak memory (or crash..) if the distance data type is pass-by-ref.
>
> If someone has a test suite at hand for performance testing this, that
> would be good. Now that I'm calling the b-tree comparison function for
> every comparison operation in the red-black tree, instead of a direct
> float8 <= instruction, this could be quite a bit slower. That could be
> mitigated somewhat by using the sort-support stuff Tom committed recently.
>
> If we bend things a bit, this might be back-patchable to 9.1. We can't
> add a new am support function easily, that would require a catalog
> update to increment pg_am.amsupport entry, but we could hardwire the
> support for int8 distances, like I hardwired the knowledge of float8's
> comparsion function in the attached patch.

thanks for taking a look quickly.

i have some questions:

doesn't gbt_int8_distance overflow?
probably result += INT64_MIN so that it fits to int8 keeping the order?

isn't strategy number necessary to find out the sorting semantics for
the operator's sortfamily?  (your patch doesn't change it, but...)

YAMAMOTO Takashi

>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com

Re: BUG #6399: knngist sometimes returns tuples in incorrect order

From
yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
Date:
hi,

> On 19.01.2012 06:35, YAMAMOTO Takashi wrote:
>>> On 18.01.2012 14:07, Heikki Linnakangas wrote:
>>>> For 9.2, I think we should change gist so
>>>> that the user-defined distance function can return any scalar data type,
>>>> not just float8 (as long as it has b-tree comparison operators).
>>>
>>> I took a shot at doing that. Patch attached. It needs some cleanup; I
>>> think we'll need to bump the version of the btree_gist extension, and I
>>> only changed the btree_int8 distance function to do that, even though it
>>> would now make a lot of sense to adjust the others, too. Also, I think
>>> it'll leak memory (or crash..) if the distance data type is pass-by-ref.
>>>
>>> If someone has a test suite at hand for performance testing this, that
>>> would be good. Now that I'm calling the b-tree comparison function for
>>> every comparison operation in the red-black tree, instead of a direct
>>> float8<= instruction, this could be quite a bit slower. That could be
>>> mitigated somewhat by using the sort-support stuff Tom committed recently.
>>>
>>> If we bend things a bit, this might be back-patchable to 9.1. We can't
>>> add a new am support function easily, that would require a catalog
>>> update to increment pg_am.amsupport entry, but we could hardwire the
>>> support for int8 distances, like I hardwired the knowledge of float8's
>>> comparsion function in the attached patch.
>>
>> thanks for taking a look quickly.
>>
>> i have some questions:
>>
>> doesn't gbt_int8_distance overflow?
>
> Yes, good point.
>
>> probably result += INT64_MIN so that it fits to int8 keeping the order?
>
> It can still overflow. The maximum distance between two int64s is (2^63
> - 1) - (-2^63) = 2^64 - 1. That's the distance between the smallest
> possible int64 (-2^63) and the greatest one (2^63 - 1)

max distance: 2^64-1
min distance: 0

if you offset them by INT64_MIN,

max: 2^64-1+INT64_MIN = INT64_MAX  (well, for many of C implementations)
min: INT64_MIN

so it should fit to int8.

but anyway, i agree that being compatible with the <-> operator is
more important.

>
> . The int8_dist function, which implements the <-> operator, can also
> overflow, but it checks for it and throws an error:
>
> postgres=# SELECT 9223372036854775807 <-> -9223372036854775808;
> ERROR:  bigint out of range
>
> We could add such an overflow check to gbt_int8_distance(), but I wonder
> if it would have some surprising effects. GiST might calculate distance
> to a value that's not in the final result set, so you'd get an error
> that you wouldn't get without the index. Then again, the alternative
> plan is to fetch the matching rows and sort, and the distance calculated
> for the sort will also overflow.
>
> Another solution for gbt_int8_distance() would be to treat the distance
> as an unsigned value, changing the comparison operation accordingly.

it sounds cleaner.

>
> I see that the distance functions for int4 and int2 can also overflow.
> The gist support function doesn't have the same problem with float8
> representation, but I think it would be good idea to change int4_dist to
> return a bigint, so that it couldn't overflow (and int2_dist to return
> an int4).

do you mean to changing the type of <-> operators so that they can't overflow?
i think it's good if compatibility is not a problem.
although it isn't possible for "biggest" types like numeric, it isn't a
regression as they can't be supported by the current knn gist anyway.

>
>> isn't strategy number necessary to find out the sorting semantics for
>> the operator's sortfamily?  (your patch doesn't change it, but...)
>
> Sorry, I didn't understand that. Can you elaborate?

the following is my understanding, which can be entirely wrong as i'm just
learning the code.

- an index should yield an order compatible with the result of
  amopopr sorted with amopsortfamily.
- the result of the distance support function is normally an equivalent of
  the result of amopopr.  in that case, its result should be sorted in
  a semantically compatible way with amopsortfamily.
- an index can have more than one ordering ops.  their amopsortfamily need
  not to be the same.  in that case, support functions need to check the
  strategy number to distinguish them.

so i thought your distance_cmp support function might want to take a strategy
number as a distance support function does.

YAMAMOTO Takashi

>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com