Re: BUG #6399: knngist sometimes returns tuples in incorrect order - Mailing list pgsql-bugs

From yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
Subject Re: BUG #6399: knngist sometimes returns tuples in incorrect order
Date
Msg-id 20120120034009.EA32014A16D@mail.netbsd.org
Whole thread Raw
In response to Re: BUG #6399: knngist sometimes returns tuples in incorrect order  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-bugs
hi,

> On 19.01.2012 06:35, YAMAMOTO Takashi wrote:
>>> On 18.01.2012 14:07, Heikki Linnakangas wrote:
>>>> For 9.2, I think we should change gist so
>>>> that the user-defined distance function can return any scalar data type,
>>>> not just float8 (as long as it has b-tree comparison operators).
>>>
>>> I took a shot at doing that. Patch attached. It needs some cleanup; I
>>> think we'll need to bump the version of the btree_gist extension, and I
>>> only changed the btree_int8 distance function to do that, even though it
>>> would now make a lot of sense to adjust the others, too. Also, I think
>>> it'll leak memory (or crash..) if the distance data type is pass-by-ref.
>>>
>>> If someone has a test suite at hand for performance testing this, that
>>> would be good. Now that I'm calling the b-tree comparison function for
>>> every comparison operation in the red-black tree, instead of a direct
>>> float8<= instruction, this could be quite a bit slower. That could be
>>> mitigated somewhat by using the sort-support stuff Tom committed recently.
>>>
>>> If we bend things a bit, this might be back-patchable to 9.1. We can't
>>> add a new am support function easily, that would require a catalog
>>> update to increment pg_am.amsupport entry, but we could hardwire the
>>> support for int8 distances, like I hardwired the knowledge of float8's
>>> comparsion function in the attached patch.
>>
>> thanks for taking a look quickly.
>>
>> i have some questions:
>>
>> doesn't gbt_int8_distance overflow?
>
> Yes, good point.
>
>> probably result += INT64_MIN so that it fits to int8 keeping the order?
>
> It can still overflow. The maximum distance between two int64s is (2^63
> - 1) - (-2^63) = 2^64 - 1. That's the distance between the smallest
> possible int64 (-2^63) and the greatest one (2^63 - 1)

max distance: 2^64-1
min distance: 0

if you offset them by INT64_MIN,

max: 2^64-1+INT64_MIN = INT64_MAX  (well, for many of C implementations)
min: INT64_MIN

so it should fit to int8.

but anyway, i agree that being compatible with the <-> operator is
more important.

>
> . The int8_dist function, which implements the <-> operator, can also
> overflow, but it checks for it and throws an error:
>
> postgres=# SELECT 9223372036854775807 <-> -9223372036854775808;
> ERROR:  bigint out of range
>
> We could add such an overflow check to gbt_int8_distance(), but I wonder
> if it would have some surprising effects. GiST might calculate distance
> to a value that's not in the final result set, so you'd get an error
> that you wouldn't get without the index. Then again, the alternative
> plan is to fetch the matching rows and sort, and the distance calculated
> for the sort will also overflow.
>
> Another solution for gbt_int8_distance() would be to treat the distance
> as an unsigned value, changing the comparison operation accordingly.

it sounds cleaner.

>
> I see that the distance functions for int4 and int2 can also overflow.
> The gist support function doesn't have the same problem with float8
> representation, but I think it would be good idea to change int4_dist to
> return a bigint, so that it couldn't overflow (and int2_dist to return
> an int4).

do you mean to changing the type of <-> operators so that they can't overflow?
i think it's good if compatibility is not a problem.
although it isn't possible for "biggest" types like numeric, it isn't a
regression as they can't be supported by the current knn gist anyway.

>
>> isn't strategy number necessary to find out the sorting semantics for
>> the operator's sortfamily?  (your patch doesn't change it, but...)
>
> Sorry, I didn't understand that. Can you elaborate?

the following is my understanding, which can be entirely wrong as i'm just
learning the code.

- an index should yield an order compatible with the result of
  amopopr sorted with amopsortfamily.
- the result of the distance support function is normally an equivalent of
  the result of amopopr.  in that case, its result should be sorted in
  a semantically compatible way with amopsortfamily.
- an index can have more than one ordering ops.  their amopsortfamily need
  not to be the same.  in that case, support functions need to check the
  strategy number to distinguish them.

so i thought your distance_cmp support function might want to take a strategy
number as a distance support function does.

YAMAMOTO Takashi

>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com

pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Documentation bug: reference to checkpoint activity in bgwriter
Next
From: mark.phillips@mophilly.com
Date:
Subject: BUG #6404: postgres account not created during unattended install