Thread: Review: GiST support for UUIDs

Review: GiST support for UUIDs

From
Teodor Sigaev
Date:
http://www.postgresql.org/message-id/flat/CA+renyVepHxTO1c7dFbVjP1GYMUc0-3qDNWPN30-noo5MPyaVQ@mail.gmail.com#CA+renyVepHxTO1c7dFbVjP1GYMUc0-3qDNWPN30-noo5MPyaVQ@mail.gmail.com

Patch looks perfect but it's still needed some work.

0) rebase to current HEAD (done, in attach)
1) UUIDSIZE ->  UUID_LEN (it's defined in utils/uuid.h, done)
2)
    static double
    uuid2num(const pg_uuid_t *i)
    {
        return *((uint64 *)i);
    }
    It isn't looked as correct transformation for me. May be, it's better
    to transform to numeric type (UUID looks like a 16-digit hexademical number)
    and follow  gbt_numeric_penalty() logic (or even call directly).




--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: Review: GiST support for UUIDs

From
Andres Freund
Date:
Please post reviews to the original thread unless that's already humongously large. Otherwise somebody needs to
manuallylink up the threads in the CF application.
 

It also makes it much harder to follow the development because there'll likely be a new version of the patch after a
review.Which then cab either pushed in a thread titled review, or in an old one, without context.
 

Andres
--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: Review: GiST support for UUIDs

From
Teodor Sigaev
Date:

Andres Freund wrote:
> Please post reviews to the original thread unless that's already humongously large.
I've lost original mail. Sorry.

> Otherwise somebody needs to manually link up the threads in the CF application.

Of course, I did it

>
> It also makes it much harder to follow the development because there'll likely be a new version of the patch
> after a review. Which then cab either pushed in a thread titled review, or in an old one, without context.

Sorry, I tried to fix that as possible.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: Review: GiST support for UUIDs

From
Paul Jungwirth
Date:
> 2)
>      static double
>      uuid2num(const pg_uuid_t *i)
>      {
>          return *((uint64 *)i);
>      }
>     It isn't looked as correct transformation for me. May be, it's better
>     to transform to numeric type (UUID looks like a 16-digit hexademical
> number)
>     and follow  gbt_numeric_penalty() logic (or even call directly).

Thanks for the review! A UUID is actually not stored as a string of 
hexadecimal digits. (It is normally displayed that way, but with 32 
digits, not 16.) Rather it is stored as an unstructured 128-bit value 
(which in C is 16 unsigned chars). Here is the easy-to-misread 
declaration from src/backend/utils/adt/uuid.c:

#define UUID_LEN 16
struct pg_uuid_t
{
unsigned char data[UUID_LEN];
};

I would love to just cast this to a 128-bit unsigned int. But it looks 
like Postgres doesn't always have 128-bit ints available, so my code 
takes the top half and uses that for penalty calculations. It seemed to 
me that was "good enough" for this purpose.

The only other 128-bit type I found in btree_gist was Interval. For that 
type we convert to a double using INTERVAL_TO_SEC, then call 
penalty_num. By my read that accepts a similar loss of precision.

If I'm mistaken about 128-bit integer support, let me know, and maybe we 
can do the penalty computation on the whole UUID. Or maybe I should just 
convert the uint64 to a double before calling penalty_num? I don't 
completely understand what the penalty calculation is all about, so I 
welcome suggestions here.

Thanks again,
Paul












Re: Review: GiST support for UUIDs

From
Teodor Sigaev
Date:

Paul Jungwirth wrote:
>> 2)
>>      static double
>>      uuid2num(const pg_uuid_t *i)
>>      {
>>          return *((uint64 *)i);
>>      }
>>     It isn't looked as correct transformation for me. May be, it's better
>>     to transform to numeric type (UUID looks like a 16-digit hexademical
>> number)
>>     and follow  gbt_numeric_penalty() logic (or even call directly).
>
> Thanks for the review! A UUID is actually not stored as a string of
> hexadecimal digits. (It is normally displayed that way, but with 32
> digits, not 16.) Rather it is stored as an unstructured 128-bit value
> (which in C is 16 unsigned chars). Here is the easy-to-misread
> declaration from src/backend/utils/adt/uuid.c:
Missed number of digit, but nevertheless it doesn't matter for idea. 
Original coding uses only 8 bytes from 16 to compute penalty which could 
cause a problem with index performance. Simple way is just printing each 
4bits  with %02d modifier into string and then make a numeric value with 
a help of numeric_in.

Or something like this in pseudocode:

numeric = int8_numeric(*(uint64 *)(&i->data[0])) * 
int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(&i->data[8]))

> The only other 128-bit type I found in btree_gist was Interval. For that
> type we convert to a double using INTERVAL_TO_SEC, then call
> penalty_num. By my read that accepts a similar loss of precision.
Right, but precision of double  is enough to represent 1 century 
interval with 0.00001 seconds accuracy which is enough for  practical 
usage. In UUID case you will take into account only half of value. Of 
course, GiST will work even with penalty function returning constant but 
each scan could become full-index-scan.

>
> If I'm mistaken about 128-bit integer support, let me know, and maybe we
> can do the penalty computation on the whole UUID. Or maybe I should just
> convert the uint64 to a double before calling penalty_num? I don't
> completely understand what the penalty calculation is all about, so I
> welcome suggestions here.

Penalty method calculates how union key will be enlarged if insert will 
be produced in current subtree. It directly affects selectivity of subtree.

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru                                      WWW:
http://www.sigaev.ru/



Re: Review: GiST support for UUIDs

From
Paul Jungwirth
Date:
> Or something like this in pseudocode:
>
> numeric = int8_numeric(*(uint64 *)(&i->data[0])) *
> int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(&i->data[8]))

This is more like what I was hoping for, rather than converting to a 
string and back. Would you mind confirming for me: int8_numeric turns an 
int64 into an arbitrary-precision numeric Datum? So there is no overflow 
risk here?

But it looks like int8_numeric takes a *signed* integer. Isn't that a 
problem? I suppose I could get it working though by jumping through some 
hoops.

> In UUID case you will take into account only half of value. Of
> course, GiST will work even with penalty function returning constant but
> each scan could become full-index-scan.

Yes, but that seems like an unrealistic concern. Even "only" 2^64 is 
18446744073709551616 records. Or another way of thinking about it, if 64 
bits (or 32) is a good enough penalty input for all the other data 
types, why not for UUIDs? Keep in mind type 3-5 UUIDs should be evenly 
distributed. Perhaps we could use the bottom half (instead of the top) 
to ensure even distribution for type 1 and 2 too.

It seems to me that using only the top half should be okay, but if you 
believe it's not I'll go with the int8_numeric approach (in three chunks 
instead of two to work around signed-vs-unsigned).

Thanks,
Paul




Re: Review: GiST support for UUIDs

From
Teodor Sigaev
Date:

Paul Jungwirth wrote:
>> Or something like this in pseudocode:
>>
>> numeric = int8_numeric(*(uint64 *)(&i->data[0])) *
>> int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(&i->data[8]))
>
> This is more like what I was hoping for, rather than converting to a
> string and back. Would you mind confirming for me: int8_numeric turns an
> int64 into an arbitrary-precision numeric Datum? So there is no overflow
> risk here?
Sure, no risk. Numeric precision is limited 1000 digits with magnitude 1000


>
> But it looks like int8_numeric takes a *signed* integer. Isn't that a
> problem? I suppose I could get it working though by jumping through some
> hoops.
signed vs unsigned problem does not exist actually, because of precision 
of numeric is much better than we need and presence of numeric_abs.


> Yes, but that seems like an unrealistic concern. Even "only" 2^64 is
> 18446744073709551616 records. Or another way of thinking about it, if 64
:) "only"

> bits (or 32) is a good enough penalty input for all the other data
> types, why not for UUIDs? Keep in mind type 3-5 UUIDs should be evenly
> distributed. Perhaps we could use the bottom half (instead of the top)
> to ensure even distribution for type 1 and 2 too.
it must be. But UUID could be taken for unknown source and we can't 
predict distribution. I believe pg generates them correctly, but other 
generators could be not so good.

> It seems to me that using only the top half should be okay, but if you
> believe it's not I'll go with the int8_numeric approach (in three chunks
> instead of two to work around signed-vs-unsigned).
Yes, I believe. It is not good case when we can ruin index performance 
with special set of value.

Some difficulty  which I see is how to transform numeric penalty to 
double as it requires by GiST. May be, penalty/(INT_MAX64*INT_MAX64 + 
INT_MAX64)? Keep in mind, that penalty is how range will be enlarged 
after range union, so minimal penalty is 0 and maximum is 
0xffffffffffffffffffffffffffffffff

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru                                      WWW:
http://www.sigaev.ru/



Re: Review: GiST support for UUIDs

From
Michael Paquier
Date:
On Tue, Sep 15, 2015 at 3:03 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>
>
> Paul Jungwirth wrote:
>>>
>>> Or something like this in pseudocode:
>>>
>>> numeric = int8_numeric(*(uint64 *)(&i->data[0])) *
>>> int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(&i->data[8]))
>>
>>
>> This is more like what I was hoping for, rather than converting to a
>> string and back. Would you mind confirming for me: int8_numeric turns an
>> int64 into an arbitrary-precision numeric Datum? So there is no overflow
>> risk here?
>
> Sure, no risk. Numeric precision is limited 1000 digits with magnitude 1000
>
>
>>
>> But it looks like int8_numeric takes a *signed* integer. Isn't that a
>> problem? I suppose I could get it working though by jumping through some
>> hoops.
>
> signed vs unsigned problem does not exist actually, because of precision of
> numeric is much better than we need and presence of numeric_abs.
>
>
>> Yes, but that seems like an unrealistic concern. Even "only" 2^64 is
>> 18446744073709551616 records. Or another way of thinking about it, if 64
>
> :) "only"
>
>> bits (or 32) is a good enough penalty input for all the other data
>> types, why not for UUIDs? Keep in mind type 3-5 UUIDs should be evenly
>> distributed. Perhaps we could use the bottom half (instead of the top)
>> to ensure even distribution for type 1 and 2 too.
>
> it must be. But UUID could be taken for unknown source and we can't predict
> distribution. I believe pg generates them correctly, but other generators
> could be not so good.
>
>> It seems to me that using only the top half should be okay, but if you
>> believe it's not I'll go with the int8_numeric approach (in three chunks
>> instead of two to work around signed-vs-unsigned).
>
> Yes, I believe. It is not good case when we can ruin index performance with
> special set of value.
>
> Some difficulty  which I see is how to transform numeric penalty to double
> as it requires by GiST. May be, penalty/(INT_MAX64*INT_MAX64 + INT_MAX64)?
> Keep in mind, that penalty is how range will be enlarged after range union,
> so minimal penalty is 0 and maximum is 0xffffffffffffffffffffffffffffffff

This patch has been marked as "returned with feedback" for CF 2015-11
because of the presence of a review and a certain lack of activity..
Please free to move it to next CF if you think that's more adapted.
-- 
Michael



Re: Review: GiST support for UUIDs

From
Ildus Kurbangaliev
Date:
On Tue, 15 Sep 2015 09:03:00 +0300
Teodor Sigaev <teodor@sigaev.ru> wrote:

> Paul Jungwirth wrote:
> >> Or something like this in pseudocode:
> >>
> >> numeric = int8_numeric(*(uint64 *)(&i->data[0])) *
> >> int8_numeric(MAX_INT64) + int8_numeric(*(uint64 *)(&i->data[8]))
> >
> > This is more like what I was hoping for, rather than converting to a
> > string and back. Would you mind confirming for me: int8_numeric
> > turns an int64 into an arbitrary-precision numeric Datum? So there
> > is no overflow risk here?
> Sure, no risk. Numeric precision is limited 1000 digits with
> magnitude 1000
>
>
> >
> > But it looks like int8_numeric takes a *signed* integer. Isn't that
> > a problem? I suppose I could get it working though by jumping
> > through some hoops.
> signed vs unsigned problem does not exist actually, because of
> precision of numeric is much better than we need and presence of
> numeric_abs.
>
>
> > Yes, but that seems like an unrealistic concern. Even "only" 2^64 is
> > 18446744073709551616 records. Or another way of thinking about it,
> > if 64
> :) "only"
>
> > bits (or 32) is a good enough penalty input for all the other data
> > types, why not for UUIDs? Keep in mind type 3-5 UUIDs should be
> > evenly distributed. Perhaps we could use the bottom half (instead
> > of the top) to ensure even distribution for type 1 and 2 too.
> it must be. But UUID could be taken for unknown source and we can't
> predict distribution. I believe pg generates them correctly, but
> other generators could be not so good.
>
> > It seems to me that using only the top half should be okay, but if
> > you believe it's not I'll go with the int8_numeric approach (in
> > three chunks instead of two to work around signed-vs-unsigned).
> Yes, I believe. It is not good case when we can ruin index
> performance with special set of value.
>
> Some difficulty  which I see is how to transform numeric penalty to
> double as it requires by GiST. May be, penalty/(INT_MAX64*INT_MAX64 +
> INT_MAX64)? Keep in mind, that penalty is how range will be enlarged
> after range union, so minimal penalty is 0 and maximum is
> 0xffffffffffffffffffffffffffffffff
>

There is a more improved version of the patch. Main idea is to present
uuid as two uint64 values, and make comparisons and penalty calculation
based on these values. This approach is much faster than using memcmp
for uuid comparisons.

--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: Review: GiST support for UUIDs

From
Paul Jungwirth
Date:
On 12/23/2015 08:10 AM, Ildus Kurbangaliev wrote:
> There is a more improved version of the patch. Main idea is to present
> uuid as two uint64 values, and make comparisons and penalty calculation
> based on these values. This approach is much faster than using memcmp
> for uuid comparisons.

Thank you for picking this up! I'm sorry I was not able to work on it 
the last few months. I'm very glad to see someone wrapping it up. I'm 
not a reviewer, but personally it looks like a good change to me.

Happy holidays,

Paul







Re: Review: GiST support for UUIDs

From
Ildus Kurbangaliev
Date:
On Wed, 23 Dec 2015 16:36:23 -0800
Paul Jungwirth <pj@illuminatedcomputing.com> wrote:

> On 12/23/2015 08:10 AM, Ildus Kurbangaliev wrote:
> > There is a more improved version of the patch. Main idea is to
> > present uuid as two uint64 values, and make comparisons and penalty
> > calculation based on these values. This approach is much faster
> > than using memcmp for uuid comparisons.
>
> Thank you for picking this up! I'm sorry I was not able to work on it
> the last few months. I'm very glad to see someone wrapping it up. I'm
> not a reviewer, but personally it looks like a good change to me.
>
> Happy holidays,
>
> Paul
>
>
>
>

Thanks! The patch was almost done and ready. I attached new version of
the patch with compability changes.

--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: Review: GiST support for UUIDs

From
Teodor Sigaev
Date:
Thank you, but I have some notices about
static float
uuid_parts_distance(pg_uuid_t *a, pg_uuid_t *b)
{    pg_uuid_t ua, ub;    const double mp = pow(2, -64);
    uuid_cnv(a, &ua);    uuid_cnv(b, &ub);
    Assert(ua.v64[0] > ub.v64[0]);    uint64 high = ua.v64[0] - ub.v64[0];    uint64 low = ua.v64[1] - ub.v64[1];    if
(low> ua.v64[1])        high--;
 
    return (float) (ldexp(high, 64) + (double) low * mp);
}

First, variables (high and low) should not be declared in the middle of code. 
Second, assert will fail if ua.v64[0] == ub.v64[0] and
ua.v64[1] > ub.v64[1] although it's a possible and allowed case. Third, actually 
you multiply high value by 2^64 anf low by 2^-64. Seems, it's needed to do only 
one multiplication.


Ildus Kurbangaliev wrote:
> On Wed, 23 Dec 2015 16:36:23 -0800
> Paul Jungwirth <pj@illuminatedcomputing.com> wrote:
>
>> On 12/23/2015 08:10 AM, Ildus Kurbangaliev wrote:
>>> There is a more improved version of the patch. Main idea is to
>>> present uuid as two uint64 values, and make comparisons and penalty
>>> calculation based on these values. This approach is much faster
>>> than using memcmp for uuid comparisons.
>>
>> Thank you for picking this up! I'm sorry I was not able to work on it
>> the last few months. I'm very glad to see someone wrapping it up. I'm
>> not a reviewer, but personally it looks like a good change to me.
>>
>> Happy holidays,
>>
>> Paul
>>
>>
>>
>>
>
> Thanks! The patch was almost done and ready. I attached new version of
> the patch with compability changes.
>

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: Review: GiST support for UUIDs

From
Ildus Kurbangaliev
Date:
On Fri, 25 Dec 2015 13:34:25 +0300
Teodor Sigaev <teodor@sigaev.ru> wrote:

> Thank you, but I have some notices about
> static float
> uuid_parts_distance(pg_uuid_t *a, pg_uuid_t *b)
> {
>      pg_uuid_t ua, ub;
>      const double mp = pow(2, -64);
>
>      uuid_cnv(a, &ua);
>      uuid_cnv(b, &ub);
>
>      Assert(ua.v64[0] > ub.v64[0]);
>      uint64 high = ua.v64[0] - ub.v64[0];
>      uint64 low = ua.v64[1] - ub.v64[1];
>      if (low > ua.v64[1])
>          high--;
>
>      return (float) (ldexp(high, 64) + (double) low * mp);
> }
>
> First, variables (high and low) should not be declared in the middle
> of code. Second, assert will fail if ua.v64[0] == ub.v64[0] and
> ua.v64[1] > ub.v64[1] although it's a possible and allowed case.
> Third, actually you multiply high value by 2^64 anf low by 2^-64.
> Seems, it's needed to do only one multiplication.

Thank you for review. Fixed.

--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: Review: GiST support for UUIDs

From
Paul Jungwirth
Date:
On 12/25/2015 03:23 AM, Ildus Kurbangaliev wrote:
> On Fri, 25 Dec 2015 13:34:25 +0300
> Teodor Sigaev <teodor@sigaev.ru> wrote:
>> First, variables (high and low) should not be declared in the middle
>> of code. Second, assert will fail if ua.v64[0] == ub.v64[0] and
>> ua.v64[1] > ub.v64[1] although it's a possible and allowed case.
>> Third, actually you multiply high value by 2^64 anf low by 2^-64.
>> Seems, it's needed to do only one multiplication.
>
> Thank you for review. Fixed.

Just wanted to follow up on this and see about getting it added to the 
next commitfest. It looks like Iluds's latest patch 
(btree_gist_uuid_5.patch) doesn't appear on the commitfest page here:

https://commitfest.postgresql.org/7/332/

Also the last few emails of the thread don't show up, although you can 
read them here:

http://www.postgresql.org/message-id/20151225142340.46e577dd@lp

I'm not sure the next step here. Do I click "Move to next CF" in the 
"Status" dropdown? Apologies for not being familiar with the protocol. 
I'd be sad to see this work just get ignored.

Thanks,
Paul




Re: Review: GiST support for UUIDs

From
Paul A Jungwirth
Date:
On Fri, Feb 5, 2016 at 4:45 PM, Paul Jungwirth
<pj@illuminatedcomputing.com> wrote:
> On 12/25/2015 03:23 AM, Ildus Kurbangaliev wrote:
>> Thank you for review. Fixed.
> Just wanted to follow up on this and see about getting it added to the next
> commitfest. It looks like Iluds's latest patch (btree_gist_uuid_5.patch)
> doesn't appear on the commitfest page here:
> https://commitfest.postgresql.org/7/332/

I took the btree_gist_uuid_5.patch file from Ildus and updated it so
it would apply on the latest commit. My version is attached as
btree_gist_uuid_6.patch. Since Tom recently changed the SQL signatures
of a few functions (gbt_*_union and gbt_*_same), I did the same with
the new uuid functions. I'd love to see this get accepted next time
around (in September it looks like)!

Thanks,
Paul

Attachment