Thread: TODO: GUID datatype

TODO: GUID datatype

From

Gevik Babakhani

Date:

06 September 2006, 14:46:54

I would like to know if anyone is working on the GUID datatype.
If not, I am going to work on it. Please let me know.

Regards,
Gevik.

UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype)

From

mark@mark.mielke.cc

Date:

06 September 2006, 18:36:17

On Wed, Sep 06, 2006 at 07:49:09PM +0200, Gevik Babakhani wrote:
> I would like to know if anyone is working on the GUID datatype.
> If not, I am going to work on it. Please let me know.

Assuming you mean GUID in the same sense as UUID, there are many
non-core developers who would like to see it, or who have worked
on one.

There are at least three floating around that I am aware of, one of
which I contributed to, that works well, and one that I imagine works
well (the OSSP UUID library includes a PostgreSQL binding) both of
which include support for a UUID type, including the desired index
support and plpgsql function calls for generating, and manipulating
the UUID values.

I'm a little stuck at the moment, as I have time sapped away from me
by things like real work, and I'm trying to prepare a submission that
would be acceptable for core. I've read the thoughts of some of the
core developers on this subject, and agree with many of them.

The UUID type itself has value, however, the value it provides is
limited. Generation of a UUID doesn't have to occur with the database.
The application inserting the row can generate the UUID. The UUID type
itself has limited value, in that the difference between a 36 bytes +
4 bytes = 40 bytes as a fully expanded char(40), or the packed value
using the hexstring encode/decode SQL functions of a bytea type at 16
+ 4 bytes = 20 bytes compared to a tightly packed UUID type of 16
bytes, are very close. The argument can easily be made that if space
(disk space, index size, I/O bandwidth) is your first priority, than
a UUID is the wrong type to use. A 64-bit integer may suffice.

I'm also having trouble with the idea that a UUID deserves special
treatment. I currently have a desire to store both UUID and MD5
checksum in my rows. They are both 128 bits = 16 bytes, and fit all
of the same requirements above.

I would really like to have an MD5 checksum type now for the same
reason. It has a reasonable use that few could deny. Perhaps an MD5
checksum type would be more frequently used than a UUID type? More
systems these days are using the MD5 checksum as a unique identifier
for content. It has a few clever advantages. Assuming it really is
well distributed, and extremely unlikely for overlap to occur within
a system, the MD5 checksum has the advantage of automatically dealing
with duplicate values. In my case, I have chosen to identify uploaded
jpeg images by their MD5 checksum.

This makes it seem as if a generic 128-bit data type would be
desirable. They both have a compatible representation of a hexadecimal
string. The extra '-' characters in the UUID can be easily added when
necessary by a HEXSTRING2UUID() sort of function in plpgsql or in
the claling application.

But - MD5 isn't the only checksum that is frequently used. Some argue
that the MD5 can be shown to be weak in some regards, and that perhaps
other checksum algorithms such as SHA-1 provide a better guarantee of
uniqueness. SHA-1 isn't 128-bits. It's 160-bits.

This is where I start to buy Tom Lane's argument that the 4-byte
prefix is no big deal. I find it more desirable to have a binary data
type with a hexadecimal string input and output function. The
flexibility of being able to use 128-bits or 160-bits is worth this
4-byte header to me. What I don't want to do is store double size
fields, stored as hexadecimal.

This leads to a few options:
1) Create specific types as necessary, with associated functions. No overhead. - uuid, md5sum,
sha1sum,...

2) Create semi-generic types with common bitlengths. Associated functions work on these semi-generic types. No
overhead. - hexstring128, hexstring160, ...

3) Create a new bytea type that has ascii input and output formats, probably based around hexstrings. Overhead
of4 bytes.

4) Use varchar to store these types, and provide associated functions that return character strings in the
rightformat. This follows the existing md5sum() PostgreSQL function. Overhead is double the size of the
data.
5) Use bytea to store these types, and the encode/decode functions are passed character strings in the right
format.Possibly complicated for the application to deal with, as well as a user typing SQL commands. Overhead
of4 bytes.

As I said - I'm stuck. Not sure which way to go. I currently use a uuid
type that I and another person on this list wrote against the OSSP UUID
library. For the MD5 checksum, I use char(32).

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness
bindthem...

http://mark.mielke.cc/

Re: UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype)

From

Martijn van Oosterhout

Date:

07 September 2006, 06:35:53

On Wed, Sep 06, 2006 at 05:05:47PM -0400, mark@mark.mielke.cc wrote:
>     2) Create semi-generic types with common bitlengths. Associated
>        functions work on these semi-generic types. No overhead.
>            - hexstring128, hexstring160, ...
>
>     3) Create a new bytea type that has ascii input and output formats,
>        probably based around hexstrings. Overhead of 4 bytes.

I think 3) is worthwhile for core, it would have many uses. But you
don't actually need to have a new type for that, just new I/O
functions.

As for 2) I think would be acceptable for contrib to contain some code
that demonstrates how to make fixed length types. It would be fairly
straightforward to make a script where you give it a type name and a
length and it'll spit out the code for that type.

I don't think UUID specific stuff needs to be in core, though you could
make an argument that the hex input/output functions should ignore
dashes, to make it straightforward to store UUIDs directly in there.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: UUID/GUID discussion leading to request for hexstring bytea?

From

Gevik Babakhani

Date:

07 September 2006, 08:25:00

On Wed, 2006-09-06 at 17:05 -0400, mark@mark.mielke.cc wrote:

> The UUID type itself has value, however, the value it provides is
> limited. Generation of a UUID doesn't have to occur with the database.
> The application inserting the row can generate the UUID. The UUID type
> itself has limited value, in that the difference between a 36 bytes +
> 4 bytes = 40 bytes as a fully expanded char(40), or the packed value
> using the hexstring encode/decode SQL functions of a bytea type at 16
> + 4 bytes = 20 bytes compared to a tightly packed UUID type of 16
> bytes, are very close. The argument can easily be made that if space
> (disk space, index size, I/O bandwidth) is your first priority, than
> a UUID is the wrong type to use. A 64-bit integer may suffice.

Agreed.

> 
> I'm also having trouble with the idea that a UUID deserves special
> treatment. I currently have a desire to store both UUID and MD5
> checksum in my rows. They are both 128 bits = 16 bytes, and fit all
> of the same requirements above.

Agreed.

> This makes it seem as if a generic 128-bit data type would be
> desirable. They both have a compatible representation of a hexadecimal
> string. The extra '-' characters in the UUID can be easily added when
> necessary by a HEXSTRING2UUID() sort of function in plpgsql or in
> the claling application.

Agreed

> 
> But - MD5 isn't the only checksum that is frequently used. Some argue
> that the MD5 can be shown to be weak in some regards, and that perhaps
> other checksum algorithms such as SHA-1 provide a better guarantee of
> uniqueness. SHA-1 isn't 128-bits. It's 160-bits.

To my knowledge most apps use MD5 checksum because it is easier to
implement and use.  
> 
> This is where I start to buy Tom Lane's argument that the 4-byte
> prefix is no big deal. I find it more desirable to have a binary data
> type with a hexadecimal string input and output function. The
> flexibility of being able to use 128-bits or 160-bits is worth this
> 4-byte header to me. What I don't want to do is store double size
> fields, stored as hexadecimal.

Agreed.

I have experimented with GUID/UUID type for a while and here are my
findings when having the need to store GUID/UUID/MD5 in the database.

1. Almost always these values are created outside the database. So to my
opinion having all kinds of functions in the database to create such
values becomes second priority except when wanting to use GUIDs auto
generated for PK like the uuid datatype in MS SQL

2. Yet I haven't seen any kind of arithmetic operations (+ - * / %) on
stored GUIDs hence these operations would be meaningless. I also never
did see the need to change specific parts of a GUID 

To my opinion only some of relational/compare operations like == and !=
apply to such values. comparing guid >= guid or md5 < md5 is also
meaningless.

3. Almost always a GUID/MD5 is not changed when it is generated and
stored except for cases like resetting md5sum of a password that is
changed.

To my opinion GUIDs type need to provide the following in the database.

1. GUID type must accept the correct string format(s), with of without
extra '-'
2. GUID type must internally be stored as small as possible.
3. GUID type must be comparable with == , != , LIKE and (NOT) IS NULL
4. GUID type must have the ability to be indexed, grouped, ordered,
DISTINCT... but not MAX(), MIN() or SUM()....

I think option three would be a good one to experiment with. I am
thinking of a 16 length struct to hold the GUID value and corresponding
functionality to achieve the above however possible.

I also think we should have the GUID/UUID as a datatype and not just
functions handling hexstring.

Re: UUID/GUID discussion leading to request for hexstring bytea?

From

Martijn van Oosterhout

Date:

07 September 2006, 09:49:31

On Thu, Sep 07, 2006 at 01:27:21PM +0200, Gevik Babakhani wrote:
> To my opinion only some of relational/compare operations like == and !=
> apply to such values. comparing guid >= guid or md5 < md5 is also
> meaningless.

<snip>

> 4. GUID type must have the ability to be indexed, grouped, ordered,
> DISTINCT... but not MAX(), MIN() or SUM()....

Err, for "ordered" you need to define <, >, >=, <=, which means you're
going to get MAX, and MIN for free...

> I also think we should have the GUID/UUID as a datatype and not just
> functions handling hexstring.

Sure, but that will be the I/O format, right?

Have a nice day,

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: UUID/GUID discussion leading to request for

From

Gevik Babakhani

Date:

07 September 2006, 10:34:25

On Thu, 2006-09-07 at 14:46 +0200, Martijn van Oosterhout wrote:
> On Thu, Sep 07, 2006 at 01:27:21PM +0200, Gevik Babakhani wrote:
> > To my opinion only some of relational/compare operations like == and !=
> > apply to such values. comparing guid >= guid or md5 < md5 is also
> > meaningless.
> 
> <snip>
> 
> > 4. GUID type must have the ability to be indexed, grouped, ordered,
> > DISTINCT... but not MAX(), MIN() or SUM()....
> 
> Err, for "ordered" you need to define <, >, >=, <=, which means you're
> going to get MAX, and MIN for free...
> 

Yes indeed, I just want to address that guid > guid or any other
operation like that has no meaning. 

> > I also think we should have the GUID/UUID as a datatype and not just
> > functions handling hexstring.
> 
> Sure, but that will be the I/O format, right?

Yes, I/O format as part of the datatype and not a separate function.

> 
> Have a nice day,
>

Looking at Postgres 8.2

From

"Strong, David"

Date:

07 September 2006, 14:30:54

As part of our ongoing research into Postgres performance and
scalability, we recently downloaded version 8.2 from CVS and we wanted
to pass on some observations.

When comparing 8.2 against 8.1.4, we see that there is roughly a 20%
increase in throughput. We credit most of this improvement to the
modifications made to the way in which the BufMappingLock and
LockMgrLock locks are now handled. Locking sections (partitions) of the
Shared Buffer and Locking hash tables certainly seems to pay off.

We had also come to the same conclusion and added similar code into a
local copy of 8.1.4. Although, we used SpinLocks rather than LWLocks to
lock sections of the hash table and we used an LWLock to lock the hash
table for critical operations.

Against 8.1.4, we saw that LWLockAcquire was taking the majority of the
run time, so we added some monitoring code to track the amount of time
spent to acquire and release each lock. The BufMappingLock and
LockMgrLock were, of course, on the top of that list.

We moved some of our 8.1.4 lock monitoring code over to 8.2 to analyze
the lock distribution for the partitions. In doing so, we noted that
WALInsertLock had now become a bottleneck, absorbing most of the time
freed up by the BufMappingLock and LockMgrLock changes.

We took a bold move and made XLogInsert a NOP. The next lock to rise to
the top of the list was SInvalLock. However, the time increase in
SInvalLock did not seem as much as in WALInsertLock. Although we have
not taken steps to do so, we assume that removing SInvalLock may reveal
the next lock that might impede scalability.

Outside of LWLockAcquire (~10%), the next functions that really seem to
take most of the time are HeapTupleSatisfiesSnapshot (~21%) and
PinBuffer (~14%). We're currently profiling and monitoring those
functions.

Re: UUID/GUID discussion leading to request for hexstring bytea?

From

Thomas Hallgren

Date:

18 September 2006, 05:40:48

Gevik Babakhani wrote:
>  To my opinion GUIDs type need to provide the following in the database.
> 
> 1. GUID type must accept the correct string format(s), with of without
> extra '-'
> 2. GUID type must internally be stored as small as possible.
> 3. GUID type must be comparable with == , != , LIKE and (NOT) IS NULL
> 4. GUID type must have the ability to be indexed, grouped, ordered,
> DISTINCT... but not MAX(), MIN() or SUM()....
> 
Where do you see a need for LIKE on a GUID?

Regards,
Thomas Hallgren

Re: UUID/GUID discussion leading to request for hexstring bytea?

From

Gevik Babakhani

Date:

18 September 2006, 05:46:15

LIKE could come handy if someone wants to abuse the uuid datatype to
store MD5 hash values. However I am not going to implement it if there
is no need for that (assuming it will pass the acceptance test)

On Mon, 2006-09-18 at 10:06 +0200, Thomas Hallgren wrote:
> Gevik Babakhani wrote:
> >  To my opinion GUIDs type need to provide the following in the database.
> > 
> > 1. GUID type must accept the correct string format(s), with of without
> > extra '-'
> > 2. GUID type must internally be stored as small as possible.
> > 3. GUID type must be comparable with == , != , LIKE and (NOT) IS NULL
> > 4. GUID type must have the ability to be indexed, grouped, ordered,
> > DISTINCT... but not MAX(), MIN() or SUM()....
> > 
> Where do you see a need for LIKE on a GUID?
> 
> Regards,
> Thomas Hallgren
> 
>

Re: UUID/GUID discussion leading to request for hexstring bytea?

From

Thomas Hallgren

Date:

18 September 2006, 05:59:20

Gevik Babakhani wrote:
> LIKE could come handy if someone wants to abuse the uuid datatype to
> store MD5 hash values. However I am not going to implement it if there
> is no need for that (assuming it will pass the acceptance test)
>
>   
Perhaps providing LIKE just to encourage abuse is not such a good idea? 
IMHO, a GUID should be comparable for equality and NULL only, not LIKE. 
I also think that ordering is feasible only when looking at parts of the 
GUID, i.e. order by the result of a function that extracts a timestamp 
or a node-address. Magnitude comparison on the GUID as a whole makes no 
sense to me.

Regards,
Thomas Hallgren

> On Mon, 2006-09-18 at 10:06 +0200, Thomas Hallgren wrote:
>   
>> Gevik Babakhani wrote:
>>     
>>>  To my opinion GUIDs type need to provide the following in the database.
>>>
>>> 1. GUID type must accept the correct string format(s), with of without
>>> extra '-'
>>> 2. GUID type must internally be stored as small as possible.
>>> 3. GUID type must be comparable with == , != , LIKE and (NOT) IS NULL
>>> 4. GUID type must have the ability to be indexed, grouped, ordered,
>>> DISTINCT... but not MAX(), MIN() or SUM()....
>>>
>>>       
>> Where do you see a need for LIKE on a GUID?
>>
>> Regards,
>> Thomas Hallgren
>>
>>
>>     
>
>

Re: UUID/GUID discussion leading to request for hexstring bytea?

From

Gevik Babakhani

Date:

18 September 2006, 06:13:09

> Magnitude comparison on the GUID as a whole makes no 
> sense to me.

I agree. Any kind of comparison except equality has no meaning for the
GUID. (And this is discussed before) I rather have the option to sort
and group for the sake of consistency and compatibility.

Re: UUID/GUID discussion leading to request for hexstring bytea?

From

mark@mark.mielke.cc

Date:

18 September 2006, 13:14:29

On Mon, Sep 18, 2006 at 11:12:54AM +0200, Gevik Babakhani wrote:
> > Magnitude comparison on the GUID as a whole makes no 
> > sense to me.
> I agree. Any kind of comparison except equality has no meaning for the
> GUID. (And this is discussed before) I rather have the option to sort
> and group for the sake of consistency and compatibility.

Thomas: The ability to sort / comparison is required for use with
B-Tree index. I prefer a fast comparison over one with more meaning.
memcmp() is fine with me and it is how I implement it in my UUID
PostgreSQL library.

Gevik: Once you have your patch in a ready state, I'll compare it
against what I have and see if there is anything missing, or if I
did anything better. Possibly not, but it's worth the check.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...

                          http://mark.mielke.cc/

Re: UUID/GUID discussion leading to request for hexstring bytea?

From

Thomas Hallgren

Date:

18 September 2006, 13:24:13

mark@mark.mielke.cc wrote:
> On Mon, Sep 18, 2006 at 11:12:54AM +0200, Gevik Babakhani wrote:
>>> Magnitude comparison on the GUID as a whole makes no 
>>> sense to me.
>> I agree. Any kind of comparison except equality has no meaning for the
>> GUID. (And this is discussed before) I rather have the option to sort
>> and group for the sake of consistency and compatibility.
> 
> Thomas: The ability to sort / comparison is required for use with
> B-Tree index. I prefer a fast comparison over one with more meaning.
> memcmp() is fine with me and it is how I implement it in my UUID
> PostgreSQL library.
> 
Fair enough. Although the magnitudes as such makes little sense, the ability to order will 
make it possible to compare results from different queries etc. Very difficult to do with 
random order.

Regards,
Thomas Hallgren