UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype) - Mailing list pgsql-hackers

From mark@mark.mielke.cc
Subject UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype)
Date
Msg-id 20060906210547.GA26661@mark.mielke.cc
Whole thread Raw
In response to TODO: GUID datatype  (Gevik Babakhani <pgdev@xs4all.nl>)
Responses Re: UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype)  (Martijn van Oosterhout <kleptog@svana.org>)
Re: UUID/GUID discussion leading to request for hexstring bytea?  (Gevik Babakhani <pgdev@xs4all.nl>)
List pgsql-hackers
On Wed, Sep 06, 2006 at 07:49:09PM +0200, Gevik Babakhani wrote:
> I would like to know if anyone is working on the GUID datatype.
> If not, I am going to work on it. Please let me know.

Assuming you mean GUID in the same sense as UUID, there are many
non-core developers who would like to see it, or who have worked
on one.

There are at least three floating around that I am aware of, one of
which I contributed to, that works well, and one that I imagine works
well (the OSSP UUID library includes a PostgreSQL binding) both of
which include support for a UUID type, including the desired index
support and plpgsql function calls for generating, and manipulating
the UUID values.

I'm a little stuck at the moment, as I have time sapped away from me
by things like real work, and I'm trying to prepare a submission that
would be acceptable for core. I've read the thoughts of some of the
core developers on this subject, and agree with many of them.

The UUID type itself has value, however, the value it provides is
limited. Generation of a UUID doesn't have to occur with the database.
The application inserting the row can generate the UUID. The UUID type
itself has limited value, in that the difference between a 36 bytes +
4 bytes = 40 bytes as a fully expanded char(40), or the packed value
using the hexstring encode/decode SQL functions of a bytea type at 16
+ 4 bytes = 20 bytes compared to a tightly packed UUID type of 16
bytes, are very close. The argument can easily be made that if space
(disk space, index size, I/O bandwidth) is your first priority, than
a UUID is the wrong type to use. A 64-bit integer may suffice.

I'm also having trouble with the idea that a UUID deserves special
treatment. I currently have a desire to store both UUID and MD5
checksum in my rows. They are both 128 bits = 16 bytes, and fit all
of the same requirements above.

I would really like to have an MD5 checksum type now for the same
reason. It has a reasonable use that few could deny. Perhaps an MD5
checksum type would be more frequently used than a UUID type? More
systems these days are using the MD5 checksum as a unique identifier
for content. It has a few clever advantages. Assuming it really is
well distributed, and extremely unlikely for overlap to occur within
a system, the MD5 checksum has the advantage of automatically dealing
with duplicate values. In my case, I have chosen to identify uploaded
jpeg images by their MD5 checksum.

This makes it seem as if a generic 128-bit data type would be
desirable. They both have a compatible representation of a hexadecimal
string. The extra '-' characters in the UUID can be easily added when
necessary by a HEXSTRING2UUID() sort of function in plpgsql or in
the claling application.

But - MD5 isn't the only checksum that is frequently used. Some argue
that the MD5 can be shown to be weak in some regards, and that perhaps
other checksum algorithms such as SHA-1 provide a better guarantee of
uniqueness. SHA-1 isn't 128-bits. It's 160-bits.

This is where I start to buy Tom Lane's argument that the 4-byte
prefix is no big deal. I find it more desirable to have a binary data
type with a hexadecimal string input and output function. The
flexibility of being able to use 128-bits or 160-bits is worth this
4-byte header to me. What I don't want to do is store double size
fields, stored as hexadecimal.

This leads to a few options:
   1) Create specific types as necessary, with associated functions.      No overhead.          - uuid, md5sum,
sha1sum,...
 
   2) Create semi-generic types with common bitlengths. Associated      functions work on these semi-generic types. No
overhead.         - hexstring128, hexstring160, ...
 
   3) Create a new bytea type that has ascii input and output formats,      probably based around hexstrings. Overhead
of4 bytes.
 
   4) Use varchar to store these types, and provide associated      functions that return character strings in the
rightformat.      This follows the existing md5sum() PostgreSQL function. Overhead      is double the size of the
data.
   5) Use bytea to store these types, and the encode/decode functions      are passed character strings in the right
format.Possibly      complicated for the application to deal with, as well as a user      typing SQL commands. Overhead
of4 bytes.
 

As I said - I'm stuck. Not sure which way to go. I currently use a uuid
type that I and another person on this list wrote against the OSSP UUID
library. For the MD5 checksum, I use char(32).

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: ECPG/OpenBSD buildfarm failures, take I
Next
From: Tom Lane
Date:
Subject: Problems with extended-Query logging code