Home > mailing lists

UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype) - Mailing list pgsql-hackers

From	mark@mark.mielke.cc
Subject	UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype)
Date	September 6, 2006 21:36:17
Msg-id	20060906210547.GA26661@mark.mielke.cc Whole thread Raw
In response to	TODO: GUID datatype (Gevik Babakhani <pgdev@xs4all.nl>)
Responses	Re: UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype) (Martijn van Oosterhout <kleptog@svana.org>) Re: UUID/GUID discussion leading to request for hexstring bytea? (Gevik Babakhani <pgdev@xs4all.nl>)
List	pgsql-hackers

Tree view

On Wed, Sep 06, 2006 at 07:49:09PM +0200, Gevik Babakhani wrote:
> I would like to know if anyone is working on the GUID datatype.
> If not, I am going to work on it. Please let me know.

Assuming you mean GUID in the same sense as UUID, there are many
non-core developers who would like to see it, or who have worked
on one.

There are at least three floating around that I am aware of, one of
which I contributed to, that works well, and one that I imagine works
well (the OSSP UUID library includes a PostgreSQL binding) both of
which include support for a UUID type, including the desired index
support and plpgsql function calls for generating, and manipulating
the UUID values.

I'm a little stuck at the moment, as I have time sapped away from me
by things like real work, and I'm trying to prepare a submission that
would be acceptable for core. I've read the thoughts of some of the
core developers on this subject, and agree with many of them.

The UUID type itself has value, however, the value it provides is
limited. Generation of a UUID doesn't have to occur with the database.
The application inserting the row can generate the UUID. The UUID type
itself has limited value, in that the difference between a 36 bytes +
4 bytes = 40 bytes as a fully expanded char(40), or the packed value
using the hexstring encode/decode SQL functions of a bytea type at 16
+ 4 bytes = 20 bytes compared to a tightly packed UUID type of 16
bytes, are very close. The argument can easily be made that if space
(disk space, index size, I/O bandwidth) is your first priority, than
a UUID is the wrong type to use. A 64-bit integer may suffice.

I'm also having trouble with the idea that a UUID deserves special
treatment. I currently have a desire to store both UUID and MD5
checksum in my rows. They are both 128 bits = 16 bytes, and fit all
of the same requirements above.

I would really like to have an MD5 checksum type now for the same
reason. It has a reasonable use that few could deny. Perhaps an MD5
checksum type would be more frequently used than a UUID type? More
systems these days are using the MD5 checksum as a unique identifier
for content. It has a few clever advantages. Assuming it really is
well distributed, and extremely unlikely for overlap to occur within
a system, the MD5 checksum has the advantage of automatically dealing
with duplicate values. In my case, I have chosen to identify uploaded
jpeg images by their MD5 checksum.

This makes it seem as if a generic 128-bit data type would be
desirable. They both have a compatible representation of a hexadecimal
string. The extra '-' characters in the UUID can be easily added when
necessary by a HEXSTRING2UUID() sort of function in plpgsql or in
the claling application.

But - MD5 isn't the only checksum that is frequently used. Some argue
that the MD5 can be shown to be weak in some regards, and that perhaps
other checksum algorithms such as SHA-1 provide a better guarantee of
uniqueness. SHA-1 isn't 128-bits. It's 160-bits.

This is where I start to buy Tom Lane's argument that the 4-byte
prefix is no big deal. I find it more desirable to have a binary data
type with a hexadecimal string input and output function. The
flexibility of being able to use 128-bits or 160-bits is worth this
4-byte header to me. What I don't want to do is store double size
fields, stored as hexadecimal.

This leads to a few options:
1) Create specific types as necessary, with associated functions. No overhead. - uuid, md5sum,
sha1sum,...

2) Create semi-generic types with common bitlengths. Associated functions work on these semi-generic types. No
overhead. - hexstring128, hexstring160, ...

3) Create a new bytea type that has ascii input and output formats, probably based around hexstrings. Overhead
of4 bytes.

4) Use varchar to store these types, and provide associated functions that return character strings in the
rightformat. This follows the existing md5sum() PostgreSQL function. Overhead is double the size of the
data.
5) Use bytea to store these types, and the encode/decode functions are passed character strings in the right
format.Possibly complicated for the application to deal with, as well as a user typing SQL commands. Overhead
of4 bytes.

As I said - I'm stuck. Not sure which way to go. I currently use a uuid
type that I and another person on this list wrote against the OSSP UUID
library. For the MD5 checksum, I use char(32).

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness
bindthem...

http://mark.mielke.cc/

pgsql-hackers by date:

From: Martijn van Oosterhout
Date: 06 September 2006, 21:21:05
Subject: Re: ECPG/OpenBSD buildfarm failures, take I

From: Tom Lane
Date: 06 September 2006, 22:01:59
Subject: Problems with extended-Query logging code

UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype) - Mailing list pgsql-hackers

Previous

Next