Thread: TODO: GUID datatype
I would like to know if anyone is working on the GUID datatype. If not, I am going to work on it. Please let me know. Regards, Gevik.
UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype)
From
mark@mark.mielke.cc
Date:
On Wed, Sep 06, 2006 at 07:49:09PM +0200, Gevik Babakhani wrote: > I would like to know if anyone is working on the GUID datatype. > If not, I am going to work on it. Please let me know. Assuming you mean GUID in the same sense as UUID, there are many non-core developers who would like to see it, or who have worked on one. There are at least three floating around that I am aware of, one of which I contributed to, that works well, and one that I imagine works well (the OSSP UUID library includes a PostgreSQL binding) both of which include support for a UUID type, including the desired index support and plpgsql function calls for generating, and manipulating the UUID values. I'm a little stuck at the moment, as I have time sapped away from me by things like real work, and I'm trying to prepare a submission that would be acceptable for core. I've read the thoughts of some of the core developers on this subject, and agree with many of them. The UUID type itself has value, however, the value it provides is limited. Generation of a UUID doesn't have to occur with the database. The application inserting the row can generate the UUID. The UUID type itself has limited value, in that the difference between a 36 bytes + 4 bytes = 40 bytes as a fully expanded char(40), or the packed value using the hexstring encode/decode SQL functions of a bytea type at 16 + 4 bytes = 20 bytes compared to a tightly packed UUID type of 16 bytes, are very close. The argument can easily be made that if space (disk space, index size, I/O bandwidth) is your first priority, than a UUID is the wrong type to use. A 64-bit integer may suffice. I'm also having trouble with the idea that a UUID deserves special treatment. I currently have a desire to store both UUID and MD5 checksum in my rows. They are both 128 bits = 16 bytes, and fit all of the same requirements above. I would really like to have an MD5 checksum type now for the same reason. It has a reasonable use that few could deny. Perhaps an MD5 checksum type would be more frequently used than a UUID type? More systems these days are using the MD5 checksum as a unique identifier for content. It has a few clever advantages. Assuming it really is well distributed, and extremely unlikely for overlap to occur within a system, the MD5 checksum has the advantage of automatically dealing with duplicate values. In my case, I have chosen to identify uploaded jpeg images by their MD5 checksum. This makes it seem as if a generic 128-bit data type would be desirable. They both have a compatible representation of a hexadecimal string. The extra '-' characters in the UUID can be easily added when necessary by a HEXSTRING2UUID() sort of function in plpgsql or in the claling application. But - MD5 isn't the only checksum that is frequently used. Some argue that the MD5 can be shown to be weak in some regards, and that perhaps other checksum algorithms such as SHA-1 provide a better guarantee of uniqueness. SHA-1 isn't 128-bits. It's 160-bits. This is where I start to buy Tom Lane's argument that the 4-byte prefix is no big deal. I find it more desirable to have a binary data type with a hexadecimal string input and output function. The flexibility of being able to use 128-bits or 160-bits is worth this 4-byte header to me. What I don't want to do is store double size fields, stored as hexadecimal. This leads to a few options: 1) Create specific types as necessary, with associated functions. No overhead. - uuid, md5sum, sha1sum,... 2) Create semi-generic types with common bitlengths. Associated functions work on these semi-generic types. No overhead. - hexstring128, hexstring160, ... 3) Create a new bytea type that has ascii input and output formats, probably based around hexstrings. Overhead of4 bytes. 4) Use varchar to store these types, and provide associated functions that return character strings in the rightformat. This follows the existing md5sum() PostgreSQL function. Overhead is double the size of the data. 5) Use bytea to store these types, and the encode/decode functions are passed character strings in the right format.Possibly complicated for the application to deal with, as well as a user typing SQL commands. Overhead of4 bytes. As I said - I'm stuck. Not sure which way to go. I currently use a uuid type that I and another person on this list wrote against the OSSP UUID library. For the MD5 checksum, I use char(32). Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
Re: UUID/GUID discussion leading to request for hexstring bytea? (was: Re: TODO: GUID datatype)
From
Martijn van Oosterhout
Date:
On Wed, Sep 06, 2006 at 05:05:47PM -0400, mark@mark.mielke.cc wrote: > 2) Create semi-generic types with common bitlengths. Associated > functions work on these semi-generic types. No overhead. > - hexstring128, hexstring160, ... > > 3) Create a new bytea type that has ascii input and output formats, > probably based around hexstrings. Overhead of 4 bytes. I think 3) is worthwhile for core, it would have many uses. But you don't actually need to have a new type for that, just new I/O functions. As for 2) I think would be acceptable for contrib to contain some code that demonstrates how to make fixed length types. It would be fairly straightforward to make a script where you give it a type name and a length and it'll spit out the code for that type. I don't think UUID specific stuff needs to be in core, though you could make an argument that the hex input/output functions should ignore dashes, to make it straightforward to store UUIDs directly in there. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Wed, 2006-09-06 at 17:05 -0400, mark@mark.mielke.cc wrote: > The UUID type itself has value, however, the value it provides is > limited. Generation of a UUID doesn't have to occur with the database. > The application inserting the row can generate the UUID. The UUID type > itself has limited value, in that the difference between a 36 bytes + > 4 bytes = 40 bytes as a fully expanded char(40), or the packed value > using the hexstring encode/decode SQL functions of a bytea type at 16 > + 4 bytes = 20 bytes compared to a tightly packed UUID type of 16 > bytes, are very close. The argument can easily be made that if space > (disk space, index size, I/O bandwidth) is your first priority, than > a UUID is the wrong type to use. A 64-bit integer may suffice. Agreed. > > I'm also having trouble with the idea that a UUID deserves special > treatment. I currently have a desire to store both UUID and MD5 > checksum in my rows. They are both 128 bits = 16 bytes, and fit all > of the same requirements above. Agreed. > This makes it seem as if a generic 128-bit data type would be > desirable. They both have a compatible representation of a hexadecimal > string. The extra '-' characters in the UUID can be easily added when > necessary by a HEXSTRING2UUID() sort of function in plpgsql or in > the claling application. Agreed > > But - MD5 isn't the only checksum that is frequently used. Some argue > that the MD5 can be shown to be weak in some regards, and that perhaps > other checksum algorithms such as SHA-1 provide a better guarantee of > uniqueness. SHA-1 isn't 128-bits. It's 160-bits. To my knowledge most apps use MD5 checksum because it is easier to implement and use. > > This is where I start to buy Tom Lane's argument that the 4-byte > prefix is no big deal. I find it more desirable to have a binary data > type with a hexadecimal string input and output function. The > flexibility of being able to use 128-bits or 160-bits is worth this > 4-byte header to me. What I don't want to do is store double size > fields, stored as hexadecimal. Agreed. I have experimented with GUID/UUID type for a while and here are my findings when having the need to store GUID/UUID/MD5 in the database. 1. Almost always these values are created outside the database. So to my opinion having all kinds of functions in the database to create such values becomes second priority except when wanting to use GUIDs auto generated for PK like the uuid datatype in MS SQL 2. Yet I haven't seen any kind of arithmetic operations (+ - * / %) on stored GUIDs hence these operations would be meaningless. I also never did see the need to change specific parts of a GUID To my opinion only some of relational/compare operations like == and != apply to such values. comparing guid >= guid or md5 < md5 is also meaningless. 3. Almost always a GUID/MD5 is not changed when it is generated and stored except for cases like resetting md5sum of a password that is changed. To my opinion GUIDs type need to provide the following in the database. 1. GUID type must accept the correct string format(s), with of without extra '-' 2. GUID type must internally be stored as small as possible. 3. GUID type must be comparable with == , != , LIKE and (NOT) IS NULL 4. GUID type must have the ability to be indexed, grouped, ordered, DISTINCT... but not MAX(), MIN() or SUM().... I think option three would be a good one to experiment with. I am thinking of a 16 length struct to hold the GUID value and corresponding functionality to achieve the above however possible. I also think we should have the GUID/UUID as a datatype and not just functions handling hexstring.
On Thu, Sep 07, 2006 at 01:27:21PM +0200, Gevik Babakhani wrote: > To my opinion only some of relational/compare operations like == and != > apply to such values. comparing guid >= guid or md5 < md5 is also > meaningless. <snip> > 4. GUID type must have the ability to be indexed, grouped, ordered, > DISTINCT... but not MAX(), MIN() or SUM().... Err, for "ordered" you need to define <, >, >=, <=, which means you're going to get MAX, and MIN for free... > I also think we should have the GUID/UUID as a datatype and not just > functions handling hexstring. Sure, but that will be the I/O format, right? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Thu, 2006-09-07 at 14:46 +0200, Martijn van Oosterhout wrote: > On Thu, Sep 07, 2006 at 01:27:21PM +0200, Gevik Babakhani wrote: > > To my opinion only some of relational/compare operations like == and != > > apply to such values. comparing guid >= guid or md5 < md5 is also > > meaningless. > > <snip> > > > 4. GUID type must have the ability to be indexed, grouped, ordered, > > DISTINCT... but not MAX(), MIN() or SUM().... > > Err, for "ordered" you need to define <, >, >=, <=, which means you're > going to get MAX, and MIN for free... > Yes indeed, I just want to address that guid > guid or any other operation like that has no meaning. > > I also think we should have the GUID/UUID as a datatype and not just > > functions handling hexstring. > > Sure, but that will be the I/O format, right? Yes, I/O format as part of the datatype and not a separate function. > > Have a nice day, >
As part of our ongoing research into Postgres performance and scalability, we recently downloaded version 8.2 from CVS and we wanted to pass on some observations. When comparing 8.2 against 8.1.4, we see that there is roughly a 20% increase in throughput. We credit most of this improvement to the modifications made to the way in which the BufMappingLock and LockMgrLock locks are now handled. Locking sections (partitions) of the Shared Buffer and Locking hash tables certainly seems to pay off. We had also come to the same conclusion and added similar code into a local copy of 8.1.4. Although, we used SpinLocks rather than LWLocks to lock sections of the hash table and we used an LWLock to lock the hash table for critical operations. Against 8.1.4, we saw that LWLockAcquire was taking the majority of the run time, so we added some monitoring code to track the amount of time spent to acquire and release each lock. The BufMappingLock and LockMgrLock were, of course, on the top of that list. We moved some of our 8.1.4 lock monitoring code over to 8.2 to analyze the lock distribution for the partitions. In doing so, we noted that WALInsertLock had now become a bottleneck, absorbing most of the time freed up by the BufMappingLock and LockMgrLock changes. We took a bold move and made XLogInsert a NOP. The next lock to rise to the top of the list was SInvalLock. However, the time increase in SInvalLock did not seem as much as in WALInsertLock. Although we have not taken steps to do so, we assume that removing SInvalLock may reveal the next lock that might impede scalability. Outside of LWLockAcquire (~10%), the next functions that really seem to take most of the time are HeapTupleSatisfiesSnapshot (~21%) and PinBuffer (~14%). We're currently profiling and monitoring those functions.
Gevik Babakhani wrote: > To my opinion GUIDs type need to provide the following in the database. > > 1. GUID type must accept the correct string format(s), with of without > extra '-' > 2. GUID type must internally be stored as small as possible. > 3. GUID type must be comparable with == , != , LIKE and (NOT) IS NULL > 4. GUID type must have the ability to be indexed, grouped, ordered, > DISTINCT... but not MAX(), MIN() or SUM().... > Where do you see a need for LIKE on a GUID? Regards, Thomas Hallgren
LIKE could come handy if someone wants to abuse the uuid datatype to store MD5 hash values. However I am not going to implement it if there is no need for that (assuming it will pass the acceptance test) On Mon, 2006-09-18 at 10:06 +0200, Thomas Hallgren wrote: > Gevik Babakhani wrote: > > To my opinion GUIDs type need to provide the following in the database. > > > > 1. GUID type must accept the correct string format(s), with of without > > extra '-' > > 2. GUID type must internally be stored as small as possible. > > 3. GUID type must be comparable with == , != , LIKE and (NOT) IS NULL > > 4. GUID type must have the ability to be indexed, grouped, ordered, > > DISTINCT... but not MAX(), MIN() or SUM().... > > > Where do you see a need for LIKE on a GUID? > > Regards, > Thomas Hallgren > >
Gevik Babakhani wrote: > LIKE could come handy if someone wants to abuse the uuid datatype to > store MD5 hash values. However I am not going to implement it if there > is no need for that (assuming it will pass the acceptance test) > > Perhaps providing LIKE just to encourage abuse is not such a good idea? IMHO, a GUID should be comparable for equality and NULL only, not LIKE. I also think that ordering is feasible only when looking at parts of the GUID, i.e. order by the result of a function that extracts a timestamp or a node-address. Magnitude comparison on the GUID as a whole makes no sense to me. Regards, Thomas Hallgren > On Mon, 2006-09-18 at 10:06 +0200, Thomas Hallgren wrote: > >> Gevik Babakhani wrote: >> >>> To my opinion GUIDs type need to provide the following in the database. >>> >>> 1. GUID type must accept the correct string format(s), with of without >>> extra '-' >>> 2. GUID type must internally be stored as small as possible. >>> 3. GUID type must be comparable with == , != , LIKE and (NOT) IS NULL >>> 4. GUID type must have the ability to be indexed, grouped, ordered, >>> DISTINCT... but not MAX(), MIN() or SUM().... >>> >>> >> Where do you see a need for LIKE on a GUID? >> >> Regards, >> Thomas Hallgren >> >> >> > >
> Magnitude comparison on the GUID as a whole makes no > sense to me. I agree. Any kind of comparison except equality has no meaning for the GUID. (And this is discussed before) I rather have the option to sort and group for the sake of consistency and compatibility.
On Mon, Sep 18, 2006 at 11:12:54AM +0200, Gevik Babakhani wrote: > > Magnitude comparison on the GUID as a whole makes no > > sense to me. > I agree. Any kind of comparison except equality has no meaning for the > GUID. (And this is discussed before) I rather have the option to sort > and group for the sake of consistency and compatibility. Thomas: The ability to sort / comparison is required for use with B-Tree index. I prefer a fast comparison over one with more meaning. memcmp() is fine with me and it is how I implement it in my UUID PostgreSQL library. Gevik: Once you have your patch in a ready state, I'll compare it against what I have and see if there is anything missing, or if I did anything better. Possibly not, but it's worth the check. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
mark@mark.mielke.cc wrote: > On Mon, Sep 18, 2006 at 11:12:54AM +0200, Gevik Babakhani wrote: >>> Magnitude comparison on the GUID as a whole makes no >>> sense to me. >> I agree. Any kind of comparison except equality has no meaning for the >> GUID. (And this is discussed before) I rather have the option to sort >> and group for the sake of consistency and compatibility. > > Thomas: The ability to sort / comparison is required for use with > B-Tree index. I prefer a fast comparison over one with more meaning. > memcmp() is fine with me and it is how I implement it in my UUID > PostgreSQL library. > Fair enough. Although the magnitudes as such makes little sense, the ability to order will make it possible to compare results from different queries etc. Very difficult to do with random order. Regards, Thomas Hallgren