Thread: Enums patch v2
Hi all Here is an updated version of the enums patch. It has been brought up to date and applies against current CVS HEAD. The original email is at [1], and describes the implementation. This version contains sgml documentation, and contains the missing bounds checks and error codes noted in the last email. As mentioned before, the only part that I'm not super keen on is the hack required to pass the type oid in to the text-to-enum cast function, since normally those take type mods but not type oids. I wasn't sure how else to get a cast working though, short of allowing another type of cast function which accepts type oids as well. That seemed overkill for just one case, though, and was getting a bit beyond the realms of what I wanted to get done with this patch. Anyway, it's reasonably feature complete and should be appropriately documented now, so feedback gratefully accepted. Cheers Tom [1] http://archives.postgresql.org/pgsql-patches/2006-09/msg00000.php
Attachment
Tom Dunstan wrote: > Here is an updated version of the enums patch. It has been brought up to > date and applies against current CVS HEAD. The original email is at [1], > and describes the implementation. I'm sorry I missed the original discussions, but I have to ask: Why do we want enums in core? The only potential advantage I can see over using a look-up table and FK references is performance. And I'd rather spend time improving the performance of FK checks than add extra machinery to do the same thing in a different way. Ignoring my general dislike of enums, I have a few issues with the patch as it is: 1. What's the point of having comparison operators for enums? For most use cases, there's no natural ordering of enum values. 2. The comparison routine compares oids, right? If the oids wrap around when the enum values are created, the ordering isn't what the user expects. 3. 4 bytes per value is wasteful if you're storing simple status codes etc. Especially if you're doing this for performance, let's do no harm by wasting space. One byte seems enough for the typical use cases. I'd even argue that having a high upper limit on the number of enum values encourages misuse of the feature. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Ignoring my general dislike of enums, I have a few issues with the patch > as it is: > 1. What's the point of having comparison operators for enums? For most > use cases, there's no natural ordering of enum values. If you would like to be able to index enum columns, or even GROUP BY one, you need those; whether the ordering is arbitrary or not is irrelevant. > 2. The comparison routine compares oids, right? If the oids wrap around > when the enum values are created, the ordering isn't what the user expects. This is a fair point --- it'd be better if the ordering were not dependent on chance OID assignments. Not sure what we are willing to pay to have that though. > 3. 4 bytes per value is wasteful if you're storing simple status codes > etc. I've forgotten exactly which design Tom is proposing to implement here, but at least one of the contenders involved storing an OID that would be unique across all enum types. 1 byte is certainly not enough for that and even 2 bytes would be pretty marginal. I'm unconvinced by arguments about 2 bytes being so much better than 4 anyway --- in the majority of real table layouts, the hoped-for savings would disappear into alignment padding. regards, tom lane
On Tue, Dec 19, 2006 at 08:09:47AM +0000, Heikki Linnakangas wrote: > Tom Dunstan wrote: > >Here is an updated version of the enums patch. It has been brought up to > >date and applies against current CVS HEAD. The original email is at [1], > >and describes the implementation. > > I'm sorry I missed the original discussions, but I have to ask: Why do > we want enums in core? The only potential advantage I can see over using > a look-up table and FK references is performance. A natural ordering is another. I'd love to be able to make a type color that has Red Orange Yellow Green Blue Indigo Violet and then be able to do an ORDER BY color; > And I'd rather spend time improving the performance of FK checks > than add extra machinery to do the same thing in a different way. Not the same thing. > Ignoring my general dislike of enums, I have a few issues with the patch > as it is: > > 1. What's the point of having comparison operators for enums? For most > use cases, there's no natural ordering of enum values. A natural ordering is precisely the use case for enums. Otherwise, you just use a FK to a one-column table and have done. Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
Heikki Linnakangas wrote: > I'm sorry I missed the original discussions, but I have to ask: Why > do we want enums in core? The only potential advantage I can see over > using a look-up table and FK references is performance. The difference is that foreign-key-referenced data is part of your data whereas enums would be part of the type system used to model the data. An objection to enums on the ground that foreign keys can accomplish the same thing could be extended to object to any data type with a finite domain. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > >> 1. What's the point of having comparison operators for enums? For most >> use cases, there's no natural ordering of enum values. >> > > If you would like to be able to index enum columns, or even GROUP BY one, > you need those; whether the ordering is arbitrary or not is irrelevant. > Heikki's assertion is wrong in any case. The enumeration definition defines the ordering, and I can think of plenty of use cases where it does matter. We do not use an arbitrary ordering. An enum type is an *ordered* set of string labels. Without this the feature would be close to worthless. But if a particular application doesn't need them ordered, it need not use the comparison operators. Leaving aside the uses for GROUP BY and indexes, I would ask what the justification would be for leaving off comparison operators? > >> 2. The comparison routine compares oids, right? If the oids wrap around >> when the enum values are created, the ordering isn't what the user expects. >> > > This is a fair point --- it'd be better if the ordering were not > dependent on chance OID assignments. Not sure what we are willing > to pay to have that though. > This is a non-issue. The code sorts the oids before assigning them: /* allocate oids */ oids = (Oid *) palloc(sizeof(Oid) * n); for(i = 0; i < n; i++) { oids[i] = GetNewOid(pg_enum); } /* wraparound is unlikely, but just to be safe...*/ qsort(oids, n, sizeof(Oid), oid_cmp); > >> 3. 4 bytes per value is wasteful if you're storing simple status codes >> etc. >> > > I've forgotten exactly which design Tom is proposing to implement here, > but at least one of the contenders involved storing an OID that would be > unique across all enum types. 1 byte is certainly not enough for that > and even 2 bytes would be pretty marginal. I'm unconvinced by arguments > about 2 bytes being so much better than 4 anyway --- in the majority of > real table layouts, the hoped-for savings would disappear into alignment > padding. > > > Globally unique is the design adopted, after much on-list discussion. That was a way of getting it *down* to 4 bytes. The problem is that the output routines need enough info from just the internal representation of the type value to do their work. The original suggestions was for 8 bytes - type oid + offset in value set. Having them globally unique lets us get down to 4. As for efficiency, I agree with what Tom says about alignment and padding dissolving away any perceived advantage in most cases. If we ever get around to optimising record layout we could revisit it. cheers andrew
Andrew Dunstan wrote: > As for efficiency, I agree with what Tom says about alignment and > padding dissolving away any perceived advantage in most cases. If we > ever get around to optimising record layout we could revisit it. I don't, because there are always those that are knowledgeable enough to know how to reduce space lost to padding. So it would be nice to have 2-byte enums on-disk, and resolve them based on the column's typid. But then, I'm not familiar with the patch at all so I'm not sure if it's possible. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Andrew Dunstan wrote: > > >> As for efficiency, I agree with what Tom says about alignment and >> padding dissolving away any perceived advantage in most cases. If we >> ever get around to optimising record layout we could revisit it. >> > > I don't, because there are always those that are knowledgeable enough to > know how to reduce space lost to padding. So it would be nice to have > 2-byte enums on-disk, and resolve them based on the column's typid. But > then, I'm not familiar with the patch at all so I'm not sure if it's > possible. > > The trouble is that we have one output routine for all enum types. See previous discussions about disallowing extra params to output routines. So if all we have is a 2 byte offset into the list of values for the given type, we do not have enough info to allow the output routine to deduce which particular enum type it is dealing with. With the globally unique oid approach it doesn't even need to care - it just looks up the corresponding value. Note that this was a reduction from the previously suggested (by TGL) 8 bytes. I'm not a big fan of ordering columns to optimise record layout, except in the most extreme cases (massive DW type apps). I think visible column order should be logical, not governed by physical considerations. cheers andrew
Alvaro Herrera <alvherre@commandprompt.com> writes: > I don't, because there are always those that are knowledgeable enough to > know how to reduce space lost to padding. So it would be nice to have > 2-byte enums on-disk, and resolve them based on the column's typid. But > then, I'm not familiar with the patch at all so I'm not sure if it's > possible. Remember that the value has to be decodable by the output routine. So the only way we could do that would be by creating a separate output function for each enum type. (That is, a separate pg_proc entry ... they could all point at the same C function, which would have to check which OID it was called as and work backward to determine the enum type.) While this is doubtless doable, it's slow, it bloats pg_proc, and frankly no argument has been offered that's compelling enough to require it. The alignment issue takes enough air out of the space-saving argument that it doesn't seem sufficient to me. regards, tom lane
"Andrew Dunstan" <andrew@dunslane.net> writes: > I'm not a big fan of ordering columns to optimise record layout, except in the > most extreme cases (massive DW type apps). I think visible column order should > be logical, not governed by physical considerations. Well as long as we're talking "should"s the database should take care of this for you anyways. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > I'm sorry I missed the original discussions, but I have to ask: Why do > we want enums in core? The only potential advantage I can see over using > a look-up table and FK references is performance. Well, there are a few things. Sometimes its tidiness, sometimes integrity... I've seen more than one system with hundreds of these things, and they've either gone down the table-per-enum solution, with LOTS of extra tables whose values never change, or the EAV solution, with one or two globally referenced tables to which everything in your system has as a FK, and an integrity check in a trigger if you're very lucky. Yuck on both accounts. Enums hit a sweet spot in the middle and provide data integrity and performance for non-changing values. > 1. What's the point of having comparison operators for enums? For most > use cases, there's no natural ordering of enum values. Well, there are a number of cases where ordering IS important, and indeed, enums provide a way to do it easily where many of the alternative solutions do not. It's one of the key benefits. > 2. The comparison routine compares oids, right? If the oids wrap around > when the enum values are created, the ordering isn't what the user expects. As has been pointed out by others quicker on the draw than me, I do sort the OIDs at enum creation time, for exactly this reason. > 3. 4 bytes per value is wasteful if you're storing simple status codes > etc. Especially if you're doing this for performance, let's do no harm > by wasting space. One byte seems enough for the typical use cases. I'd > even argue that having a high upper limit on the number of enum values > encourages misuse of the feature. I'd really love to have these fit into a 1 or 2 byte value on disk, but AFAIK there's simply no way to do it currently in postgresql. If we ever move to a solution where on-disk representation is noticeably different from in-memory representation, then it might be doable. If that does happen, we might benefit from other improvements such as being able to order columns in a tuple on disk so as to minimize alignment padding, not having to store a composite type's oid, etc. Until that happens, though, if it ever does, this is probably the tightest on-disk representation we're likely to get, unless we're happy to impose some pretty severe restrictions, like 8 bits per enum, and only 256 enums in total (for a 2 byte total). I was already shot down trying to make similar restrictions when I first brought it up. :) The OID solution seems to offend the least. :) We did discuss this somewhat earlier, and I'm happy to take alternative suggestions, but AFAIK this is about as good as it's going to get right now. Cheers Tom
Alvaro Herrera wrote: > I don't, because there are always those that are knowledgeable enough to > know how to reduce space lost to padding. So it would be nice to have > 2-byte enums on-disk, and resolve them based on the column's typid. But > then, I'm not familiar with the patch at all so I'm not sure if it's > possible. Not with this patch, and AFAIK not possible generally, without writing separate I/O functions for each type. I'd love to be able to do that, but I don't think it's possible currently. The main stumbling block is the output function (and cast-to-text function), because output functions do not get provided the oid of the type that they're dealing with, for security reasons IIRC. It was never clear to me why I/O functions should ever be directly callable by a user (and hence open to security issues), but apparently it was enough to purge any that were designed like that from the system, so I wasn't going to go down that road with the patch. Cheers Tom
Peter Eisentraut wrote: > An objection to enums on the ground that foreign keys can accomplish the > same thing could be extended to object to any data type with a finite > domain. Exactly. The extreme case is the boolean type, which could easily be represented by a two-value enum. Or, if you were feeling masochistic, a FK to a separate table. Which is easier? People regularly do stuff like having domains over finite text values, or having a FK to a separate (static) table, or using some sort of EAV. Enums are type-safe, easily ordered, relatively efficient and don't leave zillions of little static tables all over the place, a combination of attributes that none of the alternative solutions in this space present. Cheers Tom
On Wed, Dec 20, 2006 at 01:39:58AM +0000, Tom Dunstan wrote: > Not with this patch, and AFAIK not possible generally, without writing > separate I/O functions for each type. I'd love to be able to do that, > but I don't think it's possible currently. The main stumbling block is > the output function (and cast-to-text function), because output > functions do not get provided the oid of the type that they're dealing > with, for security reasons IIRC. It was never clear to me why I/O > functions should ever be directly callable by a user (and hence open to > security issues), but apparently it was enough to purge any that were > designed like that from the system, so I wasn't going to go down that > road with the patch. I worked around this in taggedtypes by indeed creating seperate copies of the i/o functions on demand and at execution time looking up the required type from the function signiture. The only solution indeed is to change the calling convention if the I/O functions so that the relevent datatype oid stored in a safe place, that isn't set for normal function calls. BTW, being able to call type i/o functions directly is very useful. For example date_in(text_out(blah)) is a form of cast between types that don't usually have a cast. If you change the calling convention as indicated, that trick will still work, just not for types with the restricted i/o functions. Also, it's not just I/O functions that are the issue, consider the enum-to-integer cast. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Martijn van Oosterhout wrote: > Also, it's not just I/O functions that are the issue, consider the > enum-to-integer cast. > er, what cast? :-) IIRC Tom hasn't provided one. If you don't break the enum abstraction there should be no need for one, and given the implementation it's not quite trivial - probably the best way if this is needed would be to precalculate it at type creation time and store the value in an extra column in pg_enum. cheers andrew
Where are we on this? --------------------------------------------------------------------------- Tom Dunstan wrote: > Hi all > > Here is an updated version of the enums patch. It has been brought up to > date and applies against current CVS HEAD. The original email is at [1], > and describes the implementation. > > This version contains sgml documentation, and contains the missing > bounds checks and error codes noted in the last email. > > As mentioned before, the only part that I'm not super keen on is the > hack required to pass the type oid in to the text-to-enum cast function, > since normally those take type mods but not type oids. I wasn't sure how > else to get a cast working though, short of allowing another type of > cast function which accepts type oids as well. That seemed overkill for > just one case, though, and was getting a bit beyond the realms of what I > wanted to get done with this patch. > > Anyway, it's reasonably feature complete and should be appropriately > documented now, so feedback gratefully accepted. > > Cheers > > Tom > > [1] http://archives.postgresql.org/pgsql-patches/2006-09/msg00000.php [ application/x-gzip is not supported, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, 2007-02-01 at 22:50 -0500, Bruce Momjian wrote: > Where are we on this? I can commit to reviewing this. The patch looked fairly solid on a quick glance through, but I won't have the cycles to review it properly for a week or two. -Neil
Neil Conway wrote: > On Thu, 2007-02-01 at 22:50 -0500, Bruce Momjian wrote: >> Where are we on this? > > I can commit to reviewing this. The patch looked fairly solid on a quick > glance through, but I won't have the cycles to review it properly for a > week or two. I've brought this up to date with the operator family stuff now, and the attached patch once more applies cleanly against HEAD. Hopefully that'll make life a little easier when reviewing it. Thanks. I got sick of manually shifting the new OID values every time someone had added something new to the catalogs, so I moved all of my OIDs up to the 9000 range. Attached is a hacky bash script which can move them back to something sane. The idea is to run unused_oids first, see where the main block of unused OIDs starts, and then run shift_oids on the (unzipped) patch file before applying the patch. We discussed something similar a while ago, but no-one ever got around to implementing the script. I've attached the script and left the OIDs in my patch in the upper range rather than just running it myself before submitting the patch as I reckon that this might be a useful convention for authors of patches which add a non-trivial number of OIDs to the catalogs. If there's agreement then anyone can feel free to commit the script to CVS or put it on the wiki or whatever. Cheers Tom #!/bin/sh start=$1 end=$2 new_start=$3 filename=$4 if [ -z "$filename" ] ; then echo Usage: $0 start-oid end-oid new-start-oid filename exit 1 fi if [ ! -f $filename ] ; then echo $0: $filename is not a file exit 1 fi if [ $end -le $start ] ; then echo $0: End of OID range must be greater than or equal to the start exit 1 fi start_len=`echo -n $start | wc -c` end_len=`echo -n $end | wc -c` if [ $start_len -ne 4 -o $end_len -ne 4 ] ; then echo $0: Source OID range must have 4 digits exit 1 fi let new_end=$new_start+$end-$start if [ $start -le $new_start -a $end -ge $new_start -o $new_start -le $start -a $new_end -ge $start ] ; then echo $0: OID ranges may not overlap exit 1 fi i=$start j=$new_start while [ $i -le $end ] ; do #echo $i $j sed -i "s/$i/$j/g" $filename let i=i+1 let j=j+1 done
Attachment
I am putting this in the patches queue so it is not lost. I believe Neil is working applying this. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Tom Dunstan wrote: > Neil Conway wrote: > > On Thu, 2007-02-01 at 22:50 -0500, Bruce Momjian wrote: > >> Where are we on this? > > > > I can commit to reviewing this. The patch looked fairly solid on a quick > > glance through, but I won't have the cycles to review it properly for a > > week or two. > > I've brought this up to date with the operator family stuff now, and the > attached patch once more applies cleanly against HEAD. Hopefully that'll > make life a little easier when reviewing it. Thanks. > > I got sick of manually shifting the new OID values every time someone > had added something new to the catalogs, so I moved all of my OIDs up to > the 9000 range. Attached is a hacky bash script which can move them back > to something sane. The idea is to run unused_oids first, see where the > main block of unused OIDs starts, and then run shift_oids on the > (unzipped) patch file before applying the patch. We discussed something > similar a while ago, but no-one ever got around to implementing the script. > > I've attached the script and left the OIDs in my patch in the upper > range rather than just running it myself before submitting the patch as > I reckon that this might be a useful convention for authors of patches > which add a non-trivial number of OIDs to the catalogs. If there's > agreement then anyone can feel free to commit the script to CVS or put > it on the wiki or whatever. > > Cheers > > Tom [ application/x-gzip is not supported, skipping... ] > #!/bin/sh > > start=$1 > end=$2 > new_start=$3 > filename=$4 > > if [ -z "$filename" ] ; then > echo Usage: $0 start-oid end-oid new-start-oid filename > exit 1 > fi > > if [ ! -f $filename ] ; then > echo $0: $filename is not a file > exit 1 > fi > > if [ $end -le $start ] ; then > echo $0: End of OID range must be greater than or equal to the start > exit 1 > fi > > start_len=`echo -n $start | wc -c` > end_len=`echo -n $end | wc -c` > > if [ $start_len -ne 4 -o $end_len -ne 4 ] ; then > echo $0: Source OID range must have 4 digits > exit 1 > fi > > let new_end=$new_start+$end-$start > if [ $start -le $new_start -a $end -ge $new_start -o $new_start -le $start -a $new_end -ge $start ] ; then > echo $0: OID ranges may not overlap > exit 1 > fi > > > i=$start > j=$new_start > while [ $i -le $end ] ; do > #echo $i $j > sed -i "s/$i/$j/g" $filename > let i=i+1 > let j=j+1 > done > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +