Thread: Enums patch v2

Enums patch v2

From
Tom Dunstan
Date:
Hi all

Here is an updated version of the enums patch. It has been brought up to
date and applies against current CVS HEAD. The original email is at [1],
and describes the implementation.

This version contains sgml documentation, and contains the missing
bounds checks and error codes noted in the last email.

As mentioned before, the only part that I'm not super keen on is the
hack required to pass the type oid in to the text-to-enum cast function,
since normally those take type mods but not type oids. I wasn't sure how
else to get a cast working though, short of allowing another type of
cast function which accepts type oids as well. That seemed overkill for
just one case, though, and was getting a bit beyond the realms of what I
wanted to get done with this patch.

Anyway, it's reasonably feature complete and should be appropriately
documented now, so feedback gratefully accepted.

Cheers

Tom

[1] http://archives.postgresql.org/pgsql-patches/2006-09/msg00000.php

Attachment

Re: Enums patch v2

From
Heikki Linnakangas
Date:
Tom Dunstan wrote:
> Here is an updated version of the enums patch. It has been brought up to
> date and applies against current CVS HEAD. The original email is at [1],
> and describes the implementation.

I'm sorry I missed the original discussions, but I have to ask: Why do
we want enums in core? The only potential advantage I can see over using
a look-up table and FK references is performance. And I'd rather spend
time improving the performance of FK checks than add extra machinery to
do the same thing in a different way.

Ignoring my general dislike of enums, I have a few issues with the patch
as it is:

1. What's the point of having comparison operators for enums? For most
use cases, there's no natural ordering of enum values.

2. The comparison routine compares oids, right? If the oids wrap around
when the enum values are created, the ordering isn't what the user expects.

3. 4 bytes per value is wasteful if you're storing simple status codes
etc. Especially if you're doing this for performance, let's do no harm
by wasting space. One byte seems enough for the typical use cases. I'd
even argue that having a high upper limit on the number of enum values
encourages misuse of the feature.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Enums patch v2

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Ignoring my general dislike of enums, I have a few issues with the patch
> as it is:

> 1. What's the point of having comparison operators for enums? For most
> use cases, there's no natural ordering of enum values.

If you would like to be able to index enum columns, or even GROUP BY one,
you need those; whether the ordering is arbitrary or not is irrelevant.

> 2. The comparison routine compares oids, right? If the oids wrap around
> when the enum values are created, the ordering isn't what the user expects.

This is a fair point --- it'd be better if the ordering were not
dependent on chance OID assignments.  Not sure what we are willing
to pay to have that though.

> 3. 4 bytes per value is wasteful if you're storing simple status codes
> etc.

I've forgotten exactly which design Tom is proposing to implement here,
but at least one of the contenders involved storing an OID that would be
unique across all enum types.  1 byte is certainly not enough for that
and even 2 bytes would be pretty marginal.  I'm unconvinced by arguments
about 2 bytes being so much better than 4 anyway --- in the majority of
real table layouts, the hoped-for savings would disappear into alignment
padding.

            regards, tom lane

Re: Enums patch v2

From
David Fetter
Date:
On Tue, Dec 19, 2006 at 08:09:47AM +0000, Heikki Linnakangas wrote:
> Tom Dunstan wrote:
> >Here is an updated version of the enums patch. It has been brought up to
> >date and applies against current CVS HEAD. The original email is at [1],
> >and describes the implementation.
>
> I'm sorry I missed the original discussions, but I have to ask: Why do
> we want enums in core? The only potential advantage I can see over using
> a look-up table and FK references is performance.

A natural ordering is another.  I'd love to be able to make a type
color that has

Red
Orange
Yellow
Green
Blue
Indigo
Violet

and then be able to do an ORDER BY color;

> And I'd rather spend time improving the performance of FK checks
> than add extra machinery to do the same thing in a different way.

Not the same thing.

> Ignoring my general dislike of enums, I have a few issues with the patch
> as it is:
>
> 1. What's the point of having comparison operators for enums? For most
> use cases, there's no natural ordering of enum values.

A natural ordering is precisely the use case for enums.  Otherwise,
you just use a FK to a one-column table and have done.

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: Enums patch v2

From
Peter Eisentraut
Date:
Heikki Linnakangas wrote:
> I'm sorry I missed the original discussions, but I have to ask: Why
> do we want enums in core? The only potential advantage I can see over
> using a look-up table and FK references is performance.

The difference is that foreign-key-referenced data is part of your data
whereas enums would be part of the type system used to model the data.

An objection to enums on the ground that foreign keys can accomplish the
same thing could be extended to object to any data type with a finite
domain.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Enums patch v2

From
Andrew Dunstan
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>
>> 1. What's the point of having comparison operators for enums? For most
>> use cases, there's no natural ordering of enum values.
>>
>
> If you would like to be able to index enum columns, or even GROUP BY one,
> you need those; whether the ordering is arbitrary or not is irrelevant.
>


Heikki's assertion is wrong in any case. The enumeration definition
defines the ordering, and I can think of plenty of use cases where it
does matter. We do not use an arbitrary ordering. An enum type is an
*ordered* set of string labels. Without this the feature would be close
to worthless. But if a particular application doesn't need them ordered,
it need not use the comparison operators. Leaving aside the uses for
GROUP BY and indexes, I would ask what the justification would be for
leaving off comparison operators?

>
>> 2. The comparison routine compares oids, right? If the oids wrap around
>> when the enum values are created, the ordering isn't what the user expects.
>>
>
> This is a fair point --- it'd be better if the ordering were not
> dependent on chance OID assignments.  Not sure what we are willing
> to pay to have that though.
>

This is a non-issue. The code sorts the oids before assigning them:

    /* allocate oids */
    oids = (Oid *) palloc(sizeof(Oid) * n);
    for(i = 0; i < n; i++)
    {
        oids[i] = GetNewOid(pg_enum);
    }
    /* wraparound is unlikely, but just to be safe...*/
    qsort(oids, n, sizeof(Oid), oid_cmp);


>
>> 3. 4 bytes per value is wasteful if you're storing simple status codes
>> etc.
>>
>
> I've forgotten exactly which design Tom is proposing to implement here,
> but at least one of the contenders involved storing an OID that would be
> unique across all enum types.  1 byte is certainly not enough for that
> and even 2 bytes would be pretty marginal.  I'm unconvinced by arguments
> about 2 bytes being so much better than 4 anyway --- in the majority of
> real table layouts, the hoped-for savings would disappear into alignment
> padding.
>
>
>

Globally unique is the design adopted, after much on-list discussion.
That was a way of getting it *down* to 4 bytes. The problem is that the
output routines need enough info from just the internal representation
of the type value to do their work. The original suggestions was for 8
bytes - type oid + offset in value set. Having them globally unique lets
us get down to 4.

As for efficiency, I agree with what Tom says about alignment and
padding dissolving away any perceived advantage in most cases. If we
ever get around to optimising record layout we could revisit it.

cheers

andrew


Re: [HACKERS] Enums patch v2

From
Alvaro Herrera
Date:
Andrew Dunstan wrote:

> As for efficiency, I agree with what Tom says about alignment and
> padding dissolving away any perceived advantage in most cases. If we
> ever get around to optimising record layout we could revisit it.

I don't, because there are always those that are knowledgeable enough to
know how to reduce space lost to padding.  So it would be nice to have
2-byte enums on-disk, and resolve them based on the column's typid.  But
then, I'm not familiar with the patch at all so I'm not sure if it's
possible.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [HACKERS] Enums patch v2

From
Andrew Dunstan
Date:
Alvaro Herrera wrote:
> Andrew Dunstan wrote:
>
>
>> As for efficiency, I agree with what Tom says about alignment and
>> padding dissolving away any perceived advantage in most cases. If we
>> ever get around to optimising record layout we could revisit it.
>>
>
> I don't, because there are always those that are knowledgeable enough to
> know how to reduce space lost to padding.  So it would be nice to have
> 2-byte enums on-disk, and resolve them based on the column's typid.  But
> then, I'm not familiar with the patch at all so I'm not sure if it's
> possible.
>
>

The trouble is that we have one output routine for all enum types. See
previous discussions about disallowing extra params to output routines.
So if all we have is a 2 byte offset into the list of values for the
given type, we do not have enough info to allow the output routine to
deduce which particular enum type it is dealing with. With the globally
unique oid approach it doesn't even  need to care - it just looks up the
corresponding value. Note that this was a reduction from the previously
suggested (by TGL) 8 bytes.

I'm not a big fan of ordering columns to optimise record layout, except
in the most extreme cases (massive DW type apps). I think visible column
order should be logical, not governed by physical considerations.

cheers

andrew



Re: [HACKERS] Enums patch v2

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I don't, because there are always those that are knowledgeable enough to
> know how to reduce space lost to padding.  So it would be nice to have
> 2-byte enums on-disk, and resolve them based on the column's typid.  But
> then, I'm not familiar with the patch at all so I'm not sure if it's
> possible.

Remember that the value has to be decodable by the output routine.
So the only way we could do that would be by creating a separate output
function for each enum type.  (That is, a separate pg_proc entry
... they could all point at the same C function, which would have to
check which OID it was called as and work backward to determine the enum
type.)

While this is doubtless doable, it's slow, it bloats pg_proc, and
frankly no argument has been offered that's compelling enough to
require it.  The alignment issue takes enough air out of the
space-saving argument that it doesn't seem sufficient to me.

            regards, tom lane

Re: [HACKERS] Enums patch v2

From
Gregory Stark
Date:
"Andrew Dunstan" <andrew@dunslane.net> writes:

> I'm not a big fan of ordering columns to optimise record layout, except in the
> most extreme cases (massive DW type apps). I think visible column order should
> be logical, not governed by physical considerations.

Well as long as we're talking "should"s the database should take care of this
for you anyways.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Enums patch v2

From
Tom Dunstan
Date:
Heikki Linnakangas wrote:
> I'm sorry I missed the original discussions, but I have to ask: Why do
> we want enums in core? The only potential advantage I can see over using
> a look-up table and FK references is performance.

Well, there are a few things. Sometimes its tidiness, sometimes
integrity... I've seen more than one system with hundreds of these
things, and they've either gone down the table-per-enum solution, with
LOTS of extra tables whose values never change, or the EAV solution,
with one or two globally referenced tables to which everything in your
system has as a FK, and an integrity check in a trigger if you're very
lucky. Yuck on both accounts. Enums hit a sweet spot in the middle and
provide data integrity and performance for non-changing values.

> 1. What's the point of having comparison operators for enums? For most
> use cases, there's no natural ordering of enum values.

Well, there are a number of cases where ordering IS important, and
indeed, enums provide a way to do it easily where many of the
alternative solutions do not. It's one of the key benefits.

> 2. The comparison routine compares oids, right? If the oids wrap around
> when the enum values are created, the ordering isn't what the user expects.

As has been pointed out by others quicker on the draw than me, I do sort
the OIDs at enum creation time, for exactly this reason.

> 3. 4 bytes per value is wasteful if you're storing simple status codes
> etc. Especially if you're doing this for performance, let's do no harm
> by wasting space. One byte seems enough for the typical use cases. I'd
> even argue that having a high upper limit on the number of enum values
> encourages misuse of the feature.

I'd really love to have these fit into a 1 or 2 byte value on disk, but
AFAIK there's simply no way to do it currently in postgresql. If we ever
move to a solution where on-disk representation is noticeably different
from in-memory representation, then it might be doable. If that does
happen, we might benefit from other improvements such as being able to
order columns in a tuple on disk so as to minimize alignment padding,
not having to store a composite type's oid, etc. Until that happens,
though, if it ever does, this is probably the tightest on-disk
representation we're likely to get, unless we're happy to impose some
pretty severe restrictions, like 8 bits per enum, and only 256 enums in
total (for a 2 byte total). I was already shot down trying to make
similar restrictions when I first brought it up. :) The OID solution
seems to offend the least. :)

We did discuss this somewhat earlier, and I'm happy to take alternative
suggestions, but AFAIK this is about as good as it's going to get right now.

Cheers

Tom

Re: [HACKERS] Enums patch v2

From
Tom Dunstan
Date:
Alvaro Herrera wrote:
> I don't, because there are always those that are knowledgeable enough to
> know how to reduce space lost to padding.  So it would be nice to have
> 2-byte enums on-disk, and resolve them based on the column's typid.  But
> then, I'm not familiar with the patch at all so I'm not sure if it's
> possible.

Not with this patch, and AFAIK not possible generally, without writing
separate I/O functions for each type. I'd love to be able to do that,
but I don't think it's possible currently. The main stumbling block is
the output function (and cast-to-text function), because output
functions do not get provided the oid of the type that they're dealing
with, for security reasons IIRC. It was never clear to me why I/O
functions should ever be directly callable by a user (and hence open to
security issues), but apparently it was enough to purge any that were
designed like that from the system, so I wasn't going to go down that
road with the patch.

Cheers

Tom



Re: Enums patch v2

From
Tom Dunstan
Date:
Peter Eisentraut wrote:
> An objection to enums on the ground that foreign keys can accomplish the
> same thing could be extended to object to any data type with a finite
> domain.

Exactly. The extreme case is the boolean type, which could easily be
represented by a two-value enum. Or, if you were feeling masochistic, a
FK to a separate table. Which is easier?

People regularly do stuff like having domains over finite text values,
or having a FK to a separate (static) table, or using some sort of EAV.
Enums are type-safe, easily ordered, relatively efficient and don't
leave zillions of little static tables all over the place, a combination
of attributes that none of the alternative solutions in this space present.

Cheers

Tom


Re: [HACKERS] Enums patch v2

From
Martijn van Oosterhout
Date:
On Wed, Dec 20, 2006 at 01:39:58AM +0000, Tom Dunstan wrote:
> Not with this patch, and AFAIK not possible generally, without writing
> separate I/O functions for each type. I'd love to be able to do that,
> but I don't think it's possible currently. The main stumbling block is
> the output function (and cast-to-text function), because output
> functions do not get provided the oid of the type that they're dealing
> with, for security reasons IIRC. It was never clear to me why I/O
> functions should ever be directly callable by a user (and hence open to
> security issues), but apparently it was enough to purge any that were
> designed like that from the system, so I wasn't going to go down that
> road with the patch.

I worked around this in taggedtypes by indeed creating seperate copies
of the i/o functions on demand and at execution time looking up the
required type from the function signiture.

The only solution indeed is to change the calling convention if the I/O
functions so that the relevent datatype oid stored in a safe place,
that isn't set for normal function calls.

BTW, being able to call type i/o functions directly is very useful. For
example date_in(text_out(blah)) is a form of cast between types that
don't usually have a cast. If you change the calling convention as
indicated, that trick will still work, just not for types with the
restricted i/o functions.

Also, it's not just I/O functions that are the issue, consider the
enum-to-integer cast.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: [HACKERS] Enums patch v2

From
Andrew Dunstan
Date:
Martijn van Oosterhout wrote:
> Also, it's not just I/O functions that are the issue, consider the
> enum-to-integer cast.
>


er, what cast? :-)

IIRC Tom hasn't provided one. If you don't break the enum abstraction
there should be no need for one, and given the implementation it's not
quite trivial - probably the best way if this is needed would be to
precalculate it at type creation time and store the value in an extra
column in pg_enum.

cheers

andrew

Re: Enums patch v2

From
Bruce Momjian
Date:
Where are we on this?

---------------------------------------------------------------------------

Tom Dunstan wrote:
> Hi all
>
> Here is an updated version of the enums patch. It has been brought up to
> date and applies against current CVS HEAD. The original email is at [1],
> and describes the implementation.
>
> This version contains sgml documentation, and contains the missing
> bounds checks and error codes noted in the last email.
>
> As mentioned before, the only part that I'm not super keen on is the
> hack required to pass the type oid in to the text-to-enum cast function,
> since normally those take type mods but not type oids. I wasn't sure how
> else to get a cast working though, short of allowing another type of
> cast function which accepts type oids as well. That seemed overkill for
> just one case, though, and was getting a bit beyond the realms of what I
> wanted to get done with this patch.
>
> Anyway, it's reasonably feature complete and should be appropriately
> documented now, so feedback gratefully accepted.
>
> Cheers
>
> Tom
>
> [1] http://archives.postgresql.org/pgsql-patches/2006-09/msg00000.php

[ application/x-gzip is not supported, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Enums patch v2

From
Neil Conway
Date:
On Thu, 2007-02-01 at 22:50 -0500, Bruce Momjian wrote:
> Where are we on this?

I can commit to reviewing this. The patch looked fairly solid on a quick
glance through, but I won't have the cycles to review it properly for a
week or two.

-Neil



Re: Enums patch v2

From
Tom Dunstan
Date:
Neil Conway wrote:
> On Thu, 2007-02-01 at 22:50 -0500, Bruce Momjian wrote:
>> Where are we on this?
>
> I can commit to reviewing this. The patch looked fairly solid on a quick
> glance through, but I won't have the cycles to review it properly for a
> week or two.

I've brought this up to date with the operator family stuff now, and the
attached patch once more applies cleanly against HEAD. Hopefully that'll
make life a little easier when reviewing it. Thanks.

I got sick of manually shifting the new OID values every time someone
had added something new to the catalogs, so I moved all of my OIDs up to
the 9000 range. Attached is a hacky bash script which can move them back
to something sane. The idea is to run unused_oids first, see where the
main block of unused OIDs starts, and then run shift_oids on the
(unzipped) patch file before applying the patch. We discussed something
similar a while ago, but no-one ever got around to implementing the script.

I've attached the script and left the OIDs in my patch in the upper
range rather than just running it myself before submitting the patch as
I reckon that this might be a useful convention for authors of patches
which add a non-trivial number of OIDs to the catalogs. If there's
agreement then anyone can feel free to commit the script to CVS or put
it on the wiki or whatever.

Cheers

Tom
#!/bin/sh

start=$1
end=$2
new_start=$3
filename=$4

if [ -z "$filename" ] ; then
    echo Usage: $0 start-oid end-oid new-start-oid filename
    exit 1
fi

if [ ! -f $filename ] ; then
    echo $0: $filename is not a file
    exit 1
fi

if [ $end -le $start ] ; then
    echo $0: End of OID range must be greater than or equal to the start
    exit 1
fi

start_len=`echo -n $start | wc -c`
end_len=`echo -n $end | wc -c`

if [ $start_len -ne 4 -o $end_len -ne 4 ] ; then
    echo $0: Source OID range must have 4 digits
    exit 1
fi

let new_end=$new_start+$end-$start
if [ $start -le $new_start -a $end -ge $new_start -o $new_start -le $start -a $new_end -ge $start ] ; then
    echo $0: OID ranges may not overlap
    exit 1
fi


i=$start
j=$new_start
while [ $i -le $end ] ; do
    #echo $i $j
    sed -i "s/$i/$j/g" $filename
    let i=i+1
    let j=j+1
done

Attachment

Re: Enums patch v2

From
Bruce Momjian
Date:
I am putting this in the patches queue so it is not lost.  I believe
Neil is working applying this.

Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Tom Dunstan wrote:
> Neil Conway wrote:
> > On Thu, 2007-02-01 at 22:50 -0500, Bruce Momjian wrote:
> >> Where are we on this?
> >
> > I can commit to reviewing this. The patch looked fairly solid on a quick
> > glance through, but I won't have the cycles to review it properly for a
> > week or two.
>
> I've brought this up to date with the operator family stuff now, and the
> attached patch once more applies cleanly against HEAD. Hopefully that'll
> make life a little easier when reviewing it. Thanks.
>
> I got sick of manually shifting the new OID values every time someone
> had added something new to the catalogs, so I moved all of my OIDs up to
> the 9000 range. Attached is a hacky bash script which can move them back
> to something sane. The idea is to run unused_oids first, see where the
> main block of unused OIDs starts, and then run shift_oids on the
> (unzipped) patch file before applying the patch. We discussed something
> similar a while ago, but no-one ever got around to implementing the script.
>
> I've attached the script and left the OIDs in my patch in the upper
> range rather than just running it myself before submitting the patch as
> I reckon that this might be a useful convention for authors of patches
> which add a non-trivial number of OIDs to the catalogs. If there's
> agreement then anyone can feel free to commit the script to CVS or put
> it on the wiki or whatever.
>
> Cheers
>
> Tom

[ application/x-gzip is not supported, skipping... ]

> #!/bin/sh
>
> start=$1
> end=$2
> new_start=$3
> filename=$4
>
> if [ -z "$filename" ] ; then
>     echo Usage: $0 start-oid end-oid new-start-oid filename
>     exit 1
> fi
>
> if [ ! -f $filename ] ; then
>     echo $0: $filename is not a file
>     exit 1
> fi
>
> if [ $end -le $start ] ; then
>     echo $0: End of OID range must be greater than or equal to the start
>     exit 1
> fi
>
> start_len=`echo -n $start | wc -c`
> end_len=`echo -n $end | wc -c`
>
> if [ $start_len -ne 4 -o $end_len -ne 4 ] ; then
>     echo $0: Source OID range must have 4 digits
>     exit 1
> fi
>
> let new_end=$new_start+$end-$start
> if [ $start -le $new_start -a $end -ge $new_start -o $new_start -le $start -a $new_end -ge $start ] ; then
>     echo $0: OID ranges may not overlap
>     exit 1
> fi
>
>
> i=$start
> j=$new_start
> while [ $i -le $end ] ; do
>     #echo $i $j
>     sed -i "s/$i/$j/g" $filename
>     let i=i+1
>     let j=j+1
> done

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +