Thread: extensible enum types
A recent discussion with a developer convinced me that enums would be much more useful if new values could be added to the label list easily without changing the stored values. Given the current representation of enums as a globally unique oid, I think this is just about impossible if the new label is to go somewhere other than on the end of the list (assuming that's where the next oid would get allocated). So I have been thinking about a new type family which for the sake of argument I will call varenum (please hold the bikeshedding till later). Staying with 4 bytes as the stored value, the idea is that we partition it into two 16 bit pieces. The high-order piece is the varenum type identifier. The low-order piece would uniquely identify the label within the set for that varenum. Initial values would be allocated like this: calculate the space between values p as 2**16 / (16 + number_of_labels). Then set the first value at 8 * p, then next at 9* p and so on. This is designed to allow more space to add labels at the beginning and end of the list, where this is more likely. Adding a label would be a matter of finding the labels adjacent to the position where we want to add the new label, and placing it half way between them, possibly with special rules for the list ends. If we want to add the label between two labels having values n and n+1 the addition would fail. All this would mean a) we can't have more than 65536 varenum types in a system, and no varenum type could have more than 65536 values (and possibly less, depending on how they are added). In practice I think these are quite reasonable restrictions, and 99.99% of users would never come close to bumping up against them. Given all this, we could then allow things like: ALTER TYPE varenumtype ADD 'newlabel' [ BEFORE | AFTER 'existinglabel' ] I haven't looked at how we'd set this up in the catalog, but I assume that by analogy with pg_enum it should be fairly straightforward. We could actually allow something like the above for existing enum types without the optional clause, which would just add the label wherever the next oid happened to fall, which would commonly be at the end of the list. That would handle the common case where the application doesn't care about the label ordering. That should be a much simpler change and is probably worth doing first. There was some discussion of this here: <http://archives.postgresql.org/message-id/20080425182718.GD5888@alvh.no-ip.org> But I'm fairly reluctant to do things which will require table rewrites. I'm also less excited about the idea of removing values - that is something I don't ever recall being asked about, but I have often been asked about adding labels. For people prepared to rewrite tables, there is currently a workaround: create a new enum type, alter the table to use the new enum type, drop the old enum type, rename the new enum type to the old enum type. Thoughts? cheers andrew
On Fri, Jun 18, 2010 at 11:50 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > Then set the > first value at 8 * p, then next at 9* p and so on. This is designed to > allow more space to add labels at the beginning and end of the list, where > this is more likely. Adding a label would be a matter of finding the labels > adjacent to the position where we want to add the new label, and placing it > half way between them, possibly with special rules for the list ends. If we > want to add the label between two labels having values n and n+1 the > addition would fail. I like the idea of being able to modify enums on the fly, but I'm skeptical of an implementation that won't always work. Maybe it's still better than what we have now, but it seems grotty. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas wrote: > On Fri, Jun 18, 2010 at 11:50 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > >> Then set the >> first value at 8 * p, then next at 9* p and so on. This is designed to >> allow more space to add labels at the beginning and end of the list, where >> this is more likely. Adding a label would be a matter of finding the labels >> adjacent to the position where we want to add the new label, and placing it >> half way between them, possibly with special rules for the list ends. If we >> want to add the label between two labels having values n and n+1 the >> addition would fail. >> > > I like the idea of being able to modify enums on the fly, but I'm > skeptical of an implementation that won't always work. Maybe it's > still better than what we have now, but it seems grotty. > > I'd be perfectly happy to hear a reasonable alternative. Assuming we use some integer representation, given two labels represented by n and n+1, we can't add a label between them without rewriting the tables that use the type, whether it's my representation scheme or some other. Maybe we could have a FORCE option which would rewrite if necessary. cheers andrew
On Jun 18, 2010, at 9:07 AM, Robert Haas wrote: >> Then set the >> first value at 8 * p, then next at 9* p and so on. This is designed to >> allow more space to add labels at the beginning and end of the list, where >> this is more likely. Adding a label would be a matter of finding the labels >> adjacent to the position where we want to add the new label, and placing it >> half way between them, possibly with special rules for the list ends. If we >> want to add the label between two labels having values n and n+1 the >> addition would fail. > > I like the idea of being able to modify enums on the fly, but I'm > skeptical of an implementation that won't always work. Maybe it's > still better than what we have now, but it seems grotty. Yes, other than that I fully endorse the idea. What's the likelihood of a failure? And would the position of the new label(represented by its internal number) be predictive? IOW, would updating the same varenumtype in two databases (or ontwo servers) yield the same underlying positional value? Best, David
On Jun 18, 2010, at 9:34 AM, Andrew Dunstan wrote: > I'd be perfectly happy to hear a reasonable alternative. Assuming we use some integer representation, given two labelsrepresented by n and n+1, we can't add a label between them without rewriting the tables that use the type, whetherit's my representation scheme or some other. Maybe we could have a FORCE option which would rewrite if necessary. People would likely always use it. Why not use a decimal number? Best, David
On Fri, 2010-06-18 at 12:34 -0400, Andrew Dunstan wrote: > > Robert Haas wrote: > > On Fri, Jun 18, 2010 at 11:50 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > >> Then set the > >> first value at 8 * p, then next at 9* p and so on. This is designed to > >> allow more space to add labels at the beginning and end of the list, where > >> this is more likely. Adding a label would be a matter of finding the labels > >> adjacent to the position where we want to add the new label, and placing it > >> half way between them, possibly with special rules for the list ends. If we > >> want to add the label between two labels having values n and n+1 the > >> addition would fail. > >> > > > > I like the idea of being able to modify enums on the fly, but I'm > > skeptical of an implementation that won't always work. Maybe it's > > still better than what we have now, but it seems grotty. > > > > > > I'd be perfectly happy to hear a reasonable alternative. Assuming we use > some integer representation, given two labels represented by n and n+1, > we can't add a label between them without rewriting the tables that use > the type, whether it's my representation scheme or some other. Maybe we > could have a FORCE option which would rewrite if necessary. I apologize as this thread has already moved past the initial proposal. I had mail problems so I didn't see any part of the thread in my client until now. My standard response to enums is, don't use them. Specifically because you can't modify them. When I talk to developers and they see we have enums they get excited and then I tell them you can't modify them and they get very downtrodden. I tell them just to use a look up table. Anyway, Andrew if you can find a reasonable way to make them modifiable, I am all for it :) On that note and yes this would be weird but have we considered some kind of wrapper around just a lookup table? I mean what if there was a system table that listed : nspname, relation, column, value the "type" enum would know to look in that table for valid values Then we just add various syntactical sugar to manage the values in the table via that specific enum: ALTER TABLE foo ALTER COLUMN bar VALUES (bar,baz,bing,foo) ALTER TABLE would know its an enum and would perform proper operations on the system table. Sincerely, Joshua D. Drake > > cheers > > andrew > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering
Andrew Dunstan <andrew@dunslane.net> writes: > Robert Haas wrote: >> I like the idea of being able to modify enums on the fly, but I'm >> skeptical of an implementation that won't always work. Maybe it's >> still better than what we have now, but it seems grotty. > I'd be perfectly happy to hear a reasonable alternative. Insert a sort order column into pg_enum, and rearrange the values in that whenever the user wants to add a new value in a particular place. You give up cheap comparisons in exchange for flexibility. I think lots of people would accept that tradeoff, especially if they could make it per-datatype. One point here is that you'd have to restrict the rearrangements so that the relative sort order of existing values never changes, else you break (for example) indexes on columns of that type. regards, tom lane
David E. Wheeler wrote: > What's the likelihood of a failure? Constructing a failure case would be simple. In practice, I suspect it would be very low. > And would the position of the new label (represented by its internal number) be predictive? IOW, would updating the samevarenumtype in two databases (or on two servers) yield the same underlying positional value? > > > The algorithm I outlined is deterministic, so the same sequence of operations on the type would yield the same set of values on the low-order 16 bits. But that doesn't mean they would have the same high order 16 bits. That would depend on the history of the system. But more importantly, why do you care? the stored value is an implementation artefact that should be totally invisible to users. There would be no guarantee of the same underlying values on dump/reload either, just as there is not now for enums. cheers andrew
David E. Wheeler wrote: > On Jun 18, 2010, at 9:34 AM, Andrew Dunstan wrote: > > >> I'd be perfectly happy to hear a reasonable alternative. Assuming we use some integer representation, given two labelsrepresented by n and n+1, we can't add a label between them without rewriting the tables that use the type, whetherit's my representation scheme or some other. Maybe we could have a FORCE option which would rewrite if necessary. >> > > People would likely always use it. > > Why not use a decimal number? > > > You are just bumping up the storage cost. Part of the attraction of enums is their efficiency. cheers andrew
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Robert Haas wrote: >> >>> I like the idea of being able to modify enums on the fly, but I'm >>> skeptical of an implementation that won't always work. Maybe it's >>> still better than what we have now, but it seems grotty. >>> > > >> I'd be perfectly happy to hear a reasonable alternative. >> > > Insert a sort order column into pg_enum, and rearrange the values in > that whenever the user wants to add a new value in a particular place. > You give up cheap comparisons in exchange for flexibility. I think lots > of people would accept that tradeoff, especially if they could make it > per-datatype. > > One point here is that you'd have to restrict the rearrangements so that > the relative sort order of existing values never changes, else you break > (for example) indexes on columns of that type. > > > Hmm. Yes, that could work. The assumption in my proposal was that existing values would not be reordered anyway. But I'm not happy about giving up cheap comparison. And how would it be per data-type? That part isn't clear to me. Would we mark a given enum type as having its oids in order? It would also be sensible to quantify how much more expensive comparisons would become. If the sort order data were kept in the syscache the extra cost might get very small. What I actually like most about this suggestion is that we would be able to apply it cleanly to existing enum types without inventing anything much new. cheers cheers
On Fri, Jun 18, 2010 at 12:59 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > You are just bumping up the storage cost. Part of the attraction of enums is > their efficiency. What's efficient about them? Aren't we using 4 bytes to store a value that will nearly always fit in 2, if not 1? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas wrote: > On Fri, Jun 18, 2010 at 12:59 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > >> You are just bumping up the storage cost. Part of the attraction of enums is >> their efficiency. >> > > What's efficient about them? Aren't we using 4 bytes to store a value > that will nearly always fit in 2, if not 1? > > This was debated when we implemented enums. As between 1,2 and 4 there is often not much to choose, as alignment padding makes it pretty much the same. But any of them are more efficient than storing a numeric value or the label itself. Anyway, it might well be moot. cheers andrew
On Fri, Jun 18, 2010 at 1:59 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > This was debated when we implemented enums. As between 1,2 and 4 there is > often not much to choose, as alignment padding makes it pretty much the > same. But any of them are more efficient than storing a numeric value or the > label itself. I was assuming the alternative was an integer, rather than a numeric... but yeah, a numeric or the label itself would definitely be larger. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, Jun 18, 2010 at 6:17 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > Tom Lane wrote: >> >> Insert a sort order column into pg_enum, and rearrange the values in >> that whenever the user wants to add a new value in a particular place. +1 I was going to say exactly the same thing. >> You give up cheap comparisons in exchange for flexibility. I think lots >> of people would accept that tradeoff, especially if they could make it >> per-datatype. > Hmm. Yes, that could work. The assumption in my proposal was that existing > values would not be reordered anyway. > > But I'm not happy about giving up cheap comparison. And how would it be per > data-type? That part isn't clear to me. Would we mark a given enum type as > having its oids in order? It would also be sensible to quantify how much > more expensive comparisons would become. If the sort order data were kept in > the syscache the extra cost might get very small. I think you would need a syscache or something like it. My first instinct was to load the whole enum value->sort order mapping into a hash table the first time you're asked to compare two values in a given type. Then your comparison operator amounts to "look up/initialize hash table for this enum type, look up both sort orders in hash table, return comparison". You might need something like a syscache for the hash tables so that you don't keep the hash tables around forever. Using a syscache for the individual sort values would be slower to initially load if you're sorting a list since you would be doing a lot of retail lookups of individual values. But then perhaps it's a cheap way to protect against very large enums which using a hash table per enum type would be fragile against. -- greg
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> Insert a sort order column into pg_enum, and rearrange the values in >> that whenever the user wants to add a new value in a particular place. >> You give up cheap comparisons in exchange for flexibility. I think lots >> of people would accept that tradeoff, especially if they could make it >> per-datatype. > But I'm not happy about giving up cheap comparison. I don't think it would be all that bad. We could teach typcache.c to cache the ordering data for any type that's in active use. It'd certainly be a lot more expensive than OID comparison, but perhaps not worse than NUMERIC comparisons. > And how would it be per data-type? Well, there'd be two kinds of enums, just as you were saying before. I'm not sure how we'd expose that to users exactly, or whether there could be provisions for switching a type's behavior after creation. regards, tom lane
On Fri, Jun 18, 2010 at 1:59 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > Robert Haas wrote: >> >> On Fri, Jun 18, 2010 at 12:59 PM, Andrew Dunstan <andrew@dunslane.net> >> wrote: >> >>> >>> You are just bumping up the storage cost. Part of the attraction of enums >>> is >>> their efficiency. >>> >> >> What's efficient about them? Aren't we using 4 bytes to store a value >> that will nearly always fit in 2, if not 1? >> >> > > This was debated when we implemented enums. As between 1,2 and 4 there is > often not much to choose, as alignment padding makes it pretty much the > same. But any of them are more efficient than storing a numeric value or the > label itself. > > Anyway, it might well be moot. > > cheers > > andrew Something I don't understand in all this is: why can't the type of an enum be determined statically rather than stored in every single value? For instance, if we have: CREATE TYPE number AS ENUM ('one', 'two', 'three'); CREATE TYPE color AS ENUM ('red', 'green', 'blue'); PostgreSQL won't allow a comparison between two different enum types, e.g.: > SELECT 'one'::number = 'red'::color; ERROR: operator does not exist: number = color However, when we say: SELECT 'one'::number = 'one'::number Couldn't enum_eq just use get_fn_expr_argtype to determine the type of enum input rather than rely on it being stored in the value (either implicitly via OID or explicitly as a word half)? Also, I can't seem to find the original debates from when enums were implemented. Does anyone have a link to that thread in the archives? Thanks. Joey Adams
Joseph Adams wrote: > Also, I can't seem to find the original debates from when enums were > implemented. Does anyone have a link to that thread in the archives? > Thanks. > Start here <http://archives.postgresql.org/pgsql-hackers/2006-08/msg00979.php> cheers andrew
Excerpts from Joseph Adams's message of vie jun 18 18:17:50 -0400 2010: > Couldn't enum_eq just use get_fn_expr_argtype to determine the type of > enum input rather than rely on it being stored in the value (either > implicitly via OID or explicitly as a word half)? > > Also, I can't seem to find the original debates from when enums were > implemented. Does anyone have a link to that thread in the archives? Probably around here http://archives.postgresql.org/pgsql-patches/2006-12/msg00099.php -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Joseph Adams <joeyadams3.14159@gmail.com> writes: > Couldn't enum_eq just use get_fn_expr_argtype to determine the type of > enum input rather than rely on it being stored in the value No. Support functions have to work in many contexts where there is no "side channel" such as get_fn_expr_argtype. What's more, it's very difficult to provide a side channel without creating security holes. We used to think it was OK for output functions to rely on a type OID passed separately from the actual value, but that's insecure: http://archives.postgresql.org/pgsql-hackers/2005-04/msg00998.php regards, tom lane
On Fri, 2010-06-18 at 12:34 -0400, Andrew Dunstan wrote: > > Robert Haas wrote: > > On Fri, Jun 18, 2010 at 11:50 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > >> Then set the > >> first value at 8 * p, then next at 9* p and so on. This is designed to > >> allow more space to add labels at the beginning and end of the list, where > >> this is more likely. Adding a label would be a matter of finding the labels > >> adjacent to the position where we want to add the new label, and placing it > >> half way between them, possibly with special rules for the list ends. If we > >> want to add the label between two labels having values n and n+1 the > >> addition would fail. > >> > > > > I like the idea of being able to modify enums on the fly, but I'm > > skeptical of an implementation that won't always work. Maybe it's > > still better than what we have now, but it seems grotty. > > > > > > I'd be perfectly happy to hear a reasonable alternative. Assuming we use > some integer representation, given two labels represented by n and n+1, > we can't add a label between them without rewriting the tables that use > the type, whether it's my representation scheme or some other. Maybe we > could have a FORCE option which would rewrite if necessary. I apologize as this thread has already moved past the initial proposal. I had mail problems so I didn't see any part of the thread in my client until now. My standard response to enums is, don't use them. Specifically because you can't modify them. When I talk to developers and they see we have enums they get excited and then I tell them you can't modify them and they get very downtrodden. I tell them just to use a look up table. Anyway, Andrew if you can find a reasonable way to make them modifiable, I am all for it :) On that note and yes this would be weird but have we considered some kind of wrapper around just a lookup table? I mean what if there was a system table that listed : nspname, relation, column, value the "type" enum would know to look in that table for valid values Then we just add various syntactical sugar to manage the values in the table via that specific enum: ALTER TABLE foo ALTER COLUMN bar VALUES (bar,baz,bing,foo) ALTER TABLE would know its an enum and would perform proper operations on the system table. Sincerely, Joshua D. Drake > > cheers > > andrew > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering
Tom Lane wrote: >> And how would it be per data-type? >> > > Well, there'd be two kinds of enums, just as you were saying before. > I'm not sure how we'd expose that to users exactly, or whether there > could be provisions for switching a type's behavior after creation. > > > I'd be a whole lot happier if we didn't have to do that. I've been trying to wrack my brain for some clever way to avoid it (so far without success). cheers andrew
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> Insert a sort order column into pg_enum, and rearrange the values in >>> that whenever the user wants to add a new value in a particular place. >>> You give up cheap comparisons in exchange for flexibility. I think lots >>> of people would accept that tradeoff, especially if they could make it >>> per-datatype. >>> > > >> But I'm not happy about giving up cheap comparison. >> > > I don't think it would be all that bad. We could teach typcache.c to > cache the ordering data for any type that's in active use. It'd > certainly be a lot more expensive than OID comparison, but perhaps not > worse than NUMERIC comparisons. > > > Another thought: could we add a column to pg_type with a flag that's true if the oids are in sort order? Then the comparison routines could just look that up in the type cache and if it's true (as it often will be) just return the oid comparison. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Another thought: could we add a column to pg_type with a flag that's > true if the oids are in sort order? Then the comparison routines could > just look that up in the type cache and if it's true (as it often will > be) just return the oid comparison. Well, having to do a cache lookup already makes it a couple orders of magnitude more expensive than an OID comparison. However, it's hard to say how much that matters in terms of total application performance. We really could do with a bit of performance testing here ... regards, tom lane
On Fri, Jun 18, 2010 at 12:59 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
Probably it'd be the same as the decimal suggestion above, but we can use floating-point data type.
It will allow injection of a new label at any stage.
CREATE leads to
Label1 -> 1.0
Label2 -> 2.0
Label3 -> 3.0
ALTER ... ADD Label4 AFTER Label2; leads to
Label1 -> 1.0
Label2 -> 2.0
Label4 -> 2.5
Label3 -> 3.0
ALTER ... ADD Label5 AFTER Label2; leads to
Label1 -> 1.0
Label2 -> 2.0
Label5 -> 2.25
Label4 -> 2.5
Label3 -> 3.0
Since floating-point implementation is architecture dependent, the ALTER command should check that the injected value does not equate to any value around it (eg. comparisons of (2.5 == 2) and (2.25 == 2.5) should not yield 0); and if it does, then throw an error and ask the user to use the rewrite-the-table version of the command.
And since it is still 32 bit, and comparisons done by machine, performance should be acceptably close to current integer comparisons, and much faster that the cache lookups etc. being proposed.
This is very similar to Andrew's original suggestion of splitting 32 bits into 16+16, but managed by the machine hence no complicated comparison algos needed on our part. Also, since this is all transparent to the SQL interface, our dump-reload cycle or Slony replication, etc. should not be affected either.
Regards,
-- You are just bumping up the storage cost. Part of the attraction of enums is their efficiency.
David E. Wheeler wrote:On Jun 18, 2010, at 9:34 AM, Andrew Dunstan wrote:
I'd be perfectly happy to hear a reasonable alternative. Assuming we use some integer representation, given two labels represented by n and n+1, we can't add a label between them without rewriting the tables that use the type, whether it's my representation scheme or some other. Maybe we could have a FORCE option which would rewrite if necessary.
People would likely always use it.
Why not use a decimal number?
Probably it'd be the same as the decimal suggestion above, but we can use floating-point data type.
It will allow injection of a new label at any stage.
CREATE leads to
Label1 -> 1.0
Label2 -> 2.0
Label3 -> 3.0
ALTER ... ADD Label4 AFTER Label2; leads to
Label1 -> 1.0
Label2 -> 2.0
Label4 -> 2.5
Label3 -> 3.0
ALTER ... ADD Label5 AFTER Label2; leads to
Label1 -> 1.0
Label2 -> 2.0
Label5 -> 2.25
Label4 -> 2.5
Label3 -> 3.0
Since floating-point implementation is architecture dependent, the ALTER command should check that the injected value does not equate to any value around it (eg. comparisons of (2.5 == 2) and (2.25 == 2.5) should not yield 0); and if it does, then throw an error and ask the user to use the rewrite-the-table version of the command.
And since it is still 32 bit, and comparisons done by machine, performance should be acceptably close to current integer comparisons, and much faster that the cache lookups etc. being proposed.
This is very similar to Andrew's original suggestion of splitting 32 bits into 16+16, but managed by the machine hence no complicated comparison algos needed on our part. Also, since this is all transparent to the SQL interface, our dump-reload cycle or Slony replication, etc. should not be affected either.
Regards,
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com
singh.gurjeet@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet
Mail sent from my BlackLaptop device
Gurjeet Singh wrote: > > > This is very similar to Andrew's original suggestion of splitting 32 > bits into 16+16, but managed by the machine hence no complicated > comparison algos needed on our part. Also, since this is all > transparent to the SQL interface, our dump-reload cycle or Slony > replication, etc. should not be affected either. > > It would break the on-disk representation, though. That's not something we want to do any more if it can possibly be avoided. We want to keep pg_upgrade working. cheers andrew
On Sat, Jun 19, 2010 at 4:55 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > Gurjeet Singh wrote: >> >> >> This is very similar to Andrew's original suggestion of splitting 32 bits >> into 16+16, but managed by the machine hence no complicated comparison algos >> needed on our part. Also, since this is all transparent to the SQL >> interface, our dump-reload cycle or Slony replication, etc. should not be >> affected either. >> >> > > It would break the on-disk representation, though. That's not something we > want to do any more if it can possibly be avoided. We want to keep > pg_upgrade working. I was partial to your original idea -- i thought it was quite clever actually. enums are a performance side of a tradeoff already so I think any improvement for them should be looked at through that lens. 16 bits is IMO enough to pick a reasonable bucket size that gives you enough play to handle the vast majority of cases that are appropriate for enums. your workaround in the rare case you actually hit the limitations (most of these would fall under the 'oops, i used the wrong tool' category) seems perfectly ok imo. merlin
On Fri, 2010-06-18 at 11:50 -0400, Andrew Dunstan wrote: > Thoughts? enum types exist as an optimisation-by-avoidance of referential integrity. We're a relational database, so IMHO we should spend time performance tuning RI. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs wrote: > On Fri, 2010-06-18 at 11:50 -0400, Andrew Dunstan wrote: > > >> Thoughts? >> > > enum types exist as an optimisation-by-avoidance of referential > integrity. > > We're a relational database, so IMHO we should spend time performance > tuning RI. > > I don't accept your initial assertion at all. But in any case, these are not mutually exclusive. Your work tuning RI will not obstruct mine in making enums more useful, nor vice versa. cheers andrew
>> Thoughts? > > enum types exist as an optimisation-by-avoidance of referential > integrity. > > We're a relational database, so IMHO we should spend time performance > tuning RI. I take the view that they exist as a way of representing enumerations of application/domain values - if it's hard coded in the application, it's hard coded in the database by using an enum. This is why it isn't that big a problem that they cannot be amended - they ought to be very static, immutable values in the first place. I still think it would be handy to be able to append new values though, and not have to ALTER COLUMN USING to a new enum type. Besides, using enums in place of lookup tables doesn't really make much sense as an optimisation. It's very cool to be able to write queries like SELECT * FROM payments WHERE payment_type = 'cash', rather than having to recall time and again what the PK of cash is within your lookup table. -- Regards, Peter Geoghegan
On Sun, 2010-06-20 at 03:42 +0100, Peter Geoghegan wrote: > >> Thoughts? > It's very cool to be able to write queries like SELECT * FROM payments > WHERE payment_type = 'cash', rather than having to recall time and > again what the PK of cash is within your lookup table. Ahem. That is what a natural key is for :) Joshua D. Drake > > -- > Regards, > Peter Geoghegan > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering
On Sun, 2010-06-20 at 03:42 +0100, Peter Geoghegan wrote: > >> Thoughts? > It's very cool to be able to write queries like SELECT * FROM payments > WHERE payment_type = 'cash', rather than having to recall time and > again what the PK of cash is within your lookup table. Ahem. That is what a natural key is for :) Joshua D. Drake > > -- > Regards, > Peter Geoghegan > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering
> Ahem. That is what a natural key is for :) Well, they have their own drawbacks that don't make them particularly appealing to use with lookup tables to ape enums. How many lookup tables have you seen in the wild with a natural key? People sometimes represent things like US states as enums. This is probably a mistake, because you cannot control or predict if there'll be a new US state, unlikely though that me be. You *can* control, for example, what types of payment your application can deal with, and you'll probably have to hardcode differences in dealing with each inside your application, which makes enums a good choice. In my earlier example, in addition to 'cash', there is a value for payment_type of 'credit_card' . There is a separate column in the payments table that references a credit_cards table, because credit cards are considered transitory. A check constraint enforces that credit_cards_id is null or not null as appropriate. I don't like the idea of having values in a table that aren't so much data as an integral part of your application/database. I think it's wrong-headed. That's why I am not in favour of your enums as a lookup table wrapper suggestion. -- Regards, Peter Geoghegan
Peter Geoghegan wrote: > How many lookup tables have you seen in the wild with a natural > key? Me? Personally? A few hundred. > People sometimes represent things like US states as enums. This is > probably a mistake, because you cannot control or predict if > there'll be a new US state, unlikely though that me be. More importantly, you're likely to need to associate properties with the state. Sales tax info, maybe a sales manager, etc. A state table can be a handy place to store things like that. > I don't like the idea of having values in a table that aren't so > much data as an integral part of your application/database. Yep, exactly why natural keys should be used when possible. -Kevin
>> People sometimes represent things like US states as enums. This is >> probably a mistake, because you cannot control or predict if >> there'll be a new US state, unlikely though that me be. > > More importantly, you're likely to need to associate properties with > the state. Sales tax info, maybe a sales manager, etc. A state > table can be a handy place to store things like that. That's probably true, but if there was any question of needing to associate such values with US states, it ought to be perfectly obvious to everyone that enums are totally inappropriate. If that wasn't the case, then their use is only highly questionable, at least IMHO. What you're describing isn't really a lookup table as I understand the term. It's just a table. Lookup tables typically have things in them like the various possible states of another table's tuples. In my experience, lookup tables generally have two columns, an integer PK and a description/state. >> I don't like the idea of having values in a table that aren't so >> much data as an integral part of your application/database. > > Yep, exactly why natural keys should be used when possible. The "not having to remember lookup value PK" point I made was very much ancillary to my main point. Ideally, if you restore a schema-only dump of your database, you shouldn't be missing anything that is schema. Things like the possible states of a table's tuples are often schema, not data, and should be treated as such. -- Regards, Peter Geoghegan
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Another thought: could we add a column to pg_type with a flag that's >> true if the oids are in sort order? Then the comparison routines could >> just look that up in the type cache and if it's true (as it often will >> be) just return the oid comparison. >> > > Well, having to do a cache lookup already makes it a couple orders of > magnitude more expensive than an OID comparison. However, it's hard to > say how much that matters in terms of total application performance. > We really could do with a bit of performance testing here ... > > > I have done some. The performance hit is fairly horrible. Adding cache lookups for the enum rows to the comarison routines made a REINDEX on a 1m row table where the index is on an enum column (the enum has 500 randomly ordered labels) jump from around 10s to around 70s. I think that probably rules out doing anything like this for the existing enum types. I think the most we can reasonably do there is to allow adding a label to the end of the enum list. I'm fairly resistant to doing something which will have a major performance impact, as I know there are users who are relying on enums for performce reasons. I'm also fairly resistant to doing things which will require table rewriting. So the question then is: do we want to allow lots of flexibility for positioning new labels with significant degradation in comparison performace for a new enum variant, or have a new variant with some restrictions which probably won't impact most users but would have equivalent performance to the current enum family, or do nothing? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> Well, having to do a cache lookup already makes it a couple orders of >> magnitude more expensive than an OID comparison. However, it's hard to >> say how much that matters in terms of total application performance. >> We really could do with a bit of performance testing here ... > I have done some. The performance hit is fairly horrible. Adding cache > lookups for the enum rows to the comarison routines made a REINDEX on a > 1m row table where the index is on an enum column (the enum has 500 > randomly ordered labels) jump from around 10s to around 70s. Hmmm... that's bad, but I bet it's still less than the cost of comparing NUMERICs. Also, did you make any attempt to avoid repetitive cache lookups by storing a pointer in fn_extra (cf array comparisons)? regards, tom lane
Tom Lane wrote: >> Adding cache >> lookups for the enum rows to the comarison routines made a REINDEX on a >> 1m row table where the index is on an enum column (the enum has 500 >> randomly ordered labels) jump from around 10s to around 70s. >> > > Hmmm... that's bad, but I bet it's still less than the cost of comparing > NUMERICs. Also, did you make any attempt to avoid repetitive cache > lookups by storing a pointer in fn_extra (cf array comparisons)? > > > No. Will work on that. Thanks. cheers andrew
Peter Geoghegan <peter.geoghegan86@gmail.com> wrote: > In my experience, lookup tables generally have two columns, an > integer PK and a description/state. Eek. If that's what you consider a lookup table, I wouldn't advocate their use for anything. Ever. Period. -Kevin
On Mon, 2010-06-21 at 12:04 -0500, Kevin Grittner wrote: > Peter Geoghegan <peter.geoghegan86@gmail.com> wrote: > > > In my experience, lookup tables generally have two columns, an > > integer PK and a description/state. > > Eek. If that's what you consider a lookup table, I wouldn't > advocate their use for anything. Ever. Period. Do you mean you don't use relational modelling, or do you mean you would never implement your physical database that way because of the performance impact of RI on PostgreSQL? Or? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> Well, having to do a cache lookup already makes it a couple orders of >>> magnitude more expensive than an OID comparison. However, it's hard to >>> say how much that matters in terms of total application performance. >>> We really could do with a bit of performance testing here ... >>> > > >> I have done some. The performance hit is fairly horrible. Adding cache >> lookups for the enum rows to the comarison routines made a REINDEX on a >> 1m row table where the index is on an enum column (the enum has 500 >> regards, tom lane >> >> randomly ordered labels) jump from around 10s to around 70s. >> > > Hmmm... that's bad, but I bet it's still less than the cost of comparing > NUMERICs. Also, did you make any attempt to avoid repetitive cache > lookups by storing a pointer in fn_extra (cf array comparisons)? > > > OK, making a bit of progress. Attached is a sort of proof of concept patch that does that. It stores a bsearchable list of {enum, sort_order} pairs in fn_extra, along with a flag that indicates if the oids are in fact ordered. This flag, which would be maintained in and populated from pg_type, would allow avoidance of any significant performance penalty in such cases by relying on straight Oid comparison. We'd probably need to keep a count of labels in pg_type too so we could size the cache appropriately. This approach just about buys the best of both worlds. The execution time for the test mentioned above is down from around 70s to around 20s. I think for a worst case that's not too bad, especially when it is completely avoided unless we have perturbed the sort order. If anyone wants to play along, my test set is available at <http://developer.postgresql.org/~adunstan/enumtest.dmp> It's about 8.5Mb. cheers andrew Index: enum.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/enum.c,v retrieving revision 1.11 diff -c -r1.11 enum.c *** enum.c 26 Feb 2010 02:01:08 -0000 1.11 --- enum.c 22 Jun 2010 02:16:48 -0000 *************** *** 14,19 **** --- 14,20 ---- #include "postgres.h" #include "catalog/pg_enum.h" + #include "catalog/pg_type.h" #include "fmgr.h" #include "utils/array.h" #include "utils/builtins.h" *************** *** 22,27 **** --- 23,52 ---- #include "libpq/pqformat.h" #include "miscadmin.h" + typedef struct + { + Oid enum_oid; + uint32 sort_order; + } enum_sort; + + typedef struct + { + bool oids_are_sorted; + int sort_list_length; + enum_sort sort_order_list[1024]; + } enum_sort_cache; + + + static int + enum_sort_cmp(void * es1, void * es2) + { + enum_sort *p1, *p2; + p1 = (enum_sort *)es1; + p2 = (enum_sort *)es2; + return p1->enum_oid - p2->enum_oid; + } + + static ArrayType *enum_range_internal(Oid enumtypoid, Oid lower, Oid upper); static int enum_elem_cmp(const void *left, const void *right); *************** *** 155,167 **** /* Comparison functions and related */ Datum enum_lt(PG_FUNCTION_ARGS) { Oid a = PG_GETARG_OID(0); Oid b = PG_GETARG_OID(1); ! PG_RETURN_BOOL(a < b); } Datum --- 180,283 ---- /* Comparison functions and related */ + static inline int + enum_ccmp(Oid arg1, Oid arg2, FunctionCallInfo fcinfo) + { + + enum_sort_cache * mycache; + enum_sort *es1, *es2; + int sort1, sort2; + bool added = false; + HeapTuple tup; + Form_pg_enum en; + Oid typeoid; + Form_pg_type typ; + + if (arg1 == arg2) + return 0; + + mycache = (enum_sort_cache *) fcinfo->flinfo->fn_extra; + if (mycache == NULL ) + { + fcinfo->flinfo->fn_extra = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt, + sizeof(enum_sort_cache)); + mycache = (enum_sort_cache *) fcinfo->flinfo->fn_extra; + mycache->sort_list_length = 1; + tup = SearchSysCache1(ENUMOID, ObjectIdGetDatum(arg1)); + en = (Form_pg_enum) GETSTRUCT(tup); + mycache->sort_order_list[0].enum_oid = arg1; + mycache->sort_order_list[0].sort_order = arg1; + typeoid = en->enumtypid; + ReleaseSysCache(tup); + tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typeoid)); + typ = (Form_pg_type) GETSTRUCT(tup); + if (typ->typtype != 'e') + elog(ERROR,"wrong type for oid %u",typeoid); + /* XXX TODO fill in oids_are_sorted property from type tuple here */ + mycache->oids_are_sorted = false; + ReleaseSysCache(tup); + } + + if (mycache->oids_are_sorted) + return arg1 - arg2; + + es1 = bsearch(&arg1,mycache->sort_order_list,mycache->sort_list_length, + sizeof(enum_sort),enum_sort_cmp); + es2 = bsearch(&arg2,mycache->sort_order_list,mycache->sort_list_length, + sizeof(enum_sort),enum_sort_cmp); + + if (es1 == NULL) + { + + tup = SearchSysCache1(ENUMOID, ObjectIdGetDatum(arg1)); + en = (Form_pg_enum) GETSTRUCT(tup); + mycache->sort_order_list[mycache->sort_list_length].enum_oid = arg1; + sort1 = arg1; /* XXX should be en->enumsort */ + mycache->sort_order_list[mycache->sort_list_length].sort_order = + sort1; + ReleaseSysCache(tup); + mycache->sort_list_length ++; + added = true; + } + else + { + /* already in cache */ + sort1 = es1->sort_order; + } + + if (es2 == NULL) + { + + tup = SearchSysCache1(ENUMOID, ObjectIdGetDatum(arg2)); + en = (Form_pg_enum) GETSTRUCT(tup); + sort2 = arg2; /* XXX should be en->enumsort */ + mycache->sort_order_list[mycache->sort_list_length].enum_oid = arg2; + mycache->sort_order_list[mycache->sort_list_length].sort_order = + sort2; + ReleaseSysCache(tup); + mycache->sort_list_length ++; + added = true; + } + else + { + /* already in cache */ + sort2 = es2->sort_order; + } + + if (added) + qsort(mycache->sort_order_list,mycache->sort_list_length, + sizeof(enum_sort),enum_sort_cmp); + + return sort1 - sort2; + } + Datum enum_lt(PG_FUNCTION_ARGS) { Oid a = PG_GETARG_OID(0); Oid b = PG_GETARG_OID(1); ! PG_RETURN_BOOL(enum_ccmp(a,b,fcinfo) < 0); } Datum *************** *** 170,176 **** Oid a = PG_GETARG_OID(0); Oid b = PG_GETARG_OID(1); ! PG_RETURN_BOOL(a <= b); } Datum --- 286,292 ---- Oid a = PG_GETARG_OID(0); Oid b = PG_GETARG_OID(1); ! PG_RETURN_BOOL(enum_ccmp(a,b,fcinfo) <= 0); } Datum *************** *** 197,203 **** Oid a = PG_GETARG_OID(0); Oid b = PG_GETARG_OID(1); ! PG_RETURN_BOOL(a >= b); } Datum --- 313,319 ---- Oid a = PG_GETARG_OID(0); Oid b = PG_GETARG_OID(1); ! PG_RETURN_BOOL(enum_ccmp(a,b,fcinfo) >= 0); } Datum *************** *** 206,212 **** Oid a = PG_GETARG_OID(0); Oid b = PG_GETARG_OID(1); ! PG_RETURN_BOOL(a > b); } Datum --- 322,328 ---- Oid a = PG_GETARG_OID(0); Oid b = PG_GETARG_OID(1); ! PG_RETURN_BOOL(enum_ccmp(a,b,fcinfo) > 0); } Datum *************** *** 233,242 **** Oid a = PG_GETARG_OID(0); Oid b = PG_GETARG_OID(1); ! if (a > b) ! PG_RETURN_INT32(1); ! else if (a == b) PG_RETURN_INT32(0); else PG_RETURN_INT32(-1); } --- 349,358 ---- Oid a = PG_GETARG_OID(0); Oid b = PG_GETARG_OID(1); ! if (a == b) PG_RETURN_INT32(0); + else if (enum_ccmp(a,b,fcinfo) > 0) + PG_RETURN_INT32(1); else PG_RETURN_INT32(-1); }
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > Robert Haas wrote: > >> I like the idea of being able to modify enums on the fly, but I'm > >> skeptical of an implementation that won't always work. Maybe it's > >> still better than what we have now, but it seems grotty. > > > I'd be perfectly happy to hear a reasonable alternative. > > Insert a sort order column into pg_enum, and rearrange the values in > that whenever the user wants to add a new value in a particular place. > You give up cheap comparisons in exchange for flexibility. I think lots > of people would accept that tradeoff, especially if they could make it > per-datatype. > > One point here is that you'd have to restrict the rearrangements so that > the relative sort order of existing values never changes, else you break > (for example) indexes on columns of that type. Sorry to be commenting late, but don't most people want to add to the end or beginning of the enum list, rather than in the middle, and can't we support that already? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Bruce Momjian <bruce@momjian.us> writes: > Sorry to be commenting late, but don't most people want to add to the > end or beginning of the enum list, rather than in the middle, and can't > we support that already? We could allow adding a value, but we couldn't guarantee where it would appear in the type's sort ordering. Depending on the current OID counter it would usually show up either at the end or the beginning. I think the general feeling is that this is too implementation-dependent to be acceptable. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Sorry to be commenting late, but don't most people want to add to the > > end or beginning of the enum list, rather than in the middle, and can't > > we support that already? > > We could allow adding a value, but we couldn't guarantee where it would > appear in the type's sort ordering. Depending on the current OID > counter it would usually show up either at the end or the beginning. > I think the general feeling is that this is too implementation-dependent > to be acceptable. Well, we don't need the enum value to map into the entire oid range. Can't we just add one to the top-most value and see if there is a conflict? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Bruce Momjian <bruce@momjian.us> writes: > Well, we don't need the enum value to map into the entire oid range. > Can't we just add one to the top-most value and see if there is a > conflict? If you don't use the OID counter to generate the new value, you're going to have problems with race conditions. There's also that small chance that there is no free value before 2^32. The bottom line here is not wanting a feature that "usually" works but fails once in awhile on the basis of conditions the user can't control. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > >> Well, we don't need the enum value to map into the entire oid range. >> Can't we just add one to the top-most value and see if there is a >> conflict? >> > > If you don't use the OID counter to generate the new value, you're going > to have problems with race conditions. There's also that small chance > that there is no free value before 2^32. > > The bottom line here is not wanting a feature that "usually" works but > fails once in awhile on the basis of conditions the user can't control. > > Yeah, what I'm now hoping to be able to do, following good suggestions from Tom, is to provide a way to get virtually no degradation in bulk comparison performance in the common case where any additions have been made at the end of the list with no oid wraparound, and acceptable performance otherwise, but provide for an ability to add new values at arbitrary places in the ordering, with no limit. If we can do that why would we want less? cheers andrew