Thread: VARIANT / ANYTYPE datatype
Hello, A customer came to us with this request: a way to store "any" data in a column. We've gone back and forth trying to determine reasonable implementation restrictions, safety and useful semantics for them. I note that this has been requested in the past: http://archives.postgresql.org/pgsql-general/2004-02/msg01266.php and both Oracle and MS-SQL have it and apparently people find them useful. I didn't find any indication that SQL contains anything resembling this. The main idea is to be able to store column values in an audit table like this: change_time timestamptz table_name name column_name name old_value variant new_value variant So per-column changes, which is much more convenient than the regular idea of storing the whole NEW and/or OLD record(s). Currently, they use text for old_value and new_value, but this is, of course, not very satisfactory. My thought was that a variant type would store the datum as <typid><data> so that it would be possible to identify the datatype stored in each column/row and interpret adequately, calling the appropriate output function etc. On input it would be limited to come only from inside the system, not from the outside world, as that would have obvious security implications; so it'd be similar to pg_node_tree in that regard. Now this has obvious limitations: first, any query that tries to extract data would need to include a cast of the variant value to the appropriate type, so that the type can be resolved early. Thus, trying to extract rows of different types would be forbidden. Also, there would be a security problem with a datum storing something whose datatype later changes (consider a user-defined record type or things like that). My first reaction was to do something like CREATE TYPE foo VARIANT OF (int, text, timestamptz); and then you could declare old_value with type foo, which would only allow values of the declared types. This makes it easy to catalogue used types in any variant, and thus easy to restrict modifying or dropping types that are used in some variant. However, this idea was rejected by the customer due to the unusability: you would have to remember to edit the variant to add the new type anytime you added a new column to a table, which would be cumbersome. What the customer suggested was to have a new fork, which stores type OIDs of datatypes used in the variant. Then when a type is to be altered, all forks would be scanned to determine if the type is used, and raise an error if so. I rejected that idea as unworkable. However, as a middle ground we agreed that we could allow a declared variant to store any pinned type without restrictions; those can't be changed or dropped so there's no safety concern. Storing other types (user-defined types, records and so on) would require some safety net. Before spending too much time exploring a detailed design, I'd like to hear some opinions on the basic idea. -- Álvaro Herrera <alvherre@alvh.no-ip.org>
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > A customer came to us with this request: a way to store "any" data > in a column. +1 More than once (usually in replication, interface, or audit situations) I've had to create a table with one column each of a number of different data types, only one of which will be used in each row. It would make life much simpler for programming such things if an "unknown" type could be stored. I could either cast the types in the rows based on the related identifier column, but it would be even nicer if they could be read "unknown" and pg_typeof or something similar could be used to control flow based on the type. As one example, only *last night* I created the following table as part of the development of our shop's next generation of software. (Yes I know about Slony, Londiste, and others -- we need to roll something out which integrates with existing systems, without production disruption, over the next few years. This does require a custom solution.) Table "public.DbTranOpValue" Column | Type | Modifiers ----------------+-------------------+-----------countyNo | "CountyNoT" | not nullbackendPid | integer | not nulltranStart | "TimestampT" | not nulllogRecordSeqNo | "LogRecordSeqNoT" | not nullcolumnName | "ColumnNameT" | not nullisAfter | "BooleanT" | not nulltextValue | text |intValue | bigint |numericValue | numeric |binaryValue | bytea |booleanValue | "BooleanT" |dateValue | "DateT" |timeValue | "TimeT" |timestampValue| "TimestampT" | Indexes: "DbTranOpValue_pkey" PRIMARY KEY, btree ("backendPid", "tranStart", "logRecordSeqNo", "columnName") I would much rather have had a "value" column of unknown type. -Kevin
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > A customer came to us with this request: a way to store "any" data in a > column. We've gone back and forth trying to determine reasonable > implementation restrictions, safety and useful semantics for them. Yes, it seems rather messy. > The main idea is to be able to store column values in an audit table > like this: > old_value variant > new_value variant > Currently, they use text for old_value and new_value, but this is, of > course, not very satisfactory. Just out of curiosity, what actual functionality gain would ensue over just using text? It seems like doing anything useful with the audit table contents would still require casting the column to text, or the moral equivalent of that. If we did have a concept of ANYTYPE, or really we'd need ANYTYPE[], it could perhaps be used in pg_statistic, in place of the rather ugly anyarray hack that's there now. But I note that nothing above the level of C code can do anything very useful with the contents of pg_statistic, and I'm not sure that having an official type would change that. > However, as a middle ground we agreed that we could allow a declared > variant to store any pinned type without restrictions; those can't be > changed or dropped so there's no safety concern. If you're going to accept that there are restrictions, I don't see that there is a good argument against your thought of a declared "union" type. At least then it's clear what the restrictions are. I am firmly against exposing the notion of "pinned" types in any user-visible SQL semantics, and even more firmly against the idea of creating fundamental functionality that only works for built-in types and can't be extended to add-on types. regards, tom lane
Alvaro Herrera wrote: > A customer came to us with this request: a way to store "any" data in a > column. We've gone back and forth trying to determine reasonable > implementation restrictions, safety and useful semantics for them. > I note that this has been requested in the past: > http://archives.postgresql.org/pgsql-general/2004-02/msg01266.php > and both Oracle and MS-SQL have it and apparently people find them > useful. I didn't find any indication that SQL contains anything > resembling this. I see VARIANT/ANYTYPE as the most general case of supporting union types, which, say, could have more specific examples of "allow any number or date here but nothing else". If VARIANT is supported, unions in general ought to be also. The most effective way of supporting VARIANT or union types in general is having an implementation where in the general case each value in the database knows its own data type rather than the data type being provided by a context such as what table column it is in. For example, if rather than storing a data value directly we store a 2-attribute struct naming a data type and pointing to or holding the data value. See how SQLite works as an example of how VARIANTs or unions could work, although that on its own would need to be made more comprehensive for Pg. I claim ignorance as to how Pg currently implements these matters. Where VARIANT/union types are supported by default, declaring more specific types is just a type constraint and an optimization. Of course, when we know the type of a column/etc isn't going to be VARIANT or some other union type, then a simple optimization allows us to just store the value and have its type provided by context rather than the struct. -- Darren Duncan
On Wed, May 4, 2011 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: >> A customer came to us with this request: a way to store "any" data in a >> column. We've gone back and forth trying to determine reasonable >> implementation restrictions, safety and useful semantics for them. > > Yes, it seems rather messy. > >> The main idea is to be able to store column values in an audit table >> like this: >> old_value variant >> new_value variant >> Currently, they use text for old_value and new_value, but this is, of >> course, not very satisfactory. > > Just out of curiosity, what actual functionality gain would ensue over > just using text? It seems like doing anything useful with the audit > table contents would still require casting the column to text, or the > moral equivalent of that. The problem with text is that for composite/table types you lose the ability to cast back when the structure changes. Most of the auditing implementations I've done cast new/old to text and drop it directly into a single column on the audit record. It works ok, but you have to messily update the text later when the table changes. For non-variant composites you can add columns down the line and it works ok in dependent records without too much fuss. I think though that getting this to work such that type dependency is via row/column instead of just column is going to be tough. Outside of composites, I don't see much advantages vs the text route, performance maybe? merlin
On 05/04/2011 01:36 PM, Tom Lane wrote: > >> The main idea is to be able to store column values in an audit table >> like this: >> old_value variant >> new_value variant >> Currently, they use text for old_value and new_value, but this is, of >> course, not very satisfactory. > Just out of curiosity, what actual functionality gain would ensue over > just using text? It seems like doing anything useful with the audit > table contents would still require casting the column to text, or the > moral equivalent of that. > Yeah, I've been down this road once or twice, and I think that's the $64 question. I wrote a custom audit app two or three years ago. After several iterations the customer and I found that using an hstore for the old/new (or old record / changeset, which is what we actually use) was the most suitable for our use. I think if we did this we'd need to add some sort of is_type() and typeof() functions for variant objects. cheers andrew
On Wed, May 4, 2011 at 2:55 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > On 05/04/2011 01:36 PM, Tom Lane wrote: >> >>> The main idea is to be able to store column values in an audit table >>> like this: >>> old_value variant >>> new_value variant >>> Currently, they use text for old_value and new_value, but this is, of >>> course, not very satisfactory. >> >> Just out of curiosity, what actual functionality gain would ensue over >> just using text? It seems like doing anything useful with the audit >> table contents would still require casting the column to text, or the >> moral equivalent of that. >> > > Yeah, I've been down this road once or twice, and I think that's the $64 > question. > > I wrote a custom audit app two or three years ago. After several iterations > the customer and I found that using an hstore for the old/new (or old record > / changeset, which is what we actually use) was the most suitable for our > use. yeah -- +1 on that method. I think it's really the right way to go with the recent hstore enhancements. merlin
Excerpts from Tom Lane's message of mié may 04 14:36:44 -0300 2011: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > The main idea is to be able to store column values in an audit table > > like this: > > old_value variant > > new_value variant > > Currently, they use text for old_value and new_value, but this is, of > > course, not very satisfactory. > > Just out of curiosity, what actual functionality gain would ensue over > just using text? It seems like doing anything useful with the audit > table contents would still require casting the column to text, or the > moral equivalent of that. Storage efficiency. These people have really huge databases; small changes in how tight things are packed makes a large difference for them. (For example, we developed a type to store SHA-2 digests in a more compact way than bytea mainly because of this reason. Also, at some time they also wanted to apply compression to hstore keys and values.) As someone commented downthread, they also want to have things such as a "typeof" operator. It could be used in (say) a plpgsql function to choose different branches of code. Things go wild when you think about using VARIANT as a function parameter type. Given how complex are our current function resolution rules I'm not really sold on making this work at all, so initially I'm aiming at just raising an error in that case. > If we did have a concept of ANYTYPE, or really we'd need ANYTYPE[], > it could perhaps be used in pg_statistic, in place of the rather ugly > anyarray hack that's there now. But I note that nothing above the level > of C code can do anything very useful with the contents of pg_statistic, > and I'm not sure that having an official type would change that. Well, we could get rid of ugly hacks that are in various places in the backend code to cope with this. Nor sure how useful it'd be for doing things at the SQL level. > > However, as a middle ground we agreed that we could allow a declared > > variant to store any pinned type without restrictions; those can't be > > changed or dropped so there's no safety concern. > > If you're going to accept that there are restrictions, I don't see that > there is a good argument against your thought of a declared "union" > type. At least then it's clear what the restrictions are. I am firmly > against exposing the notion of "pinned" types in any user-visible SQL > semantics, and even more firmly against the idea of creating fundamental > functionality that only works for built-in types and can't be extended > to add-on types. The argument was "it's too cumbersome to use" (which makes sense: users are certain to forget to add the new type to the declared union when they add a new column to the table, possibly causing a run-time error if the audit trigger is in place and attempts to load a new datum into the log table.) I understand the reluctancy to use pinned-ness in a user-visible way. Back to the drawing board. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Tom Lane's message of mié may 04 14:36:44 -0300 2011: >> Just out of curiosity, what actual functionality gain would ensue over >> just using text? It seems like doing anything useful with the audit >> table contents would still require casting the column to text, or the >> moral equivalent of that. > Storage efficiency. These people have really huge databases; small > changes in how tight things are packed makes a large difference for > them. (For example, we developed a type to store SHA-2 digests in a > more compact way than bytea mainly because of this reason. Also, at > some time they also wanted to apply compression to hstore keys and > values.) Hmm. The prototypical case for this would probably be a 4-byte int, which if you add an OID to it so you can resolve the type is going to take 8 bytes, plus you are going to need a length word because there is really no alternative to the "VARIANT" type being varlena overall, which makes it 9 bytes if you're lucky on alignment and up to 16 if you're not. That is not shorter than the average length of the text representation of an int. The numbers don't seem a lot better for 8-byte quantities like int8, float8, or timestamp. It might be marginally worthwhile for timestamp, but surely this is a huge amount of effort to substitute for thinking of a more compact text representation for timestamps. Pardon me for being unconvinced. regards, tom lane
On May 4, 2011, at 3:04 PM, Alvaro Herrera wrote: > As someone commented downthread, they also want to have things such as a > "typeof" operator. It could be used in (say) a plpgsql function to > choose different branches of code. FWIW, pg_typeof(any) has been in core since 9.0. Best, David
On 05/04/2011 07:05 PM, Tom Lane wrote: > Alvaro Herrera<alvherre@commandprompt.com> writes: >> Excerpts from Tom Lane's message of mié may 04 14:36:44 -0300 2011: >>> Just out of curiosity, what actual functionality gain would ensue over >>> just using text? It seems like doing anything useful with the audit >>> table contents would still require casting the column to text, or the >>> moral equivalent of that. >> Storage efficiency. These people have really huge databases; small >> changes in how tight things are packed makes a large difference for >> them. (For example, we developed a type to store SHA-2 digests in a >> more compact way than bytea mainly because of this reason. Also, at >> some time they also wanted to apply compression to hstore keys and >> values.) > Hmm. The prototypical case for this would probably be a 4-byte int, > which if you add an OID to it so you can resolve the type is going to > take 8 bytes, plus you are going to need a length word because there is > really no alternative to the "VARIANT" type being varlena overall, which > makes it 9 bytes if you're lucky on alignment and up to 16 if you're > not. That is not shorter than the average length of the text > representation of an int. The numbers don't seem a lot better for > 8-byte quantities like int8, float8, or timestamp. It might be > marginally worthwhile for timestamp, but surely this is a huge amount of > effort to substitute for thinking of a more compact text representation > for timestamps. > > Pardon me for being unconvinced. > > I'm far from convinced that storing deltas per column rather than per record is a win anyway. I don't have hard numbers to hand, but my vague recollection is that my tests showed it to be a design that used more space. cheers andrew
"David E. Wheeler" <david@kineticode.com> writes: > On May 4, 2011, at 3:04 PM, Alvaro Herrera wrote: >> As someone commented downthread, they also want to have things such as a >> "typeof" operator. It could be used in (say) a plpgsql function to >> choose different branches of code. > FWIW, pg_typeof(any) has been in core since 9.0. But note that pg_typeof() would surely say "VARIANT" if applied to such a datatype. You'd need some other, VARIANT-specific function that knew enough to dig into the value at runtime and get the OID therein. regards, tom lane
Excerpts from Tom Lane's message of mié may 04 20:05:54 -0300 2011: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Excerpts from Tom Lane's message of mié may 04 14:36:44 -0300 2011: > >> Just out of curiosity, what actual functionality gain would ensue over > >> just using text? It seems like doing anything useful with the audit > >> table contents would still require casting the column to text, or the > >> moral equivalent of that. > > > Storage efficiency. These people have really huge databases; small > > changes in how tight things are packed makes a large difference for > > them. (For example, we developed a type to store SHA-2 digests in a > > more compact way than bytea mainly because of this reason. Also, at > > some time they also wanted to apply compression to hstore keys and > > values.) > > Hmm. The prototypical case for this would probably be a 4-byte int, > which if you add an OID to it so you can resolve the type is going to > take 8 bytes, plus you are going to need a length word because there is > really no alternative to the "VARIANT" type being varlena overall, which > makes it 9 bytes if you're lucky on alignment and up to 16 if you're > not. Hmm, I was hoping that we could skip making it varlena at least in some cases ... perhaps that's hopeless, in which case we'll have to reassess the whole idea. Still there's the new functionality argument, though. As a followup idea there exists the desire to store records as records and not text representation of same (given differing record types, of course), for which it'd be more worthwhile. > It might be > marginally worthwhile for timestamp, but surely this is a huge amount of > effort to substitute for thinking of a more compact text representation > for timestamps. Surely if it's just for timestamps it'd be a waste of time. > Pardon me for being unconvinced. :-) -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > As a followup idea there exists the desire to store records as records > and not text representation of same (given differing record types, of > course), for which it'd be more worthwhile. Maybe. The conventional wisdom is that text representation of data is more compact than PG's internal representation by a significant factor --- our FAQ says up to 5x, in fact. I know that that's including row overhead and indexes and so on, but I still don't find it to be a given that you're going to win on space with this sort of trick. Another point here is that once the values in question get to be any significant number of bytes apiece, both text and the proposed VARIANT representation could be subject to compression, which would greatly reduce any nominal advantage of one over the other. Any representational inefficiency in either would just be grist for the compressor. regards, tom lane
On Wed, May 4, 2011 at 8:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> As a followup idea there exists the desire to store records as records >> and not text representation of same (given differing record types, of >> course), for which it'd be more worthwhile. > > Maybe. The conventional wisdom is that text representation of data is > more compact than PG's internal representation by a significant factor > --- our FAQ says up to 5x, in fact. I know that that's including row > overhead and indexes and so on, but I still don't find it to be a given > that you're going to win on space with this sort of trick. I've done a lot of testing of the text vs binary format on the wire format...not exactly the same set of issues, but pretty close since you have to send all the oids, lengths, etc. Conventional wisdom is correct although overstated for this topic. Even in truly pathological cases for text, for example in sending multiple levels of redundant escaping in complex structures, the text format will almost always be smaller. For 'typical' data it can be significantly smaller. Two exceptions most people will run into are bytea obviously and the timestamp family of types where binary style manipulation is a huge win both in terms of space and performance. For complex data (say 3+ levels of composites stacked in arrays), binary type formats are much *faster*, albeit larger, via binary as long as you are not bandwidth constrained, and presumably they would be as well for variants. Perhaps even more so, because some of the manipulations made converting tuple storage to binary wire formats don't have to happen. That said, while there are use cases for sending highly structured data over the wire, I can't think of any for direct storage on a table in variant type scenarios, at least not yet :-). merlin
On May 4, 2011, at 6:24 PM, Andrew Dunstan wrote: > I'm far from convinced that storing deltas per column rather than per record is a win anyway. I don't have hard numbersto hand, but my vague recollection is that my tests showed it to be a design that used more space. It depends on how many fields you're changing in one go and how wide the table is. It's also a PITA to identify what fieldsactually changed if you're storing everything. In the case of logging, I'd say that what's really needed is a way tostore a table record that has an indicator of what fields actually changed (and possibly not storing anything for fieldsthat didn't change). That table record would need to also deal with changes to the underlying table structure. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 05/05/2011 01:00 PM, Jim Nasby wrote: > On May 4, 2011, at 6:24 PM, Andrew Dunstan wrote: >> I'm far from convinced that storing deltas per column rather than per record is a win anyway. I don't have hard numbersto hand, but my vague recollection is that my tests showed it to be a design that used more space. > It depends on how many fields you're changing in one go and how wide the table is. It's also a PITA to identify what fieldsactually changed if you're storing everything. No it's not. Instead of storing OLD/NEW, store a base record and a delta record (an hstore with just the changed fields) for an update. This saves space and means you only have to calculate what changed once. cheers andrew
Excerpts from Darren Duncan's message of mié may 04 15:33:33 -0300 2011: > I see VARIANT/ANYTYPE as the most general case of supporting union types, which, > say, could have more specific examples of "allow any number or date here but > nothing else". If VARIANT is supported, unions in general ought to be also. Okay, so aside from the performance (storage reduction) gained, there's this argument for having variant/union types. It seems to me that this is indeed possible to build. Completely general VARIANT, though, is rather complex. A declared union, where you specify exactly which types can be part of the union, can be catalogued, so that the system knows exactly where to look when a type needs to be modified. A general VARIANT however looks complex to me to solve. The problem is this: if an user attempts to drop a type, and this type is used in a variant somewhere, we would lose the stored data. So the drop needs to be aborted. Similarly, if we alter a type (easy example: a composite type) used in a variant, we need to cascade to modify all rows using that composite. If the unions that use a certain type are catalogued, we at least know what tables to scan to cascade. In a general variant, the system catalogs do not have the information of what type each variant masquerades as. We would need to examine the variant's masqueraded types on each insert; if the current type is not found, add it. This seems a bit expensive. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 05/06/2011 04:08 PM, Alvaro Herrera wrote: > Excerpts from Darren Duncan's message of mié may 04 15:33:33 -0300 2011: > >> I see VARIANT/ANYTYPE as the most general case of supporting union types, which, >> say, could have more specific examples of "allow any number or date here but >> nothing else". If VARIANT is supported, unions in general ought to be also. > Okay, so aside from the performance (storage reduction) gained, there's > this argument for having variant/union types. It seems to me that this > is indeed possible to build. Completely general VARIANT, though, is > rather complex. A declared union, where you specify exactly which types > can be part of the union, can be catalogued, so that the system knows > exactly where to look when a type needs to be modified. A general > VARIANT however looks complex to me to solve. > > The problem is this: if an user attempts to drop a type, and this type > is used in a variant somewhere, we would lose the stored data. So the > drop needs to be aborted. Similarly, if we alter a type (easy example: > a composite type) used in a variant, we need to cascade to modify all > rows using that composite. > > If the unions that use a certain type are catalogued, we at least know > what tables to scan to cascade. > > In a general variant, the system catalogs do not have the information of > what type each variant masquerades as. We would need to examine the > variant's masqueraded types on each insert; if the current type is not > found, add it. This seems a bit expensive. > So how is a declared union going to look and operate? Something like this? CREATE TYPE foo AS UNION (ival int, tval text, tsval timestamptz): CREATE TABLE bar (myunion foo); INSERT INTObar (myunion) VALUES (ival=>1), (tval=>'some text'); UPDATE bar SET myunion.tsval = CURRENT_TIMESTAMP; Something like that could actually be quite nice for a number of purposes. cheers andrew
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Darren Duncan's message of mié may 04 15:33:33 -0300 2011: >> I see VARIANT/ANYTYPE as the most general case of supporting union types, which, >> say, could have more specific examples of "allow any number or date here but >> nothing else". If VARIANT is supported, unions in general ought to be also. > Okay, so aside from the performance (storage reduction) gained, there's > this argument for having variant/union types. It seems to me that this > is indeed possible to build. Completely general VARIANT, though, is > rather complex. Yeah. I have no particular objection to a UNION over a specified set of types, but am not very happy with the idea of an unconstrained union. Also, a UNION declaration would allow attaching a field name to each containable type, as I see Andrew just posted about. That seems like potentially a good idea. > In a general variant, the system catalogs do not have the information of > what type each variant masquerades as. We would need to examine the > variant's masqueraded types on each insert; if the current type is not > found, add it. This seems a bit expensive. Not to mention race-condition-prone. How do you ensure someone is not inserting another instance of the variant, with some previously not used content type, while this is going on? regards, tom lane
On Fri, May 6, 2011 at 3:18 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 05/06/2011 04:08 PM, Alvaro Herrera wrote: >> >> Excerpts from Darren Duncan's message of mié may 04 15:33:33 -0300 2011: >> >>> I see VARIANT/ANYTYPE as the most general case of supporting union types, >>> which, >>> say, could have more specific examples of "allow any number or date here >>> but >>> nothing else". If VARIANT is supported, unions in general ought to be >>> also. >> >> Okay, so aside from the performance (storage reduction) gained, there's >> this argument for having variant/union types. It seems to me that this >> is indeed possible to build. Completely general VARIANT, though, is >> rather complex. A declared union, where you specify exactly which types >> can be part of the union, can be catalogued, so that the system knows >> exactly where to look when a type needs to be modified. A general >> VARIANT however looks complex to me to solve. >> >> The problem is this: if an user attempts to drop a type, and this type >> is used in a variant somewhere, we would lose the stored data. So the >> drop needs to be aborted. Similarly, if we alter a type (easy example: >> a composite type) used in a variant, we need to cascade to modify all >> rows using that composite. >> >> If the unions that use a certain type are catalogued, we at least know >> what tables to scan to cascade. >> >> In a general variant, the system catalogs do not have the information of >> what type each variant masquerades as. We would need to examine the >> variant's masqueraded types on each insert; if the current type is not >> found, add it. This seems a bit expensive. > > So how is a declared union going to look and operate? Something like this? > > CREATE TYPE foo AS UNION (ival int, tval text, tsval timestamptz): > CREATE TABLE bar (myunion foo); > INSERT INTO bar (myunion) VALUES (ival=>1), (tval=>'some text'); > UPDATE bar SET myunion.tsval = CURRENT_TIMESTAMP; > > > Something like that could actually be quite nice for a number of purposes. using your hypothetical example, could you cast types to the union? select 1::int::foo; record types would presumably work as well? you could do a lot of *really* neat stuff that way... merlin
Thanks for picking up on my mentioning union types; I wasn't sure if anyone did. Merlin Moncure wrote: > On Fri, May 6, 2011 at 3:18 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >> >> On 05/06/2011 04:08 PM, Alvaro Herrera wrote: >>> Excerpts from Darren Duncan's message of mié may 04 15:33:33 -0300 2011: >>> >>>> I see VARIANT/ANYTYPE as the most general case of supporting union types, >>>> which, >>>> say, could have more specific examples of "allow any number or date here >>>> but >>>> nothing else". If VARIANT is supported, unions in general ought to be >>>> also. >>> Okay, so aside from the performance (storage reduction) gained, there's >>> this argument for having variant/union types. It seems to me that this >>> is indeed possible to build. Completely general VARIANT, though, is >>> rather complex. A declared union, where you specify exactly which types >>> can be part of the union, can be catalogued, so that the system knows >>> exactly where to look when a type needs to be modified. A general >>> VARIANT however looks complex to me to solve. >>> >>> The problem is this: if an user attempts to drop a type, and this type >>> is used in a variant somewhere, we would lose the stored data. So the >>> drop needs to be aborted. Similarly, if we alter a type (easy example: >>> a composite type) used in a variant, we need to cascade to modify all >>> rows using that composite. >>> >>> If the unions that use a certain type are catalogued, we at least know >>> what tables to scan to cascade. >>> >>> In a general variant, the system catalogs do not have the information of >>> what type each variant masquerades as. We would need to examine the >>> variant's masqueraded types on each insert; if the current type is not >>> found, add it. This seems a bit expensive. >> So how is a declared union going to look and operate? Something like this? >> >> CREATE TYPE foo AS UNION (ival int, tval text, tsval timestamptz): >> CREATE TABLE bar (myunion foo); >> INSERT INTO bar (myunion) VALUES (ival=>1), (tval=>'some text'); >> UPDATE bar SET myunion.tsval = CURRENT_TIMESTAMP; >> >> >> Something like that could actually be quite nice for a number of purposes. > > using your hypothetical example, could you cast types to the union? > > select 1::int::foo; > > record types would presumably work as well? you could do a lot of > *really* neat stuff that way... Like with other respondents to this topic, I consider it much more useful and important, as well as practical, to support explicitly defined type unions than a wide-open ANYTYPE. As for how it would operate: The example above implies a union type implementation that is like C's concept of such, where you have to explicitly state how you want the value bit pattern to be interpreted, by naming ival/etc, rather than the DBMS just knowing that a particular value is of a specific type, because per-value stored metadata says so (like with SQLite). While that might be best in practice for implementation, I had envisioned something more like set unions, so you could instead do it like this: CREATE TYPE foo AS UNION (int, text, timestamptz): CREATE TABLE bar (myunion foo); INSERT INTO bar (myunion) VALUES(1), ('some text'); UPDATE bar SET myunion = CURRENT_TIMESTAMP; Unlike a record type, where multiple attributes may have the same time, presumably with a union, they would all be distinct, and so you could use the type name itself to refer to each option; you don't have to make up "ival" or whatever ... unless there are situations where types don't have names. When doing operations that are type-generic, such as equality test or assignment, especially assignment, between 2 things that are both declared to be of type foo, you could just do it with no verbosity same as if you were doing 2 int or text or whatever. When doing operations specific to int or text or whatever, or if you are assigning a foo-declared thing to an int/text-declared thing, then you'd want an explicit cast or assertion, such as "select myunion::int + 3 as answer from bar where is_a(myunion,'int')". Whether you want an explicit cast to go the other way, I would assume you don't need to, like when you have DOMAINs; eg, I would expect the 4th line above to just work, because the system knows the type of CURRENT_TIMESTAMP and it knows that this is a member of the union type of myunion. I see a UNION type as being like a DOMAIN type in reverse. -- Darren Duncan
To follow-up, an additional feature that would be useful and resembles union types is the variant where you could declare a union type first and then separately other types could declare they are a member of the union. I'm talking about loosely what mixins or type-roles or interfaces etc are in other languages. The most trivial example would be declaring an ENUM-alike first and then separately declaring the component values where the latter declare they are part of the ENUM, and this could make it easier to add or change ENUM values. But keep in mind that this is a distinct concept from what we're otherwise talking about as being union types. -- Darren Duncan
Darren Duncan wrote: > To follow-up, an additional feature that would be useful and resembles union > types is the variant where you could declare a union type first and then > separately other types could declare they are a member of the union. I'm > talking about loosely what mixins or type-roles or interfaces etc are in other > languages. The most trivial example would be declaring an ENUM-alike first and > then separately declaring the component values where the latter declare they are > part of the ENUM, and this could make it easier to add or change ENUM values. > But keep in mind that this is a distinct concept from what we're otherwise > talking about as being union types. -- Darren Duncan Should this be a TODO item? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Excerpts from Bruce Momjian's message of mar may 10 16:21:36 -0400 2011: > Darren Duncan wrote: > > To follow-up, an additional feature that would be useful and resembles union > > types is the variant where you could declare a union type first and then > > separately other types could declare they are a member of the union. I'm > > talking about loosely what mixins or type-roles or interfaces etc are in other > > languages. The most trivial example would be declaring an ENUM-alike first and > > then separately declaring the component values where the latter declare they are > > part of the ENUM, and this could make it easier to add or change ENUM values. > > But keep in mind that this is a distinct concept from what we're otherwise > > talking about as being union types. -- Darren Duncan > > Should this be a TODO item? The general idea of C-style unions, sure. Mixin-style stuff ... not sure. Seems like it'd be pretty painful. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Excerpts from Bruce Momjian's message of mar may 10 16:21:36 -0400 2011: >> Darren Duncan wrote: >>> To follow-up, an additional feature that would be useful and resembles union >>> types is the variant where you could declare a union type first and then >>> separately other types could declare they are a member of the union. I'm >>> talking about loosely what mixins or type-roles or interfaces etc are in other >>> languages. The most trivial example would be declaring an ENUM-alike first and >>> then separately declaring the component values where the latter declare they are >>> part of the ENUM, and this could make it easier to add or change ENUM values. >>> But keep in mind that this is a distinct concept from what we're otherwise >>> talking about as being union types. -- Darren Duncan >> Should this be a TODO item? > > The general idea of C-style unions, sure. Mixin-style stuff ... not sure. > Seems like it'd be pretty painful. From the perspective of users, the single greatest distinction between these 2 kinds of unions is being closed versus being open, and that is the primary reason to choose one over the other. A closed union is the C-style, where the union type declares what other types or values it ranges over. The closed union is best when the union definer can reasonably assume that the union won't either ever or would rarely be changed, and in particular can assume that application or database code would have knowledge of the parts that it deals specially with, so it can be assumed that if the closed union type ever is changed then any code designed to use it may be changed at the same time. A good example for a closed union would be a boolean type which just ranges over the two singletons false and true or an order type which ranges just over the three singletons decrease, same, increase. Or a type which enumerates the 7 days of the week, as this is unlikely to change in the life of a system. An open union is the mixin style, where the component types declare they are part of the union. The open union is best when it is likely that there would be either user-defined or extension-defined new types for the union to come along later, and we want to have code that can be generic or polymorphic for any types that can be used in particular ways. Examples of open union types could be number, which all the numeric types compose, and so you can know say that you can use the generic numeric operators on values you have simply if their types compose the number union type, and it still works if more numeric types appear later. Likewise, the string open union could include both text and blob, as both support catenation and substring matches or extraction, for example. This would aid to operator overloading in a generic way, letting you use the same syntax for different types, but allowing types to mix is optional; eg, you could support "add(int,int)" and "add(real,real)" without supporting "add(int,real)" etc but the syntax "add(x,y)" is shared, and you do this while still having a strong type system; allowing the mixing is optional case-by-case. Supporting the open union is closer to supporting ANYTYPE while the closed union isn't so much. -- Darren Duncan
On Wed, May 4, 2011 at 5:58 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > A customer came to us with this request: a way to store "any" data in a > column. We've gone back and forth trying to determine reasonable > implementation restrictions, safety and useful semantics for them. > I note that this has been requested in the past: > http://archives.postgresql.org/pgsql-general/2004-02/msg01266.php I think its a reasonably common use case. Would it be possible to do this with a "typed" hstore? Seems easier to add something there than it would be to add the VARIANT type as discussed here. > both Oracle and MS-SQL have it Do they? What types are they called? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 10, 2011 at 3:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, May 4, 2011 at 5:58 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:I think its a reasonably common use case.
> A customer came to us with this request: a way to store "any" data in a
> column. We've gone back and forth trying to determine reasonable
> implementation restrictions, safety and useful semantics for them.
> I note that this has been requested in the past:
> http://archives.postgresql.org/pgsql-general/2004-02/msg01266.php
Would it be possible to do this with a "typed" hstore? Seems easier to
add something there than it would be to add the VARIANT type as
discussed here.Do they? What types are they called?
> both Oracle and MS-SQL have it
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
MS SQL Server calls it's variant type 'sql_variant', but it's limited to a subset of the data types they support. Basically, it can store any numeric type, or any binary or text type with a constrained length. No timestamps, geometry, XML, user-defined types, etc. allowed. So it's not really as much of an "any value" type as it might look on the surface. Don't know any details of Oracle's implementation.
-Eric
It seems to me a reasonable way to implement VARIANT would be to have a data type called VARIANT that stores an OID of the inner type at the beginning, followed by the binary data. When you say pg_typeof(somevariant), you'll get 'variant'. Instead, you'd use a function like this: variant_typeof(VARIANT) returns REGTYPE VARIANT could then be casted to/from other types. I think the rules for converting to/from variant should be a little stronger than normal PostgreSQL casting rules. For example: SELECT '12345'::TEXT::INT; -- allowed SELECT '12345'::TEXT::VARIANT::INT; -- not allowed SELECT '12345'::TEXT::VARIANT::TEXT::INT;-- allowed I'm not sure how conversions to/from VARIANT could be implemented other than creating conversion functions for every type. It'd be nice if we could avoid that. A more permissive way to convert out of VARIANT might be to have a function like this: variant_unwrap(VARIANT) returns TEXT -- user casts to desired type I suppose the in/out functions could prefix the value with the type name and a colon: > SELECT '12345'::INT::VARIANT; variant ------------- integer:12345 The VARIANT type, or similar, would be useful for the JSON data type I've been intermittently working on, as it would allow us to create a function like this: from_json(JSON) returns VARIANT from_json would unwrap a JSON string/number/bool/null/array, converting it to a VARIANT whose inner type is TEXT / (INT or BIGINT or DOUBLE or NUMERIC) / BOOL / [null] / ARRAY. In the [null] case, from_json would actually return NULL (I see no need for VARIANT to wrap nulls). This is rather type-safe compared to what I currently have: from_json(JSON) returns TEXT -- user casts to desired type By returning variant, we can then cast to the desired type, and if the cast is invalid, a type error will occur. Joey Adams
On Tue, May 10, 2011 at 10:29 PM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > The VARIANT type, or similar, would be useful for the JSON data type > I've been intermittently working on, as it would allow us to create a > function like this: > > from_json(JSON) returns VARIANT This occurred to me: if PostgreSQL functions could return ANYELEMENT, then we could just say: from_json(JSON) returns ANYELEMENT -- no intermediate VARIANT value Likewise, the variant conversion functions (which could be invoked automatically) could be done this way: to_variant(ANYELEMENT) returns VARIANT from_variant(VARIANT) returns ANYELEMENT However, I'm not familiar enough with the innards of PostgreSQL's type system to know if returning ANYELEMENT would be possible and make sense.
2011/5/11 Joseph Adams <joeyadams3.14159@gmail.com>: > On Tue, May 10, 2011 at 10:29 PM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: >> The VARIANT type, or similar, would be useful for the JSON data type >> I've been intermittently working on, as it would allow us to create a >> function like this: >> >> from_json(JSON) returns VARIANT > > This occurred to me: if PostgreSQL functions could return ANYELEMENT, > then we could just say: > > from_json(JSON) returns ANYELEMENT -- no intermediate VARIANT value > > Likewise, the variant conversion functions (which could be invoked > automatically) could be done this way: > > to_variant(ANYELEMENT) returns VARIANT > from_variant(VARIANT) returns ANYELEMENT > > However, I'm not familiar enough with the innards of PostgreSQL's type > system to know if returning ANYELEMENT would be possible and make > sense. > it isn't possible. You can have a ANYELEMENT on input and output, or ANYELEMENT on input, but you ca't have ANYELEMENT just on output. The analogy for ANYELEMENT is C++ template. Regards Pavel Stehule VARIANT is maybe good idea - it can simplify a PL/pgSQL code - but though anybody about using this type in planner? > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On Tue, May 10, 2011 at 10:29 PM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > It seems to me a reasonable way to implement VARIANT would be to have > a data type called VARIANT that stores an OID of the inner type at the > beginning, followed by the binary data. That's likely to be how it gets implemented, but you seem to have missed the point of some of the discussion upthread: the big problem with that is that someone might type "DROP TYPE foo", and when they do, you need an efficient way to figure out whether foo is in use inside an instance of the variant type anywhere in the system. The devil is in the details... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 05/11/2011 07:53 AM, Robert Haas wrote: > On Tue, May 10, 2011 at 10:29 PM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: >> It seems to me a reasonable way to implement VARIANT would be to have >> a data type called VARIANT that stores an OID of the inner type at the >> beginning, followed by the binary data. > That's likely to be how it gets implemented, but you seem to have > missed the point of some of the discussion upthread: the big problem > with that is that someone might type "DROP TYPE foo", and when they > do, you need an efficient way to figure out whether foo is in use > inside an instance of the variant type anywhere in the system. The > devil is in the details... > Actually, I thought that was the variant (pun intended) Tom objected to, and wanted declared variant types instead, precisely so we could know where we'd need to look if a type was dropped. Using that you might still use the type oid at the beginning of the field, but there are other possibilities too (e.g. a single byte indicating which known variant is in use). cheers andrew
On Wed, May 11, 2011 at 7:53 AM, Robert Haas <robertmhaas@gmail.com> wrote: > That's likely to be how it gets implemented, but you seem to have > missed the point of some of the discussion upthread: the big problem > with that is that someone might type "DROP TYPE foo", and when they > do, you need an efficient way to figure out whether foo is in use > inside an instance of the variant type anywhere in the system. The > devil is in the details... Sorry, I missed that. That in mind, I think I would lean more toward the union proposal as well. Can anyone think of a case where VARIANT would be more useful? As for using one or two bytes to store the type of a UNION, that creates a problem when you want to extend the union in the future. That is, if a UNION is simply a collection of possible types values of the UNION type can hold. If UNION is implemented more like a tagged union: CREATE TYPE token AS TAGGED UNION (identifier TEXT, keyword TEXT, number INT); Then the problem of altering it is much like the problem of altering an ENUM. On Tue, May 10, 2011 at 5:19 PM, Darren Duncan <darren@darrenduncan.net> wrote: > Examples of open union types could be number, which all the numeric types > compose, and so you can know say that you can use the generic numeric > operators on values you have simply if their types compose the number union > type, and it still works if more numeric types appear later. Likewise, the > string open union could include both text and blob, as both support > catenation and substring matches or extraction, for example. > > This would aid to operator overloading in a generic way, letting you use the > same syntax for different types, but allowing types to mix is optional; eg, > you could support "add(int,int)" and "add(real,real)" without supporting > "add(int,real)" etc but the syntax "add(x,y)" is shared, and you do this > while still having a strong type system; allowing the mixing is optional > case-by-case. Coming from a Haskell perspective, this is a great idea, but I don't think the "union" feature should be used to implement it. Closed unions correspond to algebraic data types in Haskell, e.g.: data Ordering = LT | EQ | GT while open unions are better-suited to type classes: (+) :: (Num a) => a -> a -> a I, for one, would like to see PostgreSQL steal some features from Haskell's type system. PostgreSQL seems to implement a subset of Haskell's system, without type classes and where functions can have only one type variable (anyelement). To express the (+) example in PostgreSQL, it would be tempting to simply say: add(real, real) returns real However, what if each real is a different type (e.g. INT and FLOAT). Is that allowed? In the Haskell example above, (+) constraints both of its arguments to the same type. In ad-hoc syntax, it would look like this in PostgreSQL: real anyelement => add(anyelement, anyelement) returns anyelement Another thing to consider: attempting to use a type class as a column type, e.g.: CREATE TABLE foo (n real); Normally in Haskell, type information is passed implicitly as parameters (hence the term "parametric polymorphism"), rather than carried alongside values (like in object-oriented languages). In the case above, the type information would have to be carried with each value. Haskell actually supports this, but under a somewhat-weird extension called "Existential types" (see http://www.haskell.org/haskellwiki/Existential_type#Examples for an example). It isn't terribly useful in Haskell, and I don't think it will be in PostgreSQL either. Joey Adams
Excerpts from Simon Riggs's message of mar may 10 17:57:20 -0400 2011: > On Wed, May 4, 2011 at 5:58 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > both Oracle and MS-SQL have it > > Do they? What types are they called? ANYTYPE -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, May 11, 2011 at 11:43 AM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > On Tue, May 10, 2011 at 5:19 PM, Darren Duncan <darren@darrenduncan.net> wrote: >> Examples of open union types could be number, which all the numeric types >> compose, and so you can know say that you can use the generic numeric >> operators on values you have simply if their types compose the number union >> type, and it still works if more numeric types appear later. Likewise, the >> string open union could include both text and blob, as both support >> catenation and substring matches or extraction, for example. >> >> This would aid to operator overloading in a generic way, letting you use the >> same syntax for different types, but allowing types to mix is optional; eg, >> you could support "add(int,int)" and "add(real,real)" without supporting >> "add(int,real)" etc but the syntax "add(x,y)" is shared, and you do this >> while still having a strong type system; allowing the mixing is optional >> case-by-case. > > Coming from a Haskell perspective, this is a great idea, but I don't > think the "union" feature should be used to implement it. I'm unclear what the point of such a feature would be. A union of all the common numeric types is not much different from the existing type "numeric". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Wed, May 11, 2011 at 11:43 AM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: >> On Tue, May 10, 2011 at 5:19 PM, Darren Duncan <darren@darrenduncan.net> wrote: >>> Examples of open union types could be number, which all the numeric types >>> compose, and so you can know say that you can use the generic numeric >>> operators on values you have simply if their types compose the number union >>> type, and it still works if more numeric types appear later. Likewise, the >>> string open union could include both text and blob, as both support >>> catenation and substring matches or extraction, for example. >>> >>> This would aid to operator overloading in a generic way, letting you use the >>> same syntax for different types, but allowing types to mix is optional; eg, >>> you could support "add(int,int)" and "add(real,real)" without supporting >>> "add(int,real)" etc but the syntax "add(x,y)" is shared, and you do this >>> while still having a strong type system; allowing the mixing is optional >>> case-by-case. >> Coming from a Haskell perspective, this is a great idea, but I don't >> think the "union" feature should be used to implement it. > > I'm unclear what the point of such a feature would be. A union of all > the common numeric types is not much different from the existing type > "numeric". In the case of the "open union" numeric, one point is that users or extensions could come up with new types that add themselves to the union, for example say a type for complex numbers (I didn't see a built-in such). But I'm just citing numeric as an example; there would be a lot more in practice, potentially one for every individual type, so for example if operators were defined for the open union rather than for the base type, then users/extensions could define their own types and easily declare "you can use it like this type" but its different in some important way, which may just be an implementation difference. Operations that don't care about the differences can just be written against the open union type where they just work and those that do care can be more specific. Joseph Adams said: > Coming from a Haskell perspective, this is a great idea, but I don't > think the "union" feature should be used to implement it. Closed > unions correspond to algebraic data types in Haskell, e.g.: > > data Ordering = LT | EQ | GT > > while open unions are better-suited to type classes: > > (+) :: (Num a) => a -> a -> a While closed unions would often be used for your first examlpe, I see they're still useful for type classes as well. Especially in the face of open unions being available, the closed unions let users say, no, I don't really want the meaning of this union type to change just because someone else declares a new type in scope (that adds itself to the union). For example, one could declare a system_numeric closed union type that only includes Pg built-in numerics and users of that can be confident that nothing about it will change later unless the definition of system_numeric itself or the types it unions are changed. But open unions would be preferred in places they wouldn't cause trouble, where you want to allow easier user extensibility. > I, for one, would like to see PostgreSQL steal some features from > Haskell's type system. PostgreSQL seems to implement a subset of > Haskell's system, without type classes and where functions can have > only one type variable (anyelement). I think that Haskell and other functional languages have a lot to teach relational DBMSs and I see them as being highly compatible. -- Darren Duncan
To follow-up my earlier comments ... I suspect for practical purposes we may want to limit the scope of some type features. For example, the greatest benefits for "open union" / "mixin" types is with routines/operators, not so much with tables. So, Pg could choose to support open unions but only for routines, where the declared types of table columns must still be other types. Then you don't have to worry about searching your data for where one might be used, but only search your system catalog for routines/operators that use it. But closed unions should be supported in tables too. -- Darren Duncan
Darren Duncan <darren@darrenduncan.net> writes: > But I'm just citing numeric as an example; there would be a lot more > in practice, potentially one for every individual type, so for example > if operators were defined for the open union rather than for the base > type, then users/extensions could define their own types and easily > declare "you can use it like this type" but its different in some > important way, which may just be an implementation difference. > Operations that don't care about the differences can just be written > against the open union type where they just work and those that do > care can be more specific. I'm just an old-school abstract data type hacker, but I don't see anything in what you're saying that doesn't work today in our existing type system: with overloaded and/or polymorphic operators and functions you can get all those effects. Maybe there would be some small gain in ability to share code for tasks that fall between single-data-type and works-for-anything cases, but it looks like a pretty marginal improvement from here; probably not worth the cost and compatibility implications of a major overhaul of the type system. regards, tom lane