Thread: New SQL Datatype RECURRINGCHAR
Idea for a new SQL Data Type: RECURRINGCHAR The idea with RECURRINGCHAR is treated exactly like a VARCHAR in it's usage. However, it's designed for table columns that store a small set of repeated values (<=256 values). This allows for a great deal of savings in the storage of the data. Example: Query: select count(*) from order Returns: 100,000 Query: select distinct status from order Returns: OPEN REWORK PLANNED RELEASED FINISHED SHIPPED It's apparent that there is a lot of duplicate space used in the storage of this information. The idea is if order.status was stored as a RECURRINGCHAR then the only data stored for the row would be a reference to the value of the column. The actual values would be stored in a separate lookup table. Advantages: - Storage space is optimized. - a query like: select distinct {RECURRINGCHAR} from {table} can be radically optimized - Eliminates use of joins and extended knowledge of data relationships for adhoc users. This datatype could be extended to allow for larger sets of repeated values: RECURRINGCHAR1 (8-bit) up to 256 unique column valuesRECURRINGCHAR2 (16-bit) up to 65536 unique column values Reasoning behind using 'long reference values': It is often an advantage to actually store an entire word representing a business meaning as the value of a column (as opposed to a reference number or mnemonic abbreviation ). This helps to make the system 'self documenting' and adds value to users who are performing adhoc queries on the database. ---- David Bennett President - Bensoft 912 Baltimore, Suite 200 Kansas City, MO 64105
This is rather like MySQL's enum. I still opt for the join, and if you like make a view for those who don't want to know the data structure. -- Rod Taylor Your eyes are weary from staring at the CRT. You feel sleepy. Notice how restful it is to watch the cursor blink. Close your eyes. The opinions stated above are yours. You cannot imagine why you ever felt otherwise. ----- Original Message ----- From: <dbennett@jade.bensoft.com> To: <pgsql-hackers@postgresql.org> Sent: Friday, June 29, 2001 6:05 PM Subject: [HACKERS] New SQL Datatype RECURRINGCHAR > Idea for a new SQL Data Type: > > RECURRINGCHAR > > The idea with RECURRINGCHAR is treated exactly like a VARCHAR in it's > usage. However, it's designed for table columns that store a small set of > repeated values (<=256 values). This allows for a great deal of savings in > the storage of the data. > > Example: > > Query: > select count(*) from order > Returns: > 100,000 > > Query: > select distinct status from order > Returns: > OPEN > REWORK > PLANNED > RELEASED > FINISHED > SHIPPED > > It's apparent that there is a lot of duplicate space used in the storage > of this information. The idea is if order.status was stored as a > RECURRINGCHAR > then the only data stored for the row would be a reference to the value of > the column. The actual values would be stored in a separate lookup table. > > Advantages: > > - Storage space is optimized. > > - a query like: > > select distinct {RECURRINGCHAR} from {table} > > can be radically optimized > > - Eliminates use of joins and extended knowledge of data relationships > for adhoc users. > > This datatype could be extended to allow for larger sets of repeated > values: > > RECURRINGCHAR1 (8-bit) up to 256 unique column values > RECURRINGCHAR2 (16-bit) up to 65536 unique column values > > Reasoning behind using 'long reference values': > > It is often an advantage to actually store an entire word representing a > business meaning as the value of a column (as opposed to a reference > number or mnemonic abbreviation ). This helps to make the system > 'self documenting' and adds value to users who are performing adhoc > queries on the database. > > ---- > David Bennett > President - Bensoft > 912 Baltimore, Suite 200 > Kansas City, MO 64105 > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
This is not a good idea. You are probably coming from mysql background (no offense :). See comments inline. On Fri, 29 Jun 2001 dbennett@jade.bensoft.com wrote: > Idea for a new SQL Data Type: > > It's apparent that there is a lot of duplicate space used in the storage > of this information. The idea is if order.status was stored as a > RECURRINGCHAR > then the only data stored for the row would be a reference to the value of > the column. The actual values would be stored in a separate lookup table. You should instead have another table with two columns, order_status_id and order_status_desc, and join with it to get your data. > > Advantages: > > - Storage space is optimized. > > - a query like: > > select distinct {RECURRINGCHAR} from {table} > > can be radically optimized select distinct order_status_desc from order_status_lookup > - Eliminates use of joins and extended knowledge of data relationships > for adhoc users. For adhoc users, you can create a view so they won't be aware of joins. > It is often an advantage to actually store an entire word representing a > business meaning as the value of a column (as opposed to a reference > number or mnemonic abbreviation ). This helps to make the system > 'self documenting' and adds value to users who are performing adhoc > queries on the database. No, that is against good database design and any database normalization. -alex
"Rod Taylor" <rbt@barchord.com> writes: > This is rather like MySQL's enum. Yes. If we were going to do anything like this, I'd vote for stealing the "enum" API, lock stock and barrel --- might as well be compatible. regards, tom lane
On Tue, 3 Jul 2001, David Bennett wrote: > The idea is to simplify the process of storing and accessing the data. > Joins required a deeper knowledge of the relational structure. This > also complicates application programming, two tables must be > maintained instead of just one. Sometimes, to maintain correctness, its necessary to have complex designs. "All problems have simple, easy-to-understand, incorrect solutions". > Again the idea is to simplify. Reduce the number of tables required to > represent a business model. Why? You should normalize your data, which _increases_ number of tables. > >> - Eliminates use of joins and extended knowledge of data relationships > >> for adhoc users. > > > For adhoc users, you can create a view so they won't be aware of joins. > > Now we have a master table, a lookup table AND a view? > even more complication.... Well, that's called software development. If you don't want complications, you can use MS-Access *:) > > >> It is often an advantage to actually store an entire word representing a > >> business meaning as the value of a column (as opposed to a reference > >> number or mnemonic abbreviation ). This helps to make the system > >> 'self documenting' and adds value to users who are performing adhoc > >> queries on the database. > > > No, that is against good database design and any database normalization. > > I would like to hear your argument on this. I don't see how optimizing > the storage of reference value breaks a normalization rule. What if tomorrow you will need to change text name for "OPEN" status to "OPEN_PENDING_SOMETHING"? With your design, you will need to update all rows in the table changing it. With normalized design, you just update the lookup table. Etc, etc. -alex
On Fri, 6 Jul 2001, David Bennett wrote: <rest snipped> > In either model you would: > > update master_table set status='OPEN_PENDING_SOMETHING' where status='OPEN' > > This would not change, in fact, even in a normalized design you > wouldn't change the lookup table (parent) key. Perhaps you are > misunderstanding my initial concept. The MySQL 'enum' is close. > However, it is static and requires you to embed business data (your > key list) in the DDL. The idea I have here is to dynamically extend > this list as needed. I am not saying that the value can't relate to a > parent (lookup) table. It's just not necessary if the value is all > that is needed. You are making absolutely no sense. Let me break it down: a) To do an update of a key to a different value, you would need to do following: 1) look up the new value in entire table, find if its already exists 2) If it exists, good. 3) if it doesn't, pick a next number. (out of some sequence, I suppose) to represent the key. 4) do the actual update. Step 1 without an index is a killer. Then, you need to have a certain 'table' to map the existing key values to their numerical representations. How would this 'table' get populated? On startup? On select? Its one thing to take 'enum' datatype, which I wouldn't disagree too much with. Its another thing to suggest this kind of a scheme, which should be really done with views and rules. I.E. instead of (as you would have) table a(..., x recurringchar), you must have two things: table a_real (..., x int4) table lookup (x int4, varchar value) Then, have a view: create view a as select ..., value from a_real, lookup where a_real.x=lookup.x Then create a rule on insert: (syntax may be rusty) create rule foo on insert on table a do instead ...whatever magic you need to do the actual inserton, lookup, etc. > --Dave (Hoping some other SQL developers are monitoring this thread :)
On Sat, 7 Jul 2001, David Bennett wrote: > ----- > In a nutshell you are recommending: > ----- > > create table contact_type ( > code int2, > type char(16), > PRIMARY KEY ( code ) > ); > > create table contact ( > number serial, > name char(32), > type int2, > PRIMARY KEY ( number ), > FOREIGN KEY ( type ) REFERENCES contact_type ( code ) > ); > > create view contact_with_readble_type as ( > select c.number as number, > c.name as name, > t.type as type > from > contact c, > contact_type t > ); > > * To build a type lookup table: > > 1) Select type and code from contact_type > 2) Build UI object which displays type and returns code Just 'select distinct' on a view should do just fine. > * In order to insert a new record with this model: > > 1) Look up to see if type exists > 2) Insert new type > 3) Get type ID > 4) Insert contact record This can be encapsulated with "ON INSERT" rule on a view. > * The adhoc query user is now faced with > the task of understanding 3 data tables. No, only one view. All the logic is encapsulated there. > > ----- > With recurringchar you could do this easily as: > ----- > > create table contact ( > number serial, > name char(32), > type recurringchar1, > PRIMARY KEY ( number ), > ); > > * To build a type lookup table: > > 1) Select distinct type from contact (optimized access to recurringchar > dictionary) > 2) Build UI object which displays and returns type. > > * In order to insert a new record with this model: > > 1) Insert contact record > > * The adhoc query user has one data table. > > ----- > > Granted, changing the value of contact_type.type would require edits > to the contact records. It may be possible to add simple syntax to > allow editing of a 'recurringchar dictionary' to get around isolated > problem which would only exist in certain applications. > > Actually, maybe 'dictionary' or 'dictref' would be a better name for > the datatype. These things belong in application or middleware (AKA views/triggers), not in database server itself. There are multiple problems with your implementation, for example, transaction handling, assume this situation: Tran A inserts a new contact with new type "foo", but does not commit. Dictionary assigns value of N to 'foo'. Tran B inserts a new contact with type foo. What value should be entered in the dictionary? N? A new value? If a type disappears from database, does its dictionary ID get reused? All these questions are not simple questions, and its not up to database to decide it. Your preferred solution belongs in your triggers/views, not in core database.
This would be a potential feature of being able to insert into views in general. Reversing the CREATE VIEW statement to accept inserts, deletes and updates. If true, focus on that. Theres lots of views that cannot be reversed properly -- unions come to mind -- but perhaps this type of simple join could be a first step in the package. I believe this is on the TODO list already. Different attack, but accomplishes the same thing within SQL standards as I seem to recall views are supposed to do this where reasonable. Failing that, implement this type of action the same way as foreign keys. Via the described method with automagically created views, tables, etc. Though I suggest leaving it in contrib for sometime. Enum functionality isn't particularly useful to the majority whose applications tend to pull out the numbers for states when the application is opened (with the assumption they're generally static). -- Rod Taylor Your eyes are weary from staring at the CRT. You feel sleepy. Notice how restful it is to watch the cursor blink. Close your eyes. The opinions stated above are yours. You cannot imagine why you ever felt otherwise. ----- Original Message ----- From: "Alex Pilosov" <alex@pilosoft.com> To: "David Bennett" <dbennett@bensoft.com> Cc: <pgsql-hackers@postgresql.org> Sent: Saturday, July 07, 2001 9:24 PM Subject: RE: [HACKERS] New SQL Datatype RECURRINGCHAR > On Sat, 7 Jul 2001, David Bennett wrote: > > > ----- > > In a nutshell you are recommending: > > ----- > > > > create table contact_type ( > > code int2, > > type char(16), > > PRIMARY KEY ( code ) > > ); > > > > create table contact ( > > number serial, > > name char(32), > > type int2, > > PRIMARY KEY ( number ), > > FOREIGN KEY ( type ) REFERENCES contact_type ( code ) > > ); > > > > create view contact_with_readble_type as ( > > select c.number as number, > > c.name as name, > > t.type as type > > from > > contact c, > > contact_type t > > ); > > > > * To build a type lookup table: > > > > 1) Select type and code from contact_type > > 2) Build UI object which displays type and returns code > Just 'select distinct' on a view should do just fine. > > > * In order to insert a new record with this model: > > > > 1) Look up to see if type exists > > 2) Insert new type > > 3) Get type ID > > 4) Insert contact record > This can be encapsulated with "ON INSERT" rule on a view. > > > * The adhoc query user is now faced with > > the task of understanding 3 data tables. > No, only one view. All the logic is encapsulated there. > > > > > ----- > > With recurringchar you could do this easily as: > > ----- > > > > create table contact ( > > number serial, > > name char(32), > > type recurringchar1, > > PRIMARY KEY ( number ), > > ); > > > > * To build a type lookup table: > > > > 1) Select distinct type from contact (optimized access to recurringchar > > dictionary) > > 2) Build UI object which displays and returns type. > > > > * In order to insert a new record with this model: > > > > 1) Insert contact record > > > > * The adhoc query user has one data table. > > > > ----- > > > > Granted, changing the value of contact_type.type would require edits > > to the contact records. It may be possible to add simple syntax to > > allow editing of a 'recurringchar dictionary' to get around isolated > > problem which would only exist in certain applications. > > > > Actually, maybe 'dictionary' or 'dictref' would be a better name for > > the datatype. > These things belong in application or middleware (AKA views/triggers), not > in database server itself. > > There are multiple problems with your implementation, for example, > transaction handling, assume this situation: > > Tran A inserts a new contact with new type "foo", but does not commit. > Dictionary assigns value of N to 'foo'. > > Tran B inserts a new contact with type foo. What value should be entered > in the dictionary? N? A new value? > > If a type disappears from database, does its dictionary ID get reused? > > All these questions are not simple questions, and its not up to database > to decide it. Your preferred solution belongs in your triggers/views, not > in core database. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Sat, 7 Jul 2001, Rod Taylor wrote: > This would be a potential feature of being able to insert into views > in general. Reversing the CREATE VIEW statement to accept inserts, > deletes and updates. Definitely not a 'potential' feature, but a existing and documented one. Read up on rules, especially 'ON INSERT DO INSTEAD' stuff. Its not automatic, though. > If true, focus on that. Theres lots of views that cannot be reversed > properly -- unions come to mind -- but perhaps this type of simple > join could be a first step in the package. I believe this is on the > TODO list already. On TODO list are updatable views in SQL sense of word, [i.e. automatic updateability of a view which matches certain criteria]. > Different attack, but accomplishes the same thing within SQL standards > as I seem to recall views are supposed to do this where reasonable. > > > Failing that, implement this type of action the same way as foreign > keys. Via the described method with automagically created views, > tables, etc. Though I suggest leaving it in contrib for sometime. > Enum functionality isn't particularly useful to the majority whose > applications tend to pull out the numbers for states when the > application is opened (with the assumption they're generally static). Original suggestion was not for an enum type, it was for _dynamically extensible_ data dictionary type. ENUM is statically defined, and it wouldn't be too hard to implement, with one exception: one more type-specific field needs to be added to pg_attribute table, where would be stored argument for the type (such as, length for a char/varchar types, length/precision for numeric type, and possible values for a enum type). This just needs a pronouncement that this addition is a good idea, and then its a trivial thing to implement enum. -alex
> > This would be a potential feature of being able to insert into views > > in general. Reversing the CREATE VIEW statement to accept inserts, > > deletes and updates. > Definitely not a 'potential' feature, but a existing and documented one. > Read up on rules, especially 'ON INSERT DO INSTEAD' stuff. Its not > automatic, though. Trust me, I know how to go about doing those kinds of things manually. I was referring to the automated reveral -- which creates this features in a very simple manner without any additions or changes to system tables -- aside from reverse rules themselves which is a more generic feature. > > If true, focus on that. Theres lots of views that cannot be reversed > > properly -- unions come to mind -- but perhaps this type of simple > > join could be a first step in the package. I believe this is on the > > TODO list already. > On TODO list are updatable views in SQL sense of word, [i.e. automatic > updateability of a view which matches certain criteria]. > > > Different attack, but accomplishes the same thing within SQL standards > > as I seem to recall views are supposed to do this where reasonable. > > > > > > Failing that, implement this type of action the same way as foreign > > keys. Via the described method with automagically created views, > > tables, etc. Though I suggest leaving it in contrib for sometime. > > Enum functionality isn't particularly useful to the majority whose > > applications tend to pull out the numbers for states when the > > application is opened (with the assumption they're generally static). > > Original suggestion was not for an enum type, it was for _dynamically > extensible_ data dictionary type. ENUMs from my memory are easily redefined. And since the database they're implemented in requires table locks for everything, they can appear dynamic (nothing is transaction safe in that thing anyway).
> > > This would be a potential feature of being able to insert into > views > > > in general. Reversing the CREATE VIEW statement to accept > inserts, > > > deletes and updates. > > Definitely not a 'potential' feature, but a existing and documented > one. > > Read up on rules, especially 'ON INSERT DO INSTEAD' stuff. Its not > > automatic, though. > > Trust me, I know how to go about doing those kinds of things manually. > I was referring to the automated reveral -- which creates this > features in a very simple manner without any additions or changes to > system tables -- aside from reverse rules themselves which is a more > generic feature. Hmm. My above statement lost all credibility in poor grammer and speeling. Time for bed I suppose. Anyway, the point is that some of the simple views should be straight forward to reversing automatically if someone has the will and the time it can be done. A while back a list of 'views which cannot be reversed' was created and included things such as Unions, Intersections, exclusions, aggregates, CASE statements, and a few more items.
"Rod Taylor" <rbt@barchord.com> writes: > Anyway, the point is that some of the simple views should be straight > forward to reversing automatically if someone has the will and the > time it can be done. A while back a list of 'views which cannot be > reversed' was created and included things such as Unions, > Intersections, exclusions, aggregates, CASE statements, and a few more > items. SQL92 has a notion that certain simple views are "updatable", while the rest are not. In our terms this means that we should automatically create ON INSERT/UPDATE/DELETE rules if the view is updatable according to the spec. I have not bothered to chase down all the exact details of the spec's "updatableness" restrictions, but they're along the same lines you mention --- only one referenced table, no aggregation, no set operations, all view outputs are simple column references, etc. My feeling is that the restrictions are stringent enough to eliminate most of the interesting uses of views, and hence an automatic rule creation feature is not nearly as useful/important as it appears at first glance. In real-world applications you'll have to expend some thought on manual rule creation anyway. regards, tom lane
>> It's apparent that there is a lot of duplicate space used in the storage >> of this information. The idea is if order.status was stored as a >> RECURRINGCHAR >> then the only data stored for the row would be a reference to the value of >> the column. The actual values would be stored in a separate lookup table. >You should instead have another table with two columns, order_status_id >and order_status_desc, and join with it to get your data. The idea is to simplify the process of storing and accessing the data. Joins required a deeper knowledge of the relational structure. This also complicates application programming, two tables must be maintained instead of just one. >> select distinct {RECURRINGCHAR} from {table} >> >> can be radically optimized > select distinct order_status_desc from order_status_lookup Again the idea is to simplify. Reduce the number of tables required to represent a business model. >> - Eliminates use of joins and extended knowledge of data relationships >> for adhoc users. > For adhoc users, you can create a view so they won't be aware of joins. Now we have a master table, a lookup table AND a view? even more complication.... >> It is often an advantage to actually store an entire word representing a >> business meaning as the value of a column (as opposed to a reference >> number or mnemonic abbreviation ). This helps to make the system >> 'self documenting' and adds value to users who are performing adhoc >> queries on the database. > No, that is against good database design and any database normalization. I would like to hear your argument on this. I don't see how optimizing the storage of reference value breaks a normalization rule. --Dave
> various disagreements and "quotes"... I agree that you disagree.... :) RECURRINGCHAR does not break normal form. It simply optimizes the storage of reference values (recurring keys). This allows for the use of 'long words' as reference values with a great deal of system storage savings and a boost in performance in certain circumstances. This is more a form of 'compression' then anything else, as a matter of fact, this is very similar to the LZ78 family of substitutional compressors. http://www.faqs.org/faqs/compression-faq/part2/section-1.html The advantage here is that we are targeting a normalized value in it's atomic state, The recurrence rate of this these values is extremely high which allows us to store this data in a very small space and optimize the access to this data by using the 'dictionary' that we create. >What if tomorrow you will need to change text name for "OPEN" status to >"OPEN_PENDING_SOMETHING"? With your design, you will need to update all >rows in the table changing it. With normalized design, you just update the >lookup table. Etc, etc. In either model you would: update master_table set status='OPEN_PENDING_SOMETHING' where status='OPEN' This would not change, in fact, even in a normalized design you wouldn't change the lookup table (parent) key. Perhaps you are misunderstanding my initial concept. The MySQL 'enum' is close. However, it is static and requires you to embed business data (your key list) in the DDL. The idea I have here is to dynamically extend this list as needed. I am not saying that the value can't relate to a parent (lookup) table. It's just not necessary if the value is all that is needed. --Dave (Hoping some other SQL developers are monitoring this thread :)
Alex, I think I fully understand your position. Let me put wrap up our conversation so far. Given the application requirements: 1) contacts have a type. 2) new types must be added on the fly as needed. 3) types names rarely change. 4) the number of contacts should scale to support millions of records. 5) the number of types will be limited to under 64k 6) Users must be able to easily query contacts with readable types. ----- In a nutshell you are recommending: ----- create table contact_type ( code int2, type char(16), PRIMARY KEY ( code ) ); create table contact ( number serial, name char(32), type int2, PRIMARY KEY ( number ), FOREIGNKEY ( type ) REFERENCES contact_type ( code ) ); create view contact_with_readble_type as ( select c.number as number, c.name as name, t.type as type from contact c, contact_type t ); * To build a type lookup table: 1) Select type and code from contact_type 2) Build UI object which displays type and returns code * In order to insert a new record with this model: 1) Look up to see if type exists 2) Insert new type 3) Get type ID 4) Insert contact record * The adhoc query user is now faced with the task of understanding 3 data tables. ----- With recurringchar you could do this easily as: ----- create table contact ( number serial, name char(32), type recurringchar1, PRIMARY KEY ( number), ); * To build a type lookup table: 1) Select distinct type from contact (optimized access to recurringchar dictionary) 2) Build UI object which displays and returns type. * In order to insert a new record with this model: 1) Insert contact record * The adhoc query user has one data table. ----- Granted, changing the value of contact_type.type would require edits to the contact records. It may be possible to add simple syntax to allow editing of a 'recurringchar dictionary' to get around isolated problem which would only exist in certain applications. Actually, maybe 'dictionary' or 'dictref' would be a better name for the datatype.
The only problem with 'enum' is that all of the possible values must be specified at CREATE time. A logical extension to this would be to allow for 'dynamic extensions' to the list. -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Tuesday, July 03, 2001 4:49 PM To: Rod Taylor Cc: dbennett@jade.bensoft.com; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] New SQL Datatype RECURRINGCHAR "Rod Taylor" <rbt@barchord.com> writes: > This is rather like MySQL's enum. Yes. If we were going to do anything like this, I'd vote for stealing the "enum" API, lock stock and barrel --- might as well be compatible. regards, tom lane
David Bennett wrote: > Alex, > > I think I fully understand your position. Let me put wrap up our > conversation so far. > [lots of arguments for recurringchar] All I've seen up to now is that you continue to mix up simplification on the user side with data and contentcontrol on the DB designer side. Do the users create all the tables and would have to create the views, or isthat more the job of someone who's educated enough? And about the multiple lookups and storage of new types, we have procedural languages and database triggers. This is no personal offense, but I just don't see why we should adopt non-standard MySQLism for functionalitythat is available through standard mechanisms with alot more flexibility and control. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com