Thread: Dynamic table
Hi. I have a little problem (and a suggestion for a solution) that I wondered if anyone would care to comment on. I have a standard table filled with customers (with a unique customer id, names etc.) and for each customer I need to store some integer values. The problem is that the number of integer values that should be stored WILL change over time (it can both increase and decrease). It will most likely grow from zero to 10-18 and then perhaps add 1 or remove one value per year but basically be about 10-18 or so. I must be able to add a new integer value to all customers, and remove an integer value from all customers Altering the table by adding and deleting columns would theoretically solve it, but since columns are not really dropped, just hidden away, and there is a 1600 column limit on tables as I understand it, this would crash the application at some time in the future, or at least it will waste a lot of discspace. Other things I must be able to do is to alter values for a specific integer value for one customer or all customers simultaneous. This can be like "change all value 4 to 8 for the third integer value". And I must be able to quickly select all the integers for a specific customer for further processing. There will also be access to single integer values for a specific customer. It's hard to say how much compared to "get all integer values" for a specific customer. All customers will be equally accessed. As I see it I have these options. Method A) a "add-only" dynamic table that re-uses columns that are not used any more. First create a metatable CREATE TABLE Columnspec (id integer, columnname varchar(10), name varchar(30), defaultvalue integer, active boolean default true ); That will keep track of the dynamically added columns. and a table for all the integer values CREATE TABLE customerValues (customer_id integer references customer); Initially it will be empty and adding a new integer value called "shoe size" with a default value of 4 would mean to run these two queries. insert into Columnspec (id,columnname,name,defaultvalue) values (1, 'col1', 'shoe size', 4); alter table customerValues add column col1 integer default 4; removing a column would mean an operation like update Columnspec set active=false where id=5; then adding a new column would mean to reuse an inactive column, like the following operations update Columnspec set active=true, name='hair length', defaultvalue=10 where id=5; update customerValues set col5=10; -- to give all customers the default value. One could perhaps alter the default value of the column called 'col5' in the example above and also alter the names of the columns but that are just cosmetic. The only drawback I can see with this rightnow is that if the number of columns decreases, it will waste space. Method B) The EAV way. With a meta table CREATE TABLE metadata (id integer primary key, name varchar(30), defaultvalue integer); and then the values CREATE TABLE eav (customer_id references customer, key integer references metadata on delete cascade , value integer, unique(customer_id, key) ); With c customers, k integer fields method A requires at least c*(k+1) integers to store, B requires c*k*3. A factor about 3 sounds not that much wasted space, but of course it can make a difference. Method C) Have a metadata table as usual, and then store the values in an array. I must admit that I have not looked enough at arrays yet. The drawback I can see right now will be the access to a specific value in the array, but I guess PostgreSQL is fast... Method D) As method B) but add a view that aggregates the values into a complete set for each customers if that will speed up anything.... Any comments on the pros/cons with these strategies?
On Tue, Jun 16, 2009 at 09:11:20AM +0200, A B wrote: > I have a standard table filled with customers (with a unique customer > id, names etc.) and for each customer I need to store some integer > values. The problem is that the number of integer values that should > be stored WILL change over time (it can both increase and decrease). > It will most likely grow from zero to 10-18 and then perhaps add 1 or > remove one value per year but basically be about 10-18 or so. > > I must be able to add a new integer value to all customers, and remove > an integer value from all customers Altering the table by adding and > deleting columns would theoretically solve it, but since columns are > not really dropped, just hidden away, and there is a 1600 column > limit on tables as I understand it, this would crash the application > at some time in the future, or at least it will waste a lot of > discspace. > Method B) > The EAV way. > With a meta table > CREATE TABLE metadata (id integer primary key, name varchar(30), > defaultvalue integer); > > and then the values > CREATE TABLE eav (customer_id references customer, key integer > references metadata on delete cascade , value integer, > unique(customer_id, key) ); The way you described the problem the EAV solution sounds like the best match--not sure if I'd use your synthetic keys though, they will save a bit of space on disk but queries will be much more complicated to write. EAV style solutions are rarely good/easy to maintain when the problem changes so maybe you can take a step back from the problem and solve it some other way. The examples you gave (i.e. shoe size, hair length) would fit normal table columns much better. Maybe just use a set of tables for each set of related attributes (i.e. physical attributes, personal preferences, not sure what would fit here). You'd also be able to use appropriate data types/constraints, for example shoe size could have two columns one for the measurement system (i.e. European, US, UK...) and the other for the actual measurement. Just had a quick flick through your previous posts; and I'd probably stick with the multiple tables approach. It's the most natural fit to relational databases and until you know more about the problem (i.e. you've experienced the data your going to be getting and the ways it's going to change) you can't do much better. -- Sam http://samason.me.uk/
> The way you described the problem the EAV solution sounds like the best > match--not sure if I'd use your synthetic keys though, they will save a > bit of space on disk but queries will be much more complicated to write. I guess I'll have to build procedures for all the complicated queries when ever I add or remove an integer value. > EAV style solutions are rarely good/easy to maintain when the problem > changes so maybe you can take a step back from the problem and solve it > some other way. That's what I keep reading about EAV :-( > The examples you gave (i.e. shoe size, hair length) would fit normal > table columns much better. Sorry, shoe size was not a good example, think of it as <random string> instead of shoe size. The data/name is nothing you can relate to in any way or build special columns for or treat in other ways. > Just had a quick flick through your previous posts; and I'd probably > stick with the multiple tables approach. It's the most natural fit to > relational databases and until you know more about the problem (i.e. > you've experienced the data your going to be getting and the ways it's > going to change) you can't do much better. One table per integer is one way that I have not thought about. Thanks!
On Tue, Jun 16, 2009 at 12:21 PM, A B<gentosaker@gmail.com> wrote: > >> Just had a quick flick through your previous posts; and I'd probably >> stick with the multiple tables approach. It's the most natural fit to >> relational databases and until you know more about the problem (i.e. >> you've experienced the data your going to be getting and the ways it's >> going to change) you can't do much better. > > One table per integer is one way that I have not thought about. Thanks! you said you wanted to quickly select all the values for a given user. That does not jibe well with having to use a separate table for each integer. I don't think think it's fair to call this EAV actually. It sounds like the integers are a collection of things which represent the same thing. Ie, they're all bank balances or all distances driven, just for different time periods. Storing all objects representing the same thing in the same column is just a regular normalized table, not EAV. You might want to consider partitioning the table of integers by type to facilitate dropping the old ones. But that's an administrative trick, not a fundamental schema design decision. -- greg http://mit.edu/~gsstark/resume.pdf
2009/6/16 Greg Stark <gsstark@mit.edu>
Unfortunatly there is no "type" here. One can make no distinction between the integers and one can not tell when they are added or dropped.
I don't think think it's fair to call this EAV actually. It sounds
like the integers are a collection of things which represent the same
thing. Ie, they're all bank balances or all distances driven, just for
different time periods. Storing all objects representing the same
thing in the same column is just a regular normalized table, not EAV.
You might want to consider partitioning the table of integers by type
to facilitate dropping the old ones. But that's an administrative
trick, not a fundamental schema design decision.
Unfortunatly there is no "type" here. One can make no distinction between the integers and one can not tell when they are added or dropped.
2009/6/16 A B <gentosaker@gmail.com>: > > 2009/6/16 Greg Stark <gsstark@mit.edu> >> >> I don't think think it's fair to call this EAV actually. It sounds >> like the integers are a collection of things which represent the same >> thing. Ie, they're all bank balances or all distances driven, just for >> different time periods. Storing all objects representing the same >> thing in the same column is just a regular normalized table, not EAV. >> >> You might want to consider partitioning the table of integers by type >> to facilitate dropping the old ones. But that's an administrative >> trick, not a fundamental schema design decision. > > Unfortunatly there is no "type" here. One can make no distinction between > the integers and one can not tell when they are added or dropped. > Or did I misunderstand you Greg?
On Tue, Jun 16, 2009 at 01:21:42PM +0200, A B wrote: > > The examples you gave (i.e. shoe size, hair length) would fit normal > > table columns much better. > Sorry, shoe size was not a good example, think of it as <random > string> instead of shoe size. The data/name is nothing you can relate > to in any way or build special columns for or treat in other ways. It's this decision that makes me think you've backed yourself into a corner where EAV is your only solution. EAV is "nice" because it allows you express arbitrary parings between keys and values (i.e. it's a "map" of arbitrary size, like a hashtable in other languages) but it has its downsides in that all values are constrained to be of some specific type and it doesn't allow you to exploit the power/expressiveness of databases very easily. > > Just had a quick flick through your previous posts; and I'd probably > > stick with the multiple tables approach. It's the most natural fit to > > relational databases and until you know more about the problem (i.e. > > you've experienced the data your going to be getting and the ways it's > > going to change) you can't do much better. > > One table per integer is one way that I have not thought about. Thanks! That's not what I meant at all; I was meaning to group several related integer values into a single table. This would be a more natural fit for the example names you gave. Your problem is currently sounding very much like an exam question; you seem to be arbitrarily making decisions without showing any real data. When you deal with real problems in the real world you're normally making compromises when you model things and hence the decisions wouldn't be as forced as you're making it. Design is about picking and choosing between compromises and without knowing what the choices are you can't design anything, this is one of the problems with tests. -- Sam http://samason.me.uk/
On Jun 16, 2009, at 12:11 AM, A B wrote: > Hi. > I have a little problem (and a suggestion for a solution) that I > wondered if anyone would care to comment on. > > I have a standard table filled with customers (with a unique customer > id, names etc.) and for each customer I need to store some integer > values. The problem is that the number of integer values that should > be stored WILL change over time (it can both increase and decrease). > It will most likely grow from zero to 10-18 and then perhaps add 1 or > remove one value per year but basically be about 10-18 or so. > > I must be able to add a new integer value to all customers, and remove > an integer value from all customers Altering the table by adding and > deleting columns would theoretically solve it, but since columns are > not really dropped, just hidden away, and there is a 1600 column > limit on tables as I understand it, this would crash the application > at some time in the future, or at least it will waste a lot of > discspace. > > Other things I must be able to do is to alter values for a specific > integer value for one customer or all customers simultaneous. This can > be like "change all value 4 to 8 for the third integer value". > And I must be able to quickly select all the integers for a specific > customer for further processing. There will also be access to single > integer values for a specific customer. It's hard to say how much > compared to "get all integer values" for a specific customer. All > customers will be equally accessed. > > As I see it I have these options. > > Method C) > Have a metadata table as usual, and then store the values in an array. > I must admit that I have not looked enough at arrays yet. > The drawback I can see right now will be the access to a specific > value in the array, but I guess PostgreSQL is fast... You may want to try this. pagila=# create table test (id serial primary key, a int[]); NOTICE: CREATE TABLE will create implicit sequence "test_id_seq" for serial column "test.id" NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey" for table "test" CREATE TABLE Time: 12.844 ms -- let's populate it with some starter data pagila=# insert into test (a) values ('{1}'), ('{2}'), ('{3}'); INSERT 0 3 Time: 2.127 ms pagila=# select * from test; id | a ----+----- 1 | {1} 2 | {2} 3 | {3} (3 rows) Time: 2.823 ms -- so, lets say you want to add a second integer value of 5 to all rows (remember, postgres arrays are 1-based): pagila=# update test set a[2] = 5; UPDATE 3 Time: 1.157 ms pagila=# select * from test; id | a ----+------- 1 | {1,5} 2 | {2,5} 3 | {3,5} (3 rows) Time: 0.445 ms -- delete the first integer value for just id=1, the key thing here is that you use update to delete an individual value in an array pagila=# update test set a[1] = null where id = 1; UPDATE 1 Time: 1.688 ms pagila=# select * from test; id | a ----+---------- 2 | {2,5} 3 | {3,5} 1 | {NULL,5} (3 rows) Time: 0.527 ms -- get integer value 1 for all rows pagila=# select a[1] from test; a ---- 2 3 \N (3 rows) Time: 0.489 ms -- you can even skip positions pagila=# update test set a[5] = 10; UPDATE 3 Time: 1.180 ms pagila=# select * from test; id | a ----+----------------------- 2 | {2,5,NULL,NULL,10} 3 | {3,5,NULL,NULL,10} 1 | {NULL,5,NULL,NULL,10} (3 rows) Time: 0.431 ms Erik Jones, Database Administrator Engine Yard Support, Scalability, Reliability 866.518.9273 x 260 Location: US/Pacific IRC: mage2k
> Your problem is currently sounding very much like an exam question; you > seem to be arbitrarily making decisions without showing any real data. > When you deal with real problems in the real world you're normally > making compromises when you model things and hence the decisions > wouldn't be as forced as you're making it. Design is about picking and > choosing between compromises and without knowing what the choices are > you can't design anything, this is one of the problems with tests. I wish it was just an exam question.... Show real data? Well here it is: customer | value1 | value2 | value3| 1 5 3 4 2 8 2 10 I hope you can believe me when I say that the names value1,value2,value3 really are impossible to relate. I will not decide upon those. Someone else is going to do that in the future. I just have to make a structure that can handle all cases from "shoe size" to "number of atoms in persons body" (well, perhaps I can say that the values will be in the 0...100 range, but that's about it. There is really nothing else I know about how the system will be used by other people. :-( I just know that I have to make them enter data like this since it is a third party that need this kind of data. <confession>yes I took some random values for the "real data" since I don't know anything else about the data. </confession>
On 2009-06-17, A B <gentosaker@gmail.com> wrote: >> Your problem is currently sounding very much like an exam question; you >> seem to be arbitrarily making decisions without showing any real data. >> When you deal with real problems in the real world you're normally >> making compromises when you model things and hence the decisions >> wouldn't be as forced as you're making it. Design is about picking and >> choosing between compromises and without knowing what the choices are >> you can't design anything, this is one of the problems with tests. > > I wish it was just an exam question.... > > Show real data? > Well here it is: > > customer | value1 | value2 | value3| > 1 5 3 4 > 2 8 2 10 > > I hope you can believe me when I say that the names > value1,value2,value3 really are impossible to relate. I will not > decide upon those. Someone else is going to do that in the future. I > just have to make a structure that can handle all cases from "shoe > size" to "number of atoms in persons body" (well, perhaps I can say > that the values will be in the 0...100 range, but that's about it. > There is really nothing else I know about how the system will be used > by other people. :-( I just know that I have to make them enter data > like this since it is a third party that need this kind of data. > ><confession>yes I took some random values for the "real data" since I > don't know anything else about the data. </confession> I like the table-per-column approach And also the array approach, arrays of numbers of the sizes you discuss are much faster than hard disks what sorts of querys will be most common
Isn't a many-to-one relationship the classic example of a child table? Have one parent table that assigns a primary key. "PARENT" with "PARENT_ID". Have a child table that has a name and valuecolumn. "CHILD" with "PARENT_ID", "COLUMN_NAME", and "COLUMN_VALUE". Perform joins as you see fit to build up the structureat times. Use arrays if you'd like or perform multiple joins, which ever. Matthew Hartman Programmer/Analyst Information Management Kingston General Hospital, ICP (613) 544-2631 x4294 ________________________________ From: pgsql-general-owner@postgresql.org on behalf of Jasen Betts Sent: Sat 20-Jun-09 12:14 AM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] Dynamic table On 2009-06-17, A B <gentosaker@gmail.com> wrote: >> Your problem is currently sounding very much like an exam question; you >> seem to be arbitrarily making decisions without showing any real data. >> When you deal with real problems in the real world you're normally >> making compromises when you model things and hence the decisions >> wouldn't be as forced as you're making it. Design is about picking and >> choosing between compromises and without knowing what the choices are >> you can't design anything, this is one of the problems with tests. > > I wish it was just an exam question.... > > Show real data? > Well here it is: > > customer | value1 | value2 | value3| > 1 5 3 4 > 2 8 2 10 > > I hope you can believe me when I say that the names > value1,value2,value3 really are impossible to relate. I will not > decide upon those. Someone else is going to do that in the future. I > just have to make a structure that can handle all cases from "shoe > size" to "number of atoms in persons body" (well, perhaps I can say > that the values will be in the 0...100 range, but that's about it. > There is really nothing else I know about how the system will be used > by other people. :-( I just know that I have to make them enter data > like this since it is a third party that need this kind of data. > ><confession>yes I took some random values for the "real data" since I > don't know anything else about the data. </confession> I like the table-per-column approach And also the array approach, arrays of numbers of the sizes you discuss are much faster than hard disks what sorts of querys will be most common -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general