Thread: [OT] "advanced" database design (long)
Hello, I think that I understand basic relational theory but then I had an idea. What I would like to know if this is sometimes done or that I am possibly mad... Also, I do not know the terminology for this kind of thing so I do not know where and what to look for. Basically, instead of adding field to a table every time there is a need for it, have a table split in two: one holds identity (id) and one holds the attributes (linked to this id). For example, modelling subject (person): subject: id entered_on (date when entered to the database) sadg (subject attribute group): id value (string - name for group of attributes, e.g. postal address) sad (subject attribute definition): id sadg (references sadg(id)) name (the name of the attribute - e.g. e-mail address, gender) app_type (string - what data type to use in application for sa.value) db_type (string - what cast to use in database for sa.value) sa (subject attribute): id subject (references subject(id)) sad (references sad(id) value (string, holds the attribute value) sads (subject attribute definition set) - what values the sa can have: id sad (references sad(id)) value (string) Basically, if in the future user decides that the subject should have a new attribute, he can simply add "attribute definition" and attribute_definition_set (if any) and the application would handle it without re-designing table, queries, and screens (i.e. the user interface when displaying sadg: "postal address" would always pull all attributes in that group and lay-out would automatically handle the new attributes). The advantage I see, is that to add new fields (here "attributes") no db development would be needed and user could do this. The disadvantages I see is that the model is hard to work with (i.e. how do I see subject (with attributes) as a table - could cross join be used for this?. Also, hand writing the queries for this would be hard (possibly needed if user would like to write custom reports). Do people do this kind of thing (or I took it too far)? If yes, I would be grateful for pointers to examples or any other info on this... Vlad
Hi, vladimir konrad wrote: > I think that I understand basic relational theory but then I had an > idea. What I would like to know if this is sometimes done or that I am > possibly mad... Also, I do not know the terminology for this kind of > thing so I do not know where and what to look for. > > Basically, instead of adding field to a table every time there is a > need for it, have a table split in two: one holds identity (id) and one > holds the attributes (linked to this id). For example, modelling > subject (person): [example stripped] > The advantage I see, is that to add new fields (here "attributes") no > db development would be needed and user could do this. > > The disadvantages I see is that the model is hard to work with (i.e. how > do I see subject (with attributes) as a table - could cross join be > used for this?. Also, hand writing the queries for this would be hard > (possibly needed if user would like to write custom reports). > > Do people do this kind of thing (or I took it too far)? If yes, I would > be grateful for pointers to examples or any other info on this... Yes, this is known as eg. Entity-Attribute-Value model (cf. wikipedia). IMO most times its disadvantages (it can be very hard to write performant queries compared to the traditional row based model) weigh higher than you gain (in flexibility) in relational databases. But it sure has its uses cases. Ciao, Thomas
> Yes, this is known as eg. Entity-Attribute-Value model (cf. > wikipedia). Thank you for the pointer and term. This will get me started. > IMO most times its disadvantages (it can be very hard to write > performant queries compared to the traditional row based model) weigh > higher than you gain (in flexibility) in relational databases. But it > sure has its uses cases. Obviously, I will have to think it through more... Vlad
--- vladimir konrad <vk@dsl.pipex.com> wrote: > I think that I understand basic relational theory but then I had an > idea. > Basically, instead of adding field to a table every time there is a > need for it, have a table split in two: one holds identity (id) and > one holds the attributes (linked to this id). > Basically, if in the future user decides that the subject should > have a new attribute, he can simply add "attribute definition" and > attribute_definition_set (if any) and the application would handle Basically, you would be creating your own data dictionary (i.e. system catalog) on top of the db data dictionary. The database already comes with a way to easily add columns: ddl. I have seen newbie database designers reinvent this method a hundred times. The performance hits and complexity of querying data would far out weigh any perceived maintenance gain. My .02. LewisC Lewis R Cunningham An Expert's Guide to Oracle Technology http://blogs.ittoolbox.com/oracle/guide/ LewisC's Random Thoughts http://lewiscsrandomthoughts.blogspot.com/
vladimir konrad <vk@dsl.pipex.com> wrote: > > Hello, > > I think that I understand basic relational theory but then I had an > idea. What I would like to know if this is sometimes done or that I am > possibly mad... Also, I do not know the terminology for this kind of > thing so I do not know where and what to look for. > > Basically, instead of adding field to a table every time there is a > need for it, have a table split in two: one holds identity (id) and one > holds the attributes (linked to this id). For example, modelling > subject (person): > > subject: > id > entered_on (date when entered to the database) > > sadg (subject attribute group): > id > value (string - name for group of attributes, e.g. postal > address) > > sad (subject attribute definition): > id > sadg (references sadg(id)) > name (the name of the attribute - e.g. e-mail address, gender) > app_type (string - what data type to use in application for > sa.value) > db_type (string - what cast to use in database for sa.value) > > sa (subject attribute): > id > subject (references subject(id)) > sad (references sad(id) > value (string, holds the attribute value) > > sads (subject attribute definition set) - what values the sa can have: > id > sad (references sad(id)) > value (string) > > Basically, if in the future user decides that the subject should have > a new attribute, he can simply add "attribute definition" and > attribute_definition_set (if any) and the application would handle it > without re-designing table, queries, and screens (i.e. the user > interface when displaying sadg: "postal address" would always pull all > attributes in that group and lay-out would automatically handle the new > attributes). Sounds like LDAP. -- Bill Moran http://www.potentialtech.com
> Basically, you would be creating your own data dictionary (i.e. > system catalog) on top of the db data dictionary. The database > already comes with a way to easily add columns: ddl. I have seen > newbie database designers reinvent this method a hundred times. The > performance hits and complexity of querying data would far out weigh > any perceived maintenance gain. And I thought that this would make me a famous database designer ;-), well, given the deadline for this + other concerns it looks like implementing my own type system is over-kill (+ it would be impossible for "normal" users to write report queries). Thank you all for input, i will do it the "normal" way. Vlad
On Sat, Feb 02, 2008 at 01:38:19PM +0100, Thomas Pundt wrote: > Hi, > > vladimir konrad wrote: >> I think that I understand basic relational theory but Clearly, you'll have to revisit that thought. > [example stripped] > > Yes, this is known as eg. Entity-Attribute-Value model (cf. > wikipedia). > > IMO most times its disadvantages (it can be very hard to write > performant queries compared to the traditional row based model) Make that, "impossible." The "flexibility" stems from fear of making a design decision. The second and smaller price is having the system bog down entirely and have to be scrapped, whether it's 3 months down the line, or 3 years. The math beneath this is that query complexity goes up like O(E!A!V!) for Entity, Attribute and Value. The first price, though, and by far the biggest, is that it's impossible to maintain any kind of data integrity in such a system, as such constraints, by their nature, are application-dependent. Two applications means you're violating the SPOT (Single Point of Truth) Rule, and that in turn means your data turns quickly into incomprehensible gibberish. > weigh higher than you gain (in flexibility) in relational databases. > But it sure has its uses cases. Why, yes. I encourage all my competitors to use it. ;) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Lewis Cunningham wrote: > --- vladimir konrad <vk@dsl.pipex.com> wrote: > >> I think that I understand basic relational theory but then I had an >> idea. >> Basically, instead of adding field to a table every time there is a >> need for it, have a table split in two: one holds identity (id) and >> one holds the attributes (linked to this id). >> Basically, if in the future user decides that the subject should >> have a new attribute, he can simply add "attribute definition" and >> attribute_definition_set (if any) and the application would handle > > Basically, you would be creating your own data dictionary (i.e. > system catalog) on top of the db data dictionary. The database > already comes with a way to easily add columns: ddl. If you have some part of your app that needs to "select" the list of columns in a table you should look at http://www.postgresql.org/docs/8.2/interactive/catalogs.html particularly pg_class and pg_attribute -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz
> If you have some part of your app that needs to "select" the list of > columns in a table you should look at > http://www.postgresql.org/docs/8.2/interactive/catalogs.html > particularly pg_class and pg_attribute Thanks, this could come handy. Vlad
David Fetter wrote: > The math beneath this is that query complexity goes up like O(E!A!V!) > for Entity, Attribute and Value. Makes sense. > The first price, though, and by far the biggest, is that it's > impossible to maintain any kind of data integrity in such a system, as > such constraints, by their nature, are application-dependent. Two > applications means you're violating the SPOT (Single Point of Truth) > Rule, and that in turn means your data turns quickly into > incomprehensible gibberish. Starts making sense, too. Could you give advice (or pointers what I should look at) on how one would (roughly) "properly" schemafy the following requirements: - unknown number of differing paper forms to print data on - user fills in on-screen masks to aggregate data for printing Intent: use the database to store a) definitions for on-screen masks, b) definitions for printout (how to place data), c) the denormalized data eventually put into form instances (the normalized source data already is in the database). There seem to be three basic approaches: - one table per form def plus one per form type holding content - one table holding form defs as, say, XML to be parsed client-side plus another table holding form data as XML, too - EAV: tables holding form defs, field defs, form instances pointing to form defs, and field data pointing to field defs and form instances each with all the relevant foreign keys The first requires DDL whenever a form is added by a user. The second requires client-side logic making form reuse across clients a lot harder (SPOT violation ?). The third sounds OK -- but seems to be of the apparently dreaded EAV type. What am I missing ? Where should I get a clue ? Thanks, Karsten Hilbert, MD wiki.gnumed.de -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
Hello, >> vladimir konrad wrote: >>> I think that I understand basic relational theory but > Clearly, you'll have to revisit that thought. Usually I have one table per "entity" modelled (and the table holds fields describing that entity). E.g. subject would have name fields and date of birth field (fields related directly to subject), postal_address would be separate table and subject_postal_address would be linking table between postal_address and subject: subject <-- subject_postal_address --> postal_address This way, the postal_address can be made unique (with constrains) and linked to other entities where the postal_address is needed. The system I am developing has to handle "tests" (for rowing athletes): 1. how many meters athlete did in 10 minutes (result is float) 2. how long it took the athlete to do 5 kilo-meters (the result is time) So it looks that I need table for each of 1 and 2 (because of different data types describing the test). > The math beneath this is that query complexity goes up like O(E!A!V!) > for Entity, Attribute and Value. Well, when I thought about it, the gut feeling was that I am opening a can of worms - it would push the complexity into code (the bad place to have it in). > The first price, though, and by far the biggest, is that it's > impossible to maintain any kind of data integrity in such a system, as > such constraints, by their nature, are application-dependent. Two > applications means you're violating the SPOT (Single Point of Truth) > Rule, and that in turn means your data turns quickly into > incomprehensible gibberish. It could be implemented inside of the database server (stored procedures, views and such), but it would still be complex, hard, long and as you said badly performing (your point about complexity made that clear). Vlad
vladimir konrad wrote: > The system I am developing has to handle "tests" (for rowing athletes): > 1. how many meters athlete did in 10 minutes (result is float) > 2. how long it took the athlete to do 5 kilo-meters (the result is time) > > So it looks that I need table for each of 1 and 2 (because of different > data types describing the test). > Are the tests that different that you need to segregate the data? I see them both as being the time taken to travel a distance. The only difference is whether the time or distance is used to end the measurement. Personally I would think that one table that has the athlete's id and a date (or timestamp to allow more than one a day) of the event as well as a time interval and distance would suffice. For 1. the time interval would always be 10mins, for 2. the distance would always be 5km. To get individual test stats you can use WHERE time=10mins or WHERE distance=5.0 You could even create test_views with the where clause pre-determined. Worst case would be another column flagging the test type. From there you can also add in a 5, 15, 20, 30, 40 minutes or even 2, 2.5, 7.5, 10 km tests as well without changing your structure. I also see multiple samples for a single run. Time at 1km, 2km, 3km, 4km, 5km (or at 2min, 4 min, 6min...) - you could see whether they can maintain the speed over the distance or at what distance/time they wear out and slow down. (maybe they give too much in the first 2km so that they just crawl in the last 2) Maybe sub-times can be a second table. -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz
Hello, and thanks > Are the tests that different that you need to segregate the data? > I see them both as being the time taken to travel a distance. The > only difference is whether the time or distance is used to end the > measurement. Good point (I have realised this after posting, when I dug deeper into the design). What I got now looks like this (linking fields omitted): [subject] 1 | n [ergo: distance(float), time(interval), taken_on(date)] n n | | 1 1 [ergo_rate] [ergo_set] The same test can be taken at different rates (e.g. 20 strokes per minute), therefore the [ergo_rate] (there are only few rates they always use). The [ergo_set] determines what value was "set" before the test (bad name but maybe i think of something better). So, it the distance is set, it points to (1, 'distance') in [ergo_set]; if time, then to (2, 'time'). User chooses what is "set" before recording the test. Also it is possible to ask "give me all 2000m ergo test results done at this rate". > Worst case would be another column flagging the test type. Why do you think this is the worst case? > I also see multiple samples for a single run. Time at 1km, 2km, 3km, > 4km, 5km (or at 2min, 4 min, 6min...) - you could see whether they > can maintain the speed over the distance or at what distance/time > they wear out and slow down. (maybe they give too much in the first > 2km so that they just crawl in the last 2) > Maybe sub-times can be a second table. They call it splits (have to check if it is done for ergos but it is definitely done for racing on the water). In ergo case, I would have extra table [ergo_split]: [ergo] <-- [ergo_split: distance(float), clock(time)] (they record the time a watch shows them, therefore time and not interval) Vlad
vladimir konrad wrote: >> Worst case would be another column flagging the test type. > > Why do you think this is the worst case? > Bad choice of words - just referring to using the where clause to extract one particular test - if that is insufficient you can use a test column to track what test it refers to. -- Shane Ambler pgSQL (at) Sheeky (dot) Biz Get Sheeky @ http://Sheeky.Biz
create table attribute (
attribute_id int
attribute text
)
create table value (
value_id int
value text
)
create table attribute_value (
entity_id int
attribute_id int
value_id int
)
give you a lot less pages to load than building a table with say 90 columns in it that are all null, which would result in better rather than worse performance?
Alex
--- vladimir konrad <vk@dsl.pipex.com> wrote:
> I think that I understand basic relational theory but then I had an
> idea.> Basically, instead of adding field to a table every time there is a
> need for it, have a table split in two: one holds identity (id) and
> one holds the attributes (linked to this id).> Basically, if in the future user decides that the subject shouldBasically, you would be creating your own data dictionary (i.e.
> have a new attribute, he can simply add "attribute definition" and
> attribute_definition_set (if any) and the application would handle
system catalog) on top of the db data dictionary. The database
already comes with a way to easily add columns: ddl. I have seen
newbie database designers reinvent this method a hundred times. The
performance hits and complexity of querying data would far out weigh
any perceived maintenance gain.
My .02.
LewisC
Lewis R Cunningham
An Expert's Guide to Oracle Technology
http://blogs.ittoolbox.com/oracle/guide/
LewisC's Random Thoughts
http://lewiscsrandomthoughts.blogspot.com/
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
Hello
________________________________
De : pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] De la part de Alex Turner
Envoyé : lundi 4 février 2008 05:14
À : Lewis Cunningham
Cc : vladimir konrad; pgsql-general@postgresql.org
Objet : Re: [GENERAL] [OT] "advanced" database design (long)
I"m not a database expert, but wouldn't
create table attribute (
attribute_id int
attribute text
)
create table value (
value_id int
value text
)
create table attribute_value (
entity_id int
attribute_id int
value_id int
)
give you a lot less pages to load than building a table with say 90 columns in it that are all null, which would result in better rather than worse performance?
Alex
On Feb 2, 2008 9:15 AM, Lewis Cunningham <lewisc@rocketmail.com> wrote:
--- vladimir konrad <vk@dsl.pipex.com> wrote:
> I think that I understand basic relational theory but then I had an
> idea.
> Basically, instead of adding field to a table every time there is a
> need for it, have a table split in two: one holds identity (id) and
> one holds the attributes (linked to this id).
> Basically, if in the future user decides that the subject should
> have a new attribute, he can simply add "attribute definition" and
> attribute_definition_set (if any) and the application would handle
Basically, you would be creating your own data dictionary (i.e.
system catalog) on top of the db data dictionary. The database
already comes with a way to easily add columns: ddl. I have seen
newbie database designers reinvent this method a hundred times. The
performance hits and complexity of querying data would far out weigh
any perceived maintenance gain.
This model is known as Entity-Value-Attribute and not well appreciated by relational designers. I think it is not relational, but I use it as storage for data (in some case, I don't know the database structure -relational- where data will be stored). It's like a truck container used for office removal :).
Imho, don't use it with a complex database structure; as a minimum, use lookup tables for Value (to avoid uncontrolled new parameters) and add a column to store the attribute type.
Jacques Massé
On Feb 4, 2008 5:14 AM, Alex Turner <armtuk@gmail.com> wrote: > I"m not a database expert, but wouldn't > [...] > > give you a lot less pages to load than building a table with say 90 columns > in it that are all null, which would result in better rather than worse > performance? Well, but PostgreSQL's NULLs occupy almost no space, or rather a bit of space, that is one bit exactly. ;-) I am pretty much sure that storage-wise looking NULLs are more efficient. Regards, Dawid
Em Monday 04 February 2008 07:03:47 Dawid Kuroczko escreveu: > On Feb 4, 2008 5:14 AM, Alex Turner <armtuk@gmail.com> wrote: > > I"m not a database expert, but wouldn't > > [...] > > > give you a lot less pages to load than building a table with say 90 > > columns in it that are all null, which would result in better rather than > > worse performance? > > Well, but PostgreSQL's NULLs occupy almost no space, or rather a bit of > space, that is one bit exactly. ;-) I am pretty much sure that > storage-wise looking NULLs > are more efficient. I'd say 1 byte every 8 NULLs instead of 1 bit. If you only have 1 NULL, it will cost you 1 byte (not 1 bit). If you have 9, it will cost you 2 bytes (not 9 bits). -- Jorge Godoy <jgodoy@gmail.com>
On Feb 3, 2008 10:14 PM, Alex Turner <armtuk@gmail.com> wrote: > I"m not a database expert, but wouldn't > > create table attribute ( > attribute_id int > attribute text > ) > > create table value ( > value_id int > value text > ) > > create table attribute_value ( > entity_id int > attribute_id int > value_id int > ) > > give you a lot less pages to load than building a table with say 90 columns > in it that are all null, which would result in better rather than worse > performance? But you're giving us a choice between two bad methodologies. Properly normalized, you'd not have a table with 90 nullable columns, but a set of related tables where you'd only need to store things in the subordinate tables for the relative data points. The worst thing about EAV is that it makes it very hard to figure out what the heck is going on by just looking at the database schema. It's easy to develop and hard to maintain. We had a person do something like that last place I worked and it took weeks for a new developer to figure it out and replace it with a properly relational model, because there were little corner cases all through the code that kept popping up.
On Sat, 2 Feb 2008 09:45:57 -0800 David Fetter <david@fetter.org> wrote: > On Sat, Feb 02, 2008 at 01:38:19PM +0100, Thomas Pundt wrote: > > Hi, > > > > vladimir konrad wrote: > >> I think that I understand basic relational theory but > > Clearly, you'll have to revisit that thought. > > > [example stripped] > > > > Yes, this is known as eg. Entity-Attribute-Value model (cf. > > wikipedia). > > > > IMO most times its disadvantages (it can be very hard to write > > performant queries compared to the traditional row based model) > > Make that, "impossible." The "flexibility" stems from fear of > making a design decision. > The second and smaller price is having the system bog down entirely > and have to be scrapped, whether it's 3 months down the line, or 3 > years. > > The math beneath this is that query complexity goes up like > O(E!A!V!) for Entity, Attribute and Value. > The first price, though, and by far the biggest, is that it's > impossible to maintain any kind of data integrity in such a system, > as such constraints, by their nature, are application-dependent. > Two applications means you're violating the SPOT (Single Point of > Truth) Rule, and that in turn means your data turns quickly into > incomprehensible gibberish. > > > weigh higher than you gain (in flexibility) in relational > > databases. But it sure has its uses cases. > > Why, yes. I encourage all my competitors to use it. ;) There should be some standard refactoring technique *and* tool to move from EAV model to something saner on the long run. Say you still don't know what the users will need and you want to give them a chance to experiment with which proprieties they find themselves more comfortable with (build up a better model) and you don't want to give users access to ddl, or suppose this part is not a critical part of the application and you want to let the users customise it without access to the ddl... I didn't take the time to really think how you could stay far from major pain when you start from EAV model and you want to normalise stuff... I'd go for a naive/brute force approach. As usual I'd try to encapsulate this kind of stuff and centralise the "access points" and then once I've to refactor this I'd just use grep for some kind of signature and fix it case by case. But well... if I knew in advance there is a standard technique (tool?) to build up EAV models in a way they can be "easily" refactored, I'd be happier ;) -- Ivan Sergio Borgonovo http://www.webthatworks.it
"Scott Marlowe" <scott.marlowe@gmail.com> writes: > On Feb 3, 2008 10:14 PM, Alex Turner <armtuk@gmail.com> wrote: > >> give you a lot less pages to load than building a table with say 90 columns >> in it that are all null, which would result in better rather than worse >> performance? Fwiw Postgres stores NULLs quite efficiently. Those 90 columns, all of which are null would take 12 bytes. > But you're giving us a choice between two bad methodologies. But I probably agree with Scott. It depends though. There are cases which are inherently awkward for which you will end up with either EAV or 90 mostly NULL columns. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
On Mon, Feb 04, 2008 at 10:49:51AM +0100, Masse Jacques wrote: > Hello [much garbage including HTML trimmed.] Please set your mail client to text-only. HTML does not make your point better, and it annoys the heck out of people whose mail readers use text. > This model is known as Entity-Value-Attribute and not well > appreciated by relational designers. I think it is not relational, > but I use it as storage for data (in some case, I don't know the > database structure -relational- where data will be stored). It's > like a truck container used for office removal :). It's more like sitting on a land mine and hoping it won't go off. > Imho, don't use it with a complex database structure; as a minimum, > use lookup tables for Value (to avoid uncontrolled new parameters) > and add a column to store the attribute type. Putting lipstick on the EAV pig does not help. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Jorge Godoy <jgodoy@gmail.com> writes: > Em Monday 04 February 2008 07:03:47 Dawid Kuroczko escreveu: >> Well, but PostgreSQL's NULLs occupy almost no space, or rather a bit of >> space, that is one bit exactly. ;-) I am pretty much sure that >> storage-wise looking NULLs >> are more efficient. > I'd say 1 byte every 8 NULLs instead of 1 bit. If you only have 1 NULL, it > will cost you 1 byte (not 1 bit). If you have 9, it will cost you 2 bytes > (not 9 bits). This is not quite right --- the amount of space used doesn't change if you have more or fewer nulls in a row. A null bitmap is present in a row if there are any nulls at all in the row, and its size will be equal to the defined number of columns in the table. As you say, there's padding overhead too ... regards, tom lane
Alex
On Feb 3, 2008 10:14 PM, Alex Turner <armtuk@gmail.com> wrote:But you're giving us a choice between two bad methodologies.
> I"m not a database expert, but wouldn't
>
> create table attribute (
> attribute_id int
> attribute text
> )
>
> create table value (
> value_id int
> value text
> )
>
> create table attribute_value (
> entity_id int
> attribute_id int
> value_id int
> )
>
> give you a lot less pages to load than building a table with say 90 columns
> in it that are all null, which would result in better rather than worse
> performance?
Properly normalized, you'd not have a table with 90 nullable columns,
but a set of related tables where you'd only need to store things in
the subordinate tables for the relative data points.
The worst thing about EAV is that it makes it very hard to figure out
what the heck is going on by just looking at the database schema.
It's easy to develop and hard to maintain. We had a person do
something like that last place I worked and it took weeks for a new
developer to figure it out and replace it with a properly relational
model, because there were little corner cases all through the code
that kept popping up.
That is a very awesome system. I am constantly impressed at the awesomeness of Postgresql. Alex On Feb 4, 2008 1:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jorge Godoy <jgodoy@gmail.com> writes: > > Em Monday 04 February 2008 07:03:47 Dawid Kuroczko escreveu: > >> Well, but PostgreSQL's NULLs occupy almost no space, or rather a bit of > >> space, that is one bit exactly. ;-) I am pretty much sure that > >> storage-wise looking NULLs > >> are more efficient. > > > I'd say 1 byte every 8 NULLs instead of 1 bit. If you only have 1 NULL, it > > will cost you 1 byte (not 1 bit). If you have 9, it will cost you 2 bytes > > (not 9 bits). > > This is not quite right --- the amount of space used doesn't change if > you have more or fewer nulls in a row. A null bitmap is present in a > row if there are any nulls at all in the row, and its size will be equal > to the defined number of columns in the table. As you say, there's > padding overhead too ... > > regards, tom lane > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org/ >
I just thought of another problem, the system can have multiple values for a single attribute. How do you normalise that without basically adding a link table that's just the same thing as given below (I know there are array types in Postgresql, but there aren't in other DBs and I'm a fan of keeping products as DB neutral as possible)? Alex On Feb 4, 2008 7:09 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Feb 3, 2008 10:14 PM, Alex Turner <armtuk@gmail.com> wrote: > > I"m not a database expert, but wouldn't > > > > create table attribute ( > > attribute_id int > > attribute text > > ) > > > > create table value ( > > value_id int > > value text > > ) > > > > create table attribute_value ( > > entity_id int > > attribute_id int > > value_id int > > ) > > > > give you a lot less pages to load than building a table with say 90 columns > > in it that are all null, which would result in better rather than worse > > performance? > > But you're giving us a choice between two bad methodologies. > > Properly normalized, you'd not have a table with 90 nullable columns, > but a set of related tables where you'd only need to store things in > the subordinate tables for the relative data points. > > The worst thing about EAV is that it makes it very hard to figure out > what the heck is going on by just looking at the database schema. > It's easy to develop and hard to maintain. We had a person do > something like that last place I worked and it took weeks for a new > developer to figure it out and replace it with a properly relational > model, because there were little corner cases all through the code > that kept popping up. >
On Feb 5, 2008, at 10:17 AM, Alex Turner wrote: > On Feb 4, 2008 7:09 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> On Feb 3, 2008 10:14 PM, Alex Turner <armtuk@gmail.com> wrote: >>> I"m not a database expert, but wouldn't >>> >>> create table attribute ( >>> attribute_id int >>> attribute text >>> ) >>> >>> create table value ( >>> value_id int >>> value text >>> ) >>> >>> create table attribute_value ( >>> entity_id int >>> attribute_id int >>> value_id int >>> ) >>> >>> give you a lot less pages to load than building a table with say >>> 90 columns >>> in it that are all null, which would result in better rather than >>> worse >>> performance? >> >> But you're giving us a choice between two bad methodologies. >> >> Properly normalized, you'd not have a table with 90 nullable columns, >> but a set of related tables where you'd only need to store things in >> the subordinate tables for the relative data points. >> >> The worst thing about EAV is that it makes it very hard to figure out >> what the heck is going on by just looking at the database schema. >> It's easy to develop and hard to maintain. We had a person do >> something like that last place I worked and it took weeks for a new >> developer to figure it out and replace it with a properly relational >> model, because there were little corner cases all through the code >> that kept popping up. >> > I just thought of another problem, the system can have multiple values > for a single attribute. How do you normalise that without basically > adding a link table that's just the same thing as given below (I know > there are array types in Postgresql, but there aren't in other DBs and > I'm a fan of keeping products as DB neutral as possible)? No, it wouldn't. You're confusing mechanics with semantics. In that case the data would have meaning and context inherent in the schema relationship between the two tables that would have the one-to-many relationship you describe. In addition, you'd be able to take advantage of foreign keys and other kinds of constraints to enforce data integrity. Those are the intended goals (at a high level) of relational data theory in the first place. Seriously, though, go pick up a book on relational database design, most cover this very topic at some point. Erik Jones DBA | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On febr. 2, 15:15, lew...@rocketmail.com (Lewis Cunningham) wrote: > --- vladimir konrad <v...@dsl.pipex.com> wrote: > > > I think that I understand basic relational theory but then I had an > > idea. > > Basically, instead of adding field to a table every time there is a > > need for it, have a table split in two: one holds identity (id) and > > one holds the attributes (linked to this id). > > Basically, if in the future user decides that the subject should > > have a new attribute, he can simply add "attribute definition" and > > attribute_definition_set (if any) and the application would handle > > Basically, you would be creating your own data dictionary (i.e. > system catalog) on top of the db data dictionary. The database > already comes with a way to easily add columns: ddl. I have seen > newbie database designers reinvent this method a hundred times. The > performance hits and complexity of querying data would far out weigh > any perceived maintenance gain. > > My .02. > > LewisC > > Lewis R Cunningham > > An Expert's Guide to Oracle Technologyhttp://blogs.ittoolbox.com/oracle/guide/ > > LewisC's Random Thoughtshttp://lewiscsrandomthoughts.blogspot.com/ > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match I always thought that having nullable columns in a table is a Bad Thing (http://technet.microsoft.com/en-us/library/ms191178.aspx) and shows that you try to put different type of entities into the same table - having 90 in a column ... brrrrr. I think its much better to avoid it whenever you have the info but when you don't you just have to use the EAV model. E.g. If I knew what info I wanted to store on a person I could create columns for that, but since in our application users create the questionnaires that is used to store info on persons I see little choice - I must have a subjectID, questionID, value table. SWK SWK
In article <33c6269f0802032014i3878ec3co4488b4835ef1e3d8@mail.gmail.com>, Alex Turner <armtuk@gmail.com> wrote: % % I"m not a database expert, but wouldn't % % create table attribute ( % attribute_id int % attribute text % ) % % create table value ( % value_id int % value text % ) % % create table attribute_value ( % entity_id int % attribute_id int % value_id int % ) % % give you a lot less pages to load than building a table with say 90 columns % in it that are all null, which would result in better rather than worse % performance? Suppose you want one row of data. Say it's one of the ones where the columns aren't all nulls. You look up 90 rows in attribute_value, then 90 rows in attribute, then 90 rows in value. You're probably looking at 3-6 pages of index data, and then somewhere between 3 and 270 pages of data from the database, for one logical row of data. -- Patrick TJ McPhee North York Canada ptjm@interlog.com
On Feb 3, 2008 11:14 PM, Alex Turner <armtuk@gmail.com> wrote: > I"m not a database expert, but wouldn't > > create table attribute ( > attribute_id int > attribute text > ) > > create table value ( > value_id int > value text > ) > > create table attribute_value ( > entity_id int > attribute_id int > value_id int > ) > > give you a lot less pages to load than building a table with say 90 columns > in it that are all null, which would result in better rather than worse > performance? Definitely not. 90 null values will require about 12 bytes of memory to represent their absence in the "all in one" table. That's not very much space. In contrast, if you need to join out to 80 tables, possibly folded into some smaller number, you'll *at least* have an index scan, reading a few pages of data from the secondary table, and then need to read the pages containing those values that *are* joined in. That quickly grows to way more than 12 bytes :-) -- http://linuxfinances.info/info/linuxdistributions.html "The definition of insanity is doing the same thing over and over and expecting different results." -- assortedly attributed to Albert Einstein, Benjamin Franklin, Rita Mae Brown, and Rudyard Kipling
> On febr. 2, 15:15, lew...@rocketmail.com (Lewis Cunningham) wrote: > > --- vladimir konrad <v...@dsl.pipex.com> wrote: > > > > > I think that I understand basic relational theory but > then I had an > > > idea. > > > Basically, instead of adding field to a table every time > there is a > > > need for it, have a table split in two: one holds > identity (id) and > > > one holds the attributes (linked to this id). > > > Basically, if in the future user decides that the subject should > > > have a new attribute, he can simply add "attribute > definition" and > > > attribute_definition_set (if any) and the application would handle > > > > Basically, you would be creating your own data dictionary (i.e. > > system catalog) on top of the db data dictionary. The database > > already comes with a way to easily add columns: ddl. I have seen > > newbie database designers reinvent this method a hundred > times. The > > performance hits and complexity of querying data would far > out weigh > > any perceived maintenance gain. > > > > My .02. > > > > LewisC > > > > Lewis R Cunningham > > > > An Expert's Guide to Oracle > > Technologyhttp://blogs.ittoolbox.com/oracle/guide/ > > > > LewisC's Random Thoughtshttp://lewiscsrandomthoughts.blogspot.com/ > > > > ---------------------------(end of > > broadcast)--------------------------- > > TIP 9: In versions below 8.0, the planner will ignore your desire to > > choose an index scan if your joining column's > datatypes do not > > match > > I always thought that having nullable columns in a table is a > Bad Thing > (http://technet.microsoft.com/en-us/library/ms191178.aspx) > and shows that you try to put different type of entities into > the same table - having 90 in a column ... brrrrr. > I think its much better to avoid it whenever you have the > info but when you don't you just have to use the EAV model. > E.g. If I knew what info I wanted to store on a person I > could create columns for that, but since in our application > users create the questionnaires that is used to store info on > persons I see little choice - I must have a subjectID, > questionID, value table. > > SWK > > > > SWK > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org > so that your > message can get through to the mailing list cleanly > > I use it in the same manner : at the time of recording, I just know one kind of entity (words) with a value as varchar. Maybe better with xml, but i use for long time to do the following job with SQL.
SunWuKung wrote: > I always thought that having nullable columns in a table is a Bad > Thing (http://technet.microsoft.com/en-us/library/ms191178.aspx) and Ridiculous. The argument provided in that article is specious and likely SQL Server-specific. NULLable columns should occur wherever your data model calls for them, typically when you want to have a marker for "unknown" data. The advice in that article to move NULLable columns off to a separate table will actually cause worse, manual "special handling that increases the complexity of data operations" than the built-in and optimized handling the engine provides for NULLs. You should ignore this terrible advice. > shows that you try to put different type of entities into the same > table - having 90 in a column ... brrrrr. Is that a technical evaluation? As another respondent stated upthread, 90 NULLable columns is possibly a sign of a bad data model. > I think its much better to avoid it whenever you have the info but > when you don't you just have to use the EAV model. Also ridiculous. You should never "have to use" the EAV so-called "model". > E.g. If I knew what info I wanted to store on a person I could create > columns for that, but since in our application users create the > questionnaires that is used to store info on persons I see little > choice - I must have a subjectID, questionID, value table. That's not EAV. When you're modeling a questionnaire, "subject", "question" and "answer" (as I interpret your meaning for "value" here) is natural. EAV would have a row with "question" as a value in a column, not the name of a column as you suggest. It's very hard to actually think in EAV. The mind naturally thinks of things like "question" being a column, but in EAV that wouldn't be; "question" would be a value of a generic column in some row that represents a fragment of the question being described. The difficulty of conceptualizing data structures as EAV is one of the big strikes against it. The quoted citation evidences that difficulty quite well - even trying to come up with an example of an EAV structure wound up with a non-EAV description. -- Lew
On Sun, 2008-02-10 at 03:08 -0500, Lew wrote: > SunWuKung wrote: > > I always thought that having nullable columns in a table is a Bad > > Thing (http://technet.microsoft.com/en-us/library/ms191178.aspx) and > > Ridiculous. The argument provided in that article is specious and likely SQL > Server-specific. NULLable columns should occur wherever your data model calls > for them, typically when you want to have a marker for "unknown" data. The > advice in that article to move NULLable columns off to a separate table will > actually cause worse, manual "special handling that increases the complexity > of data operations" than the built-in and optimized handling the engine > provides for NULLs. You should ignore this terrible advice. I disagree that it's ridiculous. There are good arguments for avoiding NULLs, not the least of which is that they can lead to very unintuitive results from queries due to 3VL. I think the passage in question (from the above URL) is reasonably good advice. They recommend vertical partitioning to avoid NULLs, and I think that is a very good design strategy in many cases. > > I think its much better to avoid it whenever you have the info but > > when you don't you just have to use the EAV model. > > Also ridiculous. You should never "have to use" the EAV so-called "model". Agreed. > question being described. The difficulty of conceptualizing data structures > as EAV is one of the big strikes against it. The quoted citation evidences Agreed. It has many other strikes as well: for instance, it's difficult to form any kind of meaningful predicate for a relation in an EAV design. Regards, Jeff Davis