Re: My honours project - databases using dynamically attached entity-properties - Mailing list pgsql-hackers

From David Fetter
Subject Re: My honours project - databases using dynamically attached entity-properties
Date
Msg-id 20070314142443.GA8454@fetter.org
Whole thread Raw
In response to Re: My honours project - databases using dynamically attached entity-properties  ("Andrew Hammond" <andrew.george.hammond@gmail.com>)
Responses Re: My honours project - databases using dynamically attached entity-properties  (Gregory Stark <stark@enterprisedb.com>)
Re: My honours project - databases using dynamically attached entity-properties  (Eddie Stanley <eddiewould@paradise.net.nz>)
List pgsql-hackers
On Wed, Mar 14, 2007 at 12:07:45PM +1300, Edward Stanley wrote:
> On Wed, 14 Mar 2007, David Fetter wrote:
> > On Tue, Mar 13, 2007 at 05:54:34PM +0000, Richard Huxton wrote:
> > > David Fetter wrote:
> > > >On Tue, Mar 13, 2007 at 02:21:37PM +0000, Richard Huxton wrote:
> > > >>David Fetter wrote:
> > > >>>On Tue, Mar 13, 2007 at 09:31:45AM +0000, Richard Huxton wrote:
> > > >>>>* Another good example is the "questionnaire".
> > > >>>
> > > >>>With all due respect, this is a solved problem *without EAV or
> > > >>>run-time DDL*.  The URL below has one excellent approach to this.
> > > >>>
> > > >>><http://www.varlena.com/GeneralBits/110.php>
> > > >>
> > > >>Which broadly speaking was the solution I used for my
> > > >>questionnaire, except I had a restricted set of types so basically
> > > >>just coerced them to text and side-stepped the inheritance issue.
> > > >>To the extent that it's dynamic, it's still just EAV though.
> > > >
> > > >That's precisely the difference between the above solution and
> > > >yours, and it's the difference between a good design and one that
> > > >will come up and bit you on the as^Hnkle.
> > >
> > > It's still basically EAV (either approach).  The key fault with EAV
> > > is that the tables have no semantic meaning - answer_int contains
> > > number of oranges, days since birth and the price of a tube ticket
> > > in pennies.
> >
> > Stuffing all of those into an answer_int is *precisely* what the end
> > user must not do.  That's pilot error.
> >
> > > Now, with a questionnaire that might not matter because everything
> > > is an "answer" and you're not necessarily going to do much more than
> > > count/aggregate it.
> >
> > See above.
> >
> > > >>It doesn't remove the need for run-time DDL if you allow users to
> > > >>add their own questions.
> > > >
> > > >Sure it does.   When a user, who should be talking with you, wants
> > > >to ask a new kind of question, that's the start of a discussion
> > > >about what new kind(s) of questions would be generally applicable
> > > >in the questionnaire schema.  Then, when you come to an agreement,
> > > >you roll it into the new schema, and the whole system gets an
> > > >improvement.
> > >
> > > Fine, but if you're not letting the user extend the system, then
> > > it's not really addressing Edward's original posting, is it?
> >
> > It's my contention that Edward's original idea is ill-posed.  SQL is
> > just fine for doing this kind of thing, and it's *not that hard*.
> >
> > > If the user's talking to me, I might as well just write the DDL
> > > myself - it's the talk that'll take the time, not writing a dozen
> > > lines of SQL.
> >
> > It's the talk that's the important part.  Machines are really bad at
> > seeing the broader picture.  In the attempt to "save" a few minutes'
> > discussion, he's trying to borrow that time from a system asked to do
> > things that computers are inherently bad at doing, and every end user
> > will pay that time back at a very high rate of interest.  This is
> > precisely the kind of false economy that so plagues software
> > development and maintenance these days.
> >
> > > The interesting part of the problem (from a Comp-Sci point of view)
> > > is precisely in automating part of that discussion.  It's providing
> > > an abstraction so that you don't end up with a mass of attributes
> > > while still providing freedom to the user.
> >
> > This freedom and efficiency you're talking about is better supplied,
> > IMHO, by putting a standard DDL for questionnaires up on a pgfoundry
> > or an SF.net.  That way, improvements to the DDL get spread all over
> > the world, and a very large amount of wheel reinvention gets avoided.
> > Reusable components are a big chunk of both freedom and efficiency. :)
> >
> > Cheers,
> > D
> 
> Maybe I should rethink the problem a bit - from the very brief
> initial research I've done, it seems EAV schemas have two common
> uses: 

> 1) When new attributes have to be created on-the-fly 
> 2) When the number of possible properties for an entity greatly (orders of 
> magnitude) exceeds the number of properties any one entity is likely to have. 

Um, no.  The first use case is bad coding practice, and the second is
a classic case for a join table, which is the standard way to handle
M:N relationships.

> I'm not sure about solving the first problem - there seems to be a lot of 
> debate around this. I can see reasons for and against allowing this. However 
> I think the second is a very real problem. One such example is a patient 
> record system.
> 
> For each patient we have a table of common data (dob, sex, height, weight etc) 
> but as well as this a patient can present with many symptoms. This might be a 
> table of 40,000 possible symptoms. 

Here's how I'd do that:

CREATE TABLE patient (   patient_id SERIAL PRIMARY KEY, /* for simplicity.  Some
combinationof columns in the                                     table would also have a UNIQUE
           NOT NULL constraint on it.                                   */   ...
 
);

CREATE TABLE symptom (   symptom_id SERIAL PRIMARY KEY, /* See above. */   ...
);

CREATE TABLE patient_presents_with (   patient_id INTEGER NOT NULL REFERENCES patient(patient_id),   symptom_id INTEGER
NOTNULL REFERENCES symptom(symptom_id),   UNIQUE(patient_id, symptom_id)
 
);

> Lets say we want to run a query on these symptoms (using a boolean expression) 

I'd use something like the following:

SELECT   p.patient_id,   p.f_name,   p.l_name,   s.symptom_name,   s.symptom_desc
FROM   patient p
JOIN   patient_presents_with ppw   USING (patient_id)
JOIN   symptom s   USING (symptom_id)
WHERE   s.symptom_name = ALL('foo','bar','baz')
AND   s.symptom_name = ANY('quux','fleeg');

> to return the patient records which match the query string on the symptoms.
> 
> (This turns out to be a very similar problem to the 'tags' example I first 
> presented) - assume a similar schema. With more than a couple of symptoms and 
> a complex tree, the resulting SQL can span pages. 

Not really.  See above :)

> When I first started thinking about this project I believed the two problems 
> essentially to be the same class of problem, but this may not be the case.

EAV will bite you.  It's not *that* much work to keep its from biting
you. :)

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate


pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: need help in understanding gist function
Next
From: Gregory Stark
Date:
Subject: Re: My honours project - databases using dynamically attached entity-properties