Re: My honours project - databases using dynamically attached entity-properties - Mailing list pgsql-hackers

From Eddie Stanley
Subject Re: My honours project - databases using dynamically attached entity-properties
Date
Msg-id 45F84C0B.9030306@paradise.net.nz
Whole thread Raw
In response to Re: My honours project - databases using dynamically attached entity-properties  (David Fetter <david@fetter.org>)
List pgsql-hackers
David Fetter wrote:
> On Wed, Mar 14, 2007 at 12:07:45PM +1300, Edward Stanley wrote:
>   
>> On Wed, 14 Mar 2007, David Fetter wrote:
>>     
>>> On Tue, Mar 13, 2007 at 05:54:34PM +0000, Richard Huxton wrote:
>>>       
>>>> David Fetter wrote:
>>>>         
>>>>> On Tue, Mar 13, 2007 at 02:21:37PM +0000, Richard Huxton wrote:
>>>>>           
>>>>>> David Fetter wrote:
>>>>>>             
>>>>>>> On Tue, Mar 13, 2007 at 09:31:45AM +0000, Richard Huxton wrote:
>>>>>>>               
>>>>>>>> * Another good example is the "questionnaire".
>>>>>>>>                 
>>>>>>> With all due respect, this is a solved problem *without EAV or
>>>>>>> run-time DDL*.  The URL below has one excellent approach to this.
>>>>>>>
>>>>>>> <http://www.varlena.com/GeneralBits/110.php>
>>>>>>>               
>>>>>> Which broadly speaking was the solution I used for my
>>>>>> questionnaire, except I had a restricted set of types so basically
>>>>>> just coerced them to text and side-stepped the inheritance issue.
>>>>>> To the extent that it's dynamic, it's still just EAV though.
>>>>>>             
>>>>> That's precisely the difference between the above solution and
>>>>> yours, and it's the difference between a good design and one that
>>>>> will come up and bit you on the as^Hnkle.
>>>>>           
>>>> It's still basically EAV (either approach).  The key fault with EAV
>>>> is that the tables have no semantic meaning - answer_int contains
>>>> number of oranges, days since birth and the price of a tube ticket
>>>> in pennies.
>>>>         
>>> Stuffing all of those into an answer_int is *precisely* what the end
>>> user must not do.  That's pilot error.
>>>
>>>       
>>>> Now, with a questionnaire that might not matter because everything
>>>> is an "answer" and you're not necessarily going to do much more than
>>>> count/aggregate it.
>>>>         
>>> See above.
>>>
>>>       
>>>>>> It doesn't remove the need for run-time DDL if you allow users to
>>>>>> add their own questions.
>>>>>>             
>>>>> Sure it does.   When a user, who should be talking with you, wants
>>>>> to ask a new kind of question, that's the start of a discussion
>>>>> about what new kind(s) of questions would be generally applicable
>>>>> in the questionnaire schema.  Then, when you come to an agreement,
>>>>> you roll it into the new schema, and the whole system gets an
>>>>> improvement.
>>>>>           
>>>> Fine, but if you're not letting the user extend the system, then
>>>> it's not really addressing Edward's original posting, is it?
>>>>         
>>> It's my contention that Edward's original idea is ill-posed.  SQL is
>>> just fine for doing this kind of thing, and it's *not that hard*.
>>>
>>>       
>>>> If the user's talking to me, I might as well just write the DDL
>>>> myself - it's the talk that'll take the time, not writing a dozen
>>>> lines of SQL.
>>>>         
>>> It's the talk that's the important part.  Machines are really bad at
>>> seeing the broader picture.  In the attempt to "save" a few minutes'
>>> discussion, he's trying to borrow that time from a system asked to do
>>> things that computers are inherently bad at doing, and every end user
>>> will pay that time back at a very high rate of interest.  This is
>>> precisely the kind of false economy that so plagues software
>>> development and maintenance these days.
>>>
>>>       
>>>> The interesting part of the problem (from a Comp-Sci point of view)
>>>> is precisely in automating part of that discussion.  It's providing
>>>> an abstraction so that you don't end up with a mass of attributes
>>>> while still providing freedom to the user.
>>>>         
>>> This freedom and efficiency you're talking about is better supplied,
>>> IMHO, by putting a standard DDL for questionnaires up on a pgfoundry
>>> or an SF.net.  That way, improvements to the DDL get spread all over
>>> the world, and a very large amount of wheel reinvention gets avoided.
>>> Reusable components are a big chunk of both freedom and efficiency. :)
>>>
>>> Cheers,
>>> D
>>>       
>> Maybe I should rethink the problem a bit - from the very brief
>> initial research I've done, it seems EAV schemas have two common
>> uses: 
>>     
>
>   
>> 1) When new attributes have to be created on-the-fly 
>> 2) When the number of possible properties for an entity greatly (orders of 
>> magnitude) exceeds the number of properties any one entity is likely to have. 
>>     
>
> Um, no.  The first use case is bad coding practice, and the second is
> a classic case for a join table, which is the standard way to handle
> M:N relationships.
>
>   
>> I'm not sure about solving the first problem - there seems to be a lot of 
>> debate around this. I can see reasons for and against allowing this. However 
>> I think the second is a very real problem. One such example is a patient 
>> record system.
>>
>> For each patient we have a table of common data (dob, sex, height, weight etc) 
>> but as well as this a patient can present with many symptoms. This might be a 
>> table of 40,000 possible symptoms. 
>>     
>
> Here's how I'd do that:
>
> CREATE TABLE patient (
>     patient_id SERIAL PRIMARY KEY, /* for simplicity.  Some
>                                       combination of columns in the
>                                       table would also have a UNIQUE
>                                       NOT NULL constraint on it.
>                                     */
>     ...
> );
>
> CREATE TABLE symptom (
>     symptom_id SERIAL PRIMARY KEY, /* See above. */
>     ...
> );
>
> CREATE TABLE patient_presents_with (
>     patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
>     symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
>     UNIQUE(patient_id, symptom_id)
> );
>
>   
>> Lets say we want to run a query on these symptoms (using a boolean expression) 
>>     
>
> I'd use something like the following:
>
> SELECT
>     p.patient_id,
>     p.f_name,
>     p.l_name,
>     s.symptom_name,
>     s.symptom_desc
> FROM
>     patient p
> JOIN
>     patient_presents_with ppw
>     USING (patient_id)
> JOIN
>     symptom s
>     USING (symptom_id)
> WHERE
>     s.symptom_name = ALL('foo','bar','baz')
> AND
>     s.symptom_name = ANY('quux','fleeg');
>
>   
Are the ALL and ANY functions new to Postgresql 8? I haven't met them 
before.
Anyway this will work for some queries but not others.

What about the following expression? ('foo' && 'bar') | ('baz' && ! 
'quxx') | 'fleeg'
Maybe I have misunderstood how these functions work, but I don't think 
they will handle anything but trivial examples of this problem.

>> to return the patient records which match the query string on the symptoms.
>>
>> (This turns out to be a very similar problem to the 'tags' example I first 
>> presented) - assume a similar schema. With more than a couple of symptoms and 
>> a complex tree, the resulting SQL can span pages. 
>>     
>
> Not really.  See above :)
>
>   
>> When I first started thinking about this project I believed the two problems 
>> essentially to be the same class of problem, but this may not be the case.
>>     
>
> EAV will bite you.  It's not *that* much work to keep its from biting
> you. :)
>
> Cheers,
> D
>   



pgsql-hackers by date:

Previous
From: Trent Shipley
Date:
Subject: Re: My honours project - databases using dynamically attached entity-properties
Next
From: Alvaro Herrera
Date:
Subject: Re: how to add seconds to a TimestampTz