Thread: Database structure

Database structure

From

"Alain Reymond"

Date:

04 May 2004, 09:58:41

Hello,

I would like an advise on the following problem :

I have a table of patients. 
Each patient can make different biological assessments. 
Each assessment is always decomposed into different laboratory tests.
A laboratory test is made of a test number and two values coming from analysers.

The schema is :
Patients(#patient_nr,name,etc...)
Assessment(#assessment_nr, #patient_nr, assessment_type, other usefull values).
Assessment_types(assessment_type, labtest_nr)
An assessment is composed of different tests, let's say assessment type 1 is 
composed of lab test 1,2,3,5 and assessment type 2 of lab test number 10 to 70.

I have an assessment with 60 different lab tests (always the same). I have two ways 
for storing the values :

1 - a table with 120 columns for the two values.
results(#assessment_nr, p10,d10, p11,d11, .....,p70,d70).
where 10 to 70 represents the lab test number.

2 - a table with 60 rows for one assessment :
results(#assessment_nr, labtest_nr, p, d) where p and d are my two results.

Here comes my question. Which of the two would you choose?

The firsrt solution has the advantage of returning one single row for one complete 
assessment. If I have to make statistics, it is easy. But, if I have to modify the 
composition of an assessment (which occurs very rarely), I shall have to use an alter 
table instruction. As I have 4 different assessment types, I have to create five 
different tables, one per assessment type.

The second solution is normalized and more elegant. But I am preoccupied by the 
size of the table. For one assessment, I'll store 60 rows with only two useful integers 
in it. And you must add the size of the index. With 25.000 assessments a year, it 
makes 1.500.000 rows with only 4 columns amoung them 2 only for the results and 2 
for identification. I would like to store 10 years online, so 15.000.000 rows. What 
about the size of index ?

Any advise ? I thank you in advance.


Alain Reymond

(I hope that it is clear enough with my bad English).

Re: Database structure

From

"Andrei Bintintan"

Date:

04 May 2004, 11:37:33

I would go for the second one. I think the size of the table is not a
problem. You will have just to write the right indexes for easy joins.

OBS: " For one assessment, I'll store 60 rows with only two useful integers
in it" ... why? Better make a "lab_test" table where you have the tab tests
and you write in the results(#assessment_nr, labtest_nr, p, d) only those
datas that you have. For example if you have the assesment no. 3000 and you
have only the results for lab_test 10->40 then why to write in the DB also
the lab_test from 40->70(if you don't have it)???
(if I didn't understand this clear, sorry for the observation).


The second option is better if you change one time the lab_test list(have to
think also this option --- if making the database for at least 10 years).
Because in the first solution you will have to add always a new column...
and that is not the "best" option. In the second way you just add a new ID
in the lab_test list and finish. No problems.

If you go for the first option and you have to change something in the
result table... it won't be easy.

The alter table is not so tragical as it seems... use constrains...don't
ever erase from DB.

So... my final answer: the second option.

Best regards,
Andy.

----- Original Message -----
From: "Alain Reymond" <alain.reymond@ceia.com>
To: <pgsql-sql@postgresql.org>
Sent: Friday, April 30, 2004 6:06 PM
Subject: [SQL] Database structure


> Hello,
>
> I would like an advise on the following problem :
>
> I have a table of patients.
> Each patient can make different biological assessments.
> Each assessment is always decomposed into different laboratory tests.
> A laboratory test is made of a test number and two values coming from
analysers.
>
> The schema is :
> Patients(#patient_nr,name,etc...)
> Assessment(#assessment_nr, #patient_nr, assessment_type, other usefull
values).
> Assessment_types(assessment_type, labtest_nr)
> An assessment is composed of different tests, let's say assessment type 1
is
> composed of lab test 1,2,3,5 and assessment type 2 of lab test number 10
to 70.
>
> I have an assessment with 60 different lab tests (always the same). I have
two ways
> for storing the values :
>
> 1 - a table with 120 columns for the two values.
> results(#assessment_nr, p10,d10, p11,d11, .....,p70,d70).
> where 10 to 70 represents the lab test number.
>
> 2 - a table with 60 rows for one assessment :
> results(#assessment_nr, labtest_nr, p, d) where p and d are my two
results.
>
> Here comes my question. Which of the two would you choose?
>
> The firsrt solution has the advantage of returning one single row for one
complete
> assessment. If I have to make statistics, it is easy. But, if I have to
modify the
> composition of an assessment (which occurs very rarely), I shall have to
use an alter
> table instruction. As I have 4 different assessment types, I have to
create five
> different tables, one per assessment type.
>
> The second solution is normalized and more elegant. But I am preoccupied
by the
> size of the table. For one assessment, I'll store 60 rows with only two
useful integers
> in it. And you must add the size of the index. With 25.000 assessments a
year, it
> makes 1.500.000 rows with only 4 columns amoung them 2 only for the
results and 2
> for identification. I would like to store 10 years online, so 15.000.000
rows. What
> about the size of index ?
>
> Any advise ? I thank you in advance.
>
>
> Alain Reymond
>
> (I hope that it is clear enough with my bad English).
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Re: Database structure

From

"Alain Reymond"

Date:

04 May 2004, 12:44:05

I thank you for your answer. 

The more I think about it, the more I find the second option better. Just one precision. 
All tests are always done, so I always hae all columns filled with a result.

My only trouble was about size and performance. I store only a few byte with a lot of 
overhead (#assessment_nr, labtest_nr) for only one integer and one real per row. And I 
can have up to 1.500.000 rows per year with at least 10 years on line... It means big 
indexes.

Regards.

Alain

> I would go for the second one. I think the size of the table is not a
> problem. You will have just to write the right indexes for easy joins.
> 
> OBS: " For one assessment, I'll store 60 rows with only two useful
> integers in it" ... why? Better make a "lab_test" table where you have
> the tab tests and you write in the results(#assessment_nr, labtest_nr,
> p, d) only those datas that you have. For example if you have the
> assesment no. 3000 and you have only the results for lab_test 10->40
> then why to write in the DB also the lab_test from 40->70(if you don't
> have it)??? (if I didn't understand this clear, sorry for the
> observation).
> 
> 
> The second option is better if you change one time the lab_test
> list(have to think also this option --- if making the database for at
> least 10 years). Because in the first solution you will have to add
> always a new column... and that is not the "best" option. In the
> second way you just add a new ID in the lab_test list and finish. No
> problems.
> 
> If you go for the first option and you have to change something in the
> result table... it won't be easy.
> 
> The alter table is not so tragical as it seems... use
> constrains...don't ever erase from DB.
> 
> So... my final answer: the second option.
Alain Reymond
CEIA
Bd Saint-Michel 119
1040 Bruxelles
Tel: +32 2 736 04 58
Fax: +32 2 736 58 02
alain.reymond@ceia.com
PGP key sur http://pgpkeys.mit.edu:11371

Re: Database structure

From

Svenne Krap

Date:

04 May 2004, 13:13:01

I would definately say solution two.

As you point out yourself, there are only for int4s (propably even 
int2s), that is 8 bytes each for the int4 (if I remeber corretly), which 
equals something in the 40-50 bytes range for the row w/o index.
For 15m rows, thats not much more than 750 megabytes without the 
indexes, which I believe take up roughly the same amount of space.

That might be around 1.5 GB of data, which I still consider a reasonably 
sized database.
What I work on daily is in the 1 GB range already and grows 25-30 
megabytes/day and we use large (3000 bytes or longer), complex (with 
more than 10 subselects and utilizing inner joins, outer joins, cross 
joins) queries, who are returning around 3000 rows each. This runs in a 
matter of 2 minutes on a single cpu 2ghz system with ide disk (no raid) 
and only half a gig of memory.

A good starting point for help would be data sizes of each tables (in my 
opinion the number of digits is usually close enough), complexity of 
querys (how many tables, subselects, types of joins,uses of aggregates 
and so on) and finally what is considered "fast enough" - for a website 
2 seconds may be fast enough, for a croned job once a month, the same 
might be true for 3 days.

In the long run, being correct is usually better than being fast (at the 
point of the implementation), as new hardware easily solves bottlenecks 
for problems not scaling exponentially.

Svenne

Alain Reymond wrote:

>Hello,
>
>I would like an advise on the following problem :
>
>I have a table of patients. 
>Each patient can make different biological assessments. 
>Each assessment is always decomposed into different laboratory tests.
>A laboratory test is made of a test number and two values coming from analysers.
>
>The schema is :
>Patients(#patient_nr,name,etc...)
>Assessment(#assessment_nr, #patient_nr, assessment_type, other usefull values).
>Assessment_types(assessment_type, labtest_nr)
>An assessment is composed of different tests, let's say assessment type 1 is 
>composed of lab test 1,2,3,5 and assessment type 2 of lab test number 10 to 70.
>
>I have an assessment with 60 different lab tests (always the same). I have two ways 
>for storing the values :
>
>1 - a table with 120 columns for the two values.
>results(#assessment_nr, p10,d10, p11,d11, .....,p70,d70).
>where 10 to 70 represents the lab test number.
>
>2 - a table with 60 rows for one assessment :
>results(#assessment_nr, labtest_nr, p, d) where p and d are my two results.
>
>Here comes my question. Which of the two would you choose?
>
>The firsrt solution has the advantage of returning one single row for one complete 
>assessment. If I have to make statistics, it is easy. But, if I have to modify the 
>composition of an assessment (which occurs very rarely), I shall have to use an alter 
>table instruction. As I have 4 different assessment types, I have to create five 
>different tables, one per assessment type.
>
>The second solution is normalized and more elegant. But I am preoccupied by the 
>size of the table. For one assessment, I'll store 60 rows with only two useful integers 
>in it. And you must add the size of the index. With 25.000 assessments a year, it 
>makes 1.500.000 rows with only 4 columns amoung them 2 only for the results and 2 
>for identification. I would like to store 10 years online, so 15.000.000 rows. What 
>about the size of index ?
>
>Any advise ? I thank you in advance.
>
>
>Alain Reymond
>
>(I hope that it is clear enough with my bad English).
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 7: don't forget to increase your free space map settings
>
>  
>

Re: Database structure

From

Bruno Wolff III

Date:

04 May 2004, 13:21:34

On Tue, May 04, 2004 at 18:10:13 +0200, Svenne Krap <svenne@krap.dk> wrote:
> 
> In the long run, being correct is usually better than being fast (at the 
> point of the implementation), as new hardware easily solves bottlenecks 
> for problems not scaling exponentially.

And it isn't even clear that denormalizing the schema will result in an increase
in speed.

If at some point the tests in various assessments can overlap you may not
want an assessment for each table.

I also noticed that the schema isn't enforcing consistancy between the
tests done and the assessment type being done. This may not really be a
business rule as much as something that might be flagged by the application
for attention as I can see cases where in reality the wrong test is done
and recording its results might be better than throwing the data away.