Thread: Designing a DB for storing biological data

Designing a DB for storing biological data

From
Damir Dezeljin
Date:
Hello.

This is more a theoretical or better to say, conceptual question; still, I hope to get some feed backs from you folks. Additionally this is going to be a very long post :) off-topic: I asked a similar question on the MySQL forum as I'm still undecided if going with PostgreSQL or MySQL << I'm tempted at PostGIS.

I am designing a database for storing various biological and ecological data. Although there is no clear dividing line, it is possible to group the data into two groups, namely the measured (physical) and quantitative (mostly biological) data; I uploaded both a data sample and an initial draft of a DB model to this link.

From the mentioned sample, it is evident the following difference between the two:
Biological / quantitative data
  • The data are actually numbers of occurrences of a specific type of items, namely animal and plant spices. The counting is done by following a predefined method as e.g. number of samples per 100 m^2.
  • One sampling is 
  • A sampling consist of counting multiple species on a single day, predefined location, by following a predefined method. Please note the counting may repeat multiple time for a single species using the same or a different method.
  • A typical number of different species counted per sampling is something between 15 and 100.
  • Data are mostly quantitative, which means consisting mostly of integers numbers; however, this does not apply to all cases.

Measured / physical data
  • This data comprise from e.g. a set of measured physical quantities such as temperature, salinity, DI, etc. (usually up to 15 or 20 quantities). These measurements are performed on samples of waters taken from different depths at a predefined location on a predefined date and time. Although the samples of water from different depth on a single location are taken a couple of minutes apart one from another, it would help tracking them as a single profile, which basically consists of data of analyzed samples from a single location at a specific time.
  • Most data are decimal numbers of certain precision - e.g. if the instrument provides accurate information to the first decimal place, it has to be stored with precision up to the first decimal place. Contrary, the salinity from the mentioned example available at the link above is measured accurately to the third decimal place, so it makes sense to store it and make it possible to retrieve the number accurate to the third decimal place.
    I was also considering storing depth as a NUMERIC to avoid inexactness when dealing with REAL or DOUBLE -> from MySQL I have a concern two FLOAT-s (REAL in PostgreSQL) being 3.4 can't be compared in a quely like value1 = value2 -> e.g. "... WHERE depth = 3.4;".
    Am I missing something or is there a better solution how to address such cases?

General notes
  • Physical quantities may be outside the detection range of the measured instrument; in such a case, this needs to be recorded. I still do not have a clear idea how to do it. NULL’s do not seem to be a good choice to mark such data.
  • Different quantities are measured with different precision - e.g. counted quantities don’t have decimal places; some instruments report data with 1 decimal digit precision, other with 2, etc.
  • The only quantities that are always present with all data recorded are the depth where the sample was taken.
  • I use RESTful interface as a mean layer between the DB and the GUI.


Finally, here is my dilemma
I am somewhat undecided what is the best way to implement the database and consequently what kind of queries to use. At above link a database model I am currently working on can be found . Looking to the diagram it becomes evident I am deciding if storing every measurement / determinant / depth triple as a separate record. The biggest dilemma I have is a query for a simple sample of pressure, temperature, salinity and oxygen would imply multiple joins. As far as I know, this will badly affect the performance; as well, it will harden codding the RESTful interface.

The other option I considered and I did not discard yet is adopting tables to specific needs. In such case storing data from a CTD (Conductivity / Temperature / Depth) probe would result in a table row containing: depth, conductivity, salinity, temperature, depth. Such approach rather makes sense; however, in such a case I’ll end up with tons of tables that sometime in future may be extended with additional columns.

I would appreciate any advice and hint I receive.

Thanks and best regards,
 Damir

Re: Designing a DB for storing biological data

From
Adrian Klaver
Date:
On 06/14/2014 08:52 AM, Damir Dezeljin wrote:
> Hello.
>
> This is more a theoretical or better to say, conceptual question; still,
> I hope to get some feed backs from you folks. Additionally this is going
> to be a very long post :) off-topic: I asked a similar question on the
> MySQL forum as I'm still undecided if going with PostgreSQL or MySQL <<
> I'm tempted at PostGIS.
>
> I am designing a database for storing various biological and ecological
> data. Although there is no clear dividing line, it is possible to group
> the data into two groups, namely the measured (physical) and
> quantitative (mostly biological) data; I uploaded both a data sample and
> an initial draft of a DB model to this link
> <https://www.dropbox.com/sh/9gm2ezwrwhkz6xv/AAB3koD6Xzi48-2BhIEdwmlZa>.
>
>  From the mentioned sample, it is evident the following difference
> between the two:
> *Biological / quantitative data*
>
>   * The data are actually numbers of occurrences of a specific type of
>     items, namely animal and plant spices. The counting is done by
>     following a predefined method as e.g. number of samples per 100 m^2.
>   * One sampling is
>   * A sampling consist of counting multiple species on a single day,
>     predefined location, by following a predefined method. Please note
>     the counting may repeat multiple time for a single species using the
>     same or a different method.
>   * A typical number of different species counted per sampling is
>     something between 15 and 100.
>   * Data are mostly quantitative, which means consisting mostly of
>     integers numbers; however, this does not apply to all cases.
>
>
> *Measured / physical data*
>
>   * This data comprise from e.g. a set of measured physical quantities
>     such as temperature, salinity, DI, etc. (usually up to 15 or 20
>     quantities). These measurements are performed on samples of waters
>     taken from different depths at a predefined location on a predefined
>     date and time. Although the samples of water from different depth on
>     a single location are taken a couple of minutes apart one from
>     another, it would help tracking them as a single profile, which
>     basically consists of data of analyzed samples from a single
>     location at a specific time.

I would agree with consolidating as a single sample run.

>   * Most data are decimal numbers of certain precision - e.g. if the
>     instrument provides accurate information to the first decimal place,
>     it has to be stored with precision up to the first decimal place.
>     Contrary, the salinity from the mentioned example available at the
>     link above is measured accurately to the third decimal place, so it
>     makes sense to store it and make it possible to retrieve the number
>     accurate to the third decimal place.
>     I was also considering storing depth as a NUMERIC to avoid
>     inexactness when dealing with REAL or DOUBLE -> from MySQL I have a
>     concern two FLOAT-s (REAL in PostgreSQL) being 3.4 can't be compared
>     in a quely like value1 = value2 -> e.g. "... WHERE depth = 3.4;".
>     Am I missing something or is there a better solution how to address
>     such cases?

If you care about precision use NUMERIC, period. As to the scale(#
decimal points) that is a little more complicated. The easy solution
would be to use the maximum scale you would need for all the data
values. The issue then becomes the following(using your NUMERIC values):

test=> create table numeric_test (num_fld numeric(9,4));
CREATE TABLE
test=> INSERT INTO numeric_test VALUES (15.6);
INSERT 0 1
test=> SELECT * from numeric_test ;
  num_fld
---------
  15.6000

This is ambiguous unless you know what the capabilities of the sampling
method are.  So either you need to constrain the scale when you set up
the fields for each sample type or as you show(I think) provide extra
information to make that determination later.

>
>
> *General notes*
>
>   * Physical quantities may be outside the detection range of the
>     measured instrument; in such a case, this needs to be recorded. I
>     still do not have a clear idea how to do it. NULL’s do not seem to
>     be a good choice to mark such data.

As an aside, doing a dilution series is not possible?

This is sort of tricky. On the one hand you really don't what the actual
value is, on the other you know it is at or above(leaving out
approaching 0 for now) some number, so it is useful information. You
could do what you show, include an is_out_of_range field. Or you could
include the detection range information in the same table that records
the scale of the sampling methods.

>   * Different quantities are measured with different precision - e.g.
>     counted quantities don’t have decimal places; some instruments
>     report data with 1 decimal digit precision, other with 2, etc.


See above.

>   * The only quantities that are always present with all data recorded
>     are the depth where the sample was taken.
>   * I use RESTful interface as a mean layer between the DB and the GUI.
>
>
>
> *Finally, here is my dilemma*
> I am somewhat undecided what is the best way to implement the database
> and consequently what kind of queries to use. At above link a database
> model I am currently working on can be found. Looking to the diagram it
> becomes evident I am deciding if storing every measurement / determinant
> / depth triple as a separate record. The biggest dilemma I have is a
> query for a simple sample of pressure, temperature, salinity and oxygen
> would imply multiple joins. As far as I know, this will badly affect the
> performance; as well, it will harden codding the RESTful interface.


There is a lot going on here and you will end up with joins which ever
way you do this. The usual way of dealing with this is to use VIEWs,
where the data lies in individual tables and you use a VIEW to
consolidate the data for reporting/query purposes. To an external
interface it looks like a single table. From the quick look I have done
so far I would tend to keep each sample as an individual record along
the lines of:

id sampling_id  sample_depth sample_type sample_value sample_timestamp
1  1            10           DO           8.6          2014-06-14 10:46


Then it is a matter of slicing and dicing as you need:

SELECT * FROM sample_data WHERE sampling_id = 1 ORDER BY sample_depth,
sample_type


SELECT * FROM sample_data WHERE sampling_id = 1 ORDER BY sample_type,
sample_depth

SELECT * FROM sample_data WHERE sampling_type = 'DO' ORDER BY
sample_depth, sample_timestamp

etc

>
> The other option I considered and I did not discard yet is adopting
> tables to specific needs. In such case storing data from a CTD
> (Conductivity / Temperature / Depth) probe would result in a table row
> containing: depth, conductivity, salinity, temperature, depth. Such
> approach rather makes sense; however, in such a case I’ll end up with
> tons of tables that sometime in future may be extended with additional
> columns.
>
> I would appreciate any advice and hint I receive.
>
> Thanks and best regards,
>   Damir
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Designing a DB for storing biological data

From
Rich Shepard
Date:
On Sat, 14 Jun 2014, Damir Dezeljin wrote:

> This is more a theoretical or better to say, conceptual question; still, I
> hope to get some feed backs from you folks. Additionally this is going to
> be a very long post :) off-topic: I asked a similar question on the MySQL
> forum as I'm still undecided if going with PostgreSQL or MySQL << I'm
> tempted at PostGIS.

   Postgres. You can link it to spatial attributes with postgis if you want.
But, how to structure your database and tables is independent of the
analytical methods you use.

> I am designing a database for storing various biological and ecological
> data. Although there is no clear dividing line, it is possible to group
> the data into two groups, namely the measured (physical) and quantitative
> (mostly biological) data; I uploaded both a data sample and an initial
> draft of a DB model to this link
> <https://www.dropbox.com/sh/9gm2ezwrwhkz6xv/AAB3koD6Xzi48-2BhIEdwmlZa>.

   Of course there's a clear division between biological, chemical, physical,
geomorphical, and other data.

   I may be wrong, but this looks like a question from a student for a
project or master's degree.

> *Biological / quantitative data*

   You have many choices: EPA's STORET or EMAP projects, ITIS (International
Taxonomic Identification System), or your own. Each type of biological data
should have its own table; e.g., vegegation, mammals, birds, fish, benthic
macroinvertebrates, microbes.

   Each table should have a compound key: site_id, sample_date, species. This
makes each row unique, when possible. Otherwise. use a sequential key.

   The table I use for fish data has this schema:

                Table "public.fish"
    Column    |         Type          | Modifiers
-------------+-----------------------+-----------
  site        | character varying(12) | not null
  sampdate    | date                  | not null
  tclass      | character(13)         | not null
  torder      | character varying(16) | not null
  tfamily     | character varying(12) | not null
  tgenus      | character varying(12) | not null
  tspecies    | character varying(12) | not null
  tsubspecies | character varying(12) |
  common_name | character varying(32) |
  quant       | integer               | not null
  comments    | text                  |
  basin       | character varying(10) |
  stream      | character varying(20) |
Indexes:
     "fish_pkey" PRIMARY KEY, btree (site, sampdate, tspecies)

   The table I use for benthos has this schema:

                                       Table "public.benthos"
     Column     |         Type          |                         Modifiers

---------------+-----------------------+-----------------------------------
------------------------
  sampid        | integer               | not null default nextval('macroinv
_sampid_seq'::regclass)
  site          | character varying(12) | not null
  sampdate      | date                  | not null
  tclass        | character varying(20) | not null
  torder        | character varying(32) | not null
  tfamily       | character varying(50) |
  tgenus        | character varying(32) |
  tspecies      | character varying(20) |
  func_feed_grp | character varying(32) |
  quant         | integer               | not null
  comments      | text                  |
  stream        | character varying(20) |
  basin         | character varying(10) |
Indexes:
     "macroinv_pkey" PRIMARY KEY, btree (sampid)

> *Measured / physical data*

>   - Physical quantities may be outside the detection range of the measured
>   instrument; in such a case, this needs to be recorded. I still do not have
>   a clear idea how to do it. NULL’s do not seem to be a good choice to mark
>   such data.

   Here, too, you have choices. You can incorrectly record and analyze
chemical data below detection limits like EPA, states, and most regulatory
permit holders do, or you can do it correctly. This is a schema I use for
water quality data:

            Table "public.surfchem"
   Column  |         Type          | Modifiers
----------+-----------------------+-----------
  site     | character varying(20) | not null
  sampdate | date                  | not null
  param    | character varying(32) | not null
  quant    | real                  |
  ceneq1   | integer               |
  floor    | real                  |
  ceiling  | real                  |
  stream   | character varying(20) |
  basin    | character varying(10) |
Indexes:
     "surfchem_pkey" PRIMARY KEY, btree (site, sampdate, param)

   The key is the set (site, sampdate, param) so you can have multiple
samples, each of a different chemical constituent, at the same location and
date. (This works for air, soil, and other media chemistry, too). The
concentrations must be in the same units (e.g., mg/L). The column named
'ceneq1' is a flag: when the quant value is below the analytical laboratory's
reporting limit, set the flag to '1'; otherwise 0. Now you have a consistent
way to identify rows with less-than/non-detect/censored values. Since
reporting levels change as instruments become more sensitive and analysts
become better trained, you can have multiple reporting limits in your data.
That's where the 'floor' and 'ceiling' columns come in.

   This should get you started.

Rich

--
Richard B. Shepard, Ph.D.
Applied Ecosystem Services, Inc. | Troutdale, OR 97060 USA
www.appl-ecosys.com      Voice: 503-667-4517         Fax: 503-667-8863


Re: Designing a DB for storing biological data

From
Rich Shepard
Date:
On Sat, 14 Jun 2014, Damir Dezeljin wrote:

> I am designing a database for storing various biological and ecological
> data. Although there is no clear dividing line, it is possible to group
> the data into two groups, namely the measured (physical) and quantitative
> (mostly biological) data;

   I left a couple of comments out of my first response.

   First, measurements are quantitative so it does not matter what you're
measuring, all your data are quantitative. Your physical, chemical, and
geomorphic data (study area size, water velocity, air temperature) are ratio
data of continuous measurements. Your biological data are counts and/or size
or weight measurements.

   Second, your database queries should be those appropriate to extract the
data you wish to statistically analyze, which is what I assume is the point
of the whole exercise. The database is for storage. You can do your spatial
analyses with GRASS and your statistical analyses with R.

Rich

--
Richard B. Shepard, Ph.D.
Applied Ecosystem Services, Inc. | Troutdale, OR 97060 USA
www.appl-ecosys.com      Voice: 503-667-4517         Fax: 503-667-8863


Re: Designing a DB for storing biological data

From
Damir Dezeljin
Date:
Thank you both for the valuable input.

I was not aware of EPA's STORET, EMAP projects nor ITIS. Beginning of next week I'll search for details and probably I'll get back with some questions.

Regarding diploma theses or master, etc. 

I may be wrong, but this looks like a question from a student for a
​> ​
project or master's degree.
Actually I'm working on a project.
From my decade and half of experience with SW development, architecture and design I learned avoiding doing things twice or more time. So, I still hope I'll be able to use some of this work for my master I'm planning to start in autumn.


Rich, can you share some details which institute you're working at -> you can send me a personal e-mail if you think so too.

Thanks,
 Damir

Re: Designing a DB for storing biological data

From
Andy Colson
Date:
On 06/14/2014 10:52 AM, Damir Dezeljin wrote:
> Hello.
>
<SNIP>
>
> *Finally, here is my dilemma* I am somewhat undecided what is the
> best way to implement the database and consequently what kind of
> queries to use. At above link a database model I am currently working
> on can be found. Looking to the diagram it becomes evident I am
> deciding if storing every measurement / determinant / depth triple as
> a separate record. The biggest dilemma I have is a query for a simple
> sample of pressure, temperature, salinity and oxygen would imply
> multiple joins. As far as I know, this will badly affect the
> performance; as well, it will harden codding the RESTful interface.

>
> Thanks and best regards, Damir
>

Normalizing the data into multiple tables ensures correctness.  Worry about being correct before you worry about
performance. Don't worry about multiple joins being slow, databases are designed to be fast with small tables with lots
ofrows and joining them together. 

Correctness first.  Don't worry about join performance, its fast.

-Andy