Re: Designing a DB for storing biological data - Mailing list pgsql-general

From Rich Shepard
Subject Re: Designing a DB for storing biological data
Date
Msg-id alpine.LNX.2.11.1406141107100.17243@localhost
Whole thread Raw
In response to Designing a DB for storing biological data  (Damir Dezeljin <damir.dezeljin@dezo.org>)
Responses Re: Designing a DB for storing biological data
List pgsql-general
On Sat, 14 Jun 2014, Damir Dezeljin wrote:

> This is more a theoretical or better to say, conceptual question; still, I
> hope to get some feed backs from you folks. Additionally this is going to
> be a very long post :) off-topic: I asked a similar question on the MySQL
> forum as I'm still undecided if going with PostgreSQL or MySQL << I'm
> tempted at PostGIS.

   Postgres. You can link it to spatial attributes with postgis if you want.
But, how to structure your database and tables is independent of the
analytical methods you use.

> I am designing a database for storing various biological and ecological
> data. Although there is no clear dividing line, it is possible to group
> the data into two groups, namely the measured (physical) and quantitative
> (mostly biological) data; I uploaded both a data sample and an initial
> draft of a DB model to this link
> <https://www.dropbox.com/sh/9gm2ezwrwhkz6xv/AAB3koD6Xzi48-2BhIEdwmlZa>.

   Of course there's a clear division between biological, chemical, physical,
geomorphical, and other data.

   I may be wrong, but this looks like a question from a student for a
project or master's degree.

> *Biological / quantitative data*

   You have many choices: EPA's STORET or EMAP projects, ITIS (International
Taxonomic Identification System), or your own. Each type of biological data
should have its own table; e.g., vegegation, mammals, birds, fish, benthic
macroinvertebrates, microbes.

   Each table should have a compound key: site_id, sample_date, species. This
makes each row unique, when possible. Otherwise. use a sequential key.

   The table I use for fish data has this schema:

                Table "public.fish"
    Column    |         Type          | Modifiers
-------------+-----------------------+-----------
  site        | character varying(12) | not null
  sampdate    | date                  | not null
  tclass      | character(13)         | not null
  torder      | character varying(16) | not null
  tfamily     | character varying(12) | not null
  tgenus      | character varying(12) | not null
  tspecies    | character varying(12) | not null
  tsubspecies | character varying(12) |
  common_name | character varying(32) |
  quant       | integer               | not null
  comments    | text                  |
  basin       | character varying(10) |
  stream      | character varying(20) |
Indexes:
     "fish_pkey" PRIMARY KEY, btree (site, sampdate, tspecies)

   The table I use for benthos has this schema:

                                       Table "public.benthos"
     Column     |         Type          |                         Modifiers

---------------+-----------------------+-----------------------------------
------------------------
  sampid        | integer               | not null default nextval('macroinv
_sampid_seq'::regclass)
  site          | character varying(12) | not null
  sampdate      | date                  | not null
  tclass        | character varying(20) | not null
  torder        | character varying(32) | not null
  tfamily       | character varying(50) |
  tgenus        | character varying(32) |
  tspecies      | character varying(20) |
  func_feed_grp | character varying(32) |
  quant         | integer               | not null
  comments      | text                  |
  stream        | character varying(20) |
  basin         | character varying(10) |
Indexes:
     "macroinv_pkey" PRIMARY KEY, btree (sampid)

> *Measured / physical data*

>   - Physical quantities may be outside the detection range of the measured
>   instrument; in such a case, this needs to be recorded. I still do not have
>   a clear idea how to do it. NULL’s do not seem to be a good choice to mark
>   such data.

   Here, too, you have choices. You can incorrectly record and analyze
chemical data below detection limits like EPA, states, and most regulatory
permit holders do, or you can do it correctly. This is a schema I use for
water quality data:

            Table "public.surfchem"
   Column  |         Type          | Modifiers
----------+-----------------------+-----------
  site     | character varying(20) | not null
  sampdate | date                  | not null
  param    | character varying(32) | not null
  quant    | real                  |
  ceneq1   | integer               |
  floor    | real                  |
  ceiling  | real                  |
  stream   | character varying(20) |
  basin    | character varying(10) |
Indexes:
     "surfchem_pkey" PRIMARY KEY, btree (site, sampdate, param)

   The key is the set (site, sampdate, param) so you can have multiple
samples, each of a different chemical constituent, at the same location and
date. (This works for air, soil, and other media chemistry, too). The
concentrations must be in the same units (e.g., mg/L). The column named
'ceneq1' is a flag: when the quant value is below the analytical laboratory's
reporting limit, set the flag to '1'; otherwise 0. Now you have a consistent
way to identify rows with less-than/non-detect/censored values. Since
reporting levels change as instruments become more sensitive and analysts
become better trained, you can have multiple reporting limits in your data.
That's where the 'floor' and 'ceiling' columns come in.

   This should get you started.

Rich

--
Richard B. Shepard, Ph.D.
Applied Ecosystem Services, Inc. | Troutdale, OR 97060 USA
www.appl-ecosys.com      Voice: 503-667-4517         Fax: 503-667-8863


pgsql-general by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Using pg_start_backup() and pg_stop_backup() - using 9.1.2.2
Next
From: Rich Shepard
Date:
Subject: Re: Designing a DB for storing biological data