Thread: Table design

Table design

From
Sean Davis
Date:
This might be a bit off-topic, but I'm curious what folks would do with this
situation:

I have about 6 different tab-delimited file types, all of which store
similar information (microarray gene expression).  However, the files come
from different manufacturers, so each has slightly different fields with
different meanings.  However, there are a few columns that are shared.  I
may need to add table formats in the future (as we get more manufacturers).
I can think of at least three ways to go about storing these data:

1) Create a single table that has as many columns as needed for ALL formats
and make manufacturer-specific views, naming columns in the view as
appropriate.  Then put rules on the view for inserts, updates, etc.  This is
my first choice, I think, but adding a new manufacturer's format means
creating a new view and possibly adding columns; some columns may NULL for
large portions of the table.

2) Use postgres inheritance, but even shared columns in our data may have
different names depending on the manufacturer, so there may be views
involved anyway.

3) Use a fully-normalized strategy that stacks each column into one very
long table--this would be my last choice.

Thanks for any insight.

(For replies, please try to reply to me directly as well as the list as I
just get digests right now).

Thanks,
Sean


Re: Table design

From
"Gevik"
Date:
Perhaps this is not a postgresql solution, but you could;

1. first design a common data structure in postgresql.
2. then convert each type of the tab-delimited file to a basic xml
structures.
3. map the structures to the common data structure using xslt.

I hope this helps,


> This might be a bit off-topic, but I'm curious what folks would do with
> this
> situation:
>
> I have about 6 different tab-delimited file types, all of which store
> similar information (microarray gene expression).  However, the files come
> from different manufacturers, so each has slightly different fields with
> different meanings.  However, there are a few columns that are shared.  I
> may need to add table formats in the future (as we get more
> manufacturers).
> I can think of at least three ways to go about storing these data:
>
> 1) Create a single table that has as many columns as needed for ALL
> formats
> and make manufacturer-specific views, naming columns in the view as
> appropriate.  Then put rules on the view for inserts, updates, etc.  This
> is
> my first choice, I think, but adding a new manufacturer's format means
> creating a new view and possibly adding columns; some columns may NULL for
> large portions of the table.
>
> 2) Use postgres inheritance, but even shared columns in our data may have
> different names depending on the manufacturer, so there may be views
> involved anyway.
>
> 3) Use a fully-normalized strategy that stacks each column into one very
> long table--this would be my last choice.
>
> Thanks for any insight.
>
> (For replies, please try to reply to me directly as well as the list as I
> just get digests right now).
>
> Thanks,
> Sean
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>



Re: Table design

From
Adam Witney
Date:
Hi Sean,

We use something similar to approach 1) to store our microarray data.

We have a data table that has a few specific columns (signal median, bkg
median etc) as these exist in all the file formats... Plus also some generic
columns for the rest of the data fields.

Then we have a definitions table that maps the column header from the file
format to the column name in the database.

It seems to work well for us. I can send you the table definitions if they
are any use to you?

Cheers

Adam

> This might be a bit off-topic, but I'm curious what folks would do with this
> situation:
>
> I have about 6 different tab-delimited file types, all of which store
> similar information (microarray gene expression).  However, the files come
> from different manufacturers, so each has slightly different fields with
> different meanings.  However, there are a few columns that are shared.  I
> may need to add table formats in the future (as we get more manufacturers).
> I can think of at least three ways to go about storing these data:
>
> 1) Create a single table that has as many columns as needed for ALL formats
> and make manufacturer-specific views, naming columns in the view as
> appropriate.  Then put rules on the view for inserts, updates, etc.  This is
> my first choice, I think, but adding a new manufacturer's format means
> creating a new view and possibly adding columns; some columns may NULL for
> large portions of the table.
>
> 2) Use postgres inheritance, but even shared columns in our data may have
> different names depending on the manufacturer, so there may be views
> involved anyway.
>
> 3) Use a fully-normalized strategy that stacks each column into one very
> long table--this would be my last choice.
>
> Thanks for any insight.
>
> (For replies, please try to reply to me directly as well as the list as I
> just get digests right now).
>
> Thanks,
> Sean
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>              http://archives.postgresql.org


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


Re: Table design

From
"Guy Rouillier"
Date:
Sean Davis wrote:
> This might be a bit off-topic, but I'm curious what folks would do
> with this
> situation:
>
> I have about 6 different tab-delimited file types, all of which store
> similar information (microarray gene expression).  However, the files
> come from different manufacturers, so each has slightly different
> fields with different meanings.  However, there are a few columns
> that are shared.  I may need to add table formats in the future (as
> we get more manufacturers). I can think of at least three ways to go
> about storing these data:
>
> 1) Create a single table that has as many columns as needed for ALL
> formats and make manufacturer-specific views, naming columns in the
> view as appropriate.  Then put rules on the view for inserts,
> updates, etc.  This is my first choice, I think, but adding a new
> manufacturer's format means creating a new view and possibly adding
> columns; some columns may NULL for large portions of the table.
>
> 2) Use postgres inheritance, but even shared columns in our data may
> have different names depending on the manufacturer, so there may be
> views involved anyway.

I'm unclear if this is just a naming issue, or if the fields in the
files have different meaning.  If it's just a case that supplier A names
a field "foo" while supplier B names a field with the same meaning
"bar", I would think you'd want to coalesce all these incoming files
into a single table containing columns that have meaning to your
organization.  The effort then just becomes one of mapping incoming
fields into the proper columns, but the end result would be something
much more consistent and meaningful to your organization.

If on the other hand all these incoming fields have different meaning
and you need to keep them stored separately, I would look into option
(4): just keep a separate table for each supplier, since you said that
even shared fields may have different meaning; then use a view over all
the tables to answer any queries across suppliers.

But I definitely wouldn't let the way your suppliers name their fields
in the files they send you drive how you design your database.  That's
just a data mapping issue which is easily solved during data import.

>
> 3) Use a fully-normalized strategy that stacks each column into one
> very long table--this would be my last choice.
>
> Thanks for any insight.
>
> (For replies, please try to reply to me directly as well as the list
> as I just get digests right now).
>
> Thanks,
> Sean
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org



--
Guy Rouillier


Re: Table design

From
Sean Davis
Date:
On 12/2/05 10:21 AM, "Guy Rouillier" <guyr@masergy.com> wrote:

> Sean Davis wrote:
>> This might be a bit off-topic, but I'm curious what folks would do
>> with this
>> situation:
>>
>> I have about 6 different tab-delimited file types, all of which store
>> similar information (microarray gene expression).  However, the files
>> come from different manufacturers, so each has slightly different
>> fields with different meanings.  However, there are a few columns
>> that are shared.  I may need to add table formats in the future (as
>> we get more manufacturers). I can think of at least three ways to go
>> about storing these data:
>>
>> 1) Create a single table that has as many columns as needed for ALL
>> formats and make manufacturer-specific views, naming columns in the
>> view as appropriate.  Then put rules on the view for inserts,
>> updates, etc.  This is my first choice, I think, but adding a new
>> manufacturer's format means creating a new view and possibly adding
>> columns; some columns may NULL for large portions of the table.
>>
>> 2) Use postgres inheritance, but even shared columns in our data may
>> have different names depending on the manufacturer, so there may be
>> views involved anyway.
>
> I'm unclear if this is just a naming issue, or if the fields in the
> files have different meaning.  If it's just a case that supplier A names
> a field "foo" while supplier B names a field with the same meaning
> "bar", I would think you'd want to coalesce all these incoming files
> into a single table containing columns that have meaning to your
> organization.  The effort then just becomes one of mapping incoming
> fields into the proper columns, but the end result would be something
> much more consistent and meaningful to your organization.
>
> If on the other hand all these incoming fields have different meaning
> and you need to keep them stored separately, I would look into option
> (4): just keep a separate table for each supplier, since you said that
> even shared fields may have different meaning; then use a view over all
> the tables to answer any queries across suppliers.

The columns are a mixture of both cases, hence the thought about
inheritance.  However, the column names do have meaning for users of the
manufacturer's products, so they need to be present, also.  I'm not the
first to tackle this problem, and most use a common table structure for all
the data, column naming aside.

Sean