Thread: Big wide datasets

Big wide datasets

From
Michael Lush
Date:
I have dataset with ~10000 columns and ~200000 rows (GWAS data (1)) in the form

sample1, A T, A A, G C, ....
sampel2, A C, C T, A A, ....

I'd like to take subsets of both columns and rows for analysis

Two approaches spring to mind either unpack it into something like an RDF triple

ie
CREATE TABLE long_table (
                               sample_id  varchar(20),
                               column_number int,
                               snp_data  varchar(3));

for a table with 20 billion rows

or use the array datatype

CREATE TABLE wide_table (
                                sample_id,
                                snp_data[]);

Does anyone have any experience of this sort of thing?

(1) http://en.wikipedia.org/wiki/Genome-wide_association_study

--
Michael Lush

Re: Big wide datasets

From
"Jean-Yves F. Barbier"
Date:
On Thu, 8 Dec 2011 13:05:19 +0000
Michael Lush <mjlush@gmail.com> wrote:

> I have dataset with ~10000 columns and ~200000 rows (GWAS data (1)) in the
> form
> sample1, A T, A A, G C, ....
> sampel2, A C, C T, A A, ....
>
> I'd like to take subsets of both columns and rows for analysis

Why do you want to write the subsets to a table instead of reading
1/1, analyse and save the result?
This way you avoid the huge temp table which don't seem necessary.

BTW just in case, in python there's large libraries dedicated to
genetics.

--
BOFH excuse #99:
SIMM crosstalk.

Re: Big wide datasets

From
"Robert D. Schnabel"
Date:
<meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">




    On 12/8/2011 7:05 AM, Michael Lush wrote:
    <blockquote
cite="mid:CACXX7MdoDdACfJMfhnugNoGxAhe-n5kxr716tGt6iUZ1n4ZKyQ@mail.gmail.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html;
        charset=ISO-8859-1">
      I have dataset with ~10000 columns and ~200000 rows (GWAS data
      (1)) in the form

      sample1, A T, A A, G C, ....
      sampel2, A C, C T, A A, ....

      I'd like to take subsets of both columns and rows for analysis

      Two approaches spring to mind either unpack it into something like
      an RDF triple

      ie
      CREATE TABLE long_table (

                         
    sample_id  varchar(20), 

                        
     column_number int, 

                              
snp_data varchar(3)); 

      for a table with 20 billion rows

      or use the array datatype

      CREATE TABLE wide_table (

                               
sample_id,

                               
snp_data[]);

      Does anyone have any experience of this sort of thing?

      (1) <a moz-do-not-send="true"
        href="http://en.wikipedia.org/wiki/Genome-wide_association_study"
        target="_blank">http://en.wikipedia.org/wiki/Genome-wide_association_study

      --
      Michael Lush


    I store all my genotype data similar to this. 

    CREATE TABLE gen1000
      (
      -- Inherited from table genotypes:  snp_number integer NOT NULL,
      -- Inherited from table genotypes:  sample_id integer NOT NULL,
      -- Inherited from table genotypes:  genotype smallint NOT NULL,
        CONSTRAINT check1000 CHECK (snp_number < 58337 AND sample_id
      > 100000000 AND sample_id < 101000000)
      )
      INHERITS (genotypes)
      WITH (
        OIDS=FALSE
      );
      ALTER TABLE gen1000
        OWNER TO postgres;
      COMMENT ON TABLE gen1000
        IS '100 ANG';

      -- Index: xgen1000_sample_id
      CREATE INDEX xgen1000_sample_id
        ON gen1000
        USING btree
        (sample_id )
        WITH (FILLFACTOR=100)
      TABLESPACE index_tablespace;
      ALTER TABLE gen1000 CLUSTER ON xgen1000_sample_id;

      -- Index: xgen1000_snp_number
      CREATE INDEX xgen1000_snp_number
        ON gen1000
        USING btree
        (snp_number )
        WITH (FILLFACTOR=100)
      TABLESPACE index_tablespace;

    My implementation is basically a data warehouse where I am the only
    person with access to the db.

    There are several benefits to storing your genotypes this way and
    there are benefits to storing your genotypes in more of a matrix
    format.  If you store them as your "long" table you really need to
    think about whether or not you can partition your data.  In my case
    I'm able to partition by sample_id because all of my animals from a
    given breed (population, race, location etc) have IDs in a certain
    range.  Furthermore, I'm able to create partitions based on which
    assay (Illumina SNP50, Illumina HD, AFFX BOS-1) the genotypes came
    from because my snp_numbers (rather than marker names) are integer
    and specifically structured so ranges of integers correspond to an
    assay.  Therefore, a given genotype table is only as long as the
    number of animals X number of markers on the assay.  I have one
    partition that would be very large (20000 animals X 60000 markers =
    1.2B) but what I did was further split it up into sub-partitions of
    ~220M.

    I have separate tables for marker information and sample information
    which are keyed on snp_number and sample_id.  Given the structure
    and appropriate indexes, and check constraints I'm able to grab
    specific "chunks" of data fairly easily.

    As a side note, I take every opportunity I get to advise people not
    to store GWAS data in base format (AGCT).  If you are using Illumina
    data at the very least store data and work with it in A/B format. 
    Ditto for Affymetrix.  I actually convert the A/B format to my own
    coding and store genotypes as a single small integer (genotype
    above) where AA=1, BB=2, AB=3, missing=10.  At some point I'm going
    to change this to missing =0.  Here's why this is important.  With
    this coding you are able to store the genotype of an individual
    using 2 bytes.  With this coding you can also store phase
    information if you have it... B/A=4, A/null=5, B/null =6, null/A=7,
    null/B=8, null/null=9.  One other side benefit, if you are wanting
    to check parent/child inheritance all you need to do is find all the
    loci where parent_genotype + child_genotype = 3 and count the number
    of loci.  Those are the only informative genotypes.  If you're
    working in base space it's more complicated.   Furthermore, you
    could take this data in columnar form and dump it and rewrite it as
    a string (wide) where each value is an individual genotype, which is
    about as compact as you can get it.  You just need to write your
    analysis software to decode it.

    Bob

Re: Big wide datasets

From
Steve Crawford
Date:
On 12/08/2011 05:05 AM, Michael Lush wrote:
> I have dataset with ~10000 columns and ~200000 rows (GWAS data (1)) in
> the form
>
> sample1, A T, A A, G C, ....
> sampel2, A C, C T, A A, ....
I don't know if they would be relevant to your situation, but there are
some bioinformatics extensions to PostgreSQL that you may find useful.
PostBio is one (http://postbio.projects.postgresql.org/).

Cheers,
Steve