Thread: Big wide datasets
I have dataset with ~10000 columns and ~200000 rows (GWAS data (1)) in the form
sample1, A T, A A, G C, ....
sampel2, A C, C T, A A, ....
I'd like to take subsets of both columns and rows for analysis
Two approaches spring to mind either unpack it into something like an RDF triple
ie
CREATE TABLE long_table (
sample_id varchar(20),
column_number int,
snp_data varchar(3));
for a table with 20 billion rows
or use the array datatype
CREATE TABLE wide_table (
sample_id,
snp_data[]);
Does anyone have any experience of this sort of thing?
(1) http://en.wikipedia.org/wiki/Genome-wide_association_study
--
Michael Lush
sample1, A T, A A, G C, ....
sampel2, A C, C T, A A, ....
I'd like to take subsets of both columns and rows for analysis
Two approaches spring to mind either unpack it into something like an RDF triple
ie
CREATE TABLE long_table (
sample_id varchar(20),
column_number int,
snp_data varchar(3));
for a table with 20 billion rows
or use the array datatype
CREATE TABLE wide_table (
sample_id,
snp_data[]);
Does anyone have any experience of this sort of thing?
(1) http://en.wikipedia.org/wiki/Genome-wide_association_study
--
Michael Lush
On Thu, 8 Dec 2011 13:05:19 +0000 Michael Lush <mjlush@gmail.com> wrote: > I have dataset with ~10000 columns and ~200000 rows (GWAS data (1)) in the > form > sample1, A T, A A, G C, .... > sampel2, A C, C T, A A, .... > > I'd like to take subsets of both columns and rows for analysis Why do you want to write the subsets to a table instead of reading 1/1, analyse and save the result? This way you avoid the huge temp table which don't seem necessary. BTW just in case, in python there's large libraries dedicated to genetics. -- BOFH excuse #99: SIMM crosstalk.
<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"> On 12/8/2011 7:05 AM, Michael Lush wrote: <blockquote cite="mid:CACXX7MdoDdACfJMfhnugNoGxAhe-n5kxr716tGt6iUZ1n4ZKyQ@mail.gmail.com" type="cite"> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> I have dataset with ~10000 columns and ~200000 rows (GWAS data (1)) in the form sample1, A T, A A, G C, .... sampel2, A C, C T, A A, .... I'd like to take subsets of both columns and rows for analysis Two approaches spring to mind either unpack it into something like an RDF triple ie CREATE TABLE long_table ( sample_id varchar(20), column_number int, snp_data varchar(3)); for a table with 20 billion rows or use the array datatype CREATE TABLE wide_table ( sample_id, snp_data[]); Does anyone have any experience of this sort of thing? (1) <a moz-do-not-send="true" href="http://en.wikipedia.org/wiki/Genome-wide_association_study" target="_blank">http://en.wikipedia.org/wiki/Genome-wide_association_study -- Michael Lush I store all my genotype data similar to this. CREATE TABLE gen1000 ( -- Inherited from table genotypes: snp_number integer NOT NULL, -- Inherited from table genotypes: sample_id integer NOT NULL, -- Inherited from table genotypes: genotype smallint NOT NULL, CONSTRAINT check1000 CHECK (snp_number < 58337 AND sample_id > 100000000 AND sample_id < 101000000) ) INHERITS (genotypes) WITH ( OIDS=FALSE ); ALTER TABLE gen1000 OWNER TO postgres; COMMENT ON TABLE gen1000 IS '100 ANG'; -- Index: xgen1000_sample_id CREATE INDEX xgen1000_sample_id ON gen1000 USING btree (sample_id ) WITH (FILLFACTOR=100) TABLESPACE index_tablespace; ALTER TABLE gen1000 CLUSTER ON xgen1000_sample_id; -- Index: xgen1000_snp_number CREATE INDEX xgen1000_snp_number ON gen1000 USING btree (snp_number ) WITH (FILLFACTOR=100) TABLESPACE index_tablespace; My implementation is basically a data warehouse where I am the only person with access to the db. There are several benefits to storing your genotypes this way and there are benefits to storing your genotypes in more of a matrix format. If you store them as your "long" table you really need to think about whether or not you can partition your data. In my case I'm able to partition by sample_id because all of my animals from a given breed (population, race, location etc) have IDs in a certain range. Furthermore, I'm able to create partitions based on which assay (Illumina SNP50, Illumina HD, AFFX BOS-1) the genotypes came from because my snp_numbers (rather than marker names) are integer and specifically structured so ranges of integers correspond to an assay. Therefore, a given genotype table is only as long as the number of animals X number of markers on the assay. I have one partition that would be very large (20000 animals X 60000 markers = 1.2B) but what I did was further split it up into sub-partitions of ~220M. I have separate tables for marker information and sample information which are keyed on snp_number and sample_id. Given the structure and appropriate indexes, and check constraints I'm able to grab specific "chunks" of data fairly easily. As a side note, I take every opportunity I get to advise people not to store GWAS data in base format (AGCT). If you are using Illumina data at the very least store data and work with it in A/B format. Ditto for Affymetrix. I actually convert the A/B format to my own coding and store genotypes as a single small integer (genotype above) where AA=1, BB=2, AB=3, missing=10. At some point I'm going to change this to missing =0. Here's why this is important. With this coding you are able to store the genotype of an individual using 2 bytes. With this coding you can also store phase information if you have it... B/A=4, A/null=5, B/null =6, null/A=7, null/B=8, null/null=9. One other side benefit, if you are wanting to check parent/child inheritance all you need to do is find all the loci where parent_genotype + child_genotype = 3 and count the number of loci. Those are the only informative genotypes. If you're working in base space it's more complicated. Furthermore, you could take this data in columnar form and dump it and rewrite it as a string (wide) where each value is an individual genotype, which is about as compact as you can get it. You just need to write your analysis software to decode it. Bob
On 12/08/2011 05:05 AM, Michael Lush wrote: > I have dataset with ~10000 columns and ~200000 rows (GWAS data (1)) in > the form > > sample1, A T, A A, G C, .... > sampel2, A C, C T, A A, .... I don't know if they would be relevant to your situation, but there are some bioinformatics extensions to PostgreSQL that you may find useful. PostBio is one (http://postbio.projects.postgresql.org/). Cheers, Steve