I have dataset with ~10000 columns and ~200000 rows (GWAS data (1)) in the form
sample1, A T, A A, G C, ....
sampel2, A C, C T, A A, ....
I'd like to take subsets of both columns and rows for analysis
Two approaches spring to mind either unpack it into something like an RDF triple
ie
CREATE TABLE long_table (
sample_id varchar(20),
column_number int,
snp_data varchar(3));
for a table with 20 billion rows
or use the array datatype
CREATE TABLE wide_table (
sample_id,
snp_data[]);
Does anyone have any experience of this sort of thing?
(1)
http://en.wikipedia.org/wiki/Genome-wide_association_study --
Michael Lush