Re: EXTERNAL storage and substring on long strings - Mailing list pgsql-performance
From | Scott Cain |
---|---|
Subject | Re: EXTERNAL storage and substring on long strings |
Date | |
Msg-id | 1060010735.1433.26.camel@localhost.localdomain Whole thread Raw |
In response to | Re: EXTERNAL storage and substring on long strings (Ron Johnson <ron.l.johnson@cox.net>) |
Responses |
Re: [SQL] EXTERNAL storage and substring on long strings
Re: [SQL] EXTERNAL storage and substring on long strings |
List | pgsql-performance |
Hello, Note: there is a SQL question way at the bottom of this narrative :-) Last week I asked about doing substring operations on very long strings (>10 million characters). I was given a suggestion to use EXTERNAL storage on the column via the ALTER TABLE ... SET STORAGE command. In one test case, the performance of substring actually got worse using EXTERNAL storage. In an effort to find the best way to do this operation, I decided to look at what is my "worst case" scenario: the DNA sequence for human chromosome 1, which is about 250 million characters long (previous strings where about 20 million characters long). I wrote a perl script to do several substring operations over this very long string, with substring lengths varying between 1000 and 40,000 characters spread out over various locations along the string. While EXTENDED storage won in this case, it was a hollow victory: 38 seconds per operation versus 40 seconds, both of which are way too long to for an interactive application. Time for a new method. A suggestion from my boss was to "shred" the DNA into smallish chunks and a column giving offsets from the beginning of the string, so that it can be reassembled when needed. Here is the test table: string=> \d dna Table "public.dna" Column | Type | Modifiers ---------+---------+----------- foffset | integer | pdna | text | Indexes: foffset_idx btree (foffset) In practice, there would also be a foreign key column to give the identifier of the dna. Then I wrote the following function (here's the SQL part promised above): CREATE OR REPLACE FUNCTION dna_string (integer, integer) RETURNS TEXT AS ' DECLARE smin ALIAS FOR $1; smax ALIAS FOR $2; longdna TEXT := ''''; dna_row dna%ROWTYPE; dnastring TEXT; firstchunk INTEGER; lastchunk INTEGER; in_longdnastart INTEGER; in_longdnalen INTEGER; chunksize INTEGER; BEGIN SELECT INTO chunksize min(foffset) FROM dna WHERE foffset>0; firstchunk := chunksize*(smin/chunksize); lastchunk := chunksize*(smax/chunksize); in_longdnastart := smin % chunksize; in_longdnalen := smax - smin + 1; FOR dna_row IN SELECT * FROM dna WHERE foffset >= firstchunk AND foffset <= lastchunk ORDER BY foffset LOOP longdna := longdna || dna_row.pdna; END LOOP; dnastring := substring(longdna FROM in_longdnastart FOR in_longdnalen); RETURN dnastring; END; ' LANGUAGE 'plpgsql'; So here's the question: I've never written a plpgsql function before, so I don't have much experience with it; is there anything obviously wrong with this function, or are there things that could be done better? At least this appears to work and is much faster, completing substring operations like above in about 0.27 secs (that's about two orders of magnitude improvement!) Thanks, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain@cshl.org GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory
pgsql-performance by date: