Re: [PERFORM] EXTERNAL storage and substring on long strings - Mailing list pgsql-sql

From Richard Huxton
Subject Re: [PERFORM] EXTERNAL storage and substring on long strings
Date
Msg-id 200308041655.48151.dev@archonet.com
Whole thread Raw
In response to Re: [PERFORM] EXTERNAL storage and substring on long strings  (Scott Cain <cain@cshl.org>)
Responses Re: [PERFORM] EXTERNAL storage and substring on long strings  (Scott Cain <cain@cshl.org>)
List pgsql-sql
On Monday 04 August 2003 16:25, Scott Cain wrote:
[snip]
> In an effort to find the best way to do this operation, I decided to
> look at what is my "worst case" scenario: the DNA sequence for human
> chromosome 1, which is about 250 million characters long (previous
> strings where about 20 million characters long).  I wrote a perl script
> to do several substring operations over this very long string, with
> substring lengths varying between 1000 and 40,000 characters spread out
> over various locations along the string.  While EXTENDED storage won in
> this case, it was a hollow victory: 38 seconds per operation versus 40
> seconds, both of which are way too long to for an interactive
> application.
>
> Time for a new method.  A suggestion from my boss was to "shred" the DNA
> into smallish chunks and a column giving offsets from the beginning of
> the string, so that it can be reassembled when needed. Here is the test
> table:
>
> string=> \d dna
>       Table "public.dna"
>  Column  |  Type   | Modifiers
> ---------+---------+-----------
>  foffset | integer |
>  pdna    | text    |
> Indexes: foffset_idx btree (foffset)

[snipped plpgsql function which stitches chunks together and then substrings]

> So here's the question: I've never written a plpgsql function before, so
> I don't have much experience with it; is there anything obviously wrong
> with this function, or are there things that could be done better?  At
> least this appears to work and is much faster, completing substring
> operations like above in about 0.27 secs (that's about two orders of
> magnitude improvement!)

You might want some checks to make sure that smin < smax, otherwise looks like
it does the job in a good clean fashion.

Glad to hear it's going to solve your problems. Two things you might want to
bear in mind:
1. There's probably a "sweet spot" where the chunk size interacts well with
your data, usage patterns and PGs backend to give you peak performance.
You'll have to test.
2. If you want to search for a sequence you'll need to deal with the case
where it starts in one chunk and ends in another.

--
  Richard Huxton
  Archonet Ltd

pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PERFORM] EXTERNAL storage and substring on long strings
Next
From: Scott Cain
Date:
Subject: Re: [PERFORM] EXTERNAL storage and substring on long strings