Thread: Re: [PERFORM] EXTERNAL storage and substring on long strings

Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Scott Cain
Date:
Hello,

Note: there is a SQL question way at the bottom of this narrative :-)

Last week I asked about doing substring operations on very long strings
(>10 million characters).  I was given a suggestion to use EXTERNAL
storage on the column via the ALTER TABLE ... SET STORAGE command.  In
one test case, the performance of substring actually got worse using
EXTERNAL storage.

In an effort to find the best way to do this operation, I decided to
look at what is my "worst case" scenario: the DNA sequence for human
chromosome 1, which is about 250 million characters long (previous
strings where about 20 million characters long).  I wrote a perl script
to do several substring operations over this very long string, with
substring lengths varying between 1000 and 40,000 characters spread out
over various locations along the string.  While EXTENDED storage won in
this case, it was a hollow victory: 38 seconds per operation versus 40
seconds, both of which are way too long to for an interactive
application.

Time for a new method.  A suggestion from my boss was to "shred" the DNA
into smallish chunks and a column giving offsets from the beginning of
the string, so that it can be reassembled when needed. Here is the test
table:

string=> \d dna
      Table "public.dna"
 Column  |  Type   | Modifiers
---------+---------+-----------
 foffset | integer |
 pdna    | text    |
Indexes: foffset_idx btree (foffset)

In practice, there would also be a foreign key column to give the
identifier of the dna.  Then I wrote the following function (here's the
SQL part promised above):

CREATE OR REPLACE FUNCTION dna_string (integer, integer) RETURNS TEXT AS '
DECLARE
    smin ALIAS FOR $1;
    smax ALIAS FOR $2;
    longdna         TEXT := '''';
    dna_row         dna%ROWTYPE;
    dnastring       TEXT;
    firstchunk      INTEGER;
    lastchunk       INTEGER;
    in_longdnastart INTEGER;
    in_longdnalen   INTEGER;
    chunksize       INTEGER;
BEGIN
    SELECT INTO chunksize min(foffset) FROM dna WHERE foffset>0;
    firstchunk :=  chunksize*(smin/chunksize);
    lastchunk  :=  chunksize*(smax/chunksize);

    in_longdnastart := smin % chunksize;
    in_longdnalen   := smax - smin + 1;

    FOR dna_row IN
        SELECT * FROM dna
        WHERE foffset >= firstchunk AND foffset <= lastchunk
        ORDER BY foffset
        LOOP

        longdna := longdna || dna_row.pdna;
    END LOOP;

    dnastring := substring(longdna FROM in_longdnastart FOR in_longdnalen);

    RETURN dnastring;
END;
' LANGUAGE 'plpgsql';

So here's the question: I've never written a plpgsql function before, so
I don't have much experience with it; is there anything obviously wrong
with this function, or are there things that could be done better?  At
least this appears to work and is much faster, completing substring
operations like above in about 0.27 secs (that's about two orders of
magnitude improvement!)

Thanks,
Scott


--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain@cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Tom Lane
Date:
Scott Cain <cain@cshl.org> writes:
> At least this appears to work and is much faster, completing substring
> operations like above in about 0.27 secs (that's about two orders of
> magnitude improvement!)

I find it really, really hard to believe that a crude reimplementation
in plpgsql of the TOAST concept could beat the built-in implementation
at all, let alone beat it by two orders of magnitude.

Either there's something unrealistic about your testing of the
dna_string function, or your original tests are not causing TOAST to be
invoked in the expected way, or there's a bug we need to fix.  I'd
really like to see some profiling of the poor-performing
external-storage case, so we can figure out what's going on.

            regards, tom lane

Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Richard Huxton
Date:
On Monday 04 August 2003 16:25, Scott Cain wrote:
[snip]
> In an effort to find the best way to do this operation, I decided to
> look at what is my "worst case" scenario: the DNA sequence for human
> chromosome 1, which is about 250 million characters long (previous
> strings where about 20 million characters long).  I wrote a perl script
> to do several substring operations over this very long string, with
> substring lengths varying between 1000 and 40,000 characters spread out
> over various locations along the string.  While EXTENDED storage won in
> this case, it was a hollow victory: 38 seconds per operation versus 40
> seconds, both of which are way too long to for an interactive
> application.
>
> Time for a new method.  A suggestion from my boss was to "shred" the DNA
> into smallish chunks and a column giving offsets from the beginning of
> the string, so that it can be reassembled when needed. Here is the test
> table:
>
> string=> \d dna
>       Table "public.dna"
>  Column  |  Type   | Modifiers
> ---------+---------+-----------
>  foffset | integer |
>  pdna    | text    |
> Indexes: foffset_idx btree (foffset)

[snipped plpgsql function which stitches chunks together and then substrings]

> So here's the question: I've never written a plpgsql function before, so
> I don't have much experience with it; is there anything obviously wrong
> with this function, or are there things that could be done better?  At
> least this appears to work and is much faster, completing substring
> operations like above in about 0.27 secs (that's about two orders of
> magnitude improvement!)

You might want some checks to make sure that smin < smax, otherwise looks like
it does the job in a good clean fashion.

Glad to hear it's going to solve your problems. Two things you might want to
bear in mind:
1. There's probably a "sweet spot" where the chunk size interacts well with
your data, usage patterns and PGs backend to give you peak performance.
You'll have to test.
2. If you want to search for a sequence you'll need to deal with the case
where it starts in one chunk and ends in another.

--
  Richard Huxton
  Archonet Ltd

Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Scott Cain
Date:
On Mon, 2003-08-04 at 11:55, Richard Huxton wrote:
> On Monday 04 August 2003 16:25, Scott Cain wrote:
> [snip]
> > [snip]
>
> You might want some checks to make sure that smin < smax, otherwise looks like
> it does the job in a good clean fashion.

Good point--smin < smax generally by virtue of the application using the
database, but I shouldn't assume that will always be the case.
>
> Glad to hear it's going to solve your problems. Two things you might want to
> bear in mind:
> 1. There's probably a "sweet spot" where the chunk size interacts well with
> your data, usage patterns and PGs backend to give you peak performance.
> You'll have to test.

Yes, I had a feeling that was probably the case-- since this is an open
source project, I will need to write directions for installers on
picking a reasonable chunk size.

> 2. If you want to search for a sequence you'll need to deal with the case
> where it starts in one chunk and ends in another.

I forgot about searching--I suspect that application is why I faced
opposition for shredding in my schema development group.  Maybe I should
push that off to the file system and use grep (or BLAST).  Otherwise, I
could write a function that would search the chunks first, then after
failing to find the substring in those, I could start sewing the chunks
together to look for the query string.  That could get ugly (and
slow--but if the user knows that and expects it to be slow, I'm ok with
that).

Thanks,
Scott

--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain@cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Scott Cain
Date:
On Mon, 2003-08-04 at 11:53, Tom Lane wrote:
> Scott Cain <cain@cshl.org> writes:
> > At least this appears to work and is much faster, completing substring
> > operations like above in about 0.27 secs (that's about two orders of
> > magnitude improvement!)
>
> I find it really, really hard to believe that a crude reimplementation
> in plpgsql of the TOAST concept could beat the built-in implementation
> at all, let alone beat it by two orders of magnitude.
>
> Either there's something unrealistic about your testing of the
> dna_string function, or your original tests are not causing TOAST to be
> invoked in the expected way, or there's a bug we need to fix.  I'd
> really like to see some profiling of the poor-performing
> external-storage case, so we can figure out what's going on.
>
I was really hoping for a "Good job and glad to hear it" from you :-)

I don't think there is anything unrealistic about my function or its
testing, as it is very much along the lines of the types of things we do
now.  I will really try to do some profiling this week to help figure
out what is going on.

Scott

--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain@cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Joe Conway
Date:
Scott Cain wrote:
> On Mon, 2003-08-04 at 11:53, Tom Lane wrote:
>>I find it really, really hard to believe that a crude reimplementation
>>in plpgsql of the TOAST concept could beat the built-in implementation
>>at all, let alone beat it by two orders of magnitude.
>>
>>Either there's something unrealistic about your testing of the
>>dna_string function, or your original tests are not causing TOAST to be
>>invoked in the expected way, or there's a bug we need to fix.  I'd
>>really like to see some profiling of the poor-performing
>>external-storage case, so we can figure out what's going on.
>
> I was really hoping for a "Good job and glad to hear it" from you :-)
>
> I don't think there is anything unrealistic about my function or its
> testing, as it is very much along the lines of the types of things we do
> now.  I will really try to do some profiling this week to help figure
> out what is going on.

Is there a sample table schema and dataset available (external-storage
case) that we can play with?

Joe


Re: [PERFORM] EXTERNAL storage and substring on long strings

From
"Matt Clark"
Date:
> > 2. If you want to search for a sequence you'll need to deal with the case
> > where it starts in one chunk and ends in another.
>
> I forgot about searching--I suspect that application is why I faced
> opposition for shredding in my schema development group.  Maybe I should
> push that off to the file system and use grep (or BLAST).  Otherwise, I
> could write a function that would search the chunks first, then after
> failing to find the substring in those, I could start sewing the chunks
> together to look for the query string.  That could get ugly (and
> slow--but if the user knows that and expects it to be slow, I'm ok with
> that).

If you know the max length of the sequences being searched for, and this is much less than the chunk size, then you
couldsimply 
have the chunks overlap by that much, thus guaranteeing every substring will be found in its entirety in at least one
chunk.



Re: [PERFORM] EXTERNAL storage and substring on long strings

From
"Shridhar Daithankar"
Date:
On 4 Aug 2003 at 12:14, Scott Cain wrote:
> I forgot about searching--I suspect that application is why I faced
> opposition for shredding in my schema development group.  Maybe I should
> push that off to the file system and use grep (or BLAST).  Otherwise, I
> could write a function that would search the chunks first, then after
> failing to find the substring in those, I could start sewing the chunks
> together to look for the query string.  That could get ugly (and
> slow--but if the user knows that and expects it to be slow, I'm ok with
> that).

I assume your DNA sequence is compacted. Your best bet would be to fetch them
from database and run blast on them in client memory. No point duplicating
blast functionality. Last I tried it beat every technique of text searching
when heuristics are involved.

Bye
 Shridhar

--
There are two types of Linux developers - those who can spell, andthose who
can't. There is a constant pitched battle between the two.(From one of the post-
1.1.54 kernel update messages posted to c.o.l.a)


Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Scott Cain
Date:
Joe,

Good idea, since I may not get around to profiling it this week.  I
created a dump of the data set I was working with.  It is available at
http://www.gmod.org/string_dump.bz2

Thanks,
Scott


On Mon, 2003-08-04 at 16:29, Joe Conway wrote:
> Is there a sample table schema and dataset available (external-storage
> case) that we can play with?
>
> Joe
--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain@cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Scott Cain
Date:
Oh, and I forgot to mention: it's highly compressed (bzip2 -9) and is
109M.

Scott

On Tue, 2003-08-05 at 11:01, Scott Cain wrote:
> Joe,
>
> Good idea, since I may not get around to profiling it this week.  I
> created a dump of the data set I was working with.  It is available at
> http://www.gmod.org/string_dump.bz2
>
> Thanks,
> Scott
>
>
> On Mon, 2003-08-04 at 16:29, Joe Conway wrote:
> > Is there a sample table schema and dataset available (external-storage
> > case) that we can play with?
> >
> > Joe
--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain@cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Joe Conway
Date:
Scott Cain wrote:
> Oh, and I forgot to mention: it's highly compressed (bzip2 -9) and is
> 109M.

Thanks. I'll grab a copy from home later today and see if I can find
some time to poke at it.

Joe




Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Jan Wieck
Date:
Tom Lane wrote:
> Scott Cain <cain@cshl.org> writes:
>> At least this appears to work and is much faster, completing substring
>> operations like above in about 0.27 secs (that's about two orders of
>> magnitude improvement!)
>
> I find it really, really hard to believe that a crude reimplementation
> in plpgsql of the TOAST concept could beat the built-in implementation
> at all, let alone beat it by two orders of magnitude.
>
> Either there's something unrealistic about your testing of the
> dna_string function, or your original tests are not causing TOAST to be
> invoked in the expected way, or there's a bug we need to fix.  I'd
> really like to see some profiling of the poor-performing
> external-storage case, so we can figure out what's going on.

Doesn't look that unrealistic to me. A plain text based substring
function will reassemble the whole beast first before cutting out the
wanted part. His manually chunked version will read only those chunks
needed. Considering that he's talking about retrieving a few thousand
chars from a hundreds of MB size string ...


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> Doesn't look that unrealistic to me. A plain text based substring
> function will reassemble the whole beast first before cutting out the
> wanted part. His manually chunked version will read only those chunks
> needed.

So does substr(), if applied to EXTERNAL (non-compressed) toasted text.
See John Gray's improvements a release or two back.

            regards, tom lane

Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Scott Cain
Date:
Agreed.  When I actually Did It Right (tm), EXTERNAL storage gave
similar (probably better) performance as my shredding method, without
all the hoops to breakup and reassemble the string.

Scott

On Thu, 2003-08-14 at 14:00, Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
> > Doesn't look that unrealistic to me. A plain text based substring
> > function will reassemble the whole beast first before cutting out the
> > wanted part. His manually chunked version will read only those chunks
> > needed.
>
> So does substr(), if applied to EXTERNAL (non-compressed) toasted text.
> See John Gray's improvements a release or two back.
>
>             regards, tom lane
--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain@cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


Re: [PERFORM] EXTERNAL storage and substring on long strings

From
Jan Wieck
Date:
Tom Lane wrote:

> Jan Wieck <JanWieck@Yahoo.com> writes:
>> Doesn't look that unrealistic to me. A plain text based substring
>> function will reassemble the whole beast first before cutting out the
>> wanted part. His manually chunked version will read only those chunks
>> needed.
>
> So does substr(), if applied to EXTERNAL (non-compressed) toasted text.
> See John Gray's improvements a release or two back.

Duh ... of course, EXTERNAL is uncompressed ... where do I have my head?


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #