Thread: substring implementation (long string)

substring implementation (long string)

From
Scott Cain
Date:
Hello,

I am wondering about the implementation of substring for very large
strings.  I've got strings that are several million characters long and
frequently need to extract relatively small substrings (5000-40000
characters) (that's right, it's DNA).  Before I cared much about
performance, I retrieved the whole string and and substr'ed it in perl.
I realized recently it is better to do the substring in postgres
(performance increase by an order of magnitude).  So here is what I am
wondering: does postgres read the whole string into memory before it
does the substring, or does it have some sort of smart way of reading
just the substring from disk?

I am wondering because I can think of ways of potentially improving
performance, but at significant cost to the API, and I don't want to
implement it unless I will get a big boost in performance.

Thanks,
Scott

--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain@cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


Re: substring implementation (long string)

From
Joe Conway
Date:
Scott Cain wrote:
> I am wondering about the implementation of substring for very large
> strings.  I've got strings that are several million characters long and
> frequently need to extract relatively small substrings (5000-40000
> characters) (that's right, it's DNA).  Before I cared much about
> performance, I retrieved the whole string and and substr'ed it in perl.
> I realized recently it is better to do the substring in postgres
> (performance increase by an order of magnitude).  So here is what I am
> wondering: does postgres read the whole string into memory before it
> does the substring, or does it have some sort of smart way of reading
> just the substring from disk?

I believe that if you store the text uncompressed (which is not the
default) then the substring function can just grab the section of
interest. Normally text columns this large are compressed though, which
requires them to be read completely from disk before they are sliced.

See:
http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=sql-altertable.html

In particular:
ALTER TABLE [ ONLY ] table [ * ] ALTER [ COLUMN ] column
       SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }

SET STORAGE

     This form sets the storage mode for a column. This controls whether
this column is held inline or in a supplementary table, and whether the
data should be compressed or not. PLAIN must be used for fixed-length
values such as INTEGER and is inline, uncompressed. MAIN is for inline,
compressible data. EXTERNAL is for external, uncompressed data and
EXTENDED is for external, compressed data. EXTENDED is the default for
all data types that support it. The use of EXTERNAL will make substring
operations on a TEXT column faster, at the penalty of increased storage
space.


I think you'll want
ALTER TABLE mytable ALTER COLUMN bigtextcol SET STORAGE EXTERNAL;

As far as I know, there is no way to create a table with STORAGE
EXTERNAL; you have to create the table and then alter it with the above
statement.


HTH,

Joe



Re: substring implementation (long string)

From
Scott Cain
Date:
Joe,

Thanks--that looks like exactly what I need.  Here's a related question:
when I do the alter table, is there a way I can make it "retroactive".
That is, I ran the alter table, and nothing happened.  I expected the
storage space for the database to jump, and for it to take a while to do
it, but it did not.  So I am guessing that the alter table only applies
to new data put in.  Is there a way to make it apply to all the data
already in the column, short of dropping the table and reloading it.
Dropping and reloading is unattractive because of several foreign key
constraints on this table.

Thanks,
Scott


On Wed, 2003-07-30 at 00:46, Joe Conway wrote:
> Scott Cain wrote:
> > I am wondering about the implementation of substring for very large
> > strings.  I've got strings that are several million characters long and
> > frequently need to extract relatively small substrings (5000-40000
> > characters) (that's right, it's DNA).  Before I cared much about
> > performance, I retrieved the whole string and and substr'ed it in perl.
> > I realized recently it is better to do the substring in postgres
> > (performance increase by an order of magnitude).  So here is what I am
> > wondering: does postgres read the whole string into memory before it
> > does the substring, or does it have some sort of smart way of reading
> > just the substring from disk?
>
> I believe that if you store the text uncompressed (which is not the
> default) then the substring function can just grab the section of
> interest. Normally text columns this large are compressed though, which
> requires them to be read completely from disk before they are sliced.
>
> See:
> http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=sql-altertable.html
>
> In particular:
> ALTER TABLE [ ONLY ] table [ * ] ALTER [ COLUMN ] column
>        SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }
>
> SET STORAGE
>
>      This form sets the storage mode for a column. This controls whether
> this column is held inline or in a supplementary table, and whether the
> data should be compressed or not. PLAIN must be used for fixed-length
> values such as INTEGER and is inline, uncompressed. MAIN is for inline,
> compressible data. EXTERNAL is for external, uncompressed data and
> EXTENDED is for external, compressed data. EXTENDED is the default for
> all data types that support it. The use of EXTERNAL will make substring
> operations on a TEXT column faster, at the penalty of increased storage
> space.
>
>
> I think you'll want
> ALTER TABLE mytable ALTER COLUMN bigtextcol SET STORAGE EXTERNAL;
>
> As far as I know, there is no way to create a table with STORAGE
> EXTERNAL; you have to create the table and then alter it with the above
> statement.
>
>
> HTH,
>
> Joe
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain@cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


Re: substring implementation (long string)

From
Joe Conway
Date:
Scott Cain wrote:
> Thanks--that looks like exactly what I need.  Here's a related question:
> when I do the alter table, is there a way I can make it "retroactive".
> That is, I ran the alter table, and nothing happened.  I expected the
> storage space for the database to jump, and for it to take a while to do
> it, but it did not.  So I am guessing that the alter table only applies
> to new data put in.  Is there a way to make it apply to all the data
> already in the column, short of dropping the table and reloading it.
> Dropping and reloading is unattractive because of several foreign key
> constraints on this table.
>

You probably could do something like:

UPDATE mytable SET somefield = somefield;

Joe