Re: EXTERNAL storage and substring on long strings - Mailing list pgsql-performance

From Tom Lane
Subject Re: EXTERNAL storage and substring on long strings
Date
Msg-id 7896.1060202191@sss.pgh.pa.us
Whole thread Raw
In response to EXTERNAL storage and substring on long strings  (Scott Cain <cain@cshl.org>)
Responses Re: EXTERNAL storage and substring on long strings
List pgsql-performance
Scott Cain <cain@cshl.org> writes:
> A few days ago, I asked for advice on speeding up substring queries on
> the GENERAL mailing list.  Joe Conway helpfully pointed out the ALTER
> TABLE STORAGE EXTERNAL documentation.  After doing the alter,
> the queries got slower!  Here is the background:

Ah-hah, I've sussed it ... you didn't actually change the storage
representation.  You wrote:

> Now, I'll change the storage:
>    alter table feature alter column residues set storage external;
> To make sure that really happens, I run an update on feature:
>    update feature set residues = residues where feature_id<8;
> and then VACUUM ANALYZE again.

This sounds good --- in fact, I think we all just accepted it when we
read it --- but in fact *that update didn't decompress the toasted data*.
The tuple toaster sees that the same toasted value is being stored back
into the row, and so it just re-uses the existing toasted data; it does
not stop to notice that the column storage preference has changed.

To actually get the storage to change, you need to feed the value
through some function or operator that will decompress it.  Then it
won't get recompressed when it's stored.  One easy way (since this
is a text column) is

    update feature set residues = residues || '' where feature_id<8;

To verify that something really happened, try doing VACUUM VERBOSE on
the table before and after.  The quoted number of tuples in the toast
table should rise substantially.

I did the following comparisons on the test data you made available,
using two tables in which one has default storage and one has "external"
(not compressed) storage:

scott=# \timing
Timing is on.
scott=# select length (dna) from edna;
  length
-----------
 245203899
(1 row)

Time: 1.05 ms
scott=# select length (dna) from ddna;
  length
-----------
 245203899
(1 row)

Time: 1.11 ms
scott=# select length(substring(dna from 1000000 for 20000)) from edna;
 length
--------
  20000
(1 row)

Time: 30.43 ms
scott=# select length(substring(dna from 1000000 for 20000)) from ddna;
 length
--------
  20000
(1 row)

Time: 37383.02 ms
scott=#

So it looks like the external-storage optimization for substring() does
work as expected, once you get the data into the right format ...

            regards, tom lane

pgsql-performance by date:

Previous
From: "scott.marlowe"
Date:
Subject: Re: partitioning for postgresql
Next
From: "Wilson A. Galafassi Jr."
Date:
Subject: ext3 block size