Re: SUBSTRING performance for large BYTEA - Mailing list pgsql-general

From Gregory Stark
Subject Re: SUBSTRING performance for large BYTEA
Date
Msg-id 87mywoshwe.fsf@oxford.xeocode.com
Whole thread Raw
In response to Re: SUBSTRING performance for large BYTEA  ("Joshua D. Drake" <jd@commandprompt.com>)
Responses Re: SUBSTRING performance for large BYTEA  (Karsten Hilbert <Karsten.Hilbert@gmx.net>)
List pgsql-general
"Joshua D. Drake" <jd@commandprompt.com> writes:

> Well this is a guess, but:
>
> Set existing column to storage external
> update existing column with existing data:
>
> UPDATE foo SET bar = bar;

Well, not quite. That would actually reuse the toast pointer without
decompressing it. We try to be clever about not decompressing and duplicating
toast pointers unnecessarily on updates -- in this case too clever.

You could do this:

postgres=# ALTER TABLE foo ALTER bar TYPE bytea, ALTER bar SET STORAGE external;
ALTER TABLE

(Note that you have to include the 'ALTER bar SET STORAGE external' in the
same command or the storage will get reset to the default 'extended' for bytea
even if it was previously set to 'external'.)

When I tested this though I noticed it did *not* decompress compressed data
which was small enough to store internally. This may actually be desirable for
your case since anything small enough to be stored internally is probably not
worth bothering decompressing so it can be streamed out. It will still not be
compressed next time you update it so it's not really helpful for the long
term.

If you want to decompress everything you have to do something like:

postgres=# ALTER TABLE foo ALTER bar TYPE bytea USING t||'', ALTER bar SET STORAGE external;
ALTER TABLE

However note that this will require extra memory for both the decompressed
original value and the new value after "appending" the empty string.

Another option would be to update only the records which need to be
decompressed with something like

UPDATE foo SET bar=bar||'' WHERE length(bar) > pg_column_size(bar)

This at least gives you the option of doing them in small groups or even one
by one. I would suggest vacuuming between each update.



I do have to wonder how you're getting the data *in* though. If it's large
enough to have to stream out like this then how do you initially load the
data?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

pgsql-general by date:

Previous
From: Steve Manes
Date:
Subject: Re: Writing most code in Stored Procedures
Next
From: Karsten Hilbert
Date:
Subject: Re: SUBSTRING performance for large BYTEA