Thread: Efficient slicing/substring of TOAST values (for comment)

Efficient slicing/substring of TOAST values (for comment)

From
John Gray
Date:
Hi all,

I attach a patch which adds access routines for efficient extraction of
parts of TOAST values.

The principal additions are two routines in tuptoaster.c,
heap_tuple_untoast_attr_slice and toast_fetch_datum_slice. The latter
uses extra index scankeys to retrieve only the TOAST chunks which
contain the requested substring. This will provide a performance benefit
if you repeatedly extract small portions (e.g. file headers) from
TOASTed values, as only one or two chunks will need to be fetched. This
function is only invoked for external, uncompressed storage.

The public access routine (heap_tuple_untoast_attr_slice) does take care
of slicing values that are stored compressed or inline, but doesn't
provide any performance benefit in those cases.

The access macros are in the same vein to existing ones:
PG_GETARG_TEXT_P_SLICE(n,start,length)
for example.

What I haven't done:

1. Documentation. If this patch is appropriate or acceptable, I'll add
documentation.

2. Changed e.g. textsubstr and byteasubstr to use this method.
textsubstr is complicated by the multibyte support -the fast method is
only applicable in a non-multibyte environment. Also, the SQL negative
offset rule is not embodied in what I've added, and the subscripts are
zero-based. This was on the assumption that if the data was binary (e.g.
JPEG/JFIF data) and the user's intent was to extract the header, it
would be clearer to use zero-based offsets.

3. Added any facility to force a column to have attstorage 'e'. At
present it appears to be defaulted from typstorage, but I couldn't see
any problem with changing it after table creation. Would a keyword to
CREATE TABLE to override the default attstorage be useful? -especially
if the user knew that the data for a column would not be very
compressible (there would be a performance gain in not trying to
compress it, and just storing it externally uncompressed).

Of course, this may just all be useless feature bloat or not up to
scratch coding-wise (and please say so if it is) but please let me know
if it's worth me documenting this or adding any more to it.

(diffs against versions current in CVS as of twenty minutes ago or so)

Regards

John

Attachment

Re: Efficient slicing/substring of TOAST values (for comment)

From
Tom Lane
Date:
John Gray <jgray@azuli.co.uk> writes:
> 2. Changed e.g. textsubstr and byteasubstr to use this method.
> textsubstr is complicated by the multibyte support -the fast method is
> only applicable in a non-multibyte environment.

More accurately, it's only applicable in single-byte character sets,
which is a useful fraction of the MULTIBYTE world too.  Keep in mind
that MULTIBYTE is likely to default to ON in the near future, so there's
little point in such an exercise if you are going to ifdef it out.
I believe there's a routine somewhere that will tell you whether the
current database encoding is single-byte or multi-byte.

> 3. Added any facility to force a column to have attstorage 'e'. At
> present it appears to be defaulted from typstorage, but I couldn't see
> any problem with changing it after table creation. Would a keyword to
> CREATE TABLE to override the default attstorage be useful?

I've been envisioning adding an ALTER TABLE variant to do this, rather
than mucking up CREATE TABLE --- compare the handling of attstattarget.

> Of course, this may just all be useless feature bloat

No, this has been a to-do item for awhile.  I think it's a tad too late
to think in terms of squeezing it into 7.2, but keep working on it for
7.3.

Have you got any ideas about the update side of the problem?

            regards, tom lane

Re: Efficient slicing/substring of TOAST values (for

From
John Gray
Date:
On Mon, 2001-10-08 at 16:06, Tom Lane wrote:
> John Gray <jgray@azuli.co.uk> writes:
> > 2. Changed e.g. textsubstr and byteasubstr to use this method.
> > textsubstr is complicated by the multibyte support -the fast method is
> > only applicable in a non-multibyte environment.
>
> More accurately, it's only applicable in single-byte character sets,
> which is a useful fraction of the MULTIBYTE world too.  Keep in mind
> that MULTIBYTE is likely to default to ON in the near future, so there's
> little point in such an exercise if you are going to ifdef it out.
> I believe there's a routine somewhere that will tell you whether the
> current database encoding is single-byte or multi-byte.
>
I've found pg_database_encoding_max_length() which seems to be the right
thing.

There may be some prospect of performance improvement even for multibyte
cases. For a substr(attr,start,len) call, I could fetch the segment
starting from the start of the attribute through to
pg_database_encoding_max_length()*(start+len). If this is less than the
length of attr then we can save time and memory. The existing multibyte
substring operation is then applied to this returned string.

> > 3. Added any facility to force a column to have attstorage 'e'. At
> > present it appears to be defaulted from typstorage, but I couldn't see
> > any problem with changing it after table creation. Would a keyword to
> > CREATE TABLE to override the default attstorage be useful?
>
> I've been envisioning adding an ALTER TABLE variant to do this, rather
> than mucking up CREATE TABLE --- compare the handling of attstattarget.
>
I've had a look, and it seems that AlterTableAlterColumnStatistics
contains a lot of template code (and only a small amount actually
dealing with stats). If we had different Node types for the different
Column flag-changing operations (especially those which don't touch
tuple data at all) then the current AlterTableAlterColumnStatistics
might become AlterTableAlterColumnFlags (?) and take several kinds of
nodes. Thus all the inheritance-supporting code wouldn't need to be
duplicated. Alternatively, cloning the routine wouldn't be too much
work.

> No, this has been a to-do item for awhile.  I think it's a tad too late
> to think in terms of squeezing it into 7.2, but keep working on it for
> 7.3.
>
Shall do. I'll develop something for substr operations in line with
above.

> Have you got any ideas about the update side of the problem?
>

Well, if a user is only updating a small part of a toasted value, then
only the corresponding chunks need be updated -assuming that the update
is Procrustean i.e. that the substitute string is coerced to the same
length as the segment it replaces, by truncation or blank padding. What
I'm not sure about is the syntax for specifying such an update. Would we
just overload the syntax for an array?

e.g. UPDATE objects SET obj_data[32:47]='0123456789abcdef';

I can see the other problem being that an updated toast value currently
gets a new valueid. Maybe I'll need to work through this more, but I
can't see why that need be the case -i.e. the toast update could just
perform a normal update on each chunk rather than creating new ones and
deleting old ones?

Regards

John




Re: Efficient slicing/substring of TOAST values (for comment)

From
Tom Lane
Date:
John Gray <jgray@azuli.co.uk> writes:
> I've had a look, and it seems that AlterTableAlterColumnStatistics
> contains a lot of template code (and only a small amount actually
> dealing with stats). If we had different Node types for the different
> Column flag-changing operations (especially those which don't touch
> tuple data at all) then the current AlterTableAlterColumnStatistics
> might become AlterTableAlterColumnFlags (?) and take several kinds of
> nodes. Thus all the inheritance-supporting code wouldn't need to be
> duplicated. Alternatively, cloning the routine wouldn't be too much
> work.

Sounds like a great idea --- the ALTER TABLE code has kind of grown
without supervision over the past few releases, since cut-and-paste
was the easiest starting point for implementing any new variant.
Some creative refactoring would help it out a lot.  Go for it, if
you feel like doing that.

[ back to original subject ]

>> Have you got any ideas about the update side of the problem?

> What I'm not sure about is the syntax for specifying such an update.

Yup, that's the key part of the problem.

> Would we just overload the syntax for an array?

> e.g. UPDATE objects SET obj_data[32:47]='0123456789abcdef';

This makes lots of sense for bytea, which ought to be considered an
array type anyway, but I'm less certain about doing it that way for
multibyte text objects.  Seems like it could be pretty inefficient
if the subscriptable entities are of varying length.  Ideas anyone?

> I can see the other problem being that an updated toast value currently
> gets a new valueid.

Not sure if this is necessary or not.  Jan?

            regards, tom lane

Re: Efficient slicing/substring of TOAST values (for comment)

From
"Joe Conway"
Date:
> > What I'm not sure about is the syntax for specifying such an update.
>
> Yup, that's the key part of the problem.
>
> > Would we just overload the syntax for an array?
>
> > e.g. UPDATE objects SET obj_data[32:47]='0123456789abcdef';
>
> This makes lots of sense for bytea, which ought to be considered an
> array type anyway, but I'm less certain about doing it that way for
> multibyte text objects.  Seems like it could be pretty inefficient
> if the subscriptable entities are of varying length.  Ideas anyone?
>

FWIW, here's the SQL SVR 7 spec:
-----------------------------------------
UPDATETEXT {table_name.dest_column_name dest_text_ptr}
 { NULL  | insert_offset }
 { NULL  | delete_length}
  [WITH LOG]
  [ inserted_data  | {table_name.src_column_name src_text_ptr}]

Oracle 8 define DBMS_LOB.ERASE(ptr, length, offset) and DBMS_LOB.WRITE(ptr,
length, offset, data) functions, but I don't see an UPDATE function for
LOBs.

SQL99 says:
4.2.2.1
<character overlay function> is a function, OVERLAY, that modifies a string
argument by replacing a given substring of the string, which is specified by
a
given numeric starting position and a given numeric length, with another
string
(called the replacement string). When the length of the substring is zero,
nothing is removed from the original string and the string returned by the
function is the result of inserting the replacement string into the original
string at the starting position.

4.3.2.1
<blob overlay function> is a function identical in syntax and semantics to
<character overlay func-tion> except that the first argument, second
argument, and returned value are all binary strings.


So based on all of that, maybe:
UPDATE <table> SET <text_or_bytea_field> = OVERLAY(<text_or_bytea_field>,
'replacement_string', offset, length)

-- Joe


Re: Efficient slicing/substring of TOAST values (for

From
John Gray
Date:
On Thu, 2001-10-11 at 04:54, Tom Lane wrote:

> Sounds like a great idea --- the ALTER TABLE code has kind of grown
> without supervision over the past few releases, since cut-and-paste
> was the easiest starting point for implementing any new variant.
> Some creative refactoring would help it out a lot.  Go for it, if
> you feel like doing that.
>
I've done a little tidying, repackaging of standard code etc. as a first
step. It hasn't made the file much shorter, but has made some of the
routines clearer where they share basically the same code for
inheritance and permissions checking.

AS regards the ALTER TABLE ALTER COLUMN syntax for attstorage, should I
use keywords for the values or just char values 'e', 'x' etc ?

ALTER TABLE example ALTER COLUMN test SET STORAGE EXTERNALPLAIN;
or
ALTER TABLE example ALTER COLUMN test SET STORAGE 'e';

(if anyone has any views on a better keyword than STORAGE...)

Obviously I'll have to get to grips with the parser next (which will be
a novelty as I don't have any familiarity with lex/yacc)

Regards

John



Re: Efficient slicing/substring of TOAST values (for comment)

From
Tom Lane
Date:
John Gray <jgray@azuli.co.uk> writes:
>> Some creative refactoring would help it out a lot.  Go for it, if
>> you feel like doing that.
>>
> I've done a little tidying, repackaging of standard code etc. as a first
> step. It hasn't made the file much shorter, but has made some of the
> routines clearer where they share basically the same code for
> inheritance and permissions checking.

Great!  I'm afraid I just committed some fixes that will probably cause
a merge conflict, but I trust it won't be too painful to resolve.

> AS regards the ALTER TABLE ALTER COLUMN syntax for attstorage, should I
> use keywords for the values or just char values 'e', 'x' etc ?

I'd suggest keywords, namely the same ones used in CREATE TYPE for
these storage options.

            regards, tom lane

Re: Efficient slicing/substring of TOAST values (for comment)

From
Bruce Momjian
Date:
> John Gray <jgray@azuli.co.uk> writes:
> >> Some creative refactoring would help it out a lot.  Go for it, if
> >> you feel like doing that.
> >>
> > I've done a little tidying, repackaging of standard code etc. as a first
> > step. It hasn't made the file much shorter, but has made some of the
> > routines clearer where they share basically the same code for
> > inheritance and permissions checking.
>
> Great!  I'm afraid I just committed some fixes that will probably cause
> a merge conflict, but I trust it won't be too painful to resolve.
>
> > AS regards the ALTER TABLE ALTER COLUMN syntax for attstorage, should I
> > use keywords for the values or just char values 'e', 'x' etc ?
>
> I'd suggest keywords, namely the same ones used in CREATE TYPE for
> these storage options.

I assume this will be for 7.3?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Efficient slicing/substring of TOAST values (for comment)

From
Bruce Momjian
Date:
John, do you have an updated version of this patch that is ready to be
applied to 7.3?

---------------------------------------------------------------------------

John Gray wrote:
> Hi all,
>
> I attach a patch which adds access routines for efficient extraction of
> parts of TOAST values.
>
> The principal additions are two routines in tuptoaster.c,
> heap_tuple_untoast_attr_slice and toast_fetch_datum_slice. The latter
> uses extra index scankeys to retrieve only the TOAST chunks which
> contain the requested substring. This will provide a performance benefit
> if you repeatedly extract small portions (e.g. file headers) from
> TOASTed values, as only one or two chunks will need to be fetched. This
> function is only invoked for external, uncompressed storage.
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026