Re: [HACKERS] LZTEXT for rule plan stings - Mailing list pgsql-hackers

From Don Baccus
Subject Re: [HACKERS] LZTEXT for rule plan stings
Date
Msg-id 3.0.1.32.20000225211513.016e43b0@mail.pacifier.com
Whole thread Raw
In response to Re: [HACKERS] LZTEXT for rule plan stings  (wieck@debis.com (Jan Wieck))
List pgsql-hackers
At 04:06 AM 2/26/00 +0100, Jan Wieck wrote:

>    False.

>    With  TOAST, the maximum tuple length is limited by available
>    disk space (minus some overhead) and/or the number of bits we
>    use  to  represent  the  values original size and/or the size
>    addressable by the TOAST'ers table at  all.  Available  space
>    allways  limits  the  amount of data in a DB, and you allways
>    have to take some overhead into  account,  but  calling  this
>    _undefined_  isn't  correct  IMHO  -  better  call it hard to
>    figure out.

Same is true for non-TOAST lzText.

Or...non lzText text, for that matter.

Of course, the size of text IS UNDEFINED TODAY.

create table foo (   t1 text,   t2 text);

Pray tell, what is the maximum size of t1? 

Is it independent of t2?   Or...correct me if I'm mistaken...if t2
contains 8,000+ characters won't "insert" bomb me if I try to 
insert 8,000+ characters into t1?  Or even a few characters?  Exactly
where is this vaunted and well defined limit?

(oops - you can't answer that question because it depends on the
size of BLCKSZ, which of course one can change at will)

The maximum size of "text" is already undefined, as it depends on:

BLCKSZ (which the user may not've set herself, and maybe is unaware of
if she's a user at the mercy of some sysadmin)

and 

the actual bytes occupied by other variable-length columns.

"bytea" for instance.  "text" for instance.  "varchar(n)" for instance,
which actually is a variable-length string which has a maximum value.

PG lets me do this:

create table foo (  t1  varchar(8000),  t2  varchar(8000),  tn  varchar(8000) -- n fairly large
);

Can I insert 8K chars into t1?

Into t2?

Into t3?

Trick PG question - into all three at once?

Sorry, but this is a crap argument.  There is no way to know how
many characters you can insert into a "text" column unless you have
detailed knowledge of the table, not only the types in the table
but the data stored in the pertinent row of the table.

I should know, I've been fighting this when porting code over from
Oracle, where the blocksize truly limits the size of ONE COLUMN,
not a row (tuple) at large.

If I can really have a tuple with 1000 varchar(BLCKSZ-overhead) columns,
fully filled with data, could you please tell me how to do this?  My
life will be much simpler.

>> If text were implemented as lzText for a quick 7.1, which apparently
>> was Jan's spin on the idea, then for 7.1 we'd say:
>
>    On  the first look, it was a tempting solution. But there are
>    ton's of places in the backend, that assume  text  is  binary
>    compatible  to  something  or the bytes after the VARSIZE are
>    plain value bytes, not some compressed garbage to  be  passed
>    through  a function first. Replacing TEXT by LZTEXT therefore
>    wouldn't be such an easy job, but would be  working  for  the
>    wastebasked  from  the  very  beginning anyway, because TOAST
>    needs to revert it all again.

>    I don't like that kind of work.

Nor do I, which is why I didn't suggest it when lzText first came up
and drifted into a TOAST discussion.  Clearly, TOAST is a better
solution.

In particular, it solves Hannu's objection regarding the fact that
a compressed text type would have no fixed upper limit.

Better yet, it would solve Hannu's misunderstanding that today's
text type has such a limit.

Because (I love flogging dead horses):

create table foo ( i: integer, t: text);

and 

create table bar ( t: text);

create two columns T with different maximum limits.  Because the
limit is based on tuple-size.

A compressed text type is only a bad idea because it's a dead end.
Not because it turns a "defined" max text limit into an undefined
max text limit.  The maximum number of chars you can stuff into
a text var is always undefined unless you dissect exactly how
other columns eat storage.

>    Maybe I found some kind of compromise:
>
>    -  We make LZTEXT a released type, without warning and anyone
>       can use it as needed.
>
>    -  When  featuring  TOAST,  we  remove  it  and create a type
>       alias. This way, the  "backend"  will  convert  the  table
>       schemas   (WRT   lztext->text)   at  reload  time  of  the
>       conversion.

I have no strong feelings here.  Personally, I can live with just
compiling PG with a 16KB blocksize, for the work I'm doing today.

But I don't think the upgrade problem's a big deal.  If the type's
not popularized, only those of us "inside" will know of it, and as
far as I'm concerned, hand-editing a pg_dump would be fine with me if
I choose to use it.

But I'm only speaking for myself.

TOAST is clearly the way to go.

On the other hand, I don't see people flinging bricks at Interbase
for compressing their text type.  After all, they have outer joins...

>    Actually I have some problems with the type  coercion  stuff.
>    There  are  functions  lztext(text)  and  vice versa, but the
>    system is unable to find an "=" operator for lztext and  text
>    when issuing
>
>      SELECT * FROM t1, t2, WHERE t1.lztext_att = t2.text_att;
>
>    This  worked  in the past releases (IIRC), so I wonder if the
>    failure above is a wanted "feature".  I'll commit the stuff I
>    have  tomorrow  and  hope  someone  can  help  me  to get the
>    coercion working. All we have to do then is to  tell  in  the
>    release notes and docs "Never use LZTEXT type name explicitly
>    in an application query (like for type casting)  -  use  TEXT
>    instead".

Despite the above, I have no really strong feelings.  I only raised
the compressed text issue because my (belated) reading of the Interbase
docs made it clear that they do this, and Tom resurrected lztext in
regard to views (and my problems there probably made it a red herring
in this case, too!)  It's an interesting idea, and if TOAST is indeed
implemented probably a moot one.  Though...where is the crossover between
an in-place compression and moving an item to the TOASTed table.  And...
all of the problems with the backend making assumptions about text
etc will have to be addressed by the TOASTER, too.

For instance...varchar(4000) might still benefit from being compressed,
even if it is not TOASTed, due to PG's love of dragging full tuples
around.  Saves disk space.  Bigger slices of tables can be sorted in
memory vs. disk for any given backend sort/hash buffer size parameter.
Today's x86 CPUs, at least, favor shrinking the memory footprint of
data due to the fact that CPUs tend to be data-starved when working
on large amounts of data in RAM.   Etc etc etc.  So such a compressed
implementation may actually be a win even if Hannu's made happy by
affixing fixed varchar(n) limits on the column length.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: ALTER TABLE DROP COLUMN
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] PC Week Labs benchmark results