RFD: ALTER COLUMN .. SET STORAGE COMPRESSED; - Mailing list pgsql-hackers

From Dawid Kuroczko
Subject RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;
Date
Msg-id 758d5e7f0806100445j13e4768fy4342554890c14a6c@mail.gmail.com
Whole thread Raw
Responses Re: RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;  (Zdenek Kotala <Zdenek.Kotala@Sun.COM>)
Re: RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: RFD: ALTER COLUMN .. SET STORAGE COMPRESSED;  (Alvaro Herrera <alvherre@commandprompt.com>)
List pgsql-hackers
Hello!

Currently the TOASTing code does its magic when whole tuple is
larger than TOAST_TUPLE_TARGET which happens to be around 2KB.

There are times though when one is willing to trade using (fast) CPU to
reduce amount of (slow) I/O.  A data warehousing types of workload
most notably.  Rarely used large columns which are likely to compress
well but are not large enough to trigger inline compression.

As we already have four types of ALTER COLUMN .. SET STORAGE
{ PLAIN | EXTERNAL | EXTENDED | MAIN } I would like to add
"COMPRESSED" which would force column compression (if column is
smaller than some minimun, I guess somwehwere between 16 and 32 bytes).

First of all, would such a feature be desirable? [1]


...as for implementation idea, so far I see it more or less like this:
* src/backend/access/common/heaptuple.c:  for tuples with COMPRESSED attributes, we set the infomask bit
HEAP_HASEXTERNAL,so that tuple will trigger TOAST regardless  of size.
 
* src/backend/access/heap/tuptoaster.c: - add a bool "need_compress = false;" around line 425. - while scanning the
attributes(lines 472-575), mark the ones which should be COMPRESSED - if (need_compress), compress every marked column.
-perhaps refactor inline compression code (639-659) as a static  funcion shared with need_compress part above.
 

Does this sound reasonable?

PS: as a side note: I wonder if perhaps we could try compression erarlier,
at 1KB or event at 0.5KB, but leave TOASTing at 2KB limit)?

[1]: Actually some time ago I did write a system which stores tons of
real[0:59] (an hour's worth of every minute readings) data.  Such column
takes approximately 246 bytes.  For fun and experiment I did transform
the data into real[0:23][0:59] storing whole day's data.  To my surprise
such column stores between 64 (!) and 5968.  Also 66% of values were
taking less than 254 bytes (and 55% < 128 bytes)...  And as the data
is much larger than RAM and read randomly, having it shrunk by more
than 25% is tempting.  Hence the idea of SET STORAGE COMPRESSED.
I know such schema is flawed by design, but I guess there are other types
of data which would also see benefit from such an option.
-- 
Solving [site load issues] with [more database replication] is a lot
like solving your own personal problems with heroin - at first it
sorta works, but after a while things just get out of hand.


pgsql-hackers by date:

Previous
From: Zeugswetter Andreas OSB sIT
Date:
Subject: Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
Next
From: Teodor Sigaev
Date:
Subject: Re: Proposal: GiST constraints