Thread: Repair cosmetic damage (done by pg_indent?)

Repair cosmetic damage (done by pg_indent?)

From
Gregory Stark
Date:
I think these are some more comments which were misaligned by pg_indent a
couple of releases ago. I recall Bruce fixed a bug in pg_indent which was
causing it and the damage looked something like this.

I also reduced the indentation so the comments didn't take so many lines.

Fwiw, do we really not want to compress anything smaller than 256 bytes
(everyone in Postgres uses the default strategy, not the always strategy).

ISTM that with things like CHAR(n) around we might very well have some
databases where compression for smaller sized datums would be beneficial. I
would suggest 32 for the minimum.


Index: pg_lzcompress.c
===================================================================
RCS file: /home/stark/src/REPOSITORY/pgsql/src/backend/utils/adt/pg_lzcompress.c,v
retrieving revision 1.26
diff -c -r1.26 pg_lzcompress.c
*** pg_lzcompress.c    6 Apr 2007 04:21:43 -0000    1.26
--- pg_lzcompress.c    27 Jul 2007 14:43:10 -0000
***************
*** 211,239 ****
   * ----------
   */
  static const PGLZ_Strategy strategy_default_data = {
!     256,                        /* Data chunks smaller 256 bytes are not
!                                  * compressed             */
!     6144,                        /* Data chunks greater equal 6K force
!                                  * compression                 */
!     /* except compressed result is greater uncompressed data        */
!     20,                            /* Compression rates below 20% mean fallback
!                                  * to uncompressed      */
!     /* storage except compression is forced by previous parameter    */
!     128,                        /* Stop history lookup if a match of 128 bytes
!                                  * is found            */
!     10                            /* Lower good match size by 10% at every
!                                  * lookup loop iteration. */
  };
  const PGLZ_Strategy * const PGLZ_strategy_default = &strategy_default_data;


  static const PGLZ_Strategy strategy_always_data = {
!     0,                            /* Chunks of any size are compressed                            */
!     0,                            /* */
!     0,                            /* We want to save at least one single byte                        */
!     128,                        /* Stop history lookup if a match of 128 bytes
!                                  * is found            */
!     6                            /* Look harder for a good match.                                */
  };
  const PGLZ_Strategy * const PGLZ_strategy_always = &strategy_always_data;

--- 211,233 ----
   * ----------
   */
  static const PGLZ_Strategy strategy_default_data = {
!     256,    /* Data chunks smaller 256 bytes are not compressed */
!     6144,    /* Data chunks greater equal 6K force compression unless compressed
!              * result is greater uncompressed data  */
!     20,        /* Compression rates below 20% mean fallback to uncompressed
!              * storage unless compression is forced by previous parameter */
!     128,    /* Stop history lookup if a match of 128 bytes is found */
!     10        /* Lower good match size by 10% at every lookup loop iteration. */
  };
  const PGLZ_Strategy * const PGLZ_strategy_default = &strategy_default_data;


  static const PGLZ_Strategy strategy_always_data = {
!     0,        /* Chunks of any size are compressed */
!     0,        /* */
!     0,        /* We want to save at least one single byte */
!     128,    /* Stop history lookup if a match of 128 bytes is found */
!     6        /* Look harder for a good match. */
  };
  const PGLZ_Strategy * const PGLZ_strategy_always = &strategy_always_data;




--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Repair cosmetic damage (done by pg_indent?)

From
Decibel!
Date:
On Fri, Jul 27, 2007 at 04:07:01PM +0100, Gregory Stark wrote:
> Fwiw, do we really not want to compress anything smaller than 256 bytes
> (everyone in Postgres uses the default strategy, not the always strategy).

Is there actually a way to specify always compressing? I'm not seeing it
on http://www.postgresql.org/docs/8.2/interactive/storage-toast.html

> ISTM that with things like CHAR(n) around we might very well have some
> databases where compression for smaller sized datums would be beneficial. I
> would suggest 32 for the minimum.

CPU is generally cheaper than IO now-a-days, so I agree with something
less than 256. Not sure what would be best though.

I do have a database that has both user-entered information as well as
things like email addresses, so I could do some testing on that if
people want.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: Repair cosmetic damage (done by pg_indent?)

From
Gregory Stark
Date:
"Decibel!" <decibel@decibel.org> writes:

> On Fri, Jul 27, 2007 at 04:07:01PM +0100, Gregory Stark wrote:
>> Fwiw, do we really not want to compress anything smaller than 256 bytes
>> (everyone in Postgres uses the default strategy, not the always strategy).
>
> Is there actually a way to specify always compressing? I'm not seeing it
> on http://www.postgresql.org/docs/8.2/interactive/storage-toast.html

In the code there's an "always" strategy, but nothing in Postgres uses it so
there's no way to set it using ALTER TABLE ... SET STORAGE.

That might be an interesting approach though. We could add another SET STORAGE
value "COMPRESSIBLE" which says to use the always strategy. The neat thing
about this is we could set bpchar to use this storage type by default.

It's sort of sad that we're storing the extra padding bytes at all. But I
don't see a way around it. I would really prefer to strip them on input and
add them on output but I then we're back to the issue of possibly having lost
the typmod at some point along the way.

>> ISTM that with things like CHAR(n) around we might very well have some
>> databases where compression for smaller sized datums would be beneficial. I
>> would suggest 32 for the minimum.
>
> CPU is generally cheaper than IO now-a-days, so I agree with something
> less than 256. Not sure what would be best though.

Well it depends a lot on how large your database is. If your whole database
fits in RAM and you use datatypes like CHAR(n) only for storing data which is
exactly b characters long then there's really no benefit to trying to compress
smaller data.

If on the other hand your database is heavily I/O-bound and you're using
CHAR(n) or storing other highly repetitive short strings then compressing data
will save I/O bandwidth at the expense of cpu cycles.

> I do have a database that has both user-entered information as well as
> things like email addresses, so I could do some testing on that if
> people want.

You would have to recompile with the value at line 214 of
src/backend/utils/adt/pg_lzcompress.c set to a lower value.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Repair cosmetic damage (done by pg_indent?)

From
Decibel!
Date:
On Sun, Jul 29, 2007 at 12:06:50PM +0100, Gregory Stark wrote:
> "Decibel!" <decibel@decibel.org> writes:
>
> > On Fri, Jul 27, 2007 at 04:07:01PM +0100, Gregory Stark wrote:
> >> Fwiw, do we really not want to compress anything smaller than 256 bytes
> >> (everyone in Postgres uses the default strategy, not the always strategy).
> >
> > Is there actually a way to specify always compressing? I'm not seeing it
> > on http://www.postgresql.org/docs/8.2/interactive/storage-toast.html
>
> In the code there's an "always" strategy, but nothing in Postgres uses it so
> there's no way to set it using ALTER TABLE ... SET STORAGE.
>
> That might be an interesting approach though. We could add another SET STORAGE
> value "COMPRESSIBLE" which says to use the always strategy. The neat thing
> about this is we could set bpchar to use this storage type by default.

Yeah, we should have that. I'll add it to my TODO...

> >> ISTM that with things like CHAR(n) around we might very well have some
> >> databases where compression for smaller sized datums would be beneficial. I
> >> would suggest 32 for the minimum.
> >
> > CPU is generally cheaper than IO now-a-days, so I agree with something
> > less than 256. Not sure what would be best though.
>
> Well it depends a lot on how large your database is. If your whole database
> fits in RAM and you use datatypes like CHAR(n) only for storing data which is
> exactly b characters long then there's really no benefit to trying to compress
> smaller data.

Well, something else to consider is that this could make a big
difference between a database fitting in memory and not...

> If on the other hand your database is heavily I/O-bound and you're using
> CHAR(n) or storing other highly repetitive short strings then compressing data
> will save I/O bandwidth at the expense of cpu cycles.
>
> > I do have a database that has both user-entered information as well as
> > things like email addresses, so I could do some testing on that if
> > people want.
>
> You would have to recompile with the value at line 214 of
> src/backend/utils/adt/pg_lzcompress.c set to a lower value.

Ok, I'll work up some numbers. The only tests that come to mind are how long it
takes to load a dump (with indexes) and the resulting table size. Other ideas?
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: Repair cosmetic damage (done by pg_indent?)

From
Decibel!
Date:
On Fri, Aug 03, 2007 at 06:12:09PM -0500, Decibel! wrote:
> On Sun, Jul 29, 2007 at 12:06:50PM +0100, Gregory Stark wrote:
> > "Decibel!" <decibel@decibel.org> writes:
> >
> > > On Fri, Jul 27, 2007 at 04:07:01PM +0100, Gregory Stark wrote:
> > >> Fwiw, do we really not want to compress anything smaller than 256 bytes
> > >> (everyone in Postgres uses the default strategy, not the always strategy).
> > >
> > > Is there actually a way to specify always compressing? I'm not seeing it
> > > on http://www.postgresql.org/docs/8.2/interactive/storage-toast.html
> >
> > In the code there's an "always" strategy, but nothing in Postgres uses it so
> > there's no way to set it using ALTER TABLE ... SET STORAGE.
> >
> > That might be an interesting approach though. We could add another SET STORAGE
> > value "COMPRESSIBLE" which says to use the always strategy. The neat thing
> > about this is we could set bpchar to use this storage type by default.
>
> Yeah, we should have that. I'll add it to my TODO...

On second thought... how much work would it be to expose the first 3 (or
even all) of the elements in PGLZ_Strategy to SET STORAGE? It's
certainly possible that there are workloads out there that will never be
optimal with one of our pre-defined strategies...
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: Repair cosmetic damage (done by pg_indent?)

From
Decibel!
Date:
On Sun, Jul 29, 2007 at 12:06:50PM +0100, Gregory Stark wrote:
> You would have to recompile with the value at line 214 of
> src/backend/utils/adt/pg_lzcompress.c set to a lower value.

Doesn't seem to be working for me, even in the case of a table with a
bunch of rows containing nothing but 213 'x's. I can't do anything that
changes what pg_class.relpages shows.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: Repair cosmetic damage (done by pg_indent?)

From
Gregory Stark
Date:
"Decibel!" <decibel@decibel.org> writes:

> On Sun, Jul 29, 2007 at 12:06:50PM +0100, Gregory Stark wrote:
>> You would have to recompile with the value at line 214 of
>> src/backend/utils/adt/pg_lzcompress.c set to a lower value.
>
> Doesn't seem to be working for me, even in the case of a table with a
> bunch of rows containing nothing but 213 'x's. I can't do anything that
> changes what pg_class.relpages shows.

The scenario I was describing was having, for example, 20 fields each of which
are char(100) and store 'x' (which are padded with 99 spaces). So the row is
2k but the fields are highly compressible, but shorter than the 256 byte
minimum.

Actually the toaster would stop compressing fields once the total was under
2k. Perhaps for a situation like that we might want some form of STORAGE
parameter which a) set a flag on the tuple descriptor which forced the toaster
to be invoked every time and b) forced every field which that STORAGE
parameter to be compressed even if the tuple fits in the 2k target size.

But all this depends on what we want to do with the toast target size. If we
can set a per-table toast target size then that situation could be dealt with
by adjusting that parameter.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Repair cosmetic damage (done by pg_indent?)

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> The scenario I was describing was having, for example, 20 fields each
> of which are char(100) and store 'x' (which are padded with 99
> spaces). So the row is 2k but the fields are highly compressible, but
> shorter than the 256 byte minimum.

To be blunt, the solution to problems like that is sending the DBA to a
re-education camp.  I don't think we should invest huge amounts of
effort on something that's trivially fixed by using the correct datatype
instead of the wrong datatype.

            regards, tom lane

Re: Repair cosmetic damage (done by pg_indent?)

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> The scenario I was describing was having, for example, 20 fields each
>> of which are char(100) and store 'x' (which are padded with 99
>> spaces). So the row is 2k but the fields are highly compressible, but
>> shorter than the 256 byte minimum.
>
> To be blunt, the solution to problems like that is sending the DBA to a
> re-education camp.  I don't think we should invest huge amounts of
> effort on something that's trivially fixed by using the correct datatype
> instead of the wrong datatype.

Sorry, there was a bit of a mixup here. The scenario I described above is what
it would take to get Postgres to actually try to compress a small string given
the way the toaster works.

In the real world interesting cases wouldn't be so extreme. Having a single
CHAR(n) or a text field which contains any other very compressible string
could easily not be compressed currently due to being under 256 bytes.

I think the richer target here is doing some kind of cross-record compression.
For example, xml text columns often contain the same tags over and over again
in successive records but any single datum wouldn't be compressible.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Repair cosmetic damage (done by pg_indent?)

From
daveg
Date:
On Sat, Aug 04, 2007 at 09:04:33PM +0100, Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
> > Gregory Stark <stark@enterprisedb.com> writes:
> >> The scenario I was describing was having, for example, 20 fields each
> >> of which are char(100) and store 'x' (which are padded with 99
> >> spaces). So the row is 2k but the fields are highly compressible, but
> >> shorter than the 256 byte minimum.
> >
> > To be blunt, the solution to problems like that is sending the DBA to a
> > re-education camp.  I don't think we should invest huge amounts of
> > effort on something that's trivially fixed by using the correct datatype
> > instead of the wrong datatype.
>
> Sorry, there was a bit of a mixup here. The scenario I described above is what
> it would take to get Postgres to actually try to compress a small string given
> the way the toaster works.
>
> In the real world interesting cases wouldn't be so extreme. Having a single
> CHAR(n) or a text field which contains any other very compressible string
> could easily not be compressed currently due to being under 256 bytes.
>
> I think the richer target here is doing some kind of cross-record compression.
> For example, xml text columns often contain the same tags over and over again
> in successive records but any single datum wouldn't be compressible.

I have a table of (id serial primary key, url text unique) with a few
hundred million urls that average about 120 bytes each. The url index is
only used when a possibly new url is to be inserted, but between the data
and the index this table occupies a large part of the page cache. Any form
of compression here would be really helpful.

-dg


--
David Gould                                      daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Re: Repair cosmetic damage (done by pg_indent?)

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> I think these are some more comments which were misaligned by pg_indent a
> couple of releases ago. I recall Bruce fixed a bug in pg_indent which was
> causing it and the damage looked something like this.

> I also reduced the indentation so the comments didn't take so many lines.

pg_indent would just push the indentation back over, so the latter part
isn't going to help.  I cleaned it up a bit, but not exactly like that.

> Fwiw, do we really not want to compress anything smaller than 256 bytes
> (everyone in Postgres uses the default strategy, not the always strategy).

This is one of many places where no one got around to doing performance
testing/tweaking after the original commit.  Given the way that
tuptoaster actually uses the compressor (ie, compressing only the
largest fields to start with), I tend to agree that having the
compressor reject input a-priori might be a bad idea.  Needs testing.

            regards, tom lane

Re: Repair cosmetic damage (done by pg_indent?)

From
Gregory Stark
Date:
"daveg" <daveg@sonic.net> writes:

> I have a table of (id serial primary key, url text unique) with a few
> hundred million urls that average about 120 bytes each. The url index is
> only used when a possibly new url is to be inserted, but between the data
> and the index this table occupies a large part of the page cache. Any form
> of compression here would be really helpful.

So the problem here is that it's really hard to compress 120 bytes.

Question 1 is how effective our existing compression mechanism would be. Would
it be able to compress them at all? I'll have to think about what it would
take to test this for you though.

To test this for you would probably require a fair amount of work though. I
think you would need a patch which added per-table settable toast targets and
a separate change which let you lower the minimum size to try to compress.

Question 2 is how we can do better. I've thought quite a bit about what it
would take to have some sort of precomputed dictionary which would be used as
a primer for the back references. The problem I foresee is that you'll need a
reference to this dictionary and we already have 8 bytes of overhead. If we're
aiming to compress short strings then starting off with 12+ bytes of overhead
is an awfully big handicap.

I'm also wondering if our lz compression just isn't the right tool for such
short strings. It assumes you'll have fairly long common substrings. If you
don't have any long common substrings or else a backreference of 2-4 bytes
isn't going to save you much. If you're just using a small alphabet of 36
different characters but fairly randomly like in URLs you aren't going to see
much compression. Something like huffman or arithmetic encoding which can
assign meaning to codes smaller than a byte might be more effective. They
require a static dictionary but then that's precisely what I'm thinking of.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com