Re: Optimizing pglz compressor - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Optimizing pglz compressor
Date
Msg-id 5135F3FB.7040706@vmware.com
Whole thread Raw
In response to Re: Optimizing pglz compressor  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Optimizing pglz compressor  (Joachim Wieland <joe@mcknight.de>)
Re: Optimizing pglz compressor  (Amit Kapila <amit.kapila@huawei.com>)
List pgsql-hackers
I spent some more time on this, and came up with the attached patch. It
includes the changes I posted earlier, to use indexes instead of
pointers in the hash table. In addition, it makes the hash table size
variable, depending on the length of the input. This further reduces the
startup cost on small inputs. I changed the hash method slightly,
because the old method would not use any bits from the 3rd byte with a
small hash table size, but fortunately that didn't seem to negative
impact with larger hash table sizes either.

I wrote a little C extension to test this. It contains a function, which
runs pglz_compress() on a bytea input, N times. I ran that with
different kinds of inputs, and got the following results:

unpatched:

      testname      |   auto
-------------------+-----------
  5k text           |  1425.750
  512b text         |  1287.413
  2k random         | -1053.160
  100k random       | -1056.379
  512b random       | -1018.474
  64b of text       | -2547.106
  64b random        | -3731.700
  100k of same byte |  1093.885
  100k text         |   849.430
(9 rows)

pglz-replace-pointers-with-indexes.patch (the patch I posted earlier):

     testname      |   auto
-------------------+-----------
  5k text           |  1251.576
  512b text         |   946.348
  2k random         |  -815.095
  100k random       |  -808.356
  512b random       |  -614.418
  64b of text       |  -744.382
  64b random        | -1060.758
  100k of same byte |  1192.397
  100k text         |   796.530
(9 rows)

pglz-variable-size-hash-table.patch:

      testname      |  auto
-------------------+----------
  5k text           | 1276.905
  512b text         |  823.796
  2k random         | -844.484
  100k random       | -848.080
  512b random       | -615.039
  64b of text       | -393.448
  64b random        | -568.685
  100k of same byte | 1186.759
  100k text         |  799.978
(9 rows)

These values are from a single run of the test, but I did repeat them
several times to make sure there isn't too much variability in them to
render the results meaningless. The negative values mean that
pglz_compress gave up on the compression in the test, ie. it shows how
long it took for pglz_compress to realize that it can't compress the
input. Compare the abs() values.

With the variable-size hash table, I'm not sure how significant the
earlier patch to use indexes instead of pointers is. But I don't think
it makes things any worse, so it's included in this.

On 01.03.2013 17:42, Heikki Linnakangas wrote:
> On 01.03.2013 17:37, Alvaro Herrera wrote:
>> My take on this is that if this patch is necessary to get Amit's patch
>> to a commitable state, it's fair game.
>
> I don't think it's necessary for that, but let's see..

With these tweaks, I was able to make pglz-based delta encoding perform
roughly as well as Amit's patch. So I think that's the approach we
should take, as it's a bit simpler and more versatile. I'll follow up on
that in the other thread.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: sql_drop Event Trigger
Next
From: Michael Paquier
Date:
Subject: Re: Support for REINDEX CONCURRENTLY