Re: Optimizing pglz compressor - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Optimizing pglz compressor
Date
Msg-id 008501ce6cdc$51981790$f4c846b0$@kapila@huawei.com
Whole thread Raw
In response to Re: Optimizing pglz compressor  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Optimizing pglz compressor
List pgsql-hackers
On Tuesday, March 05, 2013 7:03 PM Heikki Linnakangas wrote:

> I spent some more time on this, and came up with the attached patch. It
> includes the changes I posted earlier, to use indexes instead of
> pointers in the hash table. In addition, it makes the hash table size
> variable, depending on the length of the input. This further reduces
> the startup cost on small inputs. I changed the hash method slightly,
> because the old method would not use any bits from the 3rd byte with a
> small hash table size, but fortunately that didn't seem to negative
> impact with larger hash table sizes either.
> 
> I wrote a little C extension to test this. It contains a function,
> which runs pglz_compress() on a bytea input, N times. I ran that with
> different kinds of inputs, and got the following results:
> 

The purpose of this patch is to improve LZ compression speed by reducing the
startup cost of initialization of hash_start array.
To achieve the same it uses variable hash and reduced the size of each
history entry by replacing pointers with int16 indexes. 
It achieves it's purpose for small data, but for large data in some cases
performance is degaraded, refer second set of performance data.

1. Patch compiles cleanly and all regression tests passed.
2. Change in pglz_hist_idx macro is not very clear to me, neither it is
mentioned in comments
3. Why first entry is kept as INVALID_ENTRY? It appears to me, it is for
cleaner checks in code.

Performance Data 
------------------
I have used pglz-variable-size-hash-table.patch to collect all performance
data:


Results of compress-tests.sql -- inserting large data into tmp table
------------------------------     testname     |unpatched |  patched 
-------------------+----------+------------ 5k text           |  4.8932  |  4.9014 512b text         | 22.6209  |
18.6849256b text         | 13.9784  |  8.9342 1K text           | 20.4969  |  20.5988 2k random         | 10.5826  |
10.0758100k random       |  3.9056  |  3.8200 500k random       | 22.4078  |  22.1971 512b random       | 15.7788  |
12.9575256b random       | 18.9213  |  12.5209 1K random         | 11.3933  |  9.8853 100k of same byte |  5.5877  |
5.5960500k of same byte |  2.6853  |  2.6500
 


Observation
-------------
1. This clearly shows that the patch improves performance for small data
without any impact for large data.


Performance data for directly calling lz_compress function (tests.sql)
--------------------------------------------------------------------------- 
select testname,       (compresstest(data, nrows, 8192)::numeric / 1000)::numeric(10,3) as
auto 
from tests; 



Head     testname      |   auto 
-------------------+----------- 5k text           |  3511.879 512b text         |  1430.990 256b text         |
1768.7961K text           |  1390.134 3K text           |  4099.304 2k random         |  -402.916 100k random       |
-10.311500k random       |    -2.019 512b random       |  -753.317 256b random       | -1096.999 1K random         |
-559.93110k of same byte  |  3548.651 100k of same byte | 36037.280 500k of same byte | 25565.195 
 
(14 rows) 

Patch(pglz-variable-size-hash-table.patch) 
   testname      |   auto 
-------------------+----------- 5k text           |  3840.207 512b text         |  1088.897 256b text         |
982.1721K text           |  1402.488 3K text           |  4334.802 2k random         |  -333.100 100k random       |
-8.390500k random       |    -1.672 512b random       |  -499.251 256b random       |  -524.889 1K random         |
-453.17710k of same byte  |  4754.367 100k of same byte | 50427.735 500k of same byte | 36163.265 
 
(14 rows)

Observations
--------------
1. For small data perforamce is always good with patch.
2. For random small/large data performace is good.
3. For medium and large text and same byte data(3K,5K text, 10K,100K,500K
same byte), performance is degraded.

I have used attached compress-tests-init.sql to generate data.
I am really not sure why the data you reported and what I taken differ in
few cases. I had tried multiple times but the result is same.
Kindly let me know if you think I am doing something wrong.

Note - To generate data in randomhex, I used Copy from file. I used same
command you provided to generate a file.

With Regards,
Amit Kapila.

pgsql-hackers by date:

Previous
From: Cédric Villemain
Date:
Subject: [Review] Re: [PATCH] Remove useless USE_PGXS support in contrib
Next
From: "Etsuro Fujita"
Date:
Subject: Re: Patch for removng unused targets