Re: Optimizing pglz compressor - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Optimizing pglz compressor |
Date | |
Msg-id | 008501ce6cdc$51981790$f4c846b0$@kapila@huawei.com Whole thread Raw |
In response to | Re: Optimizing pglz compressor (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Optimizing pglz compressor
|
List | pgsql-hackers |
On Tuesday, March 05, 2013 7:03 PM Heikki Linnakangas wrote: > I spent some more time on this, and came up with the attached patch. It > includes the changes I posted earlier, to use indexes instead of > pointers in the hash table. In addition, it makes the hash table size > variable, depending on the length of the input. This further reduces > the startup cost on small inputs. I changed the hash method slightly, > because the old method would not use any bits from the 3rd byte with a > small hash table size, but fortunately that didn't seem to negative > impact with larger hash table sizes either. > > I wrote a little C extension to test this. It contains a function, > which runs pglz_compress() on a bytea input, N times. I ran that with > different kinds of inputs, and got the following results: > The purpose of this patch is to improve LZ compression speed by reducing the startup cost of initialization of hash_start array. To achieve the same it uses variable hash and reduced the size of each history entry by replacing pointers with int16 indexes. It achieves it's purpose for small data, but for large data in some cases performance is degaraded, refer second set of performance data. 1. Patch compiles cleanly and all regression tests passed. 2. Change in pglz_hist_idx macro is not very clear to me, neither it is mentioned in comments 3. Why first entry is kept as INVALID_ENTRY? It appears to me, it is for cleaner checks in code. Performance Data ------------------ I have used pglz-variable-size-hash-table.patch to collect all performance data: Results of compress-tests.sql -- inserting large data into tmp table ------------------------------ testname |unpatched | patched -------------------+----------+------------ 5k text | 4.8932 | 4.9014 512b text | 22.6209 | 18.6849256b text | 13.9784 | 8.9342 1K text | 20.4969 | 20.5988 2k random | 10.5826 | 10.0758100k random | 3.9056 | 3.8200 500k random | 22.4078 | 22.1971 512b random | 15.7788 | 12.9575256b random | 18.9213 | 12.5209 1K random | 11.3933 | 9.8853 100k of same byte | 5.5877 | 5.5960500k of same byte | 2.6853 | 2.6500 Observation ------------- 1. This clearly shows that the patch improves performance for small data without any impact for large data. Performance data for directly calling lz_compress function (tests.sql) --------------------------------------------------------------------------- select testname, (compresstest(data, nrows, 8192)::numeric / 1000)::numeric(10,3) as auto from tests; Head testname | auto -------------------+----------- 5k text | 3511.879 512b text | 1430.990 256b text | 1768.7961K text | 1390.134 3K text | 4099.304 2k random | -402.916 100k random | -10.311500k random | -2.019 512b random | -753.317 256b random | -1096.999 1K random | -559.93110k of same byte | 3548.651 100k of same byte | 36037.280 500k of same byte | 25565.195 (14 rows) Patch(pglz-variable-size-hash-table.patch) testname | auto -------------------+----------- 5k text | 3840.207 512b text | 1088.897 256b text | 982.1721K text | 1402.488 3K text | 4334.802 2k random | -333.100 100k random | -8.390500k random | -1.672 512b random | -499.251 256b random | -524.889 1K random | -453.17710k of same byte | 4754.367 100k of same byte | 50427.735 500k of same byte | 36163.265 (14 rows) Observations -------------- 1. For small data perforamce is always good with patch. 2. For random small/large data performace is good. 3. For medium and large text and same byte data(3K,5K text, 10K,100K,500K same byte), performance is degraded. I have used attached compress-tests-init.sql to generate data. I am really not sure why the data you reported and what I taken differ in few cases. I had tried multiple times but the result is same. Kindly let me know if you think I am doing something wrong. Note - To generate data in randomhex, I used Copy from file. I used same command you provided to generate a file. With Regards, Amit Kapila.
pgsql-hackers by date: