On Thu, 8 Nov 2012 17:33:54 +0000 Amit Kapila wrote:
On Mon, 29 Oct 2012 20:02:11 +0530 Amit Kapila wrote:
On Sunday, October 28, 2012 12:28 AM Heikki Linnakangas wrote:
>>> One idea is to use the LZ format in the WAL record, but use your
>>> memcmp() code to construct it. I believe the slow part in LZ compression
>>> is in trying to locate matches in the "history", so if you just replace
>>> that with your code that's aware of the column boundaries and uses
>>> simple memcmp() to detect what parts changed, you could create LZ
>>> compressed output just as quickly as the custom encoded format. It would
>>> leave the door open for making the encoding smarter or to do actual
>>> compression in the future, without changing the format and the code to
>>> decode it.
>>This is good idea. I shall try it.
>>In the existing algorithm for storing the new data which is not present in
>>the history, it needs 1 control byte for
>>every 8 bytes of new data which can increase the size of the compressed
>>output as compare to our delta encoding approach.
>>Approach-2
>---------------
>>Use only one bit for control data [0 - Length and new data, 1 - pick from
>>history based on OFFSET-LENGTH]
>>The modified bit value (0) is to handle the new field data as a continuous
>>stream of data, instead of treating every byte as a new data.
> Attached are the patches
> 1. wal_update_changes_lz_v4 - to use LZ Approach with memcmp to construct WAL record
> 2. wal_update_changes_modified_lz_v5 - to use modified LZ Approach as mentioned above as Approach-2
> The main Changes as compare to previous patch are as follows:
> 1. In heap_delta_encode, use LZ encoding instead of Custom encoding.
> 2. Instead of get_tup_info(), introduced heap_getattr_with_len() macro based on suggestion from Noah.
> 3. LZ macro's moved from .c to .h, as they need to be used for encoding.
> 4. Changed the format for function arguments for heap_delta_encode()/heap_delta_decode() based on suggestion from Noah.
Please find the updated patches attached with this mail.
Modification in these Patches apart from above:
1. Traverse the tuple only once (previously it needs to traverse 3 times) to check if particular offset matches and get the offset to generate encoded tuple.
To achieve this I have modified function heap_tuple_attr_equals() to heap_attr_get_length_and_check_equals(), so that it can get the length of tuple attribute
which can be used to calculate offset. A separate function can also be written to achieve the same.
2. Improve the comments in code.
Performance Data:
1. Please refer testcase in attached file pgbench_250.c
Refer Function used to create random string at end of mail.
2. The detail data and configuration settings can be reffered in attached files (pgbench_encode_withlz_ff100 & pgbench_encode_withlz_ff80).
Benchmark results with -F 100:
-Patch- -tps@-c1- -tps@-c2- -tps@-c4- -tps@-c8- -WAL@-c8-
xlogscale 802 1453 2253 2643 13.99 GB
xlogscale+org lz 807 1602 3168 5140 9.50 GB
xlogscale+mod lz 796 1620 3216 5270 9.16 GB
Benchmark results with -F 80:
-Patch- -tps@-c1- -tps@-c2- -tps@-c4- -tps@-c8- -WAL@-c8-
xlogscale 811 1455 2148 2704 13.6 GB
xlogscale+org lz 829 1684 3223 5325 9.13 GB
xlogscale+mod lz 801 1657 3263 5488 8.86 GB
> I shall write the wal_update_changes_custom_delta_v6, and then we can compare all the three patches performance data and decide which one to go based on results.
The results with this are not better than above 2 Approaches, so I am not attaching it.
Function used to create randome string
--------------------------------------------------------
CREATE OR REPLACE FUNCTION random_text_md5_v2(INTEGER)
RETURNS TEXT
LANGUAGE SQL
AS $$
select upper(
substring(
(
SELECT string_agg(md5(random()::TEXT), '')
FROM generate_series(1, CEIL($1 / 32.)::integer)
),
$1)
);
$$;
Suggestions/Comments?
With Regards,
Amit Kapila.