On Thu, 8 Nov 2012 17:33:54 +0000 Amit Kapila wrote:
 On Mon, 29 Oct 2012 20:02:11 +0530 Amit Kapila wrote:
 On Sunday, October 28, 2012 12:28 AM Heikki Linnakangas wrote:
 >>> One idea is to use the LZ format in the WAL record, but use your
 >>> memcmp() code to construct it. I believe the slow part in LZ compression
 >>> is in trying to locate matches in the "history", so if you just replace
 >>> that with your code that's aware of the column boundaries and uses
 >>> simple memcmp() to detect what parts changed, you could create LZ
 >>> compressed output just as quickly as the custom encoded format. It would
 >>> leave the door open for making the encoding smarter or to do actual
 >>> compression in the future, without changing the format and the code to
 >>> decode it.
 >>This is good idea. I shall try it.
 >>In the existing algorithm for storing the new data which is not present in
 >>the history, it needs 1 control byte for
 >>every 8 bytes of new data which can increase the size of the compressed
 >>output as compare to our delta encoding approach.
 >>Approach-2
 >---------------
 >>Use only one bit for control data [0 - Length and new data, 1 - pick from
 >>history based on OFFSET-LENGTH]
 >>The modified bit value (0) is to handle the new field data as a continuous
 >>stream of data, instead of treating every byte as a new data.
 > Attached are the patches
 > 1. wal_update_changes_lz_v4 - to use LZ Approach with memcmp to construct WAL record
 > 2. wal_update_changes_modified_lz_v5 - to use modified LZ Approach as mentioned above as Approach-2
 > The main Changes as compare to previous patch are as follows:
 > 1. In heap_delta_encode, use LZ encoding instead of Custom encoding.
 > 2. Instead of get_tup_info(), introduced heap_getattr_with_len() macro based on suggestion from Noah.
 > 3. LZ macro's moved from .c to .h, as they need to be used for encoding.
 > 4. Changed the format for function arguments for heap_delta_encode()/heap_delta_decode() based on suggestion from Noah.
 
Please find the updated patches attached with this mail.
 
Modification in these Patches apart from above:
1. Traverse the tuple only once (previously it needs to traverse 3 times) to check if particular offset matches and get the offset to generate encoded tuple.
    To achieve this I have modified function heap_tuple_attr_equals() to heap_attr_get_length_and_check_equals(), so that it can get the length of tuple attribute
    which can be used to calculate offset. A separate function can also be written to achieve the same.
2. Improve the comments in code.
 
Performance Data:
1. Please refer testcase in attached file pgbench_250.c
   Refer Function used to create random string at end of mail.
2. The detail data and configuration settings can be reffered in attached files (pgbench_encode_withlz_ff100 & pgbench_encode_withlz_ff80).
 
Benchmark results with -F 100:
 -Patch-             -tps@-c1-   -tps@-c2-   -tps@-c4-   -tps@-c8-   -WAL@-c8-
 xlogscale           802         1453        2253        2643        13.99 GB
 xlogscale+org lz    807         1602        3168        5140        9.50 GB
 xlogscale+mod lz    796         1620        3216        5270        9.16 GB 
Benchmark results with -F 80:
 -Patch-             -tps@-c1-   -tps@-c2-   -tps@-c4-   -tps@-c8-   -WAL@-c8-
 xlogscale           811         1455        2148        2704        13.6 GB
 xlogscale+org lz    829         1684        3223        5325        9.13 GB
 xlogscale+mod lz    801         1657        3263        5488        8.86 GB
 
> I shall write the wal_update_changes_custom_delta_v6, and then we can compare all the three patches performance data and decide which one to go based on results.
   The results with this are not better than above 2 Approaches, so I am not attaching it.
 
Function used to create randome string
--------------------------------------------------------
CREATE OR REPLACE FUNCTION random_text_md5_v2(INTEGER)
RETURNS TEXT
LANGUAGE SQL
AS $$
    select upper(
        substring(
            (
                SELECT string_agg(md5(random()::TEXT), '')
                FROM generate_series(1, CEIL($1 / 32.)::integer)
                ),
        $1)
    );
$$;
Suggestions/Comments?
 
With Regards,
Amit Kapila.