Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id CA+TgmoZ0XOg0rUCidoie8cMxqeHU4T56fmBtE+5AhADncBGPcw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: [HACKERS] Custom compression methods
List pgsql-hackers
Please remember to trim unnecessary quoted material.

On Sun, Feb 7, 2021 at 6:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> [ a whole lot of quoted stuff ]
>
> I have tested the performance, pglz vs lz4
>
> Test1: With a small simple string, pglz doesn't select compression but
> lz4 select as no min limit
> Table: 100 varchar column
> Test: Insert 1000 tuple, each column of 25 bytes string (32 is min
> limit for pglz)
> Result:
> pglz: 1030 ms (doesn't attempt compression so externalize),
> lz4: 212 ms
>
> Test2: With small incompressible string, pglz don't select compression
> lz4 select but can not compress
> Table: 100 varchar column
> Test: Insert 1000 tuple, each column of 25 bytes string (32 is min
> limit for pglz)
> Result:
> pglz: 1030 ms (doesn't attempt compression so externalize),
> lz4: 1090 ms (attempt to compress but externalize):
>
> Test3: Test a few columns with large random data
> Table: 3 varchar column
> Test: Insert 1000 tuple  3 columns size(3500 byes, 4200 bytes, 4900bytes)
> pglz: 150 ms (compression ratio: 3.02%),
> lz4: 30 ms (compression ratio : 2.3%)
>
> Test4: Test3 with different large random slighly compressible, need to
> compress + externalize:
> Table: 3 varchar column
> Insert: Insert 1000 tuple  3 columns size(8192, 8192, 8192)
> CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
> 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> Test: insert into t1 select large_val(), large_val(), large_val() from
> generate_series(1,1000);
> pglz: 2000 ms
> lz4: 1500 ms
>
> Conclusion:
> 1. In most cases lz4 is faster and doing better compression as well.
> 2. In Test2 when small data is incompressible then lz4 tries to
> compress whereas pglz doesn't try so there is some performance loss.
> But if we want we can fix
> it by setting some minimum limit of size for lz4 as well, maybe the
> same size as pglz?

So my conclusion here is that perhaps there's no real problem. It
looks like externalizing is so expensive compared to compression that
it's worth trying to compress even though it may not always pay off.
If, by trying to compress, we avoid externalizing, it's a huge win
(~5x). If we try to compress and don't manage to avoid externalizing,
it's a small loss (~6%). It's probably reasonable to expect that
compressible data is more common than incompressible data, so not only
is the win a lot bigger than the loss, but we should be able to expect
it to happen a lot more often. It's not impossible that somebody could
get bitten, but it doesn't feel like a huge risk to me.

One thing that does occur to me is that it might be a good idea to
skip compression if it doesn't change the number of chunks that will
be stored into the TOAST table. If we compress the value but still
need to externalize it, and the compression didn't save enough to
reduce the number of chunks, I suppose we ideally would externalize
the uncompressed version. That would save decompression time later,
without really costing anything. However, I suppose that would be a
separate improvement from this patch. Maybe the possibility of
compressing smaller values makes it slightly more important, but I'm
not sure that it's worth getting excited about.

If anyone feels otherwise on either point, it'd be good to hear about it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: [POC] verifying UTF-8 using SIMD instructions
Next
From: Heikki Linnakangas
Date:
Subject: Re: [POC] verifying UTF-8 using SIMD instructions