Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Custom compression methods |
Date | |
Msg-id | CA+TgmoZ0XOg0rUCidoie8cMxqeHU4T56fmBtE+5AhADncBGPcw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Custom compression methods (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: [HACKERS] Custom compression methods
|
List | pgsql-hackers |
Please remember to trim unnecessary quoted material. On Sun, Feb 7, 2021 at 6:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > [ a whole lot of quoted stuff ] > > I have tested the performance, pglz vs lz4 > > Test1: With a small simple string, pglz doesn't select compression but > lz4 select as no min limit > Table: 100 varchar column > Test: Insert 1000 tuple, each column of 25 bytes string (32 is min > limit for pglz) > Result: > pglz: 1030 ms (doesn't attempt compression so externalize), > lz4: 212 ms > > Test2: With small incompressible string, pglz don't select compression > lz4 select but can not compress > Table: 100 varchar column > Test: Insert 1000 tuple, each column of 25 bytes string (32 is min > limit for pglz) > Result: > pglz: 1030 ms (doesn't attempt compression so externalize), > lz4: 1090 ms (attempt to compress but externalize): > > Test3: Test a few columns with large random data > Table: 3 varchar column > Test: Insert 1000 tuple 3 columns size(3500 byes, 4200 bytes, 4900bytes) > pglz: 150 ms (compression ratio: 3.02%), > lz4: 30 ms (compression ratio : 2.3%) > > Test4: Test3 with different large random slighly compressible, need to > compress + externalize: > Table: 3 varchar column > Insert: Insert 1000 tuple 3 columns size(8192, 8192, 8192) > CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS > 'select array_agg(md5(g::text))::text from generate_series(1, 256) g'; > Test: insert into t1 select large_val(), large_val(), large_val() from > generate_series(1,1000); > pglz: 2000 ms > lz4: 1500 ms > > Conclusion: > 1. In most cases lz4 is faster and doing better compression as well. > 2. In Test2 when small data is incompressible then lz4 tries to > compress whereas pglz doesn't try so there is some performance loss. > But if we want we can fix > it by setting some minimum limit of size for lz4 as well, maybe the > same size as pglz? So my conclusion here is that perhaps there's no real problem. It looks like externalizing is so expensive compared to compression that it's worth trying to compress even though it may not always pay off. If, by trying to compress, we avoid externalizing, it's a huge win (~5x). If we try to compress and don't manage to avoid externalizing, it's a small loss (~6%). It's probably reasonable to expect that compressible data is more common than incompressible data, so not only is the win a lot bigger than the loss, but we should be able to expect it to happen a lot more often. It's not impossible that somebody could get bitten, but it doesn't feel like a huge risk to me. One thing that does occur to me is that it might be a good idea to skip compression if it doesn't change the number of chunks that will be stored into the TOAST table. If we compress the value but still need to externalize it, and the compression didn't save enough to reduce the number of chunks, I suppose we ideally would externalize the uncompressed version. That would save decompression time later, without really costing anything. However, I suppose that would be a separate improvement from this patch. Maybe the possibility of compressing smaller values makes it slightly more important, but I'm not sure that it's worth getting excited about. If anyone feels otherwise on either point, it'd be good to hear about it. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: