Re: Compress ReorderBuffer spill files using LZ4 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Compress ReorderBuffer spill files using LZ4
Date
Msg-id bd1c42f5-7167-41eb-a473-b49b77ad9363@vondra.me
Whole thread Raw
In response to Re: Compress ReorderBuffer spill files using LZ4  (Julien Tachoires <julmon@gmail.com>)
List pgsql-hackers

On 9/23/24 21:58, Julien Tachoires wrote:
> Hi Tomas,
> 
> Le lun. 23 sept. 2024 à 18:13, Tomas Vondra <tomas@vondra.me> a écrit :
>>
>> Hi,
>>
>> I've spent a bit more time on this, mostly running tests to get a better
>> idea of the practical benefits.
> 
> Thank you for your code review and testing!
> 
>> Firstly, I think there's a bug in ReorderBufferCompress() - it's legal
>> for pglz_compress() to return -1. This can happen if the data is not
>> compressible, and would not fit into the output buffer. The code can't
>> just do elog(ERROR) in this case, it needs to handle that by storing the
>> raw data. The attached fixup patch makes this work for me - I'm not
>> claiming this is the best way to handle this, but it works.
>>
>> FWIW I find it strange the tests included in the patch did not trigger
>> this. That probably means the tests are not quite sufficient.
>>
>>
>> Now, to the testing. Attached are two scripts, testing different cases:
>>
>> test-columns.sh - Table with a variable number of 'float8' columns.
>>
>> test-toast.sh - Table with a single text column.
>>
>> The script always sets up a publication/subscription on two instances,
>> generates certain amount of data (~1GB for columns, ~3.2GB for TOAST),
>> waits for it to be replicated to the replica, and measures how much data
>> was spilled to disk with the different compression methods (off, pglz
>> and lz4). There's a couple more metrics, but that's irrelevant here.
> 
> It would be interesting to run the same tests with zstd: in my early
> testing I found that zstd was able to provide a better compression
> ratio than lz4, but seemed to use more CPU resources/is slower.
> 

Oh, I completely forgot about zstd. I don't think it'd substantially
change the conclusions, though. It might compress better/worse for some
cases, but the overall behavior would remain the same.

I can't test this right now, the testmachine is busy with some other
stuff. But it should not be difficult to update the test scripts I
attached and get results yourself. There's a couple hard-coded paths
that need to be updated, ofc.

>> For the "column" test, it looks like this (this is in MB):
>>
>>     rows    columns    distribution    off     pglz    lz4
>>   ========================================================
>>   100000       1000    compressible    778      20       9
>>                              random    778     778      16
>>   --------------------------------------------------------
>>   1000000       100    compressible    916     116      62
>>                              random    916     916      67
>>
>> It's very clear that for the "compressible" data (which just copies the
>> same value into all columns), both pglz and lz4 can significantly reduce
>> the amount of data. For 1000 columns it's 780MB -> 20MB/9MB, for 100
>> columns it's a bit less efficient, but still good.
>>
>> For the "random" data (where every column gets a random value, but rows
>> are copied), it's a very different story - pglz does not help at all,
>> while lz4 still massively reduces the amount of spilled data.
>>
>> I think the explanation is very simple - for pglz, we compress each row
>> on it's own, there's no concept of streaming/context. If a row is
>> compressible, it works fine, but when the row gets random, pglz can't
>> compress it at all. For lz4, this does not matter, because with the
>> streaming mode it still sees that rows are just repeated, and so can
>> compress them efficiently.
> 
> That's correct.
> 
>> For TOAST test, the results look like this:
>>
>>   distribution     repeats        toast       off    pglz     lz4
>>   ===============================================================
>>   compressible       10000        lz4          14       2       1
>>                                   pglz         40       4       3
>>                       1000        lz4          32      16       9
>>                                   pglz         54      17      10
>>         ---------------------------------------------------------
>>         random       10000        lz4        3305    3305    3157
>>                                   pglz       3305    3305    3157
>>                       1000        lz4        3166    3162    1580
>>                                   pglz       3334    3326    1745
>>        ----------------------------------------------------------
>>        random2       10000        lz4        3305    3305    3157
>>                                   pglz       3305    3305    3158
>>                       1000        lz4        3160    3156    3010
>>                                   pglz       3334    3326    3172
>>
>> The "repeats" value means how long the string is - it's the number of
>> "md5" hashes added to the string. The number of rows is calculated to
>> keep the total amount of data the same. The "toast" column tracks what
>> compression was used for TOAST, I was wondering if it matters.
>>
>> This time there are three data distributions - compressible means that
>> each TOAST value is nicely compressible, "random" means each value is
>> random (not compressible), but the rows are just copy of the same value
>> (so on the whole there's a lot of redundancy). And "random2" means each
>> row is random and unique (so not compressible at all).
>>
>> The table shows that with compressible TOAST values, compressing the
>> spill file is rather useless. The reason is that ReorderBufferCompress
>> is handling raw TOAST data, which is already compressed. Yes, it may
>> further reduce the amount of data, but it's negligible when compared to
>> the original amount of data.
>>
>> For the random cases, the spill compression is rather pointless. Yes,
>> lz4 can reduce it to 1/2 for the shorter strings, but other than that
>> it's not very useful.
> 
> It's still interesting to confirm that data already compressed or
> random data cannot be significantly compressed.
> 
>> For a while I was thinking this approach is flawed, because it only sees
>> and compressed changes one by one, and that seeing a batch of changes
>> would improve this (e.g. we'd see the copied rows). But I realized lz4
>> already does that (in the streaming mode at least), and yet it does not
>> help very much. Presumably that depends on how large the context is. If
>> the random string is long enough, it won't help.
>>
>> So maybe this approach is fine, and doing the compression at a lower
>> layer (for the whole file), would not really improve this. Even then
>> we'd only see a limited amount of data.
>>
>> Maybe the right answer to this is that compression does not help cases
>> where most of the replicated data is TOAST, and that it can help cases
>> with wide (and redundant) rows, or repeated rows. And that lz4 is a
>> clearly superior choice. (This also raises the question if we want to
>> support REORDER_BUFFER_STRAT_LZ4_REGULAR. I haven't looked into this,
>> but doesn't that behave more like pglz, i.e. no context?)
> 
> I'm working on a new version of this patch set that will include the
> changes you suggested in your review. About using LZ4 regular API, the
> goal was to use it when we cannot use the streaming API due to raw
> data larger than LZ4 ring buffer. But this is something I'm going to
> delete in the new version because I'm planning to use a similar
> approach as we do in astreamer_lz4.c: using frames, not blocks. LZ4
> frame API looks very similar to ZSTD's streaming API.
> 
>> FWIW when doing these tests, it made me realize how useful would it be
>> to track both the "raw" and "spilled" amounts. That is before/after
>> compression. It'd make calculating compression ratio much easier.
> 
> Yes, that's why I tried to "fix" the spill_bytes counter.
> 

But I think the 'fixed' counter only tracks the data after the new
compression, right? I'm suggesting to have two counters - one for "raw"
data (before compression) and "compressed" (after compression).


regards

-- 
Tomas Vondra



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Why don't we consider explicit Incremental Sort?
Next
From: Tatsuo Ishii
Date:
Subject: Re: pgbench: Improve result outputs related to failed transactinos