On 2/6/23 03:00, Andrey Borodin wrote:
> On Sun, Feb 5, 2023 at 5:51 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 2/5/23 19:36, Andrey Borodin wrote:
>>> On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin <amborodin86@gmail.com> wrote:
>>>>
>>>> Hello! Please find attached v8.
>>>
>>> I got some interesting feedback from some patch users.
>>> There was an oversight that frequently yielded results that are 1,2 or
>>> 3 bytes longer than expected.
>>> Looking closer I found that the correctness of the last 3-byte tail is
>>> checked in two places. PFA fix for this. Previously compressed data
>>> was correct, however in some cases few bytes longer than the result of
>>> current pglz implementation.
>>>
>>
>> Thanks. What were the consequences of the issue? Lower compression
>> ratio, or did we then fail to decompress the data (or would current pglz
>> implementation fail to decompress it)?
>>
> The data was decompressed fine. But extension tests (Citus's columnar
> engine) hard-coded a lot of compression ratio stuff.
OK. Not sure I'd blame the patch for these failures, as long as long as
the result is still correct and can be decompressed. I'm not aware of a
specification of what the compression must (not) produce.
> And there is still 1 more test where optimized version produces 1 byte
> longer output. I'm trying to find it, but with no success yet.
>
> There are known and documented cases when optimized pglz version would
> do so. good_match without 10-division and memcmp by 4 bytes. But even
> disabling this, still observing 1-byte longer compression results
> persists... The problem is the length is changed after deleting some
> data, so compression of that particular sequence seems to be somewhere
> far away.
> It was funny at the beginning - to hunt for 1 byte. But the weekend is
> ending, and it seems that byte slipped from me again...
>
I wonder what that means for the patch. I haven't investigated this at
all, but it seems as if the optimization means we fail to find a match,
producing a tad larger output. That may be still be a good tradeoff, as
long as the output is correct (assuming it doesn't break some promise
regarding expected output).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company