Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id E112AF88-6148-4540-B094-7D157B67B122@enterprisedb.com
Whole thread Raw
In response to Re: Re: [HACKERS] Custom compression methods  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: [HACKERS] Custom compression methods  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers

> On Aug 13, 2020, at 4:48 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> v1-0001: As suggested by Robert, it provides the syntax support for
> setting the compression method for a column while creating a table and
> adding columns.  However, we don't support changing the compression
> method for the existing column.  As part of this patch, there is only
> one built-in compression method that can be set (pglz).  In this, we
> have one in-build am (pglz) and the compressed attributes will directly
> store the oid of the AM.  In this patch, I have removed the
> pg_attr_compresion as we don't support changing the compression
> for the existing column so we don't need to preserve the old
> compressions.

I do not like the way pglz compression is handled in this patch.  After upgrading PostgreSQL to the first version with
thispatch included, pre-existing on-disk compressed data will not include any custom compression Oid in the header, and
toast_decompress_datumwill notice that and decompress the data directly using pglz_decompress.   If the same data were
thenwritten back out, perhaps to another table, into a column with no custom compression method defined, it will get
compressedby toast_compress_datum using DefaultCompressionOid, which is defined as PGLZ_COMPRESSION_AM_OID.  That isn't
aproper round-trip for the data, as when it gets re-compressed, the PGLZ_COMPRESSION_AM_OID gets written into the
header,which makes the data a bit longer, but also means that it is not byte-for-byte the same as it was, which is
counter-intuitive. Given that any given pglz compressed datum now has two totally different formats that might occur on
disk,code may have to consider both of them, which increases code complexity, and regression tests will need to be
writtenwith coverage for both of them, which increases test complexity.  It's also not easy to write the extra tests,
asthere isn't any way (that I see) to intentionally write out the traditional shorter form from a newer database
server;you'd have to do something like a pg_upgrade test where you install an older server to write the older format,
upgrade,and then check that the new server can handle it. 

The cleanest solution to this would seem to be removal of the compression am's Oid from the header for all compression
ams,so that pre-patch written data and post-patch written data look exactly the same.  The other solution is to give
pglzpride-of-place as the original compression mechanism, and just say that when pglz is the compression method, no Oid
getswritten to the header, and only when other compression methods are used does the Oid get written.  This second
optionseems closer to the implementation that you already have, because you already handle the decompression of data
wherethe Oid is lacking, so all you have to do is intentionally not write the Oid when compressing using pglz. 

Or did I misunderstand your implementation?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Problems with the FSM, heap fillfactor, and temporal locality
Next
From: Alvaro Herrera
Date:
Subject: Re: logtape.c stats don't account for unused "prefetched" block numbers