Thread: Fixes for compression options of pg_receivewal and refactoring of backup_compression.{c,h}

Hi all,
(Added Robert and Georgios in CC:)

Since babbbb5 and the introduction of LZ4, I have reworked the way
compression is controlled for pg_receivewal, with two options:
- --compress-method, settable to "gzip", "none" or "lz4".
- --compress, to pass down a compression level, where the allowed
range is 1-9.  If passing down 0, we'd get an error rather than
implying no compression, contrary to what we did in ~14.

I initially thought that this was fine as-is, but then Robert and
others have worked on client/server compression for pg_basebackup,
introducing a much better design with centralized APIs where one can
use METHOD:DETAIL for as compression value, where DETAIL is a
comma-separated list of keyword=value (keyword = "level" or
"workers"), with centralized checks and an extensible design.

This is something I think we had better fix before beta1, because now
we have binaries that use an inconsistent set of options.  So,
attached is a patch set aimed at rework this option set from the
ground, taking advantage of the recent work done by Robert and others
for pg_basebackup:
- 0001 is a simple rename of backup_compression.{c,h} to
compression.{c,h}, removing anything related to base backups from
that.  One extra reason behind this renaming is that I would like to
use this infra for pg_dump, but that's material for 16~.
- 0002 removes WalCompressionMethod, replacing it by
pg_compress_algorithm as these are the same enums.  Robert complained
about the confusion that WalCompressionMethod could lead to as this
could be used for the compression of data, and not only WAL.  I have
renamed some variables to be more consistent, while on it.
- 0003 is the actual option rework for pg_receivewal.  This takes
advantage of 0001, leading to the result of removing --compress-method
and replacing it with --compress, taking care of the backward
compatibility problems for an integer value, aka 0 implies no
compression and val > 0 implies gzip.  One bonus reason to switch to
that is that this would make the addition of zstd for pg_receivewal
easier in the future.

I am going to add an open item for this stuff.  Comments or thoughts?

Thanks,
--
Michael

Attachment
On Monday, April 11th, 2022 at 8:52 AM, Michael Paquier <michael@paquier.xyz>
wrote:

> This is something I think we had better fix before beta1, because now
> we have binaries that use an inconsistent set of options. So,
> attached is a patch set aimed at rework this option set from the
> ground, taking advantage of the recent work done by Robert and others
> for pg_basebackup:

Agreed. It is rather inconsistent now.

> - 0001 is a simple rename of backup_compression.{c,h} to
> compression.{c,h}, removing anything related to base backups from
> that. One extra reason behind this renaming is that I would like to
> use this infra for pg_dump, but that's material for 16~.

I agree with the design. If you permit me a couple of nitpicks regarding naming.

+typedef enum pg_compress_algorithm
+{
+   PG_COMPRESSION_NONE,
+   PG_COMPRESSION_GZIP,
+   PG_COMPRESSION_LZ4,
+   PG_COMPRESSION_ZSTD
+} pg_compress_algorithm;

Elsewhere in the codebase, (e.g. create_table.sgml, alter_table.sgml,
brin_tuple.c, detoast.c, toast_compression.c, tupdesc.c, gist.c to mention a
few) variations of of the nomenclature "compression method" are used, like
'VARDATA_COMPRESSED_GET_COMPRESS_METHOD' or 'InvalidCompressionMethod' etc. I
feel that it would be nicer if we followed one naming rule for this and I
recommend to substitute algorithm for method throughout.

On a similar note, it would help readability to be able to distinguish at a
glance the type from the variable. Maybe uppercase or camelcase the type?

Last, even though it is not needed now, it will be helpful to have a
PG_COMPRESSION_INVALID in some scenarios. Though we can add it when we come to
it.

> - 0002 removes WalCompressionMethod, replacing it by
> pg_compress_algorithm as these are the same enums. Robert complained
> about the confusion that WalCompressionMethod could lead to as this
> could be used for the compression of data, and not only WAL. I have
> renamed some variables to be more consistent, while on it.

It looks good. If you choose to discard the comment regarding the use of
'method' over 'algorithm' from above, can you please use the full word in the
variable, e.g. 'wal_compress_algorithm' instead of 'wal_compress_algo'. I can
not really explain it, the later reads a bit rude. Then again that may be just
me.

> - 0003 is the actual option rework for pg_receivewal. This takes
> advantage of 0001, leading to the result of removing --compress-method
> and replacing it with --compress, taking care of the backward
> compatibility problems for an integer value, aka 0 implies no
> compression and val > 0 implies gzip. One bonus reason to switch to
> that is that this would make the addition of zstd for pg_receivewal
> easier in the future.

Looks good.

> I am going to add an open item for this stuff. Comments or thoughts?

I agree that it is better to not release pg_receivewal with the distinct set of
options.

Cheers,
//Georgios

> Thanks,
> --
> Michael



On Mon, Apr 11, 2022 at 2:52 AM Michael Paquier <michael@paquier.xyz> wrote:
> Since babbbb5 and the introduction of LZ4, I have reworked the way
> compression is controlled for pg_receivewal, with two options:
> - --compress-method, settable to "gzip", "none" or "lz4".
> - --compress, to pass down a compression level, where the allowed
> range is 1-9.  If passing down 0, we'd get an error rather than
> implying no compression, contrary to what we did in ~14.
>
> I initially thought that this was fine as-is, but then Robert and
> others have worked on client/server compression for pg_basebackup,
> introducing a much better design with centralized APIs where one can
> use METHOD:DETAIL for as compression value, where DETAIL is a
> comma-separated list of keyword=value (keyword = "level" or
> "workers"), with centralized checks and an extensible design.
>
> This is something I think we had better fix before beta1, because now
> we have binaries that use an inconsistent set of options.  So,
> attached is a patch set aimed at rework this option set from the
> ground, taking advantage of the recent work done by Robert and others
> for pg_basebackup:

+1 for this in general, but I think that naming like
"compression_algo" stinks. If you think "compression_algorithm" is too
long, I think you should use "algorithm" or "compression" or
"compression_method" or something.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Mon, Apr 11, 2022 at 11:15:46AM -0400, Robert Haas wrote:
> +1 for this in general, but I think that naming like
> "compression_algo" stinks. If you think "compression_algorithm" is too
> long, I think you should use "algorithm" or "compression" or
> "compression_method" or something.

Yes, I found "compression_algorithm" to be too long initially.  For
walmethods.c and pg_receivewal.c, it may be better to just stick to
"algorithm"  then, at least that's consistent with pg_basebackup.c.
--
Michael

Attachment
On Mon, Apr 11, 2022 at 12:46:02PM +0000, gkokolatos@pm.me wrote:
> On Monday, April 11th, 2022 at 8:52 AM, Michael Paquier <michael@paquier.xyz>
> wrote:
>> - 0001 is a simple rename of backup_compression.{c,h} to
>> compression.{c,h}, removing anything related to base backups from
>> that. One extra reason behind this renaming is that I would like to
>> use this infra for pg_dump, but that's material for 16~.
>
> I agree with the design. If you permit me a couple of nitpicks regarding naming.
>
> +typedef enum pg_compress_algorithm
> +{
> +   PG_COMPRESSION_NONE,
> +   PG_COMPRESSION_GZIP,
> +   PG_COMPRESSION_LZ4,
> +   PG_COMPRESSION_ZSTD
> +} pg_compress_algorithm;
>
> Elsewhere in the codebase, (e.g. create_table.sgml, alter_table.sgml,
> brin_tuple.c, detoast.c, toast_compression.c, tupdesc.c, gist.c to mention a
> few) variations of of the nomenclature "compression method" are used, like
> 'VARDATA_COMPRESSED_GET_COMPRESS_METHOD' or 'InvalidCompressionMethod' etc. I
> feel that it would be nicer if we followed one naming rule for this and I
> recommend to substitute algorithm for method throughout.

Technically and as far as I know, both are correct and hold more or
less the same meaning.  pg_basebackup's code exposes algorithm in a
more extended way, so I have just stuck to it for the internal
variables and such.  Perhaps we could rename the whole, but I see no
strong reason to do that.

> Last, even though it is not needed now, it will be helpful to have a
> PG_COMPRESSION_INVALID in some scenarios. Though we can add it when we come to
> it.

Perhaps.  There is no need for it yet, though.  pg_dump would not need
that, as well.

>> - 0002 removes WalCompressionMethod, replacing it by
>> pg_compress_algorithm as these are the same enums. Robert complained
>> about the confusion that WalCompressionMethod could lead to as this
>> could be used for the compression of data, and not only WAL. I have
>> renamed some variables to be more consistent, while on it.
>
> It looks good. If you choose to discard the comment regarding the use of
> 'method' over 'algorithm' from above, can you please use the full word in the
> variable, e.g. 'wal_compress_algorithm' instead of 'wal_compress_algo'. I can
> not really explain it, the later reads a bit rude. Then again that may be just
> me.

Thanks.  I have been able to do an extra pass on 0001 and 0002, fixing
those naming inconsistencies with "algo" vs "algorithm" that you and
Robert have reported, and applied them.  For 0003, I'll look at it
later.  Attached is a rebase with improvements about the variable
names.
--
Michael

Attachment
On Tue, Apr 12, 2022 at 06:22:48PM +0900, Michael Paquier wrote:
> On Mon, Apr 11, 2022 at 12:46:02PM +0000, gkokolatos@pm.me wrote:
>> It looks good. If you choose to discard the comment regarding the use of
>> 'method' over 'algorithm' from above, can you please use the full word in the
>> variable, e.g. 'wal_compress_algorithm' instead of 'wal_compress_algo'. I can
>> not really explain it, the later reads a bit rude. Then again that may be just
>> me.
>
> Thanks.  I have been able to do an extra pass on 0001 and 0002, fixing
> those naming inconsistencies with "algo" vs "algorithm" that you and
> Robert have reported, and applied them.  For 0003, I'll look at it
> later.  Attached is a rebase with improvements about the variable
> names.

This has been done with the proper renames.  With that in place, I see
no reason now to not be able to set the compression level as it is
possible to pass it down with the options available.  This requires
only a couple of lines, as of the attached.  LZ4 has a dummy structure
called LZ4F_INIT_PREFERENCES to initialize LZ4F_preferences_t, that
holds the compression level before passing it down to
LZ4F_compressBegin(), but that's available only in v1.8.3.  Using it
risks lowering down the minimal version of LZ4 we are able to use now,
but replacing that with a set of memset()s is also a way to set up
things as per its documentation.

Thoughts?
--
Michael

Attachment

------- Original Message -------
On Wednesday, April 13th, 2022 at 7:25 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Tue, Apr 12, 2022 at 06:22:48PM +0900, Michael Paquier wrote:
>
> > On Mon, Apr 11, 2022 at 12:46:02PM +0000, gkokolatos@pm.me wrote:
> >
> > > It looks good. If you choose to discard the comment regarding the use of
> > > 'method' over 'algorithm' from above, can you please use the full word in the
> > > variable, e.g. 'wal_compress_algorithm' instead of 'wal_compress_algo'. I can
> > > not really explain it, the later reads a bit rude. Then again that may be just
> > > me.
> >
> > Thanks. I have been able to do an extra pass on 0001 and 0002, fixing
> > those naming inconsistencies with "algo" vs "algorithm" that you and
> > Robert have reported, and applied them. For 0003, I'll look at it
> > later. Attached is a rebase with improvements about the variable
> > names.
>
> This has been done with the proper renames. With that in place, I see
> no reason now to not be able to set the compression level as it is
> possible to pass it down with the options available. This requires
> only a couple of lines, as of the attached. LZ4 has a dummy structure
> called LZ4F_INIT_PREFERENCES to initialize LZ4F_preferences_t, that
> holds the compression level before passing it down to
> LZ4F_compressBegin(), but that's available only in v1.8.3. Using it
> risks lowering down the minimal version of LZ4 we are able to use now,
> but replacing that with a set of memset()s is also a way to set up
> things as per its documentation.
>
> Thoughts?

It's really not hard to add compression level. However we had briefly
discussed it in the original thread [1] and decided against. That is why
I did not write that code. If the community thinks differently now, let
me know if you would like me to offer a patch for it.

Cheers,
//Georgios


[1]
https://www.postgresql.org/message-id/flat/CABUevEwuq7XXyd4fA0W3jY9MsJu9B2WRbHumAA%2B3WzHrGAQjsg%40mail.gmail.com#b6456fa2adc1cdb049a57bf3587666b9



On Wed, Apr 13, 2022 at 02:58:28PM +0000, gkokolatos@pm.me wrote:
> It's really not hard to add compression level. However we had briefly
> discussed it in the original thread [1] and decided against. That is why
> I did not write that code. If the community thinks differently now, let
> me know if you would like me to offer a patch for it.

The issue back then was how to design the option set to handle all
that (right?  My memories may be short on that), and pg_basebackup
takes care of that with its option design.

This is roughly what has been done here, except that this was for the
contentSize:

https://www.postgresql.org/message-id/rYyZ3Fj9qayyY9-egNsV_kkLbL_BSWcOEdi3Mb6M9eQRTkcA2jrqFEHglLUEYnzWR_wttCqn7VI94MZ2p7mwNje51lHTvWYnJ1jHdOceen4=@pm.me

Do you think that the extra test coverage is going to be too much of a
burden?  I was thinking about just adding a level to the main lz4
command, with an extra negative test in the SKIP block with a level
out of range
--
Michael

Attachment
On Thu, Apr 14, 2022 at 06:18:29AM +0900, Michael Paquier wrote:
> The issue back then was how to design the option set to handle all
> that (right?  My memories may be short on that), and pg_basebackup
> takes care of that with its option design.

I have looked at that again this morning, and the change is
straight-forward so I saw no reason to not do it.  And, applied.
--
Michael

Attachment