Thread: Add LZ4 compression in pg_dump

Add LZ4 compression in pg_dump

From
Georgios
Date:
Hi,

please find attached a patchset which adds lz4 compression in pg_dump.

The first commit does the heavy lifting required for additional compression methods.
It expands testing coverage for the already supported gzip compression. Commit
bf9aa490db introduced cfp in compress_io.{c,h} with the intent of unifying
compression related code and allow for the introduction of additional archive
formats. However, pg_backup_archiver.c was not using that API. This commit
teaches pg_backup_archiver.c about cfp and is using it through out.

Furthermore, compression was chosen based on the value of the level passed
as an argument during the invocation of pg_dump or some hardcoded defaults. This
does not scale for more than one compression methods. Now the method used for
compression can be explicitly requested during command invocation, or set during
hardcoded defaults. Then it is stored in the relevant structs and passed in the
relevant functions, along side compression level which has lost it's special
meaning. The method for compression is not yet stored in the actual archive.
This is done in the next commit which does introduce a new method.

The previously named CompressionAlgorithm enum is changed for
CompressionMethod so that it matches better similar variables found through out
the code base.

In a fashion similar to the binary for pg_basebackup, the method for compression
is passed using the already existing -Z/--compress parameter of pg_dump. The
legacy format and behaviour is maintained. Additionally, the user can explicitly
pass a requested method and optionaly the level to be used after a semicolon,
e.g. --compress=gzip:6

The second commit adds LZ4 compression in pg_dump and pg_restore.

Within compress_io.{c,h} there are two distinct APIs exposed, the streaming API
and a file API. The first one, is aimed at inlined use cases and thus simple
lz4.h calls can be used directly. The second one is generating output, or is
parsing input, which can be read/generated via the lz4 utility.

In the later case, the API is using an opaque wrapper around a file stream,
which aquired via fopen() or gzopen() respectively. It would then provide
wrappers around fread(), fwrite(), fgets(), fgetc(), feof(), and fclose(); or
their gz equivallents. However the LZ4F api does not provide this functionality.
So this has been implemented localy.

In order to maintain the API compatibility a new structure LZ4File is
introduced. It is responsible for keeping state and any yet unused generated
content. The later is required when the generated decompressed output, exceeds
the caller's buffer capacity.

Custom compressed archives need to now store the compression method in their
header. This requires a bump in the version number. The level of compression is
still stored in the dump, though admittedly is of no apparent use.


The series is authored by me. Rachel Heaton helped out with the expansion
of the testing coverage, testing in different platforms and providing debug information
on those, as well as native speaker wording.

Cheers,
//Georgios
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Fri, Feb 25, 2022 at 12:05:31PM +0000, Georgios wrote:
> The first commit does the heavy lifting required for additional compression methods.
> It expands testing coverage for the already supported gzip compression. Commit
> bf9aa490db introduced cfp in compress_io.{c,h} with the intent of unifying
> compression related code and allow for the introduction of additional archive
> formats. However, pg_backup_archiver.c was not using that API. This commit
> teaches pg_backup_archiver.c about cfp and is using it through out.

Thanks for the patch.  I have a few high-level comments.

+   # Do not use --no-sync to give test coverage for data sync.
+   compression_gzip_directory_format => {
+       test_key => 'compression',

The tests for GZIP had better be split into their own commit, as
that's a coverage improvement for the existing code.

I was assuming that this was going to be much larger :)

+/* Routines that support LZ4 compressed data I/O */
+#ifdef HAVE_LIBLZ4
+static void InitCompressorLZ4(CompressorState *cs);
+static void ReadDataFromArchiveLZ4(ArchiveHandle *AH, ReadFunc
readF);
+static void WriteDataToArchiveLZ4(ArchiveHandle *AH, CompressorState *cs,
+                                 const char *data, size_t dLen);
+static void EndCompressorLZ4(ArchiveHandle *AH, CompressorState *cs);
+#endif

Hmm.  This is the same set of APIs as ZLIB and NONE to init, read,
write and end, but for the LZ4 compressor (NONE has no init/end).
Wouldn't it be better to refactor the existing pg_dump code to have a
central structure holding all the function definitions in a common
structure so as all those function signatures are set in stone in the
shape of a catalog of callbacks, making the addition of more
compression formats easier?  I would imagine that we'd split the code
of each compression method into their own file with their own context
data.  This would lead to a removal of compress_io.c, with its entry
points ReadDataFromArchive(), WriteDataToArchive() & co replaced by
pointers to each per-compression callback.

> Furthermore, compression was chosen based on the value of the level passed
> as an argument during the invocation of pg_dump or some hardcoded defaults. This
> does not scale for more than one compression methods. Now the method used for
> compression can be explicitly requested during command invocation, or set during
> hardcoded defaults. Then it is stored in the relevant structs and passed in the
> relevant functions, along side compression level which has lost it's special
> meaning. The method for compression is not yet stored in the actual archive.
> This is done in the next commit which does introduce a new method.

That's one thing Robert was arguing about with pg_basebackup, so that
would be consistent, and the option set is backward-compatible as far
as I get it by reading the code.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
The patch is failing on cfbot/freebsd.
http://cfbot.cputube.org/georgios-kokolatos.html

Also, I wondered if you'd looked at the "high compression" interfaces in
lz4hc.h ?  Should pg_dump use that ?

On Fri, Feb 25, 2022 at 08:03:40AM -0600, Justin Pryzby wrote:
> Thanks for working on this.  Your 0001 looks similar to what I did for zstd 1-2
> years ago.
> https://commitfest.postgresql.org/32/2888/
> 
> I rebased and attached the latest patches I had in case they're useful to you.
> I'd like to see zstd included in pg_dump eventually, but it was too much work
> to shepherd the patches.  Now that seems reasonable for pg16.
> 
> With the other compression patches I've worked on, we've used an extra patch
> with changes the default to the new compression algorithm, to force cfbot to
> exercize the new code.
> 
> Do you know the process with commitfests and cfbot ?
> There's also this, which allows running the tests on cirrus before mailing the
> patch to the hackers list.
> ./src/tools/ci/README



Re: Add LZ4 compression in pg_dump

From
Greg Stark
Date:
It seems development on this has stalled. If there's no further work
happening I guess I'll mark the patch returned with feedback. Feel
free to resubmit it to the next CF when there's progress.



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Fri, Mar 25, 2022 at 01:20:47AM -0400, Greg Stark wrote:
> It seems development on this has stalled. If there's no further work
> happening I guess I'll mark the patch returned with feedback. Feel
> free to resubmit it to the next CF when there's progress.

Since it's a reasonably large patch (and one that I had myself started before)
and it's only been 20some days since (minor) review comments, and since the
focus right now is on committing features, and not reviewing new patches, and
this patch is new one month ago, and its 0002 not intended for pg15, therefor
I'm moving it to the next CF, where I hope to work with its authors to progress
it.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Rachel Heaton
Date:
On Fri, Mar 25, 2022 at 6:22 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Fri, Mar 25, 2022 at 01:20:47AM -0400, Greg Stark wrote:
> > It seems development on this has stalled. If there's no further work
> > happening I guess I'll mark the patch returned with feedback. Feel
> > free to resubmit it to the next CF when there's progress.
>
> Since it's a reasonably large patch (and one that I had myself started before)
> and it's only been 20some days since (minor) review comments, and since the
> focus right now is on committing features, and not reviewing new patches, and
> this patch is new one month ago, and its 0002 not intended for pg15, therefor
> I'm moving it to the next CF, where I hope to work with its authors to progress
> it.
>
Hi Folks,

Here is an updated patchset from Georgios, with minor assistance from myself.
The comments above should be addressed, but please let us know if
there are other things to go over. A functional change in this
patchset is when `--compress=none` is passed to pg_dump, it will not
compress for directory type (previously, it would use gzip if
present). The previous default behavior is retained.

- Rachel

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:

------- Original Message -------

On Saturday, March 26th, 2022 at 12:13 AM, Rachel Heaton <rachelmheaton@gmail.com> wrote:

> On Fri, Mar 25, 2022 at 6:22 AM Justin Pryzby pryzby@telsasoft.com wrote:
>
> > On Fri, Mar 25, 2022 at 01:20:47AM -0400, Greg Stark wrote:
> >
> > > It seems development on this has stalled. If there's no further work
> > > happening I guess I'll mark the patch returned with feedback. Feel
> > > free to resubmit it to the next CF when there's progress.

We had some progress yet we didn't want to distract the list with too many
emails. Of course, it seemed stalled to the outside observer, yet I simply
wanted to set the record straight and say that we are actively working on it.

> >
> > Since it's a reasonably large patch (and one that I had myself started before)
> > and it's only been 20some days since (minor) review comments, and since the
> > focus right now is on committing features, and not reviewing new patches, and
> > this patch is new one month ago, and its 0002 not intended for pg15, therefor
> > I'm moving it to the next CF, where I hope to work with its authors to progress
> > it.

Thank you. It is much appreciated. We will sent updates when the next commitfest
starts in July as to not distract from the 15 work. Then, we can take it from there.

>
> Hi Folks,
>
> Here is an updated patchset from Georgios, with minor assistance from myself.
> The comments above should be addressed, but please let us know if

A small amendment to the above statement. This patchset does not include the
refactoring of compress_io suggested by Mr Paquier in the same thread, as it is
missing documentation. An updated version will be sent to include those changes
on the next commitfest.

> there are other things to go over. A functional change in this
> patchset is when `--compress=none` is passed to pg_dump, it will not
> compress for directory type (previously, it would use gzip if
> present). The previous default behavior is retained.
>
> - Rachel



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Fri, Mar 25, 2022 at 11:43:17PM +0000, gkokolatos@pm.me wrote:
> On Saturday, March 26th, 2022 at 12:13 AM, Rachel Heaton <rachelmheaton@gmail.com> wrote:
>> Here is an updated patchset from Georgios, with minor assistance from myself.
>> The comments above should be addressed, but please let us know if
>
> A small amendment to the above statement. This patchset does not include the
> refactoring of compress_io suggested by Mr Paquier in the same thread, as it is
> missing documentation. An updated version will be sent to include those changes
> on the next commitfest.

The refactoring using callbacks would make the code much cleaner IMO
in the long term, with zstd waiting in the queue.  Now, I see some
pieces of the patch set that could be merged now without waiting for
the development cycle of 16 to begin, as of 0001 to add more tests and
0002.

I have a question about 0002, actually.  What has led you to the
conclusion that this code is dead and could be removed?
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Sat, Mar 26, 2022 at 02:57:50PM +0900, Michael Paquier wrote:
> I have a question about 0002, actually.  What has led you to the
> conclusion that this code is dead and could be removed?

See 0001 and the manpage.

+               'pg_dump: compression is not supported by tar archive format');

When I submitted a patch to support zstd, I spent awhile trying to make
compression work with tar, but it's a significant effort and better done
separately.



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
LZ4F_HEADER_SIZE_MAX isn't defined in old LZ4.

I ran into that on an ubuntu LTS, so I don't think it's so old that it
shouldn't be handled more gracefully.  LZ4 should either have an explicit
version check, or else shouldn't depend on that feature (or should define a
safe fallback version if the library header doesn't define it).

https://packages.ubuntu.com/liblz4-1

0003: typo: of legacy => or legacy

There are a large number of ifdefs being added here - it'd be nice to minimize
that.  basebackup was organized to use separate files, which is one way.

$ git grep -c 'ifdef .*LZ4' src/bin/pg_dump/compress_io.c
src/bin/pg_dump/compress_io.c:19

In last year's CF entry, I had made a union within CompressorState.  LZ4
doesn't need z_streamp (and ztsd will need ZSTD_outBuffer, ZSTD_inBuffer,
ZSTD_CStream).

0002: I wonder if you're able to re-use any of the basebackup parsing stuff
from commit ffd53659c.  You're passing both the compression method *and* level.
I think there should be a structure which includes both.  In the future, that
can also handle additional options.  I hope to re-use these same things for
wal_compression=method:level.

You renamed this:

|-       COMPR_ALG_LIBZ
|-} CompressionAlgorithm;
|+       COMPRESSION_GZIP,
|+} CompressionMethod;

..But I don't think that's an improvement.  If you were to change it, it should
say something like PGDUMP_COMPRESS_ZLIB, since there are other compression
structs and typedefs.  zlib is not idential to gzip, which uses a different
header, so in WriteDataToArchive(), LIBZ is correct, and GZIP is incorrect.

The cf* changes in pg_backup_archiver could be split out into a separate
commit.  It's strictly a code simplification - not just preparation for more
compression algorithms.  The commit message should "See also:
bf9aa490db24b2334b3595ee33653bf2fe39208c".

The changes in 0002 for cfopen_write seem insufficient:
|+       if (compressionMethod == COMPRESSION_NONE)
|+               fp = cfopen(path, mode, compressionMethod, 0);
|        else
|        {
| #ifdef HAVE_LIBZ
|                char       *fname;
| 
|                fname = psprintf("%s.gz", path);
|-               fp = cfopen(fname, mode, compression);
|+               fp = cfopen(fname, mode, compressionMethod, compressionLevel);
|                free_keep_errno(fname);
| #else

The only difference between the LIBZ and uncompressed case is the file
extension, and it'll be the only difference with LZ4 too.  So I suggest to
first handle the file extension, and the rest of the code path is not
conditional on the compression method.  I don't think cfopen_write even needs
HAVE_LIBZ - can't you handle that in cfopen_internal() ?

This patch rejects -Z0, which ought to be accepted:
./src/bin/pg_dump/pg_dump -h /tmp regression -Fc -Z0 |wc
pg_dump: error: can only specify -Z/--compress [LEVEL] when method is set

Your 0003 patch shouldn't reference LZ4:
+#ifndef HAVE_LIBLZ4
+       if (*compressionMethod == COMPRESSION_LZ4)
+               supports_compression = false;
+#endif

The 0004 patch renames zlibOutSize to outsize - I think the patch series should
be constructed such as to minimize the size of the method-specific patches.  I
say this anticipating also adding support for zstd.  The preliminary patches
should have all the boring stuff.  It would help for reviewing to keep the 
patches split up, or to enumerate all the boring things that are being renamed
(like change OutputContext to cfp, rename zlibOutSize, ...).

0004: The include should use <lz4.h> and not "lz4.h"

freebsd/cfbot is failing.

I suggested off-list to add an 0099 patch to change LZ4 to the default, to
exercise it more on CI.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
> You're passing both the compression method *and* level.  I think there should
> be a structure which includes both.  In the future, that can also handle
> additional options.

I'm not sure if there's anything worth saving, but I did that last year with
0003-Support-multiple-compression-algs-levels-opts.patch
I sent a rebased copy off-list.
https://www.postgresql.org/message-id/flat/20210104025321.GA9712@telsasoft.com#ca1b9f9d3552c87fa874731cad9d8391

|    fatal("not built with LZ4 support");
|    fatal("not built with lz4 support");

Please use consistent capitalization of "lz4" - then the compiler can optimize
away duplicate strings.

> 0004: The include should use <lz4.h> and not "lz4.h"

Also, use USE_LZ4 rather than HAVE_LIBLZ4, per 75eae0908.



Re: Add LZ4 compression in pg_dump

From
Daniel Gustafsson
Date:
> On 26 Mar 2022, at 17:21, Justin Pryzby <pryzby@telsasoft.com> wrote:

> I suggested off-list to add an 0099 patch to change LZ4 to the default, to
> exercise it more on CI.

No need to change the defaults in autoconf for that.  The CFBot uses the cirrus
file in the tree so changing what the job includes can be easily done (assuming
the CFBot hasn't changed this recently which I think it hasn't).  I used that
trick in the NSS patchset to add a completely new job for --with-ssl=nss beside
the --with-ssl=openssl job.

--
Daniel Gustafsson        https://vmware.com/




Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Sun, Mar 27, 2022 at 12:37:27AM +0100, Daniel Gustafsson wrote:
> > On 26 Mar 2022, at 17:21, Justin Pryzby <pryzby@telsasoft.com> wrote:
> 
> > I suggested off-list to add an 0099 patch to change LZ4 to the default, to
> > exercise it more on CI.
> 
> No need to change the defaults in autoconf for that.  The CFBot uses the cirrus
> file in the tree so changing what the job includes can be easily done (assuming
> the CFBot hasn't changed this recently which I think it hasn't).  I used that
> trick in the NSS patchset to add a completely new job for --with-ssl=nss beside
> the --with-ssl=openssl job.

I think you misunderstood - I'm suggesting not only to use with-lz4 (which was
always true since 93d973494), but to change pg_dump -Fc and -Fd to use LZ4 by
default (the same as I suggested for toast_compression, wal_compression, and
again in last year's patch to add zstd compression to pg_dump, for which
postgres was not ready).

@@ -781,6 +807,11 @@ main(int argc, char **argv)
                        compress.alg = COMPR_ALG_LIBZ;
                        compress.level = Z_DEFAULT_COMPRESSION;
 #endif
+
+#ifdef USE_ZSTD
+                       compress.alg = COMPR_ALG_ZSTD; // Set default for testing purposes
+                       compress.level = ZSTD_CLEVEL_DEFAULT;
+#endif




Re: Add LZ4 compression in pg_dump

From
Robert Haas
Date:
On Sat, Mar 26, 2022 at 12:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> 0002: I wonder if you're able to re-use any of the basebackup parsing stuff
> from commit ffd53659c.  You're passing both the compression method *and* level.
> I think there should be a structure which includes both.  In the future, that
> can also handle additional options.  I hope to re-use these same things for
> wal_compression=method:level.

Yeah, we should really try to use that infrastructure instead of
inventing a bunch of different ways to do it. It might require some
renaming here and there, and I'm not sure whether we really want to
try to rush all this into the current release, but I think we should
find a way to get it done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Sun, Mar 27, 2022 at 10:13:00AM -0400, Robert Haas wrote:
> On Sat, Mar 26, 2022 at 12:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > 0002: I wonder if you're able to re-use any of the basebackup parsing stuff
> > from commit ffd53659c.  You're passing both the compression method *and* level.
> > I think there should be a structure which includes both.  In the future, that
> > can also handle additional options.  I hope to re-use these same things for
> > wal_compression=method:level.
> 
> Yeah, we should really try to use that infrastructure instead of
> inventing a bunch of different ways to do it. It might require some
> renaming here and there, and I'm not sure whether we really want to
> try to rush all this into the current release, but I think we should
> find a way to get it done.

It seems like something a whole lot like parse_compress_options() should be in
common/.  Nobody wants to write it again, and I couldn't convince myself to
copy it when I looked at using it for wal_compression.

Maybe it should take an argument which specifies the default algorithm to use
for input of a numeric "level".  And reject such input if not specified, since
wal_compression has never taken a "level", so it's not useful or desirable to
have that default to some new algorithm.

I could write this down if you want, although I'm not sure how/if you intend
other people to use bc_algorithm and bc_algorithm.  I don't think it's
important to do for v15, but it seems like it could be done after featue
freeze.  pg_dump+lz4 is targetting v16, although there's a cleanup patch that
could also go in before branching.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Daniel Gustafsson
Date:
> On 27 Mar 2022, at 00:51, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Sun, Mar 27, 2022 at 12:37:27AM +0100, Daniel Gustafsson wrote:
>>> On 26 Mar 2022, at 17:21, Justin Pryzby <pryzby@telsasoft.com> wrote:
>>
>>> I suggested off-list to add an 0099 patch to change LZ4 to the default, to
>>> exercise it more on CI.
>>
>> No need to change the defaults in autoconf for that.  The CFBot uses the cirrus
>> file in the tree so changing what the job includes can be easily done (assuming
>> the CFBot hasn't changed this recently which I think it hasn't).  I used that
>> trick in the NSS patchset to add a completely new job for --with-ssl=nss beside
>> the --with-ssl=openssl job.
>
> I think you misunderstood - I'm suggesting not only to use with-lz4 (which was
> always true since 93d973494), but to change pg_dump -Fc and -Fd to use LZ4 by
> default (the same as I suggested for toast_compression, wal_compression, and
> again in last year's patch to add zstd compression to pg_dump, for which
> postgres was not ready).

Right, I clearly misunderstood, thanks for the clarification.

--
Daniel Gustafsson        https://vmware.com/




Re: Add LZ4 compression in pg_dump

From
Robert Haas
Date:
On Sun, Mar 27, 2022 at 12:06 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Maybe it should take an argument which specifies the default algorithm to use
> for input of a numeric "level".  And reject such input if not specified, since
> wal_compression has never taken a "level", so it's not useful or desirable to
> have that default to some new algorithm.

That sounds odd to me. Wouldn't it be rather confusing if a bare
integer meant gzip for one case and lz4 for another?

> I could write this down if you want, although I'm not sure how/if you intend
> other people to use bc_algorithm and bc_algorithm.  I don't think it's
> important to do for v15, but it seems like it could be done after featue
> freeze.  pg_dump+lz4 is targetting v16, although there's a cleanup patch that
> could also go in before branching.

Well, I think the first thing we should do is get rid of enum
WalCompressionMethod and use enum WalCompression instead. They've got
the same elements and very similar names, but the WalCompressionMethod
ones just have names like COMPRESSION_NONE, which is too generic,
whereas WalCompressionMethod uses WAL_COMPRESSION_NONE, which is
better. Then I think we should also rename the COMPR_ALG_* constants
in pg_dump.h to names like DUMP_COMPRESSION_*. Once we do that we've
got rid of all the unprefixed things that purport to be a list of
compression algorithms.

Then, if people are willing to adopt the syntax that the
backup_compression.c/h stuff supports as a project standard (+1 from
me) we can go the other way and rename that stuff to be more generic,
taking backup out of the name.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Mon, Mar 28, 2022 at 08:36:15AM -0400, Robert Haas wrote:
> Well, I think the first thing we should do is get rid of enum
> WalCompressionMethod and use enum WalCompression instead. They've got
> the same elements and very similar names, but the WalCompressionMethod
> ones just have names like COMPRESSION_NONE, which is too generic,
> whereas WalCompressionMethod uses WAL_COMPRESSION_NONE, which is
> better. Then I think we should also rename the COMPR_ALG_* constants
> in pg_dump.h to names like DUMP_COMPRESSION_*. Once we do that we've
> got rid of all the unprefixed things that purport to be a list of
> compression algorithms.

Yes, having a centralized enum for the compression method would make
sense, along with the routines to parse and get the compression method
names.  At least that would be one step towards more unity in
src/common/.

> Then, if people are willing to adopt the syntax that the
> backup_compression.c/h stuff supports as a project standard (+1 from
> me) we can go the other way and rename that stuff to be more generic,
> taking backup out of the name.

I am not sure about the specification part which is only used by base
backups that has no client-server requirements, so option values would
still require their own grammar.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Sat, Mar 26, 2022 at 01:14:41AM -0500, Justin Pryzby wrote:
> See 0001 and the manpage.
>
> +               'pg_dump: compression is not supported by tar archive format');
>
> When I submitted a patch to support zstd, I spent awhile trying to make
> compression work with tar, but it's a significant effort and better done
> separately.

Wow.  This stuff is old enough to vote (c3e18804), dead since its
introduction.  There is indeed an argument for removing that, it is
not good to keep around that that has never been stressed and/or
used.  Upon review, the cleanup done looks correct, as we have never
been able to generate .dat.gz files in for a dump in the tar format.

+   command_fails_like(
+       [ 'pg_dump', '--compress', '1', '--format', 'tar' ],
This addition depending on HAVE_LIBZ is a good thing as a reminder of
any work that could be done in 0002.  Now that's waiting for 20 years
so I would not hold my breath on this support.  I think that this
could be just applied first, with 0002 on top of it, as a first
improvement.

+       compress_cmd => [
+           $ENV{'GZIP_PROGRAM'},
Patch 0001 is missing and update of pg_dump's Makefile to pass down
this environment variable to the test scripts, no?

+       compress_cmd => [
+           $ENV{'GZIP_PROGRAM'},
+           '-f',
[...]
+           $ENV{'GZIP_PROGRAM'},
+           '-k', '-d',
-f and -d are available everywhere I looked at, but is -k/--keep a
portable choice with a gzip command?  I don't see this option in
OpenBSD, for one.  So this test is going to cause problems on those
buildfarm machines, at least.  Couldn't this part be replaced by a
simple --test to check that what has been compressed is in correct
shape?  We know that this works, based on our recent experiences with
the other tests.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:
------- Original Message -------

On Tuesday, March 29th, 2022 at 9:27 AM, Michael Paquier <michael@paquier.xyz> wrote:

> On Sat, Mar 26, 2022 at 01:14:41AM -0500, Justin Pryzby wrote:
> > See 0001 and the manpage.
> > + 'pg_dump: compression is not supported by tar archive format');
> > When I submitted a patch to support zstd, I spent awhile trying to make
> > compression work with tar, but it's a significant effort and better done
> > separately.
>
> Wow. This stuff is old enough to vote (c3e18804), dead since its
> introduction. There is indeed an argument for removing that, it is
> not good to keep around that that has never been stressed and/or
> used. Upon review, the cleanup done looks correct, as we have never
> been able to generate .dat.gz files in for a dump in the tar format.

Correct. My driving force behind it was to ease up the cleanup/refactoring
work that follows, by eliminating the callers of the GZ*() macros.

> + command_fails_like(
>
> + [ 'pg_dump', '--compress', '1', '--format', 'tar' ],
> This addition depending on HAVE_LIBZ is a good thing as a reminder of
> any work that could be done in 0002. Now that's waiting for 20 years
> so I would not hold my breath on this support. I think that this
> could be just applied first, with 0002 on top of it, as a first
> improvement.

Excellent, thank you.

> + compress_cmd => [
> + $ENV{'GZIP_PROGRAM'},
> Patch 0001 is missing and update of pg_dump's Makefile to pass down
> this environment variable to the test scripts, no?

Agreed. It was not properly moved forward. Fixed.

> + compress_cmd => [
> + $ENV{'GZIP_PROGRAM'},
> + '-f',
> [...]
> + $ENV{'GZIP_PROGRAM'},
> + '-k', '-d',
> -f and -d are available everywhere I looked at, but is -k/--keep a
> portable choice with a gzip command? I don't see this option in
> OpenBSD, for one. So this test is going to cause problems on those
> buildfarm machines, at least. Couldn't this part be replaced by a
> simple --test to check that what has been compressed is in correct
> shape? We know that this works, based on our recent experiences with
> the other tests.

I would argue that the simple '--test' will not do in this case, as the
TAP tests do need a file named <test>.sql to compare the contents with.
This file is generated either directly by pg_dump itself, or by running
pg_restore on pg_dump's output. In the case of compression pg_dump will
generate a <test>.sql.<compression program suffix> which can not be
used in the comparison tests. So the intention of this block, is not to
simply test for validity, but to also decompress pg_dump's output for it
to be able to be used.

I updated the patch to simply remove the '-k' flag.

Please find v3 attached. (only 0001 and 0002 are relevant, 0003 and
0004 are only for reference and are currently under active modification).

Cheers,
//Georgios

Attachment

Re: Add LZ4 compression in pg_dump

From
Robert Haas
Date:
On Tue, Mar 29, 2022 at 1:03 AM Michael Paquier <michael@paquier.xyz> wrote:
> > Then, if people are willing to adopt the syntax that the
> > backup_compression.c/h stuff supports as a project standard (+1 from
> > me) we can go the other way and rename that stuff to be more generic,
> > taking backup out of the name.
>
> I am not sure about the specification part which is only used by base
> backups that has no client-server requirements, so option values would
> still require their own grammar.

I don't know what you mean by this. I think the specification stuff
could be reused in a lot of places. If you can ask for a base backup
with zstd:level=3,long=1,fancystuff=yes or whatever we end up with,
why not enable exactly the same for every other place that uses
compression? I don't know what "client-server requirements" is or what
that has to do with this.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, Mar 29, 2022 at 09:46:27AM +0000, gkokolatos@pm.me wrote:
> On Tuesday, March 29th, 2022 at 9:27 AM, Michael Paquier <michael@paquier.xyz> wrote:
>> On Sat, Mar 26, 2022 at 01:14:41AM -0500, Justin Pryzby wrote:
>> Wow. This stuff is old enough to vote (c3e18804), dead since its
>> introduction. There is indeed an argument for removing that, it is
>> not good to keep around that that has never been stressed and/or
>> used. Upon review, the cleanup done looks correct, as we have never
>> been able to generate .dat.gz files in for a dump in the tar format.
>
> Correct. My driving force behind it was to ease up the cleanup/refactoring
> work that follows, by eliminating the callers of the GZ*() macros.

Makes sense to me.

>> + command_fails_like(
>>
>> + [ 'pg_dump', '--compress', '1', '--format', 'tar' ],
>> This addition depending on HAVE_LIBZ is a good thing as a reminder of
>> any work that could be done in 0002. Now that's waiting for 20 years
>> so I would not hold my breath on this support. I think that this
>> could be just applied first, with 0002 on top of it, as a first
>> improvement.
>
> Excellent, thank you.

I have applied the test for --compress and --format=tar, separating it
from the rest.

While moving on with 0002, I have noticed the following in
_StartBlob():
    if (AH->compression != 0)
        sfx = ".gz";
    else
        sfx = "";

Shouldn't this bit also be simplified, adding a fatal() like the other
code paths, for safety?

>> + compress_cmd => [
>> + $ENV{'GZIP_PROGRAM'},
>> + '-f',
>> [...]
>> + $ENV{'GZIP_PROGRAM'},
>> + '-k', '-d',
>> -f and -d are available everywhere I looked at, but is -k/--keep a
>> portable choice with a gzip command? I don't see this option in
>> OpenBSD, for one. So this test is going to cause problems on those
>> buildfarm machines, at least. Couldn't this part be replaced by a
>> simple --test to check that what has been compressed is in correct
>> shape? We know that this works, based on our recent experiences with
>> the other tests.
>
> I would argue that the simple '--test' will not do in this case, as the
> TAP tests do need a file named <test>.sql to compare the contents with.
> This file is generated either directly by pg_dump itself, or by running
> pg_restore on pg_dump's output. In the case of compression pg_dump will
> generate a <test>.sql.<compression program suffix> which can not be
> used in the comparison tests. So the intention of this block, is not to
> simply test for validity, but to also decompress pg_dump's output for it
> to be able to be used.

Ahh, I see, thanks.  I would add a comment about that in the area of
compression_gzip_plain_format.

+   my $supports_compression = check_pg_config("#define HAVE_LIBZ 1");

This part could be moved within the if block a couple of lines down.

+   my $compress_program = $ENV{GZIP_PROGRAM};

It seems to me that it is enough to rely on {compress_cmd}, hence
there should be no need for $compress_program, no?

It seems to me that we should have a description for compress_cmd at
the top of 002_pg_dump.pl (close to "Definition of the pg_dump runs to
make").  There is an order dependency with restore_cmd.

> I updated the patch to simply remove the '-k' flag.

Okay.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, Mar 29, 2022 at 09:14:03AM -0400, Robert Haas wrote:
> I don't know what you mean by this. I think the specification stuff
> could be reused in a lot of places. If you can ask for a base backup
> with zstd:level=3,long=1,fancystuff=yes or whatever we end up with,
> why not enable exactly the same for every other place that uses
> compression? I don't know what "client-server requirements" is or what
> that has to do with this.

Oh. I think that I got confused here.  I saw the backup component in
the file name and this has been associated with the client/server
choice that can be done in the options of pg_basebackup.  But
parse_bc_specification() does not include any knowledge about that:
pg_basebackup does this job in parse_compress_options().  I agree that
it looks possible to reuse that stuff in more places than just base
backups.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:
------- Original Message -------

On Wednesday, March 30th, 2022 at 7:54 AM, Michael Paquier <michael@paquier.xyz> wrote:

> On Tue, Mar 29, 2022 at 09:46:27AM +0000, gkokolatos@pm.me wrote:
> > On Tuesday, March 29th, 2022 at 9:27 AM, Michael Paquier michael@paquier.xyz wrote:
> > > On Sat, Mar 26, 2022 at 01:14:41AM -0500, Justin Pryzby wrote:
> > > + command_fails_like(
> > > + [ 'pg_dump', '--compress', '1', '--format', 'tar' ],
> > > This addition depending on HAVE_LIBZ is a good thing as a reminder of
> > > any work that could be done in 0002. Now that's waiting for 20 years
> > > so I would not hold my breath on this support. I think that this
> > > could be just applied first, with 0002 on top of it, as a first
> > > improvement.
> >
> > Excellent, thank you.
>
> I have applied the test for --compress and --format=tar, separating it
> from the rest.

Thank you.

> While moving on with 0002, I have noticed the following in
>
> _StartBlob():
>     if (AH->compression != 0)
>         sfx = ".gz";
>     else
>         sfx = "";
>
> Shouldn't this bit also be simplified, adding a fatal() like the other
> code paths, for safety?

Agreed. Fixed.

> > > + compress_cmd => [
> > > + $ENV{'GZIP_PROGRAM'},
> > > + '-f',
> > > [...]
> > > + $ENV{'GZIP_PROGRAM'},
> > > + '-k', '-d',
> > > -f and -d are available everywhere I looked at, but is -k/--keep a
> > > portable choice with a gzip command? I don't see this option in
> > > OpenBSD, for one. So this test is going to cause problems on those
> > > buildfarm machines, at least. Couldn't this part be replaced by a
> > > simple --test to check that what has been compressed is in correct
> > > shape? We know that this works, based on our recent experiences with
> > > the other tests.
> >
> > I would argue that the simple '--test' will not do in this case, as the
> > TAP tests do need a file named <test>.sql to compare the contents with.
> > This file is generated either directly by pg_dump itself, or by running
> > pg_restore on pg_dump's output. In the case of compression pg_dump will
> > generate a <test>.sql.<compression program suffix> which can not be
> > used in the comparison tests. So the intention of this block, is not to
> > simply test for validity, but to also decompress pg_dump's output for it
> > to be able to be used.
>
> Ahh, I see, thanks. I would add a comment about that in the area of
> compression_gzip_plain_format.

Agreed. Comment added.

> + my $supports_compression = check_pg_config("#define HAVE_LIBZ 1");
>
> This part could be moved within the if block a couple of lines down.

I moved it instead out of the for loop above to not have to call it on
each iteration.

> + my $compress_program = $ENV{GZIP_PROGRAM};
>
> It seems to me that it is enough to rely on {compress_cmd}, hence
> there should be no need for $compress_program, no?

Maybe not. We don't want to the tests to fail if the utility is not
installed. That becomes even more evident as more methods are added.
However I realized that the presence of the environmental variable does
not guarrantee that the utility is actually installed. In the attached,
the existance of the utility is based on the return value of system_log().

> It seems to me that we should have a description for compress_cmd at
> the top of 002_pg_dump.pl (close to "Definition of the pg_dump runs to
> make"). There is an order dependency with restore_cmd.

Agreed. Comment added.

Cheers,
//Georgios
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Wed, Mar 30, 2022 at 03:32:55PM +0000, gkokolatos@pm.me wrote:
> On Wednesday, March 30th, 2022 at 7:54 AM, Michael Paquier <michael@paquier.xyz> wrote:
>> While moving on with 0002, I have noticed the following in
>>
>> _StartBlob():
>>     if (AH->compression != 0)
>>         sfx = ".gz";
>>     else
>>         sfx = "";
>>
>> Shouldn't this bit also be simplified, adding a fatal() like the other
>> code paths, for safety?
>
> Agreed. Fixed.

Okay.  0002 looks fine as-is, and I don't mind the extra fatal()
calls.  These could be asserts but that's not a big deal one way or
the other.  And the cleanup is now applied.

>> + my $compress_program = $ENV{GZIP_PROGRAM};
>>
>> It seems to me that it is enough to rely on {compress_cmd}, hence
>> there should be no need for $compress_program, no?
>
> Maybe not. We don't want to the tests to fail if the utility is not
> installed. That becomes even more evident as more methods are added.
> However I realized that the presence of the environmental variable does
> not guarrantee that the utility is actually installed. In the attached,
> the existance of the utility is based on the return value of system_log().

Hmm.  [.. thinks ..]  The thing that's itching me here is that you
align the concept of compression with gzip, but that's not going to be
true once more compression options are added to pg_dump, and that
would make $supports_compression and $compress_program_exists
incorrect.  Perhaps the right answer would be to rename all that with
a suffix like "_gzip" to make a difference?  Or would there be enough
control with a value of "compression_gzip" instead of "compression" in
test_key?

+my $compress_program_exists = (system_log("$ENV{GZIP_PROGRAM}", '-h',
+                                         '>', '/dev/null') == 0);
Do we need this command execution at all?  In all the other tests, we
rely on a simple "if (!defined $gzip || $gzip eq '');", so we could do
the same here.

A last thing is that we should perhaps make a clear difference between
the check that looks at if the code has been built with zlib and the
check for the presence of GZIP_PROGRAM, as it can be useful in some
environments to be able to run pg_dump built with zlib, even if the
GZIP_PROGRAM command does not exist (I don't think this can be the
case, but other tests are flexible).  As of now, the patch relies on
pg_dump enforcing uncompression if building under --without-zlib even
if --compress/-Z is used, but that also means that those compression
tests overlap with the existing tests in this case.  Wouldn't it be
more consistent to check after $supports_compression when executing
the dump command for test_key = "compression[_gzip]"?  This would mean
keeping GZIP_PROGRAM as sole check when executing the compression
command.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:
On Thursday, March 31st, 2022 at 4:34 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Wed, Mar 30, 2022 at 03:32:55PM +0000, gkokolatos@pm.me wrote:
> > On Wednesday, March 30th, 2022 at 7:54 AM, Michael Paquier michael@paquier.xyz wrote:
>
> Okay. 0002 looks fine as-is, and I don't mind the extra fatal()
> calls. These could be asserts but that's not a big deal one way or
> the other. And the cleanup is now applied.

Thank you very much.

> > > + my $compress_program = $ENV{GZIP_PROGRAM};
> > > It seems to me that it is enough to rely on {compress_cmd}, hence
> > > there should be no need for $compress_program, no?
> >
> > Maybe not. We don't want to the tests to fail if the utility is not
> > installed. That becomes even more evident as more methods are added.
> > However I realized that the presence of the environmental variable does
> > not guarrantee that the utility is actually installed. In the attached,
> > the existance of the utility is based on the return value of system_log().
>
> Hmm. [.. thinks ..] The thing that's itching me here is that you
> align the concept of compression with gzip, but that's not going to be
> true once more compression options are added to pg_dump, and that
> would make $supports_compression and $compress_program_exists
> incorrect. Perhaps the right answer would be to rename all that with
> a suffix like "_gzip" to make a difference? Or would there be enough
> control with a value of "compression_gzip" instead of "compression" in
> test_key?

I understand the itch. Indeed when LZ4 is added as compression method, this
block changes slightly. I went with the minimum amount changed. Please find
in 0001 of the attached this variable renamed as $gzip_program_exist. I thought
that as prefix it will match better the already used $ENV{GZIP_PROGRAM}.

> +my $compress_program_exists = (system_log("$ENV{GZIP_PROGRAM}", '-h',
> + '>', '/dev/null') == 0);
>
> Do we need this command execution at all? In all the other tests, we
> rely on a simple "if (!defined $gzip || $gzip eq '');", so we could do
> the same here.

You are very correct that we are using the simple version, and that is what
it was included in the previous versions of the current patch. However, I
did notice that the variable is hard-coded in Makefile.global.in and it does
not go through configure. By now, gzip is considered an essential package
in most installations, and this hard-code makes sense. Though I did remove
the utility from my system, (apt remove gzip) and tried the test with the
simple "if (!defined $gzip || $gzip eq '');", which predictably failed. For
this, I went with the system call, it is not too expensive and is rather
reliable.

It is true that the rest of the TAP tests that use this, e.g. in pg_basebackup,
also failed. There is an argument to go simple and I will be happy to revert
to the previous version.

> A last thing is that we should perhaps make a clear difference between
> the check that looks at if the code has been built with zlib and the
> check for the presence of GZIP_PROGRAM, as it can be useful in some
> environments to be able to run pg_dump built with zlib, even if the
> GZIP_PROGRAM command does not exist (I don't think this can be the
> case, but other tests are flexible).

You are very correct. We do that already in the current patch. Note that we skip
the test only when we specifically have to execute a compression command. Not
all compression tests define such command, exactly so that we can test those
cases as well. The point of using an external utility program is in order to
extend the coverage in previously untested yet supported scenarios, e.g. manual
compression of the *.toc files.

Also in the case where it will actually skip the compression command because the
gzip program is not present, it will execute the pg_dump command first.

> As of now, the patch relies on
> pg_dump enforcing uncompression if building under --without-zlib even
> if --compress/-Z is used, but that also means that those compression
> tests overlap with the existing tests in this case. Wouldn't it be
> more consistent to check after $supports_compression when executing
> the dump command for test_key = "compression[_gzip]"? This would mean
> keeping GZIP_PROGRAM as sole check when executing the compression
> command.

I can see the overlap case. Yet, I understand the test_key as serving different
purpose, as it is a key of %tests and %full_runs. I do not expect the database
content of the generated dump to change based on which compression method is used.

In the next round, I can see one explitcly requesting --compress=none to override
defaults. There is a benefit to group the tests for this scenario under the same
test_key, i.e. compression.

Also there will be cases where if the program exists, yet the codebase is compiled
without support for the method. Then compress_cmd or the restore_cmd that follows
will fail. For example, in the plain output, if we try to uncompress the generated
the test will fail with 'gzip: <filename> not in gzip format'. In the directory
format the compress_cmd will compress the *.toc files, but the restore_cmd will
fail because it does not build with support for them.

In the attached version, I propose that the compression_cmd is converted into
a hash. It contains two keys, the program and the arguments. Maybe it is easier
to read than before or than simply grabbing the first element of the array.

Cheers,
//Georgios
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Fri, Apr 01, 2022 at 03:06:40PM +0000, gkokolatos@pm.me wrote:
> I understand the itch. Indeed when LZ4 is added as compression method, this
> block changes slightly. I went with the minimum amount changed. Please find
> in 0001 of the attached this variable renamed as $gzip_program_exist. I thought
> that as prefix it will match better the already used $ENV{GZIP_PROGRAM}.

Hmm.  I have spent some time on that, and upon review I really think
that we should skip the tests marked as dedicated to the gzip
compression entirely if the build is not compiled with this option,
rather than letting the code run a dump for nothing in some cases,
relying on the default to uncompress the contents in others.  In the
latter case, it happens that we have already some checks like
defaults_custom_format, but you already mentioned that.

We should also skip the later parts of the tests if the compression
program does not exist as we rely on it, but only if the command does
not exist.  This will count for LZ4.

> I can see the overlap case. Yet, I understand the test_key as serving different
> purpose, as it is a key of %tests and %full_runs. I do not expect the database
> content of the generated dump to change based on which compression method is used.

Contrary to the current LZ4 tests in pg_dump, what we have here is a
check for a command-level run and not a data-level check.  So what's
introduced is a new concept, and we need a new way to control if the
tests should be entirely skipped or not, particularly if we finish by
not using test_key to make the difference.  Perhaps the best way to
address that is to have a new keyword in the $runs structure.  The
attached defines a new compile_option, that can be completed later for
new compression methods introduced in the tests.  So the idea is to
mark all the tests related to compression with the same test_key, and
the tests can be skipped depending on what compile_option requires.

> In the attached version, I propose that the compression_cmd is converted into
> a hash. It contains two keys, the program and the arguments. Maybe it is easier
> to read than before or than simply grabbing the first element of the array.

Splitting the program and its arguments makes sense.

At the end I am finishing with the attached.  I also saw an overlap
with the addition of --jobs for the directory format vs not using the
option, so I have removed the case where --jobs was not used in the
directory format.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:

------- Original Message -------

On Tuesday, April 5th, 2022 at 3:34 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Apr 01, 2022 at 03:06:40PM +0000, gkokolatos@pm.me wrote:

> Splitting the program and its arguments makes sense.

Great.

> At the end I am finishing with the attached. I also saw an overlap
> with the addition of --jobs for the directory format vs not using the
> option, so I have removed the case where --jobs was not used in the
> directory format.

Thank you. I agree with the attached and I will carry it forward to the
rest of the patchset.

Cheers,
//Georgios

> --
> Michael



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, Apr 05, 2022 at 07:13:35AM +0000, gkokolatos@pm.me wrote:
> Thank you. I agree with the attached and I will carry it forward to the
> rest of the patchset.

No need to carry it forward anymore, I think ;)
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:

------- Original Message -------

On Tuesday, April 5th, 2022 at 12:55 PM, Michael Paquier <michael@paquier.xyz> wrote:

> On Tue, Apr 05, 2022 at 07:13:35AM +0000, gkokolatos@pm.me wrote:
> No need to carry it forward anymore, I think ;)

Thank you for committing!

Cheers,
//Georgios

> --
> Michael



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
Hi,

Will you be able to send a rebased patch for the next CF ?

If you update for the review comments I sent in March, I'll plan to do another
round of review.

On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
> LZ4F_HEADER_SIZE_MAX isn't defined in old LZ4.
> 
> I ran into that on an ubuntu LTS, so I don't think it's so old that it
> shouldn't be handled more gracefully.  LZ4 should either have an explicit
> version check, or else shouldn't depend on that feature (or should define a
> safe fallback version if the library header doesn't define it).
> 
> https://packages.ubuntu.com/liblz4-1
> 
> 0003: typo: of legacy => or legacy
> 
> There are a large number of ifdefs being added here - it'd be nice to minimize
> that.  basebackup was organized to use separate files, which is one way.
> 
> $ git grep -c 'ifdef .*LZ4' src/bin/pg_dump/compress_io.c
> src/bin/pg_dump/compress_io.c:19
> 
> In last year's CF entry, I had made a union within CompressorState.  LZ4
> doesn't need z_streamp (and ztsd will need ZSTD_outBuffer, ZSTD_inBuffer,
> ZSTD_CStream).
> 
> 0002: I wonder if you're able to re-use any of the basebackup parsing stuff
> from commit ffd53659c.  You're passing both the compression method *and* level.
> I think there should be a structure which includes both.  In the future, that
> can also handle additional options.  I hope to re-use these same things for
> wal_compression=method:level.
> 
> You renamed this:
> 
> |-       COMPR_ALG_LIBZ
> |-} CompressionAlgorithm;
> |+       COMPRESSION_GZIP,
> |+} CompressionMethod;
> 
> ..But I don't think that's an improvement.  If you were to change it, it should
> say something like PGDUMP_COMPRESS_ZLIB, since there are other compression
> structs and typedefs.  zlib is not idential to gzip, which uses a different
> header, so in WriteDataToArchive(), LIBZ is correct, and GZIP is incorrect.
> 
> The cf* changes in pg_backup_archiver could be split out into a separate
> commit.  It's strictly a code simplification - not just preparation for more
> compression algorithms.  The commit message should "See also:
> bf9aa490db24b2334b3595ee33653bf2fe39208c".
> 
> The changes in 0002 for cfopen_write seem insufficient:
> |+       if (compressionMethod == COMPRESSION_NONE)
> |+               fp = cfopen(path, mode, compressionMethod, 0);
> |        else
> |        {
> | #ifdef HAVE_LIBZ
> |                char       *fname;
> | 
> |                fname = psprintf("%s.gz", path);
> |-               fp = cfopen(fname, mode, compression);
> |+               fp = cfopen(fname, mode, compressionMethod, compressionLevel);
> |                free_keep_errno(fname);
> | #else
> 
> The only difference between the LIBZ and uncompressed case is the file
> extension, and it'll be the only difference with LZ4 too.  So I suggest to
> first handle the file extension, and the rest of the code path is not
> conditional on the compression method.  I don't think cfopen_write even needs
> HAVE_LIBZ - can't you handle that in cfopen_internal() ?
> 
> This patch rejects -Z0, which ought to be accepted:
> ./src/bin/pg_dump/pg_dump -h /tmp regression -Fc -Z0 |wc
> pg_dump: error: can only specify -Z/--compress [LEVEL] when method is set
> 
> Your 0003 patch shouldn't reference LZ4:
> +#ifndef HAVE_LIBLZ4
> +       if (*compressionMethod == COMPRESSION_LZ4)
> +               supports_compression = false;
> +#endif
> 
> The 0004 patch renames zlibOutSize to outsize - I think the patch series should
> be constructed such as to minimize the size of the method-specific patches.  I
> say this anticipating also adding support for zstd.  The preliminary patches
> should have all the boring stuff.  It would help for reviewing to keep the 
> patches split up, or to enumerate all the boring things that are being renamed
> (like change OutputContext to cfp, rename zlibOutSize, ...).
> 
> 0004: The include should use <lz4.h> and not "lz4.h"
> 
> freebsd/cfbot is failing.
> 
> I suggested off-list to add an 0099 patch to change LZ4 to the default, to
> exercise it more on CI.

On Sat, Mar 26, 2022 at 01:33:36PM -0500, Justin Pryzby wrote:
> On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
> > You're passing both the compression method *and* level.  I think there should
> > be a structure which includes both.  In the future, that can also handle
> > additional options.
> 
> I'm not sure if there's anything worth saving, but I did that last year with
> 0003-Support-multiple-compression-algs-levels-opts.patch
> I sent a rebased copy off-list.
> https://www.postgresql.org/message-id/flat/20210104025321.GA9712@telsasoft.com#ca1b9f9d3552c87fa874731cad9d8391
> 
> |    fatal("not built with LZ4 support");
> |    fatal("not built with lz4 support");
> 
> Please use consistent capitalization of "lz4" - then the compiler can optimize
> away duplicate strings.
> 
> > 0004: The include should use <lz4.h> and not "lz4.h"
> 
> Also, use USE_LZ4 rather than HAVE_LIBLZ4, per 75eae0908.



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Sunday, June 26th, 2022 at 5:55 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:


>
>
> Hi,
>
> Will you be able to send a rebased patch for the next CF ?

Thank you for taking an interest in the PR. The plan is indeed to sent
a new version.

> If you update for the review comments I sent in March, I'll plan to do another
> round of review.

Thank you.

>
> On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
>
> > LZ4F_HEADER_SIZE_MAX isn't defined in old LZ4.
> >
> > I ran into that on an ubuntu LTS, so I don't think it's so old that it
> > shouldn't be handled more gracefully. LZ4 should either have an explicit
> > version check, or else shouldn't depend on that feature (or should define a
> > safe fallback version if the library header doesn't define it).
> >
> > https://packages.ubuntu.com/liblz4-1
> >
> > 0003: typo: of legacy => or legacy
> >
> > There are a large number of ifdefs being added here - it'd be nice to minimize
> > that. basebackup was organized to use separate files, which is one way.
> >
> > $ git grep -c 'ifdef .*LZ4' src/bin/pg_dump/compress_io.c
> > src/bin/pg_dump/compress_io.c:19
> >
> > In last year's CF entry, I had made a union within CompressorState. LZ4
> > doesn't need z_streamp (and ztsd will need ZSTD_outBuffer, ZSTD_inBuffer,
> > ZSTD_CStream).
> >
> > 0002: I wonder if you're able to re-use any of the basebackup parsing stuff
> > from commit ffd53659c. You're passing both the compression method and level.
> > I think there should be a structure which includes both. In the future, that
> > can also handle additional options. I hope to re-use these same things for
> > wal_compression=method:level.
> >
> > You renamed this:
> >
> > |- COMPR_ALG_LIBZ
> > |-} CompressionAlgorithm;
> > |+ COMPRESSION_GZIP,
> > |+} CompressionMethod;
> >
> > ..But I don't think that's an improvement. If you were to change it, it should
> > say something like PGDUMP_COMPRESS_ZLIB, since there are other compression
> > structs and typedefs. zlib is not idential to gzip, which uses a different
> > header, so in WriteDataToArchive(), LIBZ is correct, and GZIP is incorrect.
> >
> > The cf* changes in pg_backup_archiver could be split out into a separate
> > commit. It's strictly a code simplification - not just preparation for more
> > compression algorithms. The commit message should "See also:
> > bf9aa490db24b2334b3595ee33653bf2fe39208c".
> >
> > The changes in 0002 for cfopen_write seem insufficient:
> > |+ if (compressionMethod == COMPRESSION_NONE)
> > |+ fp = cfopen(path, mode, compressionMethod, 0);
> > | else
> > | {
> > | #ifdef HAVE_LIBZ
> > | char *fname;
> > |
> > | fname = psprintf("%s.gz", path);
> > |- fp = cfopen(fname, mode, compression);
> > |+ fp = cfopen(fname, mode, compressionMethod, compressionLevel);
> > | free_keep_errno(fname);
> > | #else
> >
> > The only difference between the LIBZ and uncompressed case is the file
> > extension, and it'll be the only difference with LZ4 too. So I suggest to
> > first handle the file extension, and the rest of the code path is not
> > conditional on the compression method. I don't think cfopen_write even needs
> > HAVE_LIBZ - can't you handle that in cfopen_internal() ?
> >
> > This patch rejects -Z0, which ought to be accepted:
> > ./src/bin/pg_dump/pg_dump -h /tmp regression -Fc -Z0 |wc
> > pg_dump: error: can only specify -Z/--compress [LEVEL] when method is set
> >
> > Your 0003 patch shouldn't reference LZ4:
> > +#ifndef HAVE_LIBLZ4
> > + if (*compressionMethod == COMPRESSION_LZ4)
> > + supports_compression = false;
> > +#endif
> >
> > The 0004 patch renames zlibOutSize to outsize - I think the patch series should
> > be constructed such as to minimize the size of the method-specific patches. I
> > say this anticipating also adding support for zstd. The preliminary patches
> > should have all the boring stuff. It would help for reviewing to keep the
> > patches split up, or to enumerate all the boring things that are being renamed
> > (like change OutputContext to cfp, rename zlibOutSize, ...).
> >
> > 0004: The include should use <lz4.h> and not "lz4.h"
> >
> > freebsd/cfbot is failing.
> >
> > I suggested off-list to add an 0099 patch to change LZ4 to the default, to
> > exercise it more on CI.
>
>
> On Sat, Mar 26, 2022 at 01:33:36PM -0500, Justin Pryzby wrote:
>
> > On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
> >
> > > You're passing both the compression method and level. I think there should
> > > be a structure which includes both. In the future, that can also handle
> > > additional options.
> >
> > I'm not sure if there's anything worth saving, but I did that last year with
> > 0003-Support-multiple-compression-algs-levels-opts.patch
> > I sent a rebased copy off-list.
> > https://www.postgresql.org/message-id/flat/20210104025321.GA9712@telsasoft.com#ca1b9f9d3552c87fa874731cad9d8391
> >
> > | fatal("not built with LZ4 support");
> > | fatal("not built with lz4 support");
> >
> > Please use consistent capitalization of "lz4" - then the compiler can optimize
> > away duplicate strings.
> >
> > > 0004: The include should use <lz4.h> and not "lz4.h"
> >
> > Also, use USE_LZ4 rather than HAVE_LIBLZ4, per 75eae0908.
>
>



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Sunday, June 26th, 2022 at 5:55 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:


>
> Hi,
>
> Will you be able to send a rebased patch for the next CF ?

Please find a rebased and heavily refactored patchset. Since parts of this
patchset were already committed, I restarted numbering. I am not certain if
this is the preferred way. This makes alignment with previous comments a bit
harder

> If you update for the review comments I sent in March, I'll plan to do another
> round of review.

I have updated for "some" of the comments. This is not an unwillingness to
incorporate those specific comments. Simply this patchset had started to divert
heavily already based on comments from Mr. Paquier who had already requested for
the APIs to be refactored to use function pointers. This is happening in 0002 of
the patchset. 0001 of the patchset is using the new compression.h under common.

This patchset should be considered a late draft, as commentary, documentation,
and some finer details are not yet finalized; because I am expecting the proposed
refactor to receive a wealth of comments. It would be helpful to understand if
the proposed direction is something worth to be worked upon, before moving to the
finer details.

For what is worth, I am the sole author of the current patchset.

Cheers,
//Georgios


> On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
>
> > LZ4F_HEADER_SIZE_MAX isn't defined in old LZ4.
> >
> > I ran into that on an ubuntu LTS, so I don't think it's so old that it
> > shouldn't be handled more gracefully. LZ4 should either have an explicit
> > version check, or else shouldn't depend on that feature (or should define a
> > safe fallback version if the library header doesn't define it).
> >
> > https://packages.ubuntu.com/liblz4-1
> >
> > 0003: typo: of legacy => or legacy
> >
> > There are a large number of ifdefs being added here - it'd be nice to minimize
> > that. basebackup was organized to use separate files, which is one way.
> >
> > $ git grep -c 'ifdef .*LZ4' src/bin/pg_dump/compress_io.c
> > src/bin/pg_dump/compress_io.c:19
> >
> > In last year's CF entry, I had made a union within CompressorState. LZ4
> > doesn't need z_streamp (and ztsd will need ZSTD_outBuffer, ZSTD_inBuffer,
> > ZSTD_CStream).
> >
> > 0002: I wonder if you're able to re-use any of the basebackup parsing stuff
> > from commit ffd53659c. You're passing both the compression method and level.
> > I think there should be a structure which includes both. In the future, that
> > can also handle additional options. I hope to re-use these same things for
> > wal_compression=method:level.
> >
> > You renamed this:
> >
> > |- COMPR_ALG_LIBZ
> > |-} CompressionAlgorithm;
> > |+ COMPRESSION_GZIP,
> > |+} CompressionMethod;
> >
> > ..But I don't think that's an improvement. If you were to change it, it should
> > say something like PGDUMP_COMPRESS_ZLIB, since there are other compression
> > structs and typedefs. zlib is not idential to gzip, which uses a different
> > header, so in WriteDataToArchive(), LIBZ is correct, and GZIP is incorrect.
> >
> > The cf* changes in pg_backup_archiver could be split out into a separate
> > commit. It's strictly a code simplification - not just preparation for more
> > compression algorithms. The commit message should "See also:
> > bf9aa490db24b2334b3595ee33653bf2fe39208c".
> >
> > The changes in 0002 for cfopen_write seem insufficient:
> > |+ if (compressionMethod == COMPRESSION_NONE)
> > |+ fp = cfopen(path, mode, compressionMethod, 0);
> > | else
> > | {
> > | #ifdef HAVE_LIBZ
> > | char *fname;
> > |
> > | fname = psprintf("%s.gz", path);
> > |- fp = cfopen(fname, mode, compression);
> > |+ fp = cfopen(fname, mode, compressionMethod, compressionLevel);
> > | free_keep_errno(fname);
> > | #else
> >
> > The only difference between the LIBZ and uncompressed case is the file
> > extension, and it'll be the only difference with LZ4 too. So I suggest to
> > first handle the file extension, and the rest of the code path is not
> > conditional on the compression method. I don't think cfopen_write even needs
> > HAVE_LIBZ - can't you handle that in cfopen_internal() ?
> >
> > This patch rejects -Z0, which ought to be accepted:
> > ./src/bin/pg_dump/pg_dump -h /tmp regression -Fc -Z0 |wc
> > pg_dump: error: can only specify -Z/--compress [LEVEL] when method is set
> >
> > Your 0003 patch shouldn't reference LZ4:
> > +#ifndef HAVE_LIBLZ4
> > + if (*compressionMethod == COMPRESSION_LZ4)
> > + supports_compression = false;
> > +#endif
> >
> > The 0004 patch renames zlibOutSize to outsize - I think the patch series should
> > be constructed such as to minimize the size of the method-specific patches. I
> > say this anticipating also adding support for zstd. The preliminary patches
> > should have all the boring stuff. It would help for reviewing to keep the
> > patches split up, or to enumerate all the boring things that are being renamed
> > (like change OutputContext to cfp, rename zlibOutSize, ...).
> >
> > 0004: The include should use <lz4.h> and not "lz4.h"
> >
> > freebsd/cfbot is failing.
> >
> > I suggested off-list to add an 0099 patch to change LZ4 to the default, to
> > exercise it more on CI.
>
>
> On Sat, Mar 26, 2022 at 01:33:36PM -0500, Justin Pryzby wrote:
>
> > On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
> >
> > > You're passing both the compression method and level. I think there should
> > > be a structure which includes both. In the future, that can also handle
> > > additional options.
> >
> > I'm not sure if there's anything worth saving, but I did that last year with
> > 0003-Support-multiple-compression-algs-levels-opts.patch
> > I sent a rebased copy off-list.
> > https://www.postgresql.org/message-id/flat/20210104025321.GA9712@telsasoft.com#ca1b9f9d3552c87fa874731cad9d8391
> >
> > | fatal("not built with LZ4 support");
> > | fatal("not built with lz4 support");
> >
> > Please use consistent capitalization of "lz4" - then the compiler can optimize
> > away duplicate strings.
> >
> > > 0004: The include should use <lz4.h> and not "lz4.h"
> >
> > Also, use USE_LZ4 rather than HAVE_LIBLZ4, per 75eae0908.
>
>
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
This is a review of 0001.

On Tue, Jul 05, 2022 at 01:22:47PM +0000, gkokolatos@pm.me wrote:
> Simply this patchset had started to divert
> heavily already based on comments from Mr. Paquier who had already requested for
> the APIs to be refactored to use function pointers. This is happening in 0002 of
> the patchset.

I said something about reducing ifdefs, but I'm having trouble finding what
Michael said about this ?

> > On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
> >
> > > LZ4F_HEADER_SIZE_MAX isn't defined in old LZ4.
> > >
> > > I ran into that on an ubuntu LTS, so I don't think it's so old that it
> > > shouldn't be handled more gracefully. LZ4 should either have an explicit
> > > version check, or else shouldn't depend on that feature (or should define a
> > > safe fallback version if the library header doesn't define it).
> > > https://packages.ubuntu.com/liblz4-1

The constant still seems to be used without defining a fallback or a minimum version.

> > > 0003: typo: of legacy => or legacy

This is still there

> > > You renamed this:
> > >
> > > |- COMPR_ALG_LIBZ
> > > |-} CompressionAlgorithm;
> > > |+ COMPRESSION_GZIP,
> > > |+} CompressionMethod;
> > >
> > > ..But I don't think that's an improvement. If you were to change it, it should
> > > say something like PGDUMP_COMPRESS_ZLIB, since there are other compression
> > > structs and typedefs. zlib is not idential to gzip, which uses a different
> > > header, so in WriteDataToArchive(), LIBZ is correct, and GZIP is incorrect.

This comment still applies - zlib's gz* functions are "gzip" but the others are
"zlib".  https://zlib.net/manual.html

That affects both the 0001 and 0002 patches.

Actually, I think that "gzip" should not be the name of the user-facing option,
since (except for "plain" format) it isn't using gzip.

+Robert, since this suggests amending parse_compress_algorithm().  Maybe "zlib"
should be parsed the same way as "gzip" - I don't think we ever expose both to
a user, but in some cases (basebackup and pg_dump -Fp -Z1) the output is "gzip"
and in some cases NO it's zlib (pg_dump -Fc -Z1).

> > > The cf* changes in pg_backup_archiver could be split out into a separate
> > > commit. It's strictly a code simplification - not just preparation for more
> > > compression algorithms. The commit message should "See also:
> > > bf9aa490db24b2334b3595ee33653bf2fe39208c".

I still think this could be an early, 0000 patch.

> > > freebsd/cfbot is failing.

This is still failing for bsd, windows and compiler warnings.
Windows also has compiler warnings.
http://cfbot.cputube.org/georgios-kokolatos.html

Please see: src/tools/ci/README, which you can use to run check-world on 4 OS
by pushing a branch to github.

> > > I suggested off-list to add an 0099 patch to change LZ4 to the default, to
> > > exercise it more on CI.

What about this ?  I think the patch needs to pass CI on all 4 OS with
default=zlib and default=lz4.

> > On Sat, Mar 26, 2022 at 01:33:36PM -0500, Justin Pryzby wrote:

> @@ -254,7 +251,12 @@ CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
>  Archive *
>  OpenArchive(const char *FileSpec, const ArchiveFormat fmt)
>  {
> -    ArchiveHandle *AH = _allocAH(FileSpec, fmt, 0, true, archModeRead, setupRestoreWorker);
> +    ArchiveHandle *AH;
> +    pg_compress_specification compress_spec;

Should this be initialized to {0} ?

> @@ -969,6 +969,8 @@ NewRestoreOptions(void)
>      opts->format = archUnknown;
>      opts->cparams.promptPassword = TRI_DEFAULT;
>      opts->dumpSections = DUMP_UNSECTIONED;
> +    opts->compress_spec.algorithm = PG_COMPRESSION_NONE;
> +    opts->compress_spec.level = INT_MIN;

Why INT_MIN ?

> @@ -1115,23 +1117,28 @@ PrintTOCSummary(Archive *AHX)
>      ArchiveHandle *AH = (ArchiveHandle *) AHX;
>      RestoreOptions *ropt = AH->public.ropt;
>      TocEntry   *te;
> +    pg_compress_specification out_compress_spec;

Should have {0} ?
I suggest to write it like my 2020 patch for this, which says:
no_compression = {0};

>      /* Open stdout with no compression for AH output handle */
> -    AH->gzOut = 0;
> -    AH->OF = stdout;
> +    out_compress_spec.algorithm = PG_COMPRESSION_NONE;
> +    AH->OF = cfdopen(dup(fileno(stdout)), PG_BINARY_A, out_compress_spec);

Ideally this should check the success of dup().

> @@ -3776,21 +3746,25 @@ ReadHead(ArchiveHandle *AH)
> +    if (AH->compress_spec.level != INT_MIN)

Why is it testing the level and not the algorithm ?

> --- a/src/bin/pg_dump/pg_backup_custom.c
> +++ b/src/bin/pg_dump/pg_backup_custom.c
> @@ -298,7 +298,7 @@ _StartData(ArchiveHandle *AH, TocEntry *te)
>      _WriteByte(AH, BLK_DATA);    /* Block type */
>      WriteInt(AH, te->dumpId);    /* For sanity check */
>  
> -    ctx->cs = AllocateCompressor(AH->compression, _CustomWriteFunc);
> +    ctx->cs = AllocateCompressor(AH->compress_spec, _CustomWriteFunc);

Is it necessary to rename the data structure ?
If not, this file can remain unchanged.

> --- a/src/bin/pg_dump/pg_backup_directory.c
> +++ b/src/bin/pg_dump/pg_backup_directory.c
> @@ -573,6 +574,7 @@ _CloseArchive(ArchiveHandle *AH)
>      if (AH->mode == archModeWrite)
>      {
>          cfp           *tocFH;
> +        pg_compress_specification compress_spec;

Should use {0} ?

> @@ -639,12 +642,14 @@ static void
>  _StartBlobs(ArchiveHandle *AH, TocEntry *te)
>  {
>      lclContext *ctx = (lclContext *) AH->formatData;
> +    pg_compress_specification compress_spec;

Same

> +    /*
> +     * Custom and directory formats are compressed by default (zlib), others
> +     * not
> +     */
> +    if (user_compression_defined == false)

Should be: !user_compression_defined

Your 0001+0002 patches (without 0003) fail to compile:

pg_backup_directory.c: In function ‘_ReadByte’:
pg_backup_directory.c:519:12: error: ‘CompressFileHandle’ {aka ‘struct CompressFileHandle’} has no member named
‘_IO_getc’
  519 |  return CFH->getc(CFH);
      |            ^~
pg_backup_directory.c:520:1: warning: control reaches end of non-void function [-Wreturn-type]
  520 | }

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, Jul 05, 2022 at 01:22:47PM +0000, gkokolatos@pm.me wrote:
> I have updated for "some" of the comments. This is not an unwillingness to
> incorporate those specific comments. Simply this patchset had started to divert
> heavily already based on comments from Mr. Paquier who had already requested for
> the APIs to be refactored to use function pointers. This is happening in 0002 of
> the patchset. 0001 of the patchset is using the new compression.h under common.
>
> This patchset should be considered a late draft, as commentary, documentation,
> and some finer details are not yet finalized; because I am expecting the proposed
> refactor to receive a wealth of comments. It would be helpful to understand if
> the proposed direction is something worth to be worked upon, before moving to the
> finer details.

I have read through the patch set, and I like a lot the separation you
are doing here with CompressFileHandle where a compression method has
to specify a full set of callbacks depending on the actions that need
to be taken.  One advantage, as you patch shows, is that you reduce
the dependency of each code path depending on the compression method,
with #ifdefs and such located mostly into their own file structure, so
as adding a new compression method becomes really easier.  These
callbacks are going to require much more documentation to describe
what anybody using them should expect from them, and perhaps they
could be renamed in a more generic way as the currect names come from
POSIX (say read_char(), read_string()?), even if this patch has just
inherited the names coming from pg_dump itself, but this can be tuned
over and over.

The split into three parts as of 0001 to plug into pg_dump the new
compression option set, 0002 to introduce the callbacks and 0003 to
add LZ4, building on the two first parts, makes sense to me.  0001 and
0002 could be done in a reversed order as they are mostly independent,
this order is fine as-is.

In short, I am fine with the proposed approach.

+#define K_VERS_1_15 MAKE_ARCHIVE_VERSION(1, 15, 0) /* add compressionMethod
+                                                    * in header */
Indeed, the dump format needs a version bump for this information.

+static bool
+parse_compression_option(const char *opt,
+                        pg_compress_specification *compress_spec)
This parsing logic in pg_dump.c looks a lot like what pg_receivewal.c
does with its parse_compress_options() where, for compatibility:
- If only a number is given:
-- Assume no compression if level is 0.
-- Assume gzip with given compression if level > 0.
- If a string is found, assume a full spec, with optionally a level.
So some consolidation could be done between both.

By the way, I can see that GZCLOSE(), etc. are still defined in
compress_io.h but they are not used.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Jacob Champion
Date:
This entry has been waiting on author input for a while (our current
threshold is roughly two weeks), so I've marked it Returned with
Feedback.

Once you think the patchset is ready for review again, you (or any
interested party) can resurrect the patch entry by visiting

    https://commitfest.postgresql.org/38/3571/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

Thanks,
--Jacob



Re: Add LZ4 compression in pg_dump

From
Georgios Kokolatos
Date:
Thank you for your work during commitfest.

The patch is still in development. Given vacation status, expect the next patches to be ready for the November
commitfest.
For now it has moved to the September one. Further action will be taken then as needed.

Enjoy the rest of the summer!

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
Checking if you'll be able to submit new patches soon ?



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:



On Wed, Nov 2, 2022 at 14:28, Justin Pryzby <pryzby@telsasoft.com> wrote:
Checking if you'll be able to submit new patches soon ?

Thank you for checking up. Expect new versions within this commitfest cycle. 

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Fri, Aug 05, 2022 at 02:23:45PM +0000, Georgios Kokolatos wrote:
> Thank you for your work during commitfest.
> 
> The patch is still in development. Given vacation status, expect the next patches to be ready for the November
commitfest.
> For now it has moved to the September one. Further action will be taken then as needed.

On Sun, Nov 06, 2022 at 02:53:12PM +0000, gkokolatos@pm.me wrote:
> On Wed, Nov 2, 2022 at 14:28, Justin Pryzby <pryzby@telsasoft.com> wrote:
> > Checking if you'll be able to submit new patches soon ?
> 
> Thank you for checking up. Expect new versions within this commitfest cycle.

Hi,

I think this patch record should be closed for now.  You can re-open the
existing patch record once a patch is ready to be reviewed.

The commitfest is a time for committing/reviewing patches that were
previously submitted, but there's no new patch since July.  Making a
patch available for review at the start of the commitfest seems like a
requirement for current patch records (same as for new patch records).

I wrote essentially the same patch as your early patches 2 years ago
(before postgres was ready to consider new compression algorithms), so
I'm happy to review a new patch when it's available, regardless of its
status in the cfapp.

BTW, some of my own review comments from March weren't addressed.
Please check.  Also, in February, I asked if you knew how to use
cirrusci to run checks on cirrusci, but the patches still had
compilation errors and warnings on various OS.

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/39/3571

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Sun, Nov 20, 2022 at 11:26:11AM -0600, Justin Pryzby wrote:
> I think this patch record should be closed for now.  You can re-open the
> existing patch record once a patch is ready to be reviewed.

Indeed.  As of things are, this is just a dead entry in the CF which
would be confusing.  I have marked it as RwF.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Monday, November 21st, 2022 at 12:13 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Sun, Nov 20, 2022 at 11:26:11AM -0600, Justin Pryzby wrote:
>
> > I think this patch record should be closed for now. You can re-open the
> > existing patch record once a patch is ready to be reviewed.
>
>
> Indeed. As of things are, this is just a dead entry in the CF which
> would be confusing. I have marked it as RwF.

Thank you for closing it.

For the record I am currently working on it simply unsure if I should submit
WIP patches and add noise to the list or wait until it is in a state that I
feel that the comments have been addressed.

A new version that I feel that is in a decent enough state for review should
be ready within this week. I am happy to drop the patch if you think I should
not work on it though.

Cheers,
//Georgios

> --
> Michael



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, Nov 22, 2022 at 10:00:47AM +0000, gkokolatos@pm.me wrote:
> A new version that I feel that is in a decent enough state for review should
> be ready within this week. I am happy to drop the patch if you think I should
> not work on it though.

If you can post a new version of the patch, that's fine, of course.
I'll be happy to look over it more.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Tue, Nov 22, 2022 at 10:00:47AM +0000, gkokolatos@pm.me wrote:
> For the record I am currently working on it simply unsure if I should submit
> WIP patches and add noise to the list or wait until it is in a state that I
> feel that the comments have been addressed.
> 
> A new version that I feel that is in a decent enough state for review should
> be ready within this week. I am happy to drop the patch if you think I should
> not work on it though.

I hope you'll want to continue work on it.  The patch record is like a
request for review, so it's closed if there's nothing ready to review.

I think you should re-send patches (and update the CF app) as often as
they're ready for more review.  Your 001 commit (which is almost the
same as what I wrote 2 years ago) still needs to account for some review
comments, and the whole patch set ought to pass cirrusci tests.  At that
point, you'll be ready for another round of review, even if there's
known TODO/FIXME items in later patches.

BTW I saw that you updated your branch on github.  You'll need to make
the corresponding changes to ./meson.build that you made to ./Makefile.
https://wiki.postgresql.org/wiki/Meson_for_patch_authors
https://wiki.postgresql.org/wiki/Meson
                                                                                                          
 

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Tuesday, November 22nd, 2022 at 11:49 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Tue, Nov 22, 2022 at 10:00:47AM +0000, gkokolatos@pm.me wrote:
>
> > A new version that I feel that is in a decent enough state for review should
> > be ready within this week. I am happy to drop the patch if you think I should
> > not work on it though.
>
>
> If you can post a new version of the patch, that's fine, of course.
> I'll be happy to look over it more.

Thank you Michael (and Justin). Allow me to present v8.

The focus of this version of this series is 0001 and 0002.

Admittedly 0001 could be presented in a separate thread though given its size and
proximity to the topic, I present it here.

In an earlier review you spotted the similarity between pg_dump's and pg_receivewal's
parsing of compression options. However there exists a substantial difference in the
behaviour of the two programs; one treats the lack of support for the requested
algorithm as a fatal error, whereas the other does not. The existing functions in
common/compression.c do not account for the later. 0002 proposes an implementation
for this. It's usefulness is shown in 0003.

Please consider 0003-0005 as work in progress. They are differences from v7 yet they
may contain unaddressed comments for now.

A welcome feedback would be in splitting and/or reordering of 0003-0005. I think
that they now split in coherent units and are presented in a logical order. Let me
know if you disagree and where should the breakpoints be.

Cheers,
//Georgios

> --
> Michael
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Mon, Nov 28, 2022 at 04:32:43PM +0000, gkokolatos@pm.me wrote:
> The focus of this version of this series is 0001 and 0002.
>
> Admittedly 0001 could be presented in a separate thread though given its size and
> proximity to the topic, I present it here.

I don't mind.  This was a hole in meson.build, so nice catch!  I have
noticed a second defect with pg_verifybackup for all the commands, and
applied both at the same time.

> In an earlier review you spotted the similarity between pg_dump's and pg_receivewal's
> parsing of compression options. However there exists a substantial difference in the
> behaviour of the two programs; one treats the lack of support for the requested
> algorithm as a fatal error, whereas the other does not. The existing functions in
> common/compression.c do not account for the later. 0002 proposes an implementation
> for this. It's usefulness is shown in 0003.

In what does it matter?  The logic in compression.c provides an error
when looking at a spec or validating it, but the caller is free to
consume it as it wants because this is shared between the frontend and
the backend, and that includes consuming it as a warning rather than a
ahrd failure.  If we don't want to issue an error and force
non-compression if attempting to use a compression method not
supported in pg_dump, that's fine by me as a historical behavior, but
I don't see why these routines have any need to be split more as
proposed in 0002.

Saying that, I do agree that it would be nice to remove the
duplication between the option parsing of pg_basebackup and
pg_receivewal.  Your patch is very close to that, actually, and it
occured to me that if we move the check on "server-" and "client-" in
pg_basebackup to be just before the integer-only check then we can
consolidate the whole thing.

Attached is an alternative that does not sacrifice the pluggability of
the existing routines while allowing 0003~ to still use them (I don't
really want to move around the checks on the supported build options
now in parse_compress_specification(), that was hard enough to settle
on this location).  On top of that, pg_basebackup is able to cope with
the case of --compress=0 already, enforcing "none" (BaseBackup could
be simplified a bit more before StartLogStreamer).  This refactoring
shaves a little bit of code.

> Please consider 0003-0005 as work in progress. They are differences from v7 yet they
> may contain unaddressed comments for now.

Okay.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, Nov 29, 2022 at 03:19:17PM +0900, Michael Paquier wrote:
> Attached is an alternative that does not sacrifice the pluggability of
> the existing routines while allowing 0003~ to still use them (I don't
> really want to move around the checks on the supported build options
> now in parse_compress_specification(), that was hard enough to settle
> on this location).  On top of that, pg_basebackup is able to cope with
> the case of --compress=0 already, enforcing "none" (BaseBackup could
> be simplified a bit more before StartLogStreamer).  This refactoring
> shaves a little bit of code.

One thing that I forgot to mention is that this refactoring would
treat things like server-N, client-N as valid grammars (in this case N
takes precedence over an optional detail string), implying that N = 0
is "none" and N > 0 is gzip, so that makes for an extra grammar flavor
without impacting the existing ones.  I am not sure that it is worth
documenting, still worth mentioning.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Tuesday, November 29th, 2022 at 7:19 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Mon, Nov 28, 2022 at 04:32:43PM +0000, gkokolatos@pm.me wrote:
>
> > The focus of this version of this series is 0001 and 0002.
> >
> > Admittedly 0001 could be presented in a separate thread though given its size and
> > proximity to the topic, I present it here.
>
>
> I don't mind. This was a hole in meson.build, so nice catch! I have
> noticed a second defect with pg_verifybackup for all the commands, and
> applied both at the same time.

Thank you.

>
> > In an earlier review you spotted the similarity between pg_dump's and pg_receivewal's
> > parsing of compression options. However there exists a substantial difference in the
> > behaviour of the two programs; one treats the lack of support for the requested
> > algorithm as a fatal error, whereas the other does not. The existing functions in
> > common/compression.c do not account for the later. 0002 proposes an implementation
> > for this. It's usefulness is shown in 0003.
>
>
> In what does it matter? The logic in compression.c provides an error
> when looking at a spec or validating it, but the caller is free to
> consume it as it wants because this is shared between the frontend and
> the backend, and that includes consuming it as a warning rather than a
> ahrd failure. If we don't want to issue an error and force
> non-compression if attempting to use a compression method not
> supported in pg_dump, that's fine by me as a historical behavior, but
> I don't see why these routines have any need to be split more as
> proposed in 0002.

I understand. The reason for the change in the routines was because it was
impossible to distinguish a genuine parse error from a missing library in
parse_compress_specification(). If the zlib library is missing, then both
'--compress=gzip:garbage' and '--compress=gzip:7' would populate the
parse_error member of the struct and subsequent calls to
validate_compress_specification() would error out, although only one of
the two options is truly an error. Historically the code would fail on
invalid input regardless of whether the library was present or not.

> Saying that, I do agree that it would be nice to remove the
> duplication between the option parsing of pg_basebackup and
> pg_receivewal. Your patch is very close to that, actually, and it
> occured to me that if we move the check on "server-" and "client-" in
> pg_basebackup to be just before the integer-only check then we can
> consolidate the whole thing.

Great. I did notice the possible benefit but chose to not tread too far
off the necessary in my patch.

> Attached is an alternative that does not sacrifice the pluggability of
> the existing routines while allowing 0003~ to still use them (I don't
> really want to move around the checks on the supported build options
> now in parse_compress_specification(), that was hard enough to settle
> on this location).

Yeah, I thought that it would be a hard sell, hence an "earlier"
version.

The attached version 10, contains verbatim your proposed v9 as 0001.
Then 0002 is switching a bit the parsing order in pg_dump and will not
fail as described above on missing libraries. Now, it will first parse
the algorithm, discard it when unsupported, and only parse the rest of
the option if the algorithm is supported. Granted it is a bit 'uglier'
with the preprocessing blocks, yet it maintains most of the historic
behaviour without altering the common compression interfaces. Now, as
shown in 001_basic.pl, invalid detail will fail only if the algorithm
is supported.

> On top of that, pg_basebackup is able to cope with
> the case of --compress=0 already, enforcing "none" (BaseBackup could
> be simplified a bit more before StartLogStreamer). This refactoring
> shaves a little bit of code.
>
> > Please consider 0003-0005 as work in progress. They are differences from v7 yet they
> > may contain unaddressed comments for now.
>
>
> Okay.

Thank you. Please advice if is preferable to split 0002 in two parts.
I think not but I will happily do so if you think otherwise.

Cheers,
//Georgios

> --
> Michael
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, Nov 29, 2022 at 12:10:46PM +0000, gkokolatos@pm.me wrote:
> Thank you. Please advice if is preferable to split 0002 in two parts.
> I think not but I will happily do so if you think otherwise.

This one makes me curious.  What kind of split are you talking about?
If it makes the code review and the git history cleaner and easier, I
am usually a lot in favor of such incremental changes.  As far as I
can see, there is the switch from the compression integer to
compression specification as one thing.  The second thing is the
refactoring of cfclose() and these routines, paving the way for 0003.
Hmm, it may be cleaner to move the switch to the compression spec in
one patch, and move the logic around cfclose() to its own, paving the
way to 0003.

By the way, I think that this 0002 should drop all the default clauses
in the switches for the compression method so as we'd catch any
missing code paths with compiler warnings if a new compression method
is added in the future.

Anyway, I have applied 0001, adding you as a primary author because
you did most of it with only tweaks from me for pg_basebackup.  The
docs of pg_basebackup have been amended to mention the slight change
in grammar, affecting the case where we do not have a detail string.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Wednesday, November 30th, 2022 at 1:50 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Tue, Nov 29, 2022 at 12:10:46PM +0000, gkokolatos@pm.me wrote:
>
> > Thank you. Please advice if is preferable to split 0002 in two parts.
> > I think not but I will happily do so if you think otherwise.
>
>
> This one makes me curious. What kind of split are you talking about?
> If it makes the code review and the git history cleaner and easier, I
> am usually a lot in favor of such incremental changes. As far as I
> can see, there is the switch from the compression integer to
> compression specification as one thing. The second thing is the
> refactoring of cfclose() and these routines, paving the way for 0003.
> Hmm, it may be cleaner to move the switch to the compression spec in
> one patch, and move the logic around cfclose() to its own, paving the
> way to 0003.

Fair enough. The atteched v11 does that. 0001 introduces compression
specification and is using it throughout. 0002 paves the way to the
new interface by homogenizing the use of cfp. 0003 introduces the new
API and stores the compression algorithm in the custom format header
instead of the compression level integer. Finally 0004 adds support for
LZ4.

Besides the version bump in 0003 which can possibly be split out and
as an independent and earlier step, I think that the patchset consists
of coherent units.

> By the way, I think that this 0002 should drop all the default clauses
> in the switches for the compression method so as we'd catch any
> missing code paths with compiler warnings if a new compression method
> is added in the future.

Sure.

> Anyway, I have applied 0001, adding you as a primary author because
> you did most of it with only tweaks from me for pg_basebackup. The
> docs of pg_basebackup have been amended to mention the slight change
> in grammar, affecting the case where we do not have a detail string.

Very kind of you, thank you.

Cheers,
//Georgios

> --
> Michael
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Wed, Nov 30, 2022 at 05:11:44PM +0000, gkokolatos@pm.me wrote:
> Fair enough. The atteched v11 does that. 0001 introduces compression
> specification and is using it throughout. 0002 paves the way to the
> new interface by homogenizing the use of cfp. 0003 introduces the new
> API and stores the compression algorithm in the custom format header
> instead of the compression level integer. Finally 0004 adds support for
> LZ4.

I have been looking at 0001, and..  Hmm.  I am really wondering
whether it would not be better to just nuke this warning into orbit.
This stuff enforces non-compression even if -Z has been used to a
non-default value.  This has been moved to its current location by
cae2bb1 as of this thread:
https://www.postgresql.org/message-id/20160526.185551.242041780.horiguchi.kyotaro%40lab.ntt.co.jp

However, this is only active if -Z is used when not building with
zlib.  At the end, it comes down to whether we want to prioritize the
portability of pg_dump commands specifying a -Z/--compress across
environments knowing that these may or may not be built with zlib,
vs the amount of simplification/uniformity we would get across the
binaries in the tree once we switch everything to use the compression
specifications.  Now that pg_basebackup and pg_receivewal are managed
by compression specifications, and that we'd want more compression
options for pg_dump, I would tend to do the latter and from now on
complain if attempting to do a pg_dump -Z under --without-zlib with a
compression level > 0.  zlib is also widely available, and we don't
document the fact that non-compression is enforced in this case,
either.  (Two TAP tests with the custom format had to be tweaked.)

As per the patch, it is true that we do not need to bump the format of
the dump archives, as we can still store only the compression level
and guess the method from it.  I have added some notes about that in
ReadHead and WriteHead to not forget.

Most of the changes are really-straight forward, and it has resisted
my tests, so I think that this is in a rather-commitable shape as-is.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Thursday, December 1st, 2022 at 3:05 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Wed, Nov 30, 2022 at 05:11:44PM +0000, gkokolatos@pm.me wrote:
>
> > Fair enough. The atteched v11 does that. 0001 introduces compression
> > specification and is using it throughout. 0002 paves the way to the
> > new interface by homogenizing the use of cfp. 0003 introduces the new
> > API and stores the compression algorithm in the custom format header
> > instead of the compression level integer. Finally 0004 adds support for
> > LZ4.
>
>
> I have been looking at 0001, and.. Hmm. I am really wondering
> whether it would not be better to just nuke this warning into orbit.
> This stuff enforces non-compression even if -Z has been used to a
> non-default value. This has been moved to its current location by
> cae2bb1 as of this thread:
> https://www.postgresql.org/message-id/20160526.185551.242041780.horiguchi.kyotaro%40lab.ntt.co.jp
>
> However, this is only active if -Z is used when not building with
> zlib. At the end, it comes down to whether we want to prioritize the
> portability of pg_dump commands specifying a -Z/--compress across
> environments knowing that these may or may not be built with zlib,
> vs the amount of simplification/uniformity we would get across the
> binaries in the tree once we switch everything to use the compression
> specifications. Now that pg_basebackup and pg_receivewal are managed
> by compression specifications, and that we'd want more compression
> options for pg_dump, I would tend to do the latter and from now on
> complain if attempting to do a pg_dump -Z under --without-zlib with a
> compression level > 0. zlib is also widely available, and we don't
> document the fact that non-compression is enforced in this case,
> either. (Two TAP tests with the custom format had to be tweaked.)

Fair enough. Thank you for looking. However I have a small comment
on your new patch.

-       /* Custom and directory formats are compressed by default, others not */
-       if (compressLevel == -1)
-       {
-#ifdef HAVE_LIBZ
-               if (archiveFormat == archCustom || archiveFormat == archDirectory)
-                       compressLevel = Z_DEFAULT_COMPRESSION;
-               else
-#endif
-                       compressLevel = 0;
-       }


Nuking the warning from orbit and changing the behaviour around disabling
the requested compression when the libraries are not present, should not
mean that we need to change the behaviour of default values for different
formats. Please find v13 attached which reinstates it.

Which in itself it got me looking and wondering why the tests succeeded.
The only existing test covering that path is `defaults_dir_format` in
`002_pg_dump.pl`. However as the test is currently written it does not
check whether the output was compressed. The restore command would succeed
in either case. A simple `gzip -t -r` against the directory will not
suffice to test it, because there exist files which are never compressed
in this format (.toc). A little bit more involved test case would need
to be written, yet before I embark to this journey, I would like to know
if you would agree to reinstate the defaults for those formats.

>
> As per the patch, it is true that we do not need to bump the format of
> the dump archives, as we can still store only the compression level
> and guess the method from it. I have added some notes about that in
> ReadHead and WriteHead to not forget.

Agreed. A minor suggestion if you may.

 #ifndef HAVE_LIBZ
-       if (AH->compression != 0)
+       if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
                pg_log_warning("archive is compressed, but this installation does not support compression -- no data
willbe available"); 
 #endif

It would seem a more consistent to error out in this case. We do error
in all other cases where the compression is not available.

>
> Most of the changes are really-straight forward, and it has resisted
> my tests, so I think that this is in a rather-commitable shape as-is.

Thank you.

Cheers,
//Georgios

> --
> Michael
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Thu, Dec 01, 2022 at 02:58:35PM +0000, gkokolatos@pm.me wrote:
> Nuking the warning from orbit and changing the behaviour around disabling
> the requested compression when the libraries are not present, should not
> mean that we need to change the behaviour of default values for different
> formats. Please find v13 attached which reinstates it.

Gah, thanks!  And this default behavior is documented as dependent on
the compilation as well.

> Which in itself it got me looking and wondering why the tests succeeded.
> The only existing test covering that path is `defaults_dir_format` in
> `002_pg_dump.pl`. However as the test is currently written it does not
> check whether the output was compressed. The restore command would succeed
> in either case. A simple `gzip -t -r` against the directory will not
> suffice to test it, because there exist files which are never compressed
> in this format (.toc). A little bit more involved test case would need
> to be written, yet before I embark to this journey, I would like to know
> if you would agree to reinstate the defaults for those formats.

On top of my mind, I briefly recall that -r is not that portable.  And
the toc format makes the files generated non-deterministic as these
use OIDs..

[.. thinks ..]

We are going to need a new thing here, as compress_cmd cannot be
directly used.  What if we used only an array of glob()-able elements?
Let's say "expected_contents" that could include a "dir_path/*.gz"
conditional on $supports_gzip?  glob() can only be calculated when the
test is run as the file names cannot be known beforehand :/

>> As per the patch, it is true that we do not need to bump the format of
>> the dump archives, as we can still store only the compression level
>> and guess the method from it. I have added some notes about that in
>> ReadHead and WriteHead to not forget.
>
> Agreed. A minor suggestion if you may.
>
>  #ifndef HAVE_LIBZ
> -       if (AH->compression != 0)
> +       if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
>                 pg_log_warning("archive is compressed, but this installation does not support compression -- no data
willbe available"); 
>  #endif
>
> It would seem a more consistent to error out in this case. We do error
> in all other cases where the compression is not available.

Makes sense.

I have gone through the patch again, and applied it.  Thanks!
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:
------- Original Message -------
On Friday, December 2nd, 2022 at 2:56 AM, Michael Paquier <michael@paquier.xyz> wrote:

> On top of my mind, I briefly recall that -r is not that portable. And
> the toc format makes the files generated non-deterministic as these
> use OIDs..
>
> [.. thinks ..]
>
> We are going to need a new thing here, as compress_cmd cannot be
> directly used. What if we used only an array of glob()-able elements?
> Let's say "expected_contents" that could include a "dir_path/*.gz"
> conditional on $supports_gzip? glob() can only be calculated when the
> test is run as the file names cannot be known beforehand :/

You are very correct. However one can glob after the fact. Please find
0001 of the attached v14 which attempts to implement it.

> I have gone through the patch again, and applied it. Thanks!

Thank you. Please find the rest of of the patchset series rebased on top
of it. I dare to say that 0002 is in a state worth of your consideration.

Cheers,
//Georgios

> --
> Michael
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Fri, Dec 02, 2022 at 04:15:10PM +0000, gkokolatos@pm.me wrote:
> You are very correct. However one can glob after the fact. Please find
> 0001 of the attached v14 which attempts to implement it.

+       if ($pgdump_runs{$run}->{glob_pattern})
+       {
+               my $glob_pattern = $pgdump_runs{$run}->{glob_pattern};
+               my @glob_output = glob($glob_pattern);
+               is(scalar(@glob_output) > 0, 1, "glob pattern matched")
+       }

While this is correct in checking that the contents are compressed
under --with-zlib, this also removes the coverage where we make sure
that this command is able to complete under --without-zlib without
compressing any of the table data files.  Hence my point from
upthread: this test had better not use compile_option, but change
glob_pattern depending on if the build uses zlib or not.

In order to check this behavior with defaults_custom_format, perhaps
we could just remove the -Z6 from it or add an extra command for its
default behavior?
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Sat, Dec 03, 2022 at 11:45:30AM +0900, Michael Paquier wrote:
> While this is correct in checking that the contents are compressed
> under --with-zlib, this also removes the coverage where we make sure
> that this command is able to complete under --without-zlib without
> compressing any of the table data files.  Hence my point from
> upthread: this test had better not use compile_option, but change
> glob_pattern depending on if the build uses zlib or not.

In short, I mean something like the attached.  I have named the flag
content_patterns, and switched it to an array so as we can check that
toc.dat is always uncompression and that the other data files are
always uncompressed.

> In order to check this behavior with defaults_custom_format, perhaps
> we could just remove the -Z6 from it or add an extra command for its
> default behavior?

This is slightly more complicated as there is just one file generated
for the compression and non-compression cases, so I have let that as
it is now.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Monday, December 5th, 2022 at 8:05 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Sat, Dec 03, 2022 at 11:45:30AM +0900, Michael Paquier wrote:
>
> > While this is correct in checking that the contents are compressed
> > under --with-zlib, this also removes the coverage where we make sure
> > that this command is able to complete under --without-zlib without
> > compressing any of the table data files. Hence my point from
> > upthread: this test had better not use compile_option, but change
> > glob_pattern depending on if the build uses zlib or not.
>
> In short, I mean something like the attached. I have named the flag
> content_patterns, and switched it to an array so as we can check that
> toc.dat is always uncompression and that the other data files are
> always uncompressed.

I see. This approach is much better than my proposal, thanks. If you
allow me, I find 'content_patterns' to be slightly ambiguous. While is
true that it refers to the contents of a directory, it is not the
contents of the dump that it is examining. I took the liberty of proposing
an alternative name in the attached v16.

I also took the liberty of applying the test pattern when it the dump
is explicitly compressed.

> > In order to check this behavior with defaults_custom_format, perhaps
> > we could just remove the -Z6 from it or add an extra command for its
> > default behavior?
>
> This is slightly more complicated as there is just one file generated
> for the compression and non-compression cases, so I have let that as
> it is now.

I was thinking a bit more about this. I think that we can use the list
TOC option of pg_restore. This option will first print out the header
info which contains the compression. Perl utils already support to
parse the generated output of a command. Please find an attempt to do
so in the attached. The benefits of having some testing for this case
become a bit more obvious in 0004 of the patchset, when lz4 is
introduced.

Cheers,
//Georgios

> --
> Michael
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Mon, Dec 05, 2022 at 12:48:28PM +0000, gkokolatos@pm.me wrote:
> I also took the liberty of applying the test pattern when it the dump
> is explicitly compressed.

Sticking with glob_patterns is fine by me.

> I was thinking a bit more about this. I think that we can use the list
> TOC option of pg_restore. This option will first print out the header
> info which contains the compression. Perl utils already support to
> parse the generated output of a command. Please find an attempt to do
> so in the attached. The benefits of having some testing for this case
> become a bit more obvious in 0004 of the patchset, when lz4 is
> introduced.

This is where the fun is.  What you are doing here is more complete,
and we would make sure that the custom and data directory would always
see their contents compressed by default.  And it would have caught
the bug you mentioned upthread for the custom format.

I have kept things as you proposed at the end, added a few comments,
documented the new command_like and an extra command_like for
defaults_dir_format.  Glad to see this addressed, thanks!
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Tuesday, December 6th, 2022 at 1:22 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Mon, Dec 05, 2022 at 12:48:28PM +0000, gkokolatos@pm.me wrote:
>
> This is where the fun is. What you are doing here is more complete,
> and we would make sure that the custom and data directory would always
> see their contents compressed by default. And it would have caught
> the bug you mentioned upthread for the custom format.

Thank you very much Michael.

> I have kept things as you proposed at the end, added a few comments,
> documented the new command_like and an extra command_like for
> defaults_dir_format. Glad to see this addressed, thanks!

Please find attached v17, which builds on top of what is already
committed. I dare to think 0001 as ready to be reviewed. 0002 is
also complete albeit with some documentation gaps.

Cheers,
//Georgios

> --
> Michael
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
001: still refers to "gzip", which is correct for -Fp and -Fd but not
for -Fc, for which it's more correct to say "zlib".  That affects the
name of the function, structures, comments, etc.  I'm not sure if it's
an issue to re-use the basebackup compression routines here.  Maybe we
should accept "-Zn" for zlib output (-Fc), but reject "gzip:9", which
I'm sure some will find confusing, as it does not output.  Maybe 001
should be split into a patch to re-use the existing "cfp" interface
(which is a clear win), and 002 to re-use the basebackup interfaces for
user input and constants, etc.

001 still doesn't compile on freebsd, and 002 doesn't compile on
windows.  Have you checked test results from cirrusci on your private
github account ?

002 says:
+       save_errno = errno;
                                                                                   
 
+       errno = save_errno;
                                                                                   
 

I suppose that's intended to wrap the preceding library call.

002 breaks "pg_dump -Fc -Z2" because (I think) AllocateCompressor()
doesn't store the passed-in compression_spec.

003 still uses <lz4.h> and not "lz4.h".

Earlier this year I also suggested to include an 999 patch to change to
use LZ4 as the default compression, to exercise the new code under CI.
I suggest to re-open the cf patch entry after that passes tests on all
platforms and when it's ready for more review.

BTW, some of these review comments are the same as what I sent earlier
this year.

https://www.postgresql.org/message-id/20220326162156.GI28503%40telsasoft.com
https://www.postgresql.org/message-id/20220705151328.GQ13040%40telsasoft.com

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Sat, Dec 17, 2022 at 05:26:15PM -0600, Justin Pryzby wrote:
> 001: still refers to "gzip", which is correct for -Fp and -Fd but not
> for -Fc, for which it's more correct to say "zlib".

Or should we begin by changing all these existing "not built with zlib
support" error strings to the more generic "this build does not
support compression with %s" to reduce the number of messages to
translate?  That would bring consistency with the other tools dealing
with compression.

> That affects the
> name of the function, structures, comments, etc.  I'm not sure if it's
> an issue to re-use the basebackup compression routines here.  Maybe we
> should accept "-Zn" for zlib output (-Fc), but reject "gzip:9", which
> I'm sure some will find confusing, as it does not output.  Maybe 001
> should be split into a patch to re-use the existing "cfp" interface
> (which is a clear win), and 002 to re-use the basebackup interfaces for
> user input and constants, etc.
>
> 001 still doesn't compile on freebsd, and 002 doesn't compile on
> windows.  Have you checked test results from cirrusci on your private
> github account ?

FYI, I have re-added an entry to the CF app to get some automated
coverage:
https://commitfest.postgresql.org/41/3571/

On MinGW, a complain about the open() callback, which I guess ought to
be avoided with a rename:
[00:16:37.254] compress_gzip.c:356:38: error: macro "open" passed 4 arguments, but takes just 3
[00:16:37.254]   356 |  ret = CFH->open(fname, -1, mode, CFH);
[00:16:37.254]       |                                      ^
[00:16:37.254] In file included from ../../../src/include/c.h:1309,
[00:16:37.254]                  from ../../../src/include/postgres_fe.h:25,
[00:16:37.254]                  from compress_gzip.c:15:

On MSVC, some declaration conflicts, for a similar issue:
[00:12:31.966] ../src/bin/pg_dump/compress_io.c(193): error C2371: '_read': redefinition; different basic types
[00:12:31.966] C:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt\corecrt_io.h(252): note: see
declarationof '_read' 
[00:12:31.966] ../src/bin/pg_dump/compress_io.c(210): error C2371: '_write': redefinition; different basic types
[00:12:31.966] C:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt\corecrt_io.h(294): note: see
declarationof '_write' 

> 002 breaks "pg_dump -Fc -Z2" because (I think) AllocateCompressor()
> doesn't store the passed-in compression_spec.

Hmm.  This looks like a gap in the existing tests that we'd better fix
first.  This CI is green on Linux.

> 003 still uses <lz4.h> and not "lz4.h".

This should be <lz4.h>, not "lz4.h".
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Monday, December 19th, 2022 at 5:06 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Sat, Dec 17, 2022 at 05:26:15PM -0600, Justin Pryzby wrote:
>

Thank you for the comments, please find v18 attached.

> > 001: still refers to "gzip", which is correct for -Fp and -Fd but not
> > for -Fc, for which it's more correct to say "zlib".
>
>
> Or should we begin by changing all these existing "not built with zlib
> support" error strings to the more generic "this build does not
> support compression with %s" to reduce the number of messages to
> translate? That would bring consistency with the other tools dealing
> with compression.

This has been the approach from 0002 on-wards. In the attached it is also
applied on the remaining location in 0001.

>
> > That affects the
> > name of the function, structures, comments, etc. I'm not sure if it's
> > an issue to re-use the basebackup compression routines here. Maybe we
> > should accept "-Zn" for zlib output (-Fc), but reject "gzip:9", which
> > I'm sure some will find confusing, as it does not output. Maybe 001
> > should be split into a patch to re-use the existing "cfp" interface
> > (which is a clear win), and 002 to re-use the basebackup interfaces for
> > user input and constants, etc.
> >
> > 001 still doesn't compile on freebsd, and 002 doesn't compile on
> > windows. Have you checked test results from cirrusci on your private
> > github account ?

There are still known gaps in 0002 and 0003, for example documentation,
and I have not been focusing too much on those. You are right, it is helpful
and kind to try to reduce the noise. The attached should have hopefully
tackled the ci errors.

>
> FYI, I have re-added an entry to the CF app to get some automated
> coverage:
> https://commitfest.postgresql.org/41/3571/

Much obliged. Should I change the state to "ready for review" when post a
new version or should I leave that to the senior personnel?

>
> On MinGW, a complain about the open() callback, which I guess ought to
> be avoided with a rename:
> [00:16:37.254] compress_gzip.c:356:38: error: macro "open" passed 4 arguments, but takes just 3
> [00:16:37.254] 356 | ret = CFH->open(fname, -1, mode, CFH);
>
> [00:16:37.254] | ^
> [00:16:37.254] In file included from ../../../src/include/c.h:1309,
> [00:16:37.254] from ../../../src/include/postgres_fe.h:25,
> [00:16:37.254] from compress_gzip.c:15:
>
> On MSVC, some declaration conflicts, for a similar issue:
> [00:12:31.966] ../src/bin/pg_dump/compress_io.c(193): error C2371: '_read': redefinition; different basic types
> [00:12:31.966] C:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt\corecrt_io.h(252): note: see
declarationof '_read' 
> [00:12:31.966] ../src/bin/pg_dump/compress_io.c(210): error C2371: '_write': redefinition; different basic types
> [00:12:31.966] C:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt\corecrt_io.h(294): note: see
declarationof '_write' 
>

A rename was enough.

> > 002 breaks "pg_dump -Fc -Z2" because (I think) AllocateCompressor()
> > doesn't store the passed-in compression_spec.
>

I am afraid I have not been able to reproduce this error. I tried both
debian and freebsd after I addressed the compilation warnings. Which
error did you get? Is it still present in the attached?

> Hmm. This looks like a gap in the existing tests that we'd better fix
> first. This CI is green on Linux.

As the code stands, the compression level is not stored in the custom
format's header as it is no longer relevant information. We can decide
to make it relevant for the tests only on the expense of increasing
dump size by four bytes. In either case this is not applicable in
current head and can wait for 0002's turn.

Cheers,
//Georgios

>
> > 003 still uses <lz4.h> and not "lz4.h".
>
>
> This should be <lz4.h>, not "lz4.h".
>
> --
> Michael
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Mon, Dec 19, 2022 at 05:03:21PM +0000, gkokolatos@pm.me wrote:
> > > 001 still doesn't compile on freebsd, and 002 doesn't compile on
> > > windows. Have you checked test results from cirrusci on your private
> > > github account ?
> 
> There are still known gaps in 0002 and 0003, for example documentation,
> and I have not been focusing too much on those. You are right, it is helpful
> and kind to try to reduce the noise. The attached should have hopefully
> tackled the ci errors.

Yep.  Are you using cirrusci under your github account ?

> > FYI, I have re-added an entry to the CF app to get some automated
> > coverage:
> > https://commitfest.postgresql.org/41/3571/
> 
> Much obliged. Should I change the state to "ready for review" when post a
> new version or should I leave that to the senior personnel?   

It's better to update it to reflect what you think its current status
is.  If you think it's ready for review.

> > > 002 breaks "pg_dump -Fc -Z2" because (I think) AllocateCompressor()
> > > doesn't store the passed-in compression_spec.
> 
> I am afraid I have not been able to reproduce this error. I tried both
> debian and freebsd after I addressed the compilation warnings. Which
> error did you get? Is it still present in the attached?

It's not that there's an error - it's that compression isn't working.

$ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z1 -Fp regression |wc -c
659956
$ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z2 -Fp regression |wc -c
637192

$ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z1 -Fc regression |wc -c
1954890
$ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z2 -Fc regression |wc -c
1954890

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Mon, Dec 19, 2022 at 01:06:00PM +0900, Michael Paquier wrote:
> On Sat, Dec 17, 2022 at 05:26:15PM -0600, Justin Pryzby wrote:
> > 001: still refers to "gzip", which is correct for -Fp and -Fd but not
> > for -Fc, for which it's more correct to say "zlib".
> 
> Or should we begin by changing all these existing "not built with zlib 
> support" error strings to the more generic "this build does not
> support compression with %s" to reduce the number of messages to
> translate?  That would bring consistency with the other tools dealing
> with compression.

That's fine, but it doesn't touch on the issue I'm talking about, which
is that zlib != gzip.

BTW I noticed that that also affects the pg_dump file itself; 002
changes the file format to say "gzip", but that's wrong for -Fc, which
does not use gzip headers, which could be surprising to someone who
specified "gzip".

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Monday, December 19th, 2022 at 6:27 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:


>
>
> On Mon, Dec 19, 2022 at 05:03:21PM +0000, gkokolatos@pm.me wrote:
>
> > > > 001 still doesn't compile on freebsd, and 002 doesn't compile on
> > > > windows. Have you checked test results from cirrusci on your private
> > > > github account ?
> >
> > There are still known gaps in 0002 and 0003, for example documentation,
> > and I have not been focusing too much on those. You are right, it is helpful
> > and kind to try to reduce the noise. The attached should have hopefully
> > tackled the ci errors.
>
>
> Yep. Are you using cirrusci under your github account ?

Thank you. To be very honest, I am not using github exclusively to post patches.
Sometimes I do, sometimes I do not. Is github a requirement?

To answer your question, some of my github accounts are integrated with cirrusci,
others are not.

The current cfbot build is green for what is worth.
https://cirrus-ci.com/build/5934319840002048

>
> > > FYI, I have re-added an entry to the CF app to get some automated
> > > coverage:
> > > https://commitfest.postgresql.org/41/3571/
> >
> > Much obliged. Should I change the state to "ready for review" when post a
> > new version or should I leave that to the senior personnel?
>
>
> It's better to update it to reflect what you think its current status
> is. If you think it's ready for review.

Thank you.

>
> > > > 002 breaks "pg_dump -Fc -Z2" because (I think) AllocateCompressor()
> > > > doesn't store the passed-in compression_spec.
> >
> > I am afraid I have not been able to reproduce this error. I tried both
> > debian and freebsd after I addressed the compilation warnings. Which
> > error did you get? Is it still present in the attached?
>
>
> It's not that there's an error - it's that compression isn't working.
>
> $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z1 -Fp regression |wc -c
> 659956
> $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z2 -Fp regression |wc -c
> 637192
>
> $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z1 -Fc regression |wc -c
> 1954890
> $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z2 -Fc regression |wc -c
> 1954890
>

Thank you. Now I understand what you mean. Trying the same on top of v18-0003
on Ubuntu 22.04 yields:

$ for compression in none gzip:1 gzip:6 gzip:9; do \
      pg_dump --format=custom --compress="$compression" -f regression."$compression".dump -d regression; \
      wc -c regression."$compression".dump; \
  done;
14963753 regression.none.dump
3600183 regression.gzip:1.dump
3223755 regression.gzip:6.dump
3196903 regression.gzip:9.dump

and on FreeBSD 13.1

$ for compression in none gzip:1 gzip:6 gzip:9; do \
      pg_dump --format=custom --compress="$compression" -f regression."$compression".dump -d regression; \
      wc -c regression."$compression".dump; \
  done;
14828822 regression.none.dump
3584304 regression.gzip:1.dump
3208548 regression.gzip:6.dump
3182044 regression.gzip:9.dump

Although there are some variations between the installations, within the same
installation the size of the dump file is shrinking as expected.

Investigating a bit further on the issue, you are correct in identifying an
issue in v17. Up until v16, the compressor function looked like:

+InitCompressorGzip(CompressorState *cs, int compressionLevel)
+{
+       GzipCompressorState *gzipcs;
+
+       cs->readData = ReadDataFromArchiveGzip;
+       cs->writeData = WriteDataToArchiveGzip;
+       cs->end = EndCompressorGzip;
+
+       gzipcs = (GzipCompressorState *) pg_malloc0(sizeof(GzipCompressorState));
+       gzipcs->compressionLevel = compressionLevel;

V17 considered that more options could become available in the future
and changed the signature of the relevant Init functions to:

+InitCompressorGzip(CompressorState *cs, const pg_compress_specification compression_spec)
+{
+       GzipCompressorState *gzipcs;
+
+       cs->readData = ReadDataFromArchiveGzip;
+       cs->writeData = WriteDataToArchiveGzip;
+       cs->end = EndCompressorGzip;
+
+       gzipcs = (GzipCompressorState *) pg_malloc0(sizeof(GzipCompressorState));
+

V18 reinstated the assignment in similar fashion to InitCompressorNone and
InitCompressorLz4:

+void
+InitCompressorGzip(CompressorState *cs, const pg_compress_specification compression_spec)
+{
+       GzipCompressorState *gzipcs;
+
+       cs->readData = ReadDataFromArchiveGzip;
+       cs->writeData = WriteDataToArchiveGzip;
+       cs->end = EndCompressorGzip;
+
+       cs->compression_spec = compression_spec;
+
+       gzipcs = (GzipCompressorState *) pg_malloc0(sizeof(GzipCompressorState));

A test case can be added which performs a check similar to the loop above.
Create a custom dump with the least and most compression for each method.
Then verify that the output sizes differ as expected. This addition could
become 0001 in the current series.

Thoughts?

Cheers,
//Georgios

> --
> Justin



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Tue, Dec 20, 2022 at 11:19:15AM +0000, gkokolatos@pm.me wrote:
> ------- Original Message -------
> On Monday, December 19th, 2022 at 6:27 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Mon, Dec 19, 2022 at 05:03:21PM +0000, gkokolatos@pm.me wrote:
> > 
> > > > > 001 still doesn't compile on freebsd, and 002 doesn't compile on
> > > > > windows. Have you checked test results from cirrusci on your private
> > > > > github account ?
> > > 
> > > There are still known gaps in 0002 and 0003, for example documentation,
> > > and I have not been focusing too much on those. You are right, it is helpful
> > > and kind to try to reduce the noise. The attached should have hopefully
> > > tackled the ci errors.
> > 
> > 
> > Yep. Are you using cirrusci under your github account ?
> 
> Thank you. To be very honest, I am not using github exclusively to post patches.
> Sometimes I do, sometimes I do not. Is github a requirement?

Github isn't a requirement for postgres (but cirrusci only supports
github).  I wasn't not trying to say that it's required, only trying to
make sure that you (and others) know that it's available, since our
cirrus.yml is relatively new.

> > > > > 002 breaks "pg_dump -Fc -Z2" because (I think) AllocateCompressor()
> > > > > doesn't store the passed-in compression_spec.
> > > 
> > > I am afraid I have not been able to reproduce this error. I tried both
> > > debian and freebsd after I addressed the compilation warnings. Which
> > > error did you get? Is it still present in the attached?
> > 
> > 
> > It's not that there's an error - it's that compression isn't working.
> > 
> > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z1 -Fp regression |wc -c
> > 659956
> > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z2 -Fp regression |wc -c
> > 637192
> > 
> > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z1 -Fc regression |wc -c
> > 1954890
> > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z2 -Fc regression |wc -c
> > 1954890
> > 
> 
> Thank you. Now I understand what you mean. Trying the same on top of v18-0003
> on Ubuntu 22.04 yields:

You're right; this seems to be fixed in v18.  Thanks.

It looks like I'd forgotten to run "meson test tmp_install", so had
retested v17...

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Tuesday, December 20th, 2022 at 4:26 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:


>
>
> On Tue, Dec 20, 2022 at 11:19:15AM +0000, gkokolatos@pm.me wrote:
>
> > ------- Original Message -------
> > On Monday, December 19th, 2022 at 6:27 PM, Justin Pryzby pryzby@telsasoft.com wrote:
> >
> > > On Mon, Dec 19, 2022 at 05:03:21PM +0000, gkokolatos@pm.me wrote:
> > >
> > > > > > 001 still doesn't compile on freebsd, and 002 doesn't compile on
> > > > > > windows. Have you checked test results from cirrusci on your private
> > > > > > github account ?
> > > >
> > > > There are still known gaps in 0002 and 0003, for example documentation,
> > > > and I have not been focusing too much on those. You are right, it is helpful
> > > > and kind to try to reduce the noise. The attached should have hopefully
> > > > tackled the ci errors.
> > >
> > > Yep. Are you using cirrusci under your github account ?
> >
> > Thank you. To be very honest, I am not using github exclusively to post patches.
> > Sometimes I do, sometimes I do not. Is github a requirement?
>
>
> Github isn't a requirement for postgres (but cirrusci only supports
> github). I wasn't not trying to say that it's required, only trying to
> make sure that you (and others) know that it's available, since our
> cirrus.yml is relatively new.

Got it. Thank you very much for spreading the word. It is a useful feature which
should be known.

>
> > > > > > 002 breaks "pg_dump -Fc -Z2" because (I think) AllocateCompressor()
> > > > > > doesn't store the passed-in compression_spec.
> > > >
> > > > I am afraid I have not been able to reproduce this error. I tried both
> > > > debian and freebsd after I addressed the compilation warnings. Which
> > > > error did you get? Is it still present in the attached?
> > >
> > > It's not that there's an error - it's that compression isn't working.
> > >
> > > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z1 -Fp regression |wc -c
> > > 659956
> > > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z2 -Fp regression |wc -c
> > > 637192
> > >
> > > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z1 -Fc regression |wc -c
> > > 1954890
> > > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z2 -Fc regression |wc -c
> > > 1954890
> >
> > Thank you. Now I understand what you mean. Trying the same on top of v18-0003
> > on Ubuntu 22.04 yields:
>
>
> You're right; this seems to be fixed in v18. Thanks.

Great. Still there was a bug in v17 which you discovered. Thank you for the review
effort.

Please find in the attached v19 an extra check right before calling deflateInit().
This check will verify that only compressed output will be generated for this
method.

Also v19 is rebased on top f450695e889 and applies cleanly.

Cheers.
//Georgios

> --
> Justin
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
There's a couple of lz4 bits which shouldn't be present in 002: file
extension and comments.



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Thu, Dec 22, 2022 at 11:08:59AM -0600, Justin Pryzby wrote:
> There's a couple of lz4 bits which shouldn't be present in 002: file
> extension and comments.

There were "LZ4" comments and file extension stuff in the preparatory
commit.  But now it seems like you *removed* them in the LZ4 commit
(where it actually belongs) rather than *moving* it from the
prior/parent commit *to* the lz4 commit.  I recommend to run something
like "git diff @{1}" whenever doing this kind of patch surgery.

+   if (AH->compression_spec.algorithm != PG_COMPRESSION_NONE &&
+       AH->compression_spec.algorithm == PG_COMPRESSION_GZIP &&

This looks wrong/redundant.  The gzip part should be removed, right ?

Maybe other places that check if (compression==PG_COMPRESSION_GZIP)
should maybe change to say compression!=NONE?

_PrepParallelRestore() references ".gz", so I think it needs to be
retrofitted to handle .lz4.  Ideally, that's built into a struct or list
of file extensions to try.  Maybe compression.h should have a function
to return the file extension of a given algorithm.  I'm planning to send
a patch for zstd, and hoping its changes will be minimized by these
preparatory commits.

+ errno = errno ? : ENOSPC;

"?:" is a GNU extension (not the ternary operator, but the ternary
operator with only 2 args).  It's not in use anywhere else in postgres.
You could instead write it with 3 "errno"s or as "if (errno==0):
errno=ENOSPC"

You wrote "eol_flag == false" and "eol_flag == 0" and true.  But it's
cleaner to test it as a boolean: if (eol_flag) / if (!eol_flag).

Both LZ4File_init() and its callers check "inited".  Better to do it in
one place than 3.  It's a static function, so I think there's no
performance concern.

Gzip_close() still has a useless save_errno (or rebase issue?).

I think it's confusing to have two functions, one named
InitCompressLZ4() and InitCompressorLZ4().

pg_compress_specification is being passed by value, but I think it
should be passed as a pointer, as is done everywhere else.

pg_compress_algorithm is being writen directly into the pg_dump header.
Currently, I think that's not an externally-visible value (it could be
renumbered, theoretically even in a minor release).  Maybe there should
be a "private" enum for encoding the pg_dump header, similar to
WAL_COMPRESSION_LZ4 vs BKPIMAGE_COMPRESS_LZ4 ?  Or else a comment there
should warn that the values are encoded in pg_dump, and must never be
changed.

+ Verify that data files where compressed
typo: s/where/were/

Also:
s/occurance/occurrence/
s/begining/beginning/
s/Verfiy/Verify/
s/nessary/necessary/

BTW I noticed that cfdopen() was accidentally committed to compress_io.h
in master without being defined anywhere.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
vignesh C
Date:
On Wed, 21 Dec 2022 at 15:40, <gkokolatos@pm.me> wrote:
>
>
>
>
>
>
> ------- Original Message -------
> On Tuesday, December 20th, 2022 at 4:26 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
>
>
> >
> >
> > On Tue, Dec 20, 2022 at 11:19:15AM +0000, gkokolatos@pm.me wrote:
> >
> > > ------- Original Message -------
> > > On Monday, December 19th, 2022 at 6:27 PM, Justin Pryzby pryzby@telsasoft.com wrote:
> > >
> > > > On Mon, Dec 19, 2022 at 05:03:21PM +0000, gkokolatos@pm.me wrote:
> > > >
> > > > > > > 001 still doesn't compile on freebsd, and 002 doesn't compile on
> > > > > > > windows. Have you checked test results from cirrusci on your private
> > > > > > > github account ?
> > > > >
> > > > > There are still known gaps in 0002 and 0003, for example documentation,
> > > > > and I have not been focusing too much on those. You are right, it is helpful
> > > > > and kind to try to reduce the noise. The attached should have hopefully
> > > > > tackled the ci errors.
> > > >
> > > > Yep. Are you using cirrusci under your github account ?
> > >
> > > Thank you. To be very honest, I am not using github exclusively to post patches.
> > > Sometimes I do, sometimes I do not. Is github a requirement?
> >
> >
> > Github isn't a requirement for postgres (but cirrusci only supports
> > github). I wasn't not trying to say that it's required, only trying to
> > make sure that you (and others) know that it's available, since our
> > cirrus.yml is relatively new.
>
> Got it. Thank you very much for spreading the word. It is a useful feature which
> should be known.
>
> >
> > > > > > > 002 breaks "pg_dump -Fc -Z2" because (I think) AllocateCompressor()
> > > > > > > doesn't store the passed-in compression_spec.
> > > > >
> > > > > I am afraid I have not been able to reproduce this error. I tried both
> > > > > debian and freebsd after I addressed the compilation warnings. Which
> > > > > error did you get? Is it still present in the attached?
> > > >
> > > > It's not that there's an error - it's that compression isn't working.
> > > >
> > > > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z1 -Fp regression |wc -c
> > > > 659956
> > > > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z2 -Fp regression |wc -c
> > > > 637192
> > > >
> > > > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z1 -Fc regression |wc -c
> > > > 1954890
> > > > $ ./tmp_install/usr/local/pgsql/bin/pg_dump -h /tmp -Z2 -Fc regression |wc -c
> > > > 1954890
> > >
> > > Thank you. Now I understand what you mean. Trying the same on top of v18-0003
> > > on Ubuntu 22.04 yields:
> >
> >
> > You're right; this seems to be fixed in v18. Thanks.
>
> Great. Still there was a bug in v17 which you discovered. Thank you for the review
> effort.
>
> Please find in the attached v19 an extra check right before calling deflateInit().
> This check will verify that only compressed output will be generated for this
> method.
>
> Also v19 is rebased on top f450695e889 and applies cleanly.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch ./v19-0002-Introduce-Compressor-API-in-pg_dump.patch
patching file src/bin/pg_dump/compress_io.h
Hunk #1 FAILED at 37.
1 out of 1 hunk FAILED -- saving rejects to file
src/bin/pg_dump/compress_io.h.rej

[1] - http://cfbot.cputube.org/patch_41_3571.log

Regards,
Vignesh



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Sun, Jan 08, 2023 at 01:45:25PM -0600, Justin Pryzby wrote:
> On Thu, Dec 22, 2022 at 11:08:59AM -0600, Justin Pryzby wrote:
> > There's a couple of lz4 bits which shouldn't be present in 002: file
> > extension and comments.

> BTW I noticed that cfdopen() was accidentally committed to compress_io.h
> in master without being defined anywhere.

This was resolved in 69fb29d1a (so now needs to be re-added for this
patch series).

> pg_compress_specification is being passed by value, but I think it
> should be passed as a pointer, as is done everywhere else.

ISTM that was an issue with 5e73a6048, affecting a few public and
private functions.  I wrote a pre-preparatory patch which changes to
pass by reference.

And addressed a handful of other issues I reported as separate fixup
commits.  And changed to use LZ4 by default for CI.

I also rebased my 2 year old patch to support zstd in pg_dump.  I hope
it can finally added for v16.  I'll send it for the next CF if these
patches progress.

One more thing: some comments still refer to the cfopen API, which this
patch removes.

> There were "LZ4" comments and file extension stuff in the preparatory
> commit.  But now it seems like you *removed* them in the LZ4 commit
> (where it actually belongs) rather than *moving* it from the
> prior/parent commit *to* the lz4 commit.  I recommend to run something
> like "git diff @{1}" whenever doing this kind of patch surgery.

TODO

> Maybe other places that check if (compression==PG_COMPRESSION_GZIP)
> should maybe change to say compression!=NONE?
> 
> _PrepParallelRestore() references ".gz", so I think it needs to be
> retrofitted to handle .lz4.  Ideally, that's built into a struct or list
> of file extensions to try.  Maybe compression.h should have a function
> to return the file extension of a given algorithm.  I'm planning to send
> a patch for zstd, and hoping its changes will be minimized by these
> preparatory commits.

TODO

> I think it's confusing to have two functions, one named
> InitCompressLZ4() and InitCompressorLZ4().

TODO

> pg_compress_algorithm is being writen directly into the pg_dump header.
> Currently, I think that's not an externally-visible value (it could be
> renumbered, theoretically even in a minor release).  Maybe there should
> be a "private" enum for encoding the pg_dump header, similar to
> WAL_COMPRESSION_LZ4 vs BKPIMAGE_COMPRESS_LZ4 ?  Or else a comment there
> should warn that the values are encoded in pg_dump, and must never be
> changed.

Michael, WDYT ?

-- 
Justin

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Sat, Jan 14, 2023 at 03:43:08PM -0600, Justin Pryzby wrote:
> On Sun, Jan 08, 2023 at 01:45:25PM -0600, Justin Pryzby wrote:
> > pg_compress_specification is being passed by value, but I think it
> > should be passed as a pointer, as is done everywhere else.
> 
> ISTM that was an issue with 5e73a6048, affecting a few public and
> private functions.  I wrote a pre-preparatory patch which changes to
> pass by reference.

I updated 001 to change SetOutput() to pass by reference, too (before,
that ended up in the 002 patch).

I can't see any issue in 002 other than the == GZIP change (the fix for
which I'd previously included in a later patch).

> One more thing: some comments still refer to the cfopen API, which this
> patch removes.
> 
> > There were "LZ4" comments and file extension stuff in the preparatory
> > commit.  But now it seems like you *removed* them in the LZ4 commit
> > (where it actually belongs) rather than *moving* it from the
> > prior/parent commit *to* the lz4 commit.  I recommend to run something
> > like "git diff @{1}" whenever doing this kind of patch surgery.
> 
> TODO

I addressed that in the fixup commits 005 and 007.

-- 
Justin

Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Sat, Jan 14, 2023 at 03:43:09PM -0600, Justin Pryzby wrote:
> On Sun, Jan 08, 2023 at 01:45:25PM -0600, Justin Pryzby wrote:
>> pg_compress_specification is being passed by value, but I think it
>> should be passed as a pointer, as is done everywhere else.
>
> ISTM that was an issue with 5e73a6048, affecting a few public and
> private functions.  I wrote a pre-preparatory patch which changes to
> pass by reference.

The functions changed by 0001 are cfopen[_write](),
AllocateCompressor() and ReadDataFromArchive().  Why is it a good idea
to change these interfaces which basically exist to handle inputs?  Is
there some benefit in changing compression_spec within the internals
of these routines before going back one layer down to their callers?
Changing the compression_spec on-the-fly in these internal paths could
be risky, actually, no?

> And addressed a handful of other issues I reported as separate fixup
> commits.  And changed to use LZ4 by default for CI.

Are your slight changes shaped as of 0003-f.patch, 0005-f.patch and
0007-f.patch on top of the original patches sent by Georgios?

> I also rebased my 2 year old patch to support zstd in pg_dump.  I hope
> it can finally added for v16.  I'll send it for the next CF if these
> patches progress.

Good idea to see if what you have done for zstd fits with what's
presented here.

>> pg_compress_algorithm is being writen directly into the pg_dump header.

Do you mean that this is what happens once the patch series 0001~0008
sent upthread is applied on HEAD?

>> Currently, I think that's not an externally-visible value (it could be
>> renumbered, theoretically even in a minor release).  Maybe there should
>> be a "private" enum for encoding the pg_dump header, similar to
>> WAL_COMPRESSION_LZ4 vs BKPIMAGE_COMPRESS_LZ4 ?  Or else a comment there
>> should warn that the values are encoded in pg_dump, and must never be
>> changed.
>
> Michael, WDYT ?

Changing the order of the members in an enum would cause an ABI
breakage, so that would not happen, and we tend to be very careful
about that.  Appending new members would be fine, though.  FWIW, I'd
rather avoid adding more enums that would just be exact maps to
pg_compress_algorithm.

-   /*
-    * For now the compression type is implied by the level.  This will need
-    * to change once support for more compression algorithms is added,
-    * requiring a format bump.
-    */
-   WriteInt(AH, AH->compression_spec.level);
+   AH->WriteBytePtr(AH, AH->compression_spec.algorithm);

I may be missing something here, but it seems to me that you ought to
store as well the level in the dump header, or it would not be
possible to report in the dump's description what was used?  Hence,
K_VERS_1_15 should imply that we have both the method compression and
the compression level.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Mon, Jan 16, 2023 at 10:28:50AM +0900, Michael Paquier wrote:
> On Sat, Jan 14, 2023 at 03:43:09PM -0600, Justin Pryzby wrote:
> > On Sun, Jan 08, 2023 at 01:45:25PM -0600, Justin Pryzby wrote:
> >> pg_compress_specification is being passed by value, but I think it
> >> should be passed as a pointer, as is done everywhere else.
> > 
> > ISTM that was an issue with 5e73a6048, affecting a few public and
> > private functions.  I wrote a pre-preparatory patch which changes to
> > pass by reference.
> 
> The functions changed by 0001 are cfopen[_write](),
> AllocateCompressor() and ReadDataFromArchive().  Why is it a good idea
> to change these interfaces which basically exist to handle inputs?

I changed to pass pg_compress_specification as a pointer, since that's
the usual convention for structs, as followed by the existing uses of
pg_compress_specification.

> Is there some benefit in changing compression_spec within the
> internals of these routines before going back one layer down to their
> callers?  Changing the compression_spec on-the-fly in these internal
> paths could be risky, actually, no?

I think what you're saying is that if the spec is passed as a pointer,
then the called functions shouldn't set spec->algorithm=something.

I agree that if they need to do that, they should use a local variable.
Which looks to be true for the functions that were changed in 001.

> > And addressed a handful of other issues I reported as separate fixup
> > commits.  And changed to use LZ4 by default for CI.
> 
> Are your slight changes shaped as of 0003-f.patch, 0005-f.patch and
> 0007-f.patch on top of the original patches sent by Georgios?

Yes, the original patches, rebased as needed on top of HEAD and 001...

> >> pg_compress_algorithm is being writen directly into the pg_dump header.
> 
> Do you mean that this is what happens once the patch series 0001~0008
> sent upthread is applied on HEAD?

Yes

> -   /*
> -    * For now the compression type is implied by the level.  This will need
> -    * to change once support for more compression algorithms is added,
> -    * requiring a format bump.
> -    */
> -   WriteInt(AH, AH->compression_spec.level);
> +   AH->WriteBytePtr(AH, AH->compression_spec.algorithm);
> 
> I may be missing something here, but it seems to me that you ought to
> store as well the level in the dump header, or it would not be
> possible to report in the dump's description what was used?  Hence,
> K_VERS_1_15 should imply that we have both the method compression and
> the compression level.

Maybe.  But the "level" isn't needed for decompression for any case I'm
aware of.

Also, dumps with the default compression level currently say:
"Compression: -1", which does't seem valuable.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:
Oh, I didn’t realize you took over Justin? Why? After almost a year of work?

This is rather disheartening. 


On Mon, Jan 16, 2023 at 02:56, Justin Pryzby <pryzby@telsasoft.com> wrote:
On Mon, Jan 16, 2023 at 10:28:50AM +0900, Michael Paquier wrote:
> On Sat, Jan 14, 2023 at 03:43:09PM -0600, Justin Pryzby wrote:
> > On Sun, Jan 08, 2023 at 01:45:25PM -0600, Justin Pryzby wrote:
> >> pg_compress_specification is being passed by value, but I think it
> >> should be passed as a pointer, as is done everywhere else.
> >
> > ISTM that was an issue with 5e73a6048, affecting a few public and
> > private functions. I wrote a pre-preparatory patch which changes to
> > pass by reference.
>
> The functions changed by 0001 are cfopen[_write](),
> AllocateCompressor() and ReadDataFromArchive(). Why is it a good idea
> to change these interfaces which basically exist to handle inputs?

I changed to pass pg_compress_specification as a pointer, since that's
the usual convention for structs, as followed by the existing uses of
pg_compress_specification.

> Is there some benefit in changing compression_spec within the
> internals of these routines before going back one layer down to their
> callers? Changing the compression_spec on-the-fly in these internal
> paths could be risky, actually, no?

I think what you're saying is that if the spec is passed as a pointer,
then the called functions shouldn't set spec->algorithm=something.

I agree that if they need to do that, they should use a local variable.
Which looks to be true for the functions that were changed in 001.

> > And addressed a handful of other issues I reported as separate fixup
> > commits. And changed to use LZ4 by default for CI.
>
> Are your slight changes shaped as of 0003-f.patch, 0005-f.patch and
> 0007-f.patch on top of the original patches sent by Georgios?

Yes, the original patches, rebased as needed on top of HEAD and 001...

> >> pg_compress_algorithm is being writen directly into the pg_dump header.
>
> Do you mean that this is what happens once the patch series 0001~0008
> sent upthread is applied on HEAD?

Yes

> - /*
> - * For now the compression type is implied by the level. This will need
> - * to change once support for more compression algorithms is added,
> - * requiring a format bump.
> - */
> - WriteInt(AH, AH->compression_spec.level);
> + AH->WriteBytePtr(AH, AH->compression_spec.algorithm);
>
> I may be missing something here, but it seems to me that you ought to
> store as well the level in the dump header, or it would not be
> possible to report in the dump's description what was used? Hence,
> K_VERS_1_15 should imply that we have both the method compression and
> the compression level.

Maybe. But the "level" isn't needed for decompression for any case I'm
aware of.

Also, dumps with the default compression level currently say:
"Compression: -1", which does't seem valuable.

--
Justin

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Mon, Jan 16, 2023 at 02:27:56AM +0000, gkokolatos@pm.me wrote:
> Oh, I didn’t realize you took over Justin? Why? After almost a year of work?
> 
> This is rather disheartening.

I believe you've misunderstood my intent here.  I sent rebased versions
of your patches with fixup commits implementing fixes that I'd
previously sent.  I don't think that's unusual.  I hope your patches
will be included in v16, and I hope to facilitate that.  I don't mean
any offense.  Actually, the fixups are provided as separate patches so
you can adopt the changes easily into your branch.  

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Sun, Jan 15, 2023 at 07:56:25PM -0600, Justin Pryzby wrote:
> On Mon, Jan 16, 2023 at 10:28:50AM +0900, Michael Paquier wrote:
>> The functions changed by 0001 are cfopen[_write](),
>> AllocateCompressor() and ReadDataFromArchive().  Why is it a good idea
>> to change these interfaces which basically exist to handle inputs?
>
> I changed to pass pg_compress_specification as a pointer, since that's
> the usual convention for structs, as followed by the existing uses of
> pg_compress_specification.

Okay, but what do we gain here?  It seems to me that this introduces
the risk that a careless change in one of the internal routines if
they change slight;ly compress_spec, hence impacting any of their
callers?  Or is that fixing an actual bug (except if I am missing your
point, that does not seem to be the case)?

>> Is there some benefit in changing compression_spec within the
>> internals of these routines before going back one layer down to their
>> callers?  Changing the compression_spec on-the-fly in these internal
>> paths could be risky, actually, no?
>
> I think what you're saying is that if the spec is passed as a pointer,
> then the called functions shouldn't set spec->algorithm=something.

Yes.  HEAD makes sure of that, 0001 would not prevent that.  So I am a
bit confused in seeing how this is a benefit.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:
Hi,

I admit I am completely at lost as to what is expected from me anymore.

I had posted v19-0001 for a committer's consideration and v19-000{2,3} for completeness.
Please find a rebased v20 attached.

Also please let me know if I should silently step away from it and let other people lead
it. I would be glad to comply either way.

Cheers,
//Georgios


------- Original Message -------
On Monday, January 16th, 2023 at 3:54 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Sun, Jan 15, 2023 at 07:56:25PM -0600, Justin Pryzby wrote:
>
> > On Mon, Jan 16, 2023 at 10:28:50AM +0900, Michael Paquier wrote:
> >
> > > The functions changed by 0001 are cfopen_write,
> > > AllocateCompressor() and ReadDataFromArchive(). Why is it a good idea
> > > to change these interfaces which basically exist to handle inputs?
> >
> > I changed to pass pg_compress_specification as a pointer, since that's
> > the usual convention for structs, as followed by the existing uses of
> > pg_compress_specification.
>
>
> Okay, but what do we gain here? It seems to me that this introduces
> the risk that a careless change in one of the internal routines if
> they change slight;ly compress_spec, hence impacting any of their
> callers? Or is that fixing an actual bug (except if I am missing your
> point, that does not seem to be the case)?
>
> > > Is there some benefit in changing compression_spec within the
> > > internals of these routines before going back one layer down to their
> > > callers? Changing the compression_spec on-the-fly in these internal
> > > paths could be risky, actually, no?
> >
> > I think what you're saying is that if the spec is passed as a pointer,
> > then the called functions shouldn't set spec->algorithm=something.
>
>
> Yes. HEAD makes sure of that, 0001 would not prevent that. So I am a
> bit confused in seeing how this is a benefit.
> --
> Michael
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
Hi,

On 1/16/23 16:14, gkokolatos@pm.me wrote:
> Hi,
> 
> I admit I am completely at lost as to what is expected from me anymore.
> 

:-(

I understand it's frustrating not to know why a patch is not moving
forward. Particularly when is seems fairly straightforward ...

Let me briefly explain my personal (and admittedly very subjective) view
on picking what patches to review/commit. I'm sure other committers have
other criteria, but maybe this will help.

There are always more patches than I can review/commit, so I have to
prioritize, and pick which patches to look at. For me, it's mostly about
cost/benefit of the patch. The cost is e.g. the amount of time I need to
spend to review/commit the stuff, maybe read the thread, etc. Benefits
is mainly the new features/improvements.

It's oversimplified, we could talk about various bits that contribute to
the costs and benefits, but this is what it boils down.

There's always the aspect of time - patches A and B have roughly the
same benefits, but with A we get it "immediately" while B requires
additional parts that we don't have ready yet (and if they don't make it
we get no benefit), I'll probably pick A.

Unfortunately, this plays against this patch - I'm certainly in favor of
adding lz4 (and other compression algos) into pg_dump, but if I commit
0001 we get little benefit, and the other parts actually adding lz4/zstd
are treated as "WIP / for completeness" so it's unclear when we'd get to
commit them.

So if I could recommend one thing, it'd be to get at least one of those
WIP patches into a shape that's likely committable right after 0001.

> I had posted v19-0001 for a committer's consideration and v19-000{2,3} for completeness.
> Please find a rebased v20 attached.
> 

I took a quick look at 0001, so a couple comments (sorry if some of this
was already discussed in the thread):

1) I don't think a "refactoring" patch should reference particular
compression algorithms (lz4/zstd), and in particular I don't think we
should have "not yet implemented" messages. We only have a couple other
places doing that, when we didn't have a better choice. But here we can
simply reject the algorithm when parsing the options, we don't need to
do that in a dozen other places.

2) I wouldn't reorder the cases in WriteDataToArchive, i.e. I'd keep
"none" at the end. It might make backpatches harder.

3) While building, I get bunch of warnings about missing cfdopen()
prototype and pg_backup_archiver.c not knowing about cfdopen() and
adding an implicit prototype (so I doubt it actually works).

4) "cfp" struct no longer wraps gzFile, but the comment was not updated.
FWIW I'm not sure switching to "void *" is an improvement, maybe it'd be
better to have a "union" of correct types?

5) cfopen/cfdopen are missing comments. cfopen_internal has an updated
comment, but that's a static function while cfopen/cfdopen are the
actual API.

> Also please let me know if I should silently step away from it and let other people lead
> it. I would be glad to comply either way.
> 

Please don't. I promise to take a look at this patch again.

Thanks for doing all the work.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Wednesday, January 18th, 2023 at 3:00 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
> Hi,
>
> On 1/16/23 16:14, gkokolatos@pm.me wrote:
>
> > Hi,
> >
> > I admit I am completely at lost as to what is expected from me anymore.
>
<snip>
>
> Unfortunately, this plays against this patch - I'm certainly in favor of
> adding lz4 (and other compression algos) into pg_dump, but if I commit
> 0001 we get little benefit, and the other parts actually adding lz4/zstd
> are treated as "WIP / for completeness" so it's unclear when we'd get to
> commit them.

Thank you for your kindness and for taking the time to explain.

> So if I could recommend one thing, it'd be to get at least one of those
> WIP patches into a shape that's likely committable right after 0001.

This was clearly my fault. I misunderstood a suggestion upthread to focus
on the first patch of the series and ignore documentation and comments on
the rest.

Please find v21 to contain 0002 and 0003 in a state which I no longer consider
as WIP but worthy of proper consideration. Some guidance on where is best to add
documentation in 0002 for the function pointers in CompressFileHandle will
be welcomed.

>
> > I had posted v19-0001 for a committer's consideration and v19-000{2,3} for completeness.
> > Please find a rebased v20 attached.
>
>
> I took a quick look at 0001, so a couple comments (sorry if some of this
> was already discussed in the thread):

Much appreciated!

>
> 1) I don't think a "refactoring" patch should reference particular
> compression algorithms (lz4/zstd), and in particular I don't think we
> should have "not yet implemented" messages. We only have a couple other
> places doing that, when we didn't have a better choice. But here we can
> simply reject the algorithm when parsing the options, we don't need to
> do that in a dozen other places.

I have now removed lz4/zstd from where they were present with the exception
of pg_dump.c which is responsible for parsing.

> 2) I wouldn't reorder the cases in WriteDataToArchive, i.e. I'd keep
> "none" at the end. It might make backpatches harder.

Agreed. However a 'default' is needed in order to avoid compilation warnings.
Also note that 0002 completely does away with cases within WriteDataToArchive.

> 3) While building, I get bunch of warnings about missing cfdopen()
> prototype and pg_backup_archiver.c not knowing about cfdopen() and
> adding an implicit prototype (so I doubt it actually works).

Fixed. cfdopen() got prematurely introduced in 5e73a6048 and then got removed
in 69fb29d1af. v20 failed to properly take 69fb29d1af in consideration. Note
that cfdopen is removed in 0002 which explains why cfbot didn't complain.

> 4) "cfp" struct no longer wraps gzFile, but the comment was not updated.
> FWIW I'm not sure switching to "void *" is an improvement, maybe it'd be
> better to have a "union" of correct types?

Please find and updated comment and a union in place of the void *. Also
note that 0002 completely does away with cfp in favour of a new struct
CompressFileHandle. I maintained the void * there because it is used by
private methods of the compressors. 0003 contains such an example with
LZ4CompressorState.

> 5) cfopen/cfdopen are missing comments. cfopen_internal has an updated
> comment, but that's a static function while cfopen/cfdopen are the
> actual API.

Added comments to cfopen/cfdopen.

>
> > Also please let me know if I should silently step away from it and let other people lead
> > it. I would be glad to comply either way.
>
>
> Please don't. I promise to take a look at this patch again.

Thank you very much.

> Thanks for doing all the work.

Thank you.

Cheers,
//Georgios

> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 1/18/23 20:05, gkokolatos@pm.me wrote:
> 
> 
> 
> 
> 
> ------- Original Message -------
> On Wednesday, January 18th, 2023 at 3:00 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> 
> 
>>
>>
>> Hi,
>>
>> On 1/16/23 16:14, gkokolatos@pm.me wrote:
>>
>>> Hi,
>>>
>>> I admit I am completely at lost as to what is expected from me anymore.
>>
> <snip>
>>
>> Unfortunately, this plays against this patch - I'm certainly in favor of
>> adding lz4 (and other compression algos) into pg_dump, but if I commit
>> 0001 we get little benefit, and the other parts actually adding lz4/zstd
>> are treated as "WIP / for completeness" so it's unclear when we'd get to
>> commit them.
> 
> Thank you for your kindness and for taking the time to explain.
>  
>> So if I could recommend one thing, it'd be to get at least one of those
>> WIP patches into a shape that's likely committable right after 0001.
> 
> This was clearly my fault. I misunderstood a suggestion upthread to focus
> on the first patch of the series and ignore documentation and comments on
> the rest.
> 
> Please find v21 to contain 0002 and 0003 in a state which I no longer consider
> as WIP but worthy of proper consideration. Some guidance on where is best to add
> documentation in 0002 for the function pointers in CompressFileHandle will
> be welcomed.
> 

This is internal-only API, not meant for use by regular users and/or
extension authors, so I don't think we need sgml docs. I'd just add
regular code-level documentation to compress_io.h.

For inspiration see docs for "struct ReorderBuffer" in reorderbuffer.h,
or "struct _archiveHandle" in pg_backup_archiver.h.

Or what other kind of documentation you had in mind?

>>
>>> I had posted v19-0001 for a committer's consideration and v19-000{2,3} for completeness.
>>> Please find a rebased v20 attached.
>>
>>
>> I took a quick look at 0001, so a couple comments (sorry if some of this
>> was already discussed in the thread):
> 
> Much appreciated!
> 
>>
>> 1) I don't think a "refactoring" patch should reference particular
>> compression algorithms (lz4/zstd), and in particular I don't think we
>> should have "not yet implemented" messages. We only have a couple other
>> places doing that, when we didn't have a better choice. But here we can
>> simply reject the algorithm when parsing the options, we don't need to
>> do that in a dozen other places.
> 
> I have now removed lz4/zstd from where they were present with the exception
> of pg_dump.c which is responsible for parsing.
> 

I'm not sure I understand why leave the lz4/zstd in this place?

>> 2) I wouldn't reorder the cases in WriteDataToArchive, i.e. I'd keep
>> "none" at the end. It might make backpatches harder.
> 
> Agreed. However a 'default' is needed in order to avoid compilation warnings.
> Also note that 0002 completely does away with cases within WriteDataToArchive.
> 

OK, although that's also a consequence of using a "switch" instead of
plan "if" branches.

Furthermore, I'm not sure we really need the pg_fatal() about invalid
compression method in these default blocks. I mean, how could we even
get to these places when the build does not support the algorithm? All
of this (ReadDataFromArchive, WriteDataToArchive, EndCompressor, ...)
happens looong after the compressor was initialized and the method
checked, no? So maybe either this should simply do Assert(false) or use
a different error message.

>> 3) While building, I get bunch of warnings about missing cfdopen()
>> prototype and pg_backup_archiver.c not knowing about cfdopen() and
>> adding an implicit prototype (so I doubt it actually works).
> 
> Fixed. cfdopen() got prematurely introduced in 5e73a6048 and then got removed
> in 69fb29d1af. v20 failed to properly take 69fb29d1af in consideration. Note
> that cfdopen is removed in 0002 which explains why cfbot didn't complain.
>  

OK.

>> 4) "cfp" struct no longer wraps gzFile, but the comment was not updated.
>> FWIW I'm not sure switching to "void *" is an improvement, maybe it'd be
>> better to have a "union" of correct types?
> 
> Please find and updated comment and a union in place of the void *. Also
> note that 0002 completely does away with cfp in favour of a new struct
> CompressFileHandle. I maintained the void * there because it is used by
> private methods of the compressors. 0003 contains such an example with
> LZ4CompressorState.
> 

I wonder if this (and also the previous item) makes sense to keep 0001
and 0002 or to combine them. The "intermediate" state is a bit annoying.

>> 5) cfopen/cfdopen are missing comments. cfopen_internal has an updated
>> comment, but that's a static function while cfopen/cfdopen are the
>> actual API.
> 
> Added comments to cfopen/cfdopen.
> 

OK.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Thursday, January 19th, 2023 at 4:45 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
> On 1/18/23 20:05, gkokolatos@pm.me wrote:
>
> > ------- Original Message -------
> > On Wednesday, January 18th, 2023 at 3:00 PM, Tomas Vondra tomas.vondra@enterprisedb.com wrote:
> >
> > > Hi,
> > >
> > > On 1/16/23 16:14, gkokolatos@pm.me wrote:
> > >
> > > > Hi,
> > > >
> > > > I admit I am completely at lost as to what is expected from me anymore.
> >
> > <snip>
> >
> > > Unfortunately, this plays against this patch - I'm certainly in favor of
> > > adding lz4 (and other compression algos) into pg_dump, but if I commit
> > > 0001 we get little benefit, and the other parts actually adding lz4/zstd
> > > are treated as "WIP / for completeness" so it's unclear when we'd get to
> > > commit them.
> >
> > Thank you for your kindness and for taking the time to explain.
> >
> > > So if I could recommend one thing, it'd be to get at least one of those
> > > WIP patches into a shape that's likely committable right after 0001.
> >
> > This was clearly my fault. I misunderstood a suggestion upthread to focus
> > on the first patch of the series and ignore documentation and comments on
> > the rest.
> >
> > Please find v21 to contain 0002 and 0003 in a state which I no longer consider
> > as WIP but worthy of proper consideration. Some guidance on where is best to add
> > documentation in 0002 for the function pointers in CompressFileHandle will
> > be welcomed.
>
>
> This is internal-only API, not meant for use by regular users and/or
> extension authors, so I don't think we need sgml docs. I'd just add
> regular code-level documentation to compress_io.h.
>
> For inspiration see docs for "struct ReorderBuffer" in reorderbuffer.h,
> or "struct _archiveHandle" in pg_backup_archiver.h.
>
> Or what other kind of documentation you had in mind?

This is exactly what I was after. I was between compress_io.c and compress_io.h.
Thank you.

> > > > I had posted v19-0001 for a committer's consideration and v19-000{2,3} for completeness.
> > > > Please find a rebased v20 attached.
> > >
> > > I took a quick look at 0001, so a couple comments (sorry if some of this
> > > was already discussed in the thread):
> >
> > Much appreciated!
> >
> > > 1) I don't think a "refactoring" patch should reference particular
> > > compression algorithms (lz4/zstd), and in particular I don't think we
> > > should have "not yet implemented" messages. We only have a couple other
> > > places doing that, when we didn't have a better choice. But here we can
> > > simply reject the algorithm when parsing the options, we don't need to
> > > do that in a dozen other places.
> >
> > I have now removed lz4/zstd from where they were present with the exception
> > of pg_dump.c which is responsible for parsing.
>
>
> I'm not sure I understand why leave the lz4/zstd in this place?

You are right, it is not obvious. Those were added in 5e73a60488 which is
already committed in master and I didn't want to backtrack. Of course, I am
not opposing in doing so if you wish.

>
> > > 2) I wouldn't reorder the cases in WriteDataToArchive, i.e. I'd keep
> > > "none" at the end. It might make backpatches harder.
> >
> > Agreed. However a 'default' is needed in order to avoid compilation warnings.
> > Also note that 0002 completely does away with cases within WriteDataToArchive.
>
>
> OK, although that's also a consequence of using a "switch" instead of
> plan "if" branches.
>
> Furthermore, I'm not sure we really need the pg_fatal() about invalid
> compression method in these default blocks. I mean, how could we even
> get to these places when the build does not support the algorithm? All
> of this (ReadDataFromArchive, WriteDataToArchive, EndCompressor, ...)
> happens looong after the compressor was initialized and the method
> checked, no? So maybe either this should simply do Assert(false) or use
> a different error message.

I like Assert(false).

> > > 3) While building, I get bunch of warnings about missing cfdopen()
> > > prototype and pg_backup_archiver.c not knowing about cfdopen() and
> > > adding an implicit prototype (so I doubt it actually works).
> >
> > Fixed. cfdopen() got prematurely introduced in 5e73a6048 and then got removed
> > in 69fb29d1af. v20 failed to properly take 69fb29d1af in consideration. Note
> > that cfdopen is removed in 0002 which explains why cfbot didn't complain.
>
>
> OK.
>
> > > 4) "cfp" struct no longer wraps gzFile, but the comment was not updated.
> > > FWIW I'm not sure switching to "void *" is an improvement, maybe it'd be
> > > better to have a "union" of correct types?
> >
> > Please find and updated comment and a union in place of the void *. Also
> > note that 0002 completely does away with cfp in favour of a new struct
> > CompressFileHandle. I maintained the void * there because it is used by
> > private methods of the compressors. 0003 contains such an example with
> > LZ4CompressorState.
>
>
> I wonder if this (and also the previous item) makes sense to keep 0001
> and 0002 or to combine them. The "intermediate" state is a bit annoying.

Agreed. It was initially submitted as one patch. Then it was requested to be
split up in two parts, one to expand the use of the existing API and one to
replace with the new interface. Unfortunately the expansion of usage of the
existing API requires some tweaking, but that is not a very good reason for
the current patch set. I should have done a better job there.

Please find v22 attach which combines back 0001 and 0002. It is missing the
documentation that was discussed above as I wanted to give a quick feedback.
Let me know if you think that the combined version is the one to move forward
with.

Cheers,
//Georgios

>
> > > 5) cfopen/cfdopen are missing comments. cfopen_internal has an updated
> > > comment, but that's a static function while cfopen/cfdopen are the
> > > actual API.
> >
> > Added comments to cfopen/cfdopen.
>
>
> OK.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
Hi,

On 1/19/23 17:42, gkokolatos@pm.me wrote:
> 
> ------- Original Message -------
> On Thursday, January 19th, 2023 at 4:45 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 1/18/23 20:05, gkokolatos@pm.me wrote:
>>
>>> ------- Original Message -------
>>> On Wednesday, January 18th, 2023 at 3:00 PM, Tomas Vondra tomas.vondra@enterprisedb.com wrote:
>>
>> I'm not sure I understand why leave the lz4/zstd in this place?
> 
> You are right, it is not obvious. Those were added in 5e73a60488 which is
> already committed in master and I didn't want to backtrack. Of course, I am
> not opposing in doing so if you wish.
> 

Ah, I didn't realize it was already added by earlier commit. In that
case let's not worry about it.

>>
>>>> 2) I wouldn't reorder the cases in WriteDataToArchive, i.e. I'd keep
>>>> "none" at the end. It might make backpatches harder.
>>>
>>> Agreed. However a 'default' is needed in order to avoid compilation warnings.
>>> Also note that 0002 completely does away with cases within WriteDataToArchive.
>>
>>
>> OK, although that's also a consequence of using a "switch" instead of
>> plan "if" branches.
>>
>> Furthermore, I'm not sure we really need the pg_fatal() about invalid
>> compression method in these default blocks. I mean, how could we even
>> get to these places when the build does not support the algorithm? All
>> of this (ReadDataFromArchive, WriteDataToArchive, EndCompressor, ...)
>> happens looong after the compressor was initialized and the method
>> checked, no? So maybe either this should simply do Assert(false) or use
>> a different error message.
> 
> I like Assert(false).
> 

OK, good. Do you agree we should never actually get there, if the
earlier checks work correctly?

>>
>>>> 4) "cfp" struct no longer wraps gzFile, but the comment was not updated.
>>>> FWIW I'm not sure switching to "void *" is an improvement, maybe it'd be
>>>> better to have a "union" of correct types?
>>>
>>> Please find and updated comment and a union in place of the void *. Also
>>> note that 0002 completely does away with cfp in favour of a new struct
>>> CompressFileHandle. I maintained the void * there because it is used by
>>> private methods of the compressors. 0003 contains such an example with
>>> LZ4CompressorState.
>>
>>
>> I wonder if this (and also the previous item) makes sense to keep 0001
>> and 0002 or to combine them. The "intermediate" state is a bit annoying.
> 
> Agreed. It was initially submitted as one patch. Then it was requested to be
> split up in two parts, one to expand the use of the existing API and one to
> replace with the new interface. Unfortunately the expansion of usage of the
> existing API requires some tweaking, but that is not a very good reason for
> the current patch set. I should have done a better job there.
> 
> Please find v22 attach which combines back 0001 and 0002. It is missing the
> documentation that was discussed above as I wanted to give a quick feedback.
> Let me know if you think that the combined version is the one to move forward
> with.
> 

Thanks, I'll take a look.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 1/19/23 18:55, Tomas Vondra wrote:
> Hi,
> 
> On 1/19/23 17:42, gkokolatos@pm.me wrote:
>>
>> ...
>>
>> Agreed. It was initially submitted as one patch. Then it was requested to be
>> split up in two parts, one to expand the use of the existing API and one to
>> replace with the new interface. Unfortunately the expansion of usage of the
>> existing API requires some tweaking, but that is not a very good reason for
>> the current patch set. I should have done a better job there.
>>
>> Please find v22 attach which combines back 0001 and 0002. It is missing the
>> documentation that was discussed above as I wanted to give a quick feedback.
>> Let me know if you think that the combined version is the one to move forward
>> with.
>>
> 
> Thanks, I'll take a look.
> 

After taking a look and thinking about it a bit more, I think we should
keep the two parts separate. I think Michael (or whoever proposed) the
split was right, it makes the patches easier to grok.

Sorry for the noise, hopefully we can just revert to the last version.

While reading the thread, I also noticed this:

> By the way, I think that this 0002 should drop all the default clauses
> in the switches for the compression method so as we'd catch any
> missing code paths with compiler warnings if a new compression method
> is added in the future.

Now I realize why there were "not yet implemented" errors for lz4/zstd
in all the switches, and why after removing them you had to add a
default branch.

We DON'T want a default branch, because the idea is that after adding a
new compression algorithm, we get warnings about switches not handling
it correctly.

So I guess we should walk back this change too :-( It's probably easier
to go back to v20 from January 16, and redo the couple remaining things
I commented on.


FWIW I think this is a hint that adding LZ4/ZSTD options, in 5e73a6048,
but without implementation, was not a great idea. It mostly defeats the
idea of getting the compiler warnings - all the places already handle
PG_COMPRESSION_LZ4/PG_COMPRESSION_ZSTD by throwing a pg_fatal. So you'd
have to grep for the options, inspect all the places or something like
that anyway. The warnings would only work for entirely new methods.

However, I now also realize the compressor API in 0002 replaces all of
this with calls to a generic API callback, so trying to improve this was
pretty silly from me.


Please, fix the couple remaining details in v20, add the docs for the
callbacks, and I'll try to polish it and get it committed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Friday, January 20th, 2023 at 12:34 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
> On 1/19/23 18:55, Tomas Vondra wrote:
>
> > Hi,
> >
> > On 1/19/23 17:42, gkokolatos@pm.me wrote:
> >
> > > ...
> > >
> > > Agreed. It was initially submitted as one patch. Then it was requested to be
> > > split up in two parts, one to expand the use of the existing API and one to
> > > replace with the new interface. Unfortunately the expansion of usage of the
> > > existing API requires some tweaking, but that is not a very good reason for
> > > the current patch set. I should have done a better job there.
> > >
> > > Please find v22 attach which combines back 0001 and 0002. It is missing the
> > > documentation that was discussed above as I wanted to give a quick feedback.
> > > Let me know if you think that the combined version is the one to move forward
> > > with.
> >
> > Thanks, I'll take a look.
>
>
> After taking a look and thinking about it a bit more, I think we should
> keep the two parts separate. I think Michael (or whoever proposed) the
> split was right, it makes the patches easier to grok.
>

Excellent. I will attempt a better split this time round.

>
> While reading the thread, I also noticed this:
>
> > By the way, I think that this 0002 should drop all the default clauses
> > in the switches for the compression method so as we'd catch any
> > missing code paths with compiler warnings if a new compression method
> > is added in the future.
>
>
> Now I realize why there were "not yet implemented" errors for lz4/zstd
> in all the switches, and why after removing them you had to add a
> default branch.
>
> We DON'T want a default branch, because the idea is that after adding a
> new compression algorithm, we get warnings about switches not handling
> it correctly.
>
> So I guess we should walk back this change too :-( It's probably easier
> to go back to v20 from January 16, and redo the couple remaining things
> I commented on.
>

Sure.

>
> FWIW I think this is a hint that adding LZ4/ZSTD options, in 5e73a6048,
> but without implementation, was not a great idea. It mostly defeats the
> idea of getting the compiler warnings - all the places already handle
> PG_COMPRESSION_LZ4/PG_COMPRESSION_ZSTD by throwing a pg_fatal. So you'd
> have to grep for the options, inspect all the places or something like
> that anyway. The warnings would only work for entirely new methods.
>
> However, I now also realize the compressor API in 0002 replaces all of
> this with calls to a generic API callback, so trying to improve this was
> pretty silly from me.

I can try to do a better job at splitting things up.

>
> Please, fix the couple remaining details in v20, add the docs for the
> callbacks, and I'll try to polish it and get it committed.

Excellent. Allow me an attempt to polish and expect a new version soon.

Cheers,
//Georgios

>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Friday, January 20th, 2023 at 12:34 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
> On 1/19/23 18:55, Tomas Vondra wrote:
>
> > Hi,
> >
> > On 1/19/23 17:42, gkokolatos@pm.me wrote:
> >
> > > ...
> > >
> > > Agreed. It was initially submitted as one patch. Then it was requested to be
> > > split up in two parts, one to expand the use of the existing API and one to
> > > replace with the new interface. Unfortunately the expansion of usage of the
> > > existing API requires some tweaking, but that is not a very good reason for
> > > the current patch set. I should have done a better job there.
> > >
> > > Please find v22 attach which combines back 0001 and 0002. It is missing the
> > > documentation that was discussed above as I wanted to give a quick feedback.
> > > Let me know if you think that the combined version is the one to move forward
> > > with.
> >
> > Thanks, I'll take a look.
>
>
> After taking a look and thinking about it a bit more, I think we should
> keep the two parts separate. I think Michael (or whoever proposed) the
> split was right, it makes the patches easier to grok.

Please find attached v23 which reintroduces the split.

0001 is reworked to have a reduced footprint than before. Also in an attempt
to facilitate the readability, 0002 splits the API's and the uncompressed
implementation in separate files.

>
> While reading the thread, I also noticed this:
>
> > By the way, I think that this 0002 should drop all the default clauses
> > in the switches for the compression method so as we'd catch any
> > missing code paths with compiler warnings if a new compression method
> > is added in the future.
>
>
> Now I realize why there were "not yet implemented" errors for lz4/zstd
> in all the switches, and why after removing them you had to add a
> default branch.
>
> We DON'T want a default branch, because the idea is that after adding a
> new compression algorithm, we get warnings about switches not handling
> it correctly.
>
> So I guess we should walk back this change too :-( It's probably easier
> to go back to v20 from January 16, and redo the couple remaining things
> I commented on.

No problem.

> FWIW I think this is a hint that adding LZ4/ZSTD options, in 5e73a6048,
> but without implementation, was not a great idea. It mostly defeats the
> idea of getting the compiler warnings - all the places already handle
> PG_COMPRESSION_LZ4/PG_COMPRESSION_ZSTD by throwing a pg_fatal. So you'd
> have to grep for the options, inspect all the places or something like
> that anyway. The warnings would only work for entirely new methods.
>
> However, I now also realize the compressor API in 0002 replaces all of
> this with calls to a generic API callback, so trying to improve this was
> pretty silly from me.
>
>
> Please, fix the couple remaining details in v20, add the docs for the
> callbacks, and I'll try to polish it and get it committed.

Thank you very much. Please find an attempt to comply with the requested
changes in the attached.

>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Mon, Jan 23, 2023 at 05:31:55PM +0000, gkokolatos@pm.me wrote:
> Please find attached v23 which reintroduces the split.
> 
> 0001 is reworked to have a reduced footprint than before. Also in an attempt
> to facilitate the readability, 0002 splits the API's and the uncompressed
> implementation in separate files.

Thanks for updating the patch.  Could you address the review comments I
sent here ?
https://www.postgresql.org/message-id/20230108194524.GA27637%40telsasoft.com

Thanks,
-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Monday, January 23rd, 2023 at 7:00 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:


>
>
> On Mon, Jan 23, 2023 at 05:31:55PM +0000, gkokolatos@pm.me wrote:
>
> > Please find attached v23 which reintroduces the split.
> >
> > 0001 is reworked to have a reduced footprint than before. Also in an attempt
> > to facilitate the readability, 0002 splits the API's and the uncompressed
> > implementation in separate files.
>
>
> Thanks for updating the patch. Could you address the review comments I
> sent here ?
> https://www.postgresql.org/message-id/20230108194524.GA27637%40telsasoft.com

Please find v24 attached.

Cheers,
//Georgios

>
> Thanks,
> --
> Justin
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Tue, Jan 24, 2023 at 03:56:20PM +0000, gkokolatos@pm.me wrote:
> On Monday, January 23rd, 2023 at 7:00 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Mon, Jan 23, 2023 at 05:31:55PM +0000, gkokolatos@pm.me wrote:
> > 
> > > Please find attached v23 which reintroduces the split.
> > > 
> > > 0001 is reworked to have a reduced footprint than before. Also in an attempt
> > > to facilitate the readability, 0002 splits the API's and the uncompressed
> > > implementation in separate files.
> > 
> > Thanks for updating the patch. Could you address the review comments I
> > sent here ?
> > https://www.postgresql.org/message-id/20230108194524.GA27637%40telsasoft.com
> 
> Please find v24 attached.

Thanks for updating the patch.

In 001, RestoreArchive() does:

> -#ifndef HAVE_LIBZ
> -       if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP &&
> -               AH->PrintTocDataPtr != NULL)
> +       supports_compression = false;
> +       if (AH->compression_spec.algorithm == PG_COMPRESSION_NONE ||
> +               AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> +               supports_compression = true;
> +
> +       if (AH->PrintTocDataPtr != NULL)
>         {
>                 for (te = AH->toc->next; te != AH->toc; te = te->next)
>                 {
>                         if (te->hadDumper && (te->reqs & REQ_DATA) != 0)
> -                               pg_fatal("cannot restore from compressed archive (compression not supported in this
installation)");
> +                       {
> +#ifndef HAVE_LIBZ
> +                               if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> +                                       supports_compression = false;
> +#endif
> +                               if (supports_compression == false)
> +                                       pg_fatal("cannot restore from compressed archive (compression not supported
inthis installation)");
 
> +                       }
>                 }
>         }
> -#endif

This first checks if the algorithm is implemented, and then checks if
the algorithm is supported by the current build - that confused me for a
bit.  It seems unnecessary to check for unimplemented algorithms before
looping.  That also requires referencing both GZIP and LZ4 in two
places.

I think it could be written to avoid the need to change for added
compression algorithms:

+                       if (te->hadDumper && (te->reqs & REQ_DATA) != 0)
+                       {
+                               /* Check if the compression algorithm is supported */
+                               pg_compress_specification spec;
+                               parse_compress_specification(AH->compression_spec.algorithm, NULL, &spec);
+                               if (spec->parse_error != NULL)
+                                       pg_fatal(spec->parse_error);
+                       }

Or maybe add a new function to compression.c to indicate whether a given
algorithm is supported.

That would also indicate *which* compression library isn't supported.

Other than that, I think 001 is ready.

002/003 use these names, which I think are too similar - initially I
didn't even realize there were two separate functions (each with a
second stub function to handle the case of unsupported compression):

+extern void InitCompressorGzip(CompressorState *cs, const pg_compress_specification compression_spec);
                                                                                                    
 
+extern void InitCompressGzip(CompressFileHandle *CFH, const pg_compress_specification compression_spec);
                                                                                                        
 

+extern void InitCompressorLZ4(CompressorState *cs, const pg_compress_specification compression_spec);
                                                                                                          
 
+extern void InitCompressLZ4(CompressFileHandle *CFH, const pg_compress_specification compression_spec);
                                                                                                          
 

typo:
s/not build with/not built with/

Should AllocateCompressor() set cs->compression_spec, rather than doing
it in each compressor ?

Thanks for considering.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Wednesday, January 25th, 2023 at 2:42 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:


>
>
> On Tue, Jan 24, 2023 at 03:56:20PM +0000, gkokolatos@pm.me wrote:
>
> > On Monday, January 23rd, 2023 at 7:00 PM, Justin Pryzby pryzby@telsasoft.com wrote:
> >
> > > On Mon, Jan 23, 2023 at 05:31:55PM +0000, gkokolatos@pm.me wrote:
> > >
> > > > Please find attached v23 which reintroduces the split.
> > > >
> > > > 0001 is reworked to have a reduced footprint than before. Also in an attempt
> > > > to facilitate the readability, 0002 splits the API's and the uncompressed
> > > > implementation in separate files.
> > >
> > > Thanks for updating the patch. Could you address the review comments I
> > > sent here ?
> > > https://www.postgresql.org/message-id/20230108194524.GA27637%40telsasoft.com
> >
> > Please find v24 attached.
>
>
> Thanks for updating the patch.
>
> In 001, RestoreArchive() does:
>
> > -#ifndef HAVE_LIBZ
> > - if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP &&
> > - AH->PrintTocDataPtr != NULL)
> > + supports_compression = false;
> > + if (AH->compression_spec.algorithm == PG_COMPRESSION_NONE ||
> > + AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> > + supports_compression = true;
> > +
> > + if (AH->PrintTocDataPtr != NULL)
> > {
> > for (te = AH->toc->next; te != AH->toc; te = te->next)
> > {
> > if (te->hadDumper && (te->reqs & REQ_DATA) != 0)
> > - pg_fatal("cannot restore from compressed archive (compression not supported in this installation)");
> > + {
> > +#ifndef HAVE_LIBZ
> > + if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> > + supports_compression = false;
> > +#endif
> > + if (supports_compression == false)
> > + pg_fatal("cannot restore from compressed archive (compression not supported in this installation)");
> > + }
> > }
> > }
> > -#endif
>
>
> This first checks if the algorithm is implemented, and then checks if
> the algorithm is supported by the current build - that confused me for a
> bit. It seems unnecessary to check for unimplemented algorithms before
> looping. That also requires referencing both GZIP and LZ4 in two
> places.

I am not certain that it is unnecessary, at least not in the way that is
described. The idea is that new compression methods can be added, without
changing the archive's version number. It is very possible that it is
requested to restore an archive compressed with a method not implemented
in the current binary. The first check takes care of that and sets
supports_compression only for the supported versions. It is possible to
enter the loop with supports_compression already set to false, for example
because the archive was compressed with ZSTD, triggering the fatal error.

Of course, one can throw the error before entering the loop, yet I think
that it does not help the readability of the code. IMHO it is easier to
follow if the error is thrown once during that check.

>
> I think it could be written to avoid the need to change for added
> compression algorithms:
>
> + if (te->hadDumper && (te->reqs & REQ_DATA) != 0)
>
> + {
> + /* Check if the compression algorithm is supported */
> + pg_compress_specification spec;
> + parse_compress_specification(AH->compression_spec.algorithm, NULL, &spec);
>
> + if (spec->parse_error != NULL)
>
> + pg_fatal(spec->parse_error);
>
> + }

I am not certain how that would work in the example with ZSTD above.
If I am not wrong, parse_compress_specification() will not throw an error
if the codebase supports ZSTD, yet this specific pg_dump binary will not
support it because ZSTD is not implemented. parse_compress_specification()
is not aware of that and should not be aware of it, should it?

>
> Or maybe add a new function to compression.c to indicate whether a given
> algorithm is supported.

I am not certain how this would help, as compression.c is supposed to be
used by multiple binaries while this is a pg_dump specific detail.

> That would also indicate which compression library isn't supported.

If anything, I can suggest to throw an error much earlier, i.e. in ReadHead(),
and remove altogether this check. On the other hand, I like the belts
and suspenders approach because there are no more checks after this point.

> Other than that, I think 001 is ready.

Thank you.

> 002/003 use these names, which I think are too similar - initially I
> didn't even realize there were two separate functions (each with a
> second stub function to handle the case of unsupported compression):
>
> +extern void InitCompressorGzip(CompressorState *cs, const pg_compress_specification compression_spec);
> +extern void InitCompressGzip(CompressFileHandle *CFH, const pg_compress_specification compression_spec);
>
> +extern void InitCompressorLZ4(CompressorState *cs, const pg_compress_specification compression_spec);
> +extern void InitCompressLZ4(CompressFileHandle *CFH, const pg_compress_specification compression_spec);

Fair enough. Names are now updated.

>
> typo:
> s/not build with/not built with/

Thank you.

>
> Should AllocateCompressor() set cs->compression_spec, rather than doing
> it in each compressor ?

I think that compression_spec should be owned by each compressor. With that
in mind, it makes more sense to set it within each compressor. This is not
a hill I am willing to die on though.

Please find v25 attached.

>
> Thanks for considering.
>
> --
> Justin
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 1/25/23 16:37, gkokolatos@pm.me wrote:
> 
> 
> 
> 
> 
> ------- Original Message -------
> On Wednesday, January 25th, 2023 at 2:42 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> 
> 
>>
>>
>> On Tue, Jan 24, 2023 at 03:56:20PM +0000, gkokolatos@pm.me wrote:
>>
>>> On Monday, January 23rd, 2023 at 7:00 PM, Justin Pryzby pryzby@telsasoft.com wrote:
>>>
>>>> On Mon, Jan 23, 2023 at 05:31:55PM +0000, gkokolatos@pm.me wrote:
>>>>
>>>>> Please find attached v23 which reintroduces the split.
>>>>>
>>>>> 0001 is reworked to have a reduced footprint than before. Also in an attempt
>>>>> to facilitate the readability, 0002 splits the API's and the uncompressed
>>>>> implementation in separate files.
>>>>
>>>> Thanks for updating the patch. Could you address the review comments I
>>>> sent here ?
>>>> https://www.postgresql.org/message-id/20230108194524.GA27637%40telsasoft.com
>>>
>>> Please find v24 attached.
>>
>>
>> Thanks for updating the patch.
>>
>> In 001, RestoreArchive() does:
>>
>>> -#ifndef HAVE_LIBZ
>>> - if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP &&
>>> - AH->PrintTocDataPtr != NULL)
>>> + supports_compression = false;
>>> + if (AH->compression_spec.algorithm == PG_COMPRESSION_NONE ||
>>> + AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
>>> + supports_compression = true;
>>> +
>>> + if (AH->PrintTocDataPtr != NULL)
>>> {
>>> for (te = AH->toc->next; te != AH->toc; te = te->next)
>>> {
>>> if (te->hadDumper && (te->reqs & REQ_DATA) != 0)
>>> - pg_fatal("cannot restore from compressed archive (compression not supported in this installation)");
>>> + {
>>> +#ifndef HAVE_LIBZ
>>> + if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
>>> + supports_compression = false;
>>> +#endif
>>> + if (supports_compression == false)
>>> + pg_fatal("cannot restore from compressed archive (compression not supported in this installation)");
>>> + }
>>> }
>>> }
>>> -#endif
>>
>>
>> This first checks if the algorithm is implemented, and then checks if
>> the algorithm is supported by the current build - that confused me for a
>> bit. It seems unnecessary to check for unimplemented algorithms before
>> looping. That also requires referencing both GZIP and LZ4 in two
>> places.
> 
> I am not certain that it is unnecessary, at least not in the way that is
> described. The idea is that new compression methods can be added, without
> changing the archive's version number. It is very possible that it is
> requested to restore an archive compressed with a method not implemented
> in the current binary. The first check takes care of that and sets
> supports_compression only for the supported versions. It is possible to
> enter the loop with supports_compression already set to false, for example
> because the archive was compressed with ZSTD, triggering the fatal error.
> 
> Of course, one can throw the error before entering the loop, yet I think
> that it does not help the readability of the code. IMHO it is easier to
> follow if the error is thrown once during that check.
> 

Actually, I don't understand why 0001 moves the check into the loop. I
mean, why not check HAVE_LIBZ before the loop?

>>
>> I think it could be written to avoid the need to change for added
>> compression algorithms:
>>
>> + if (te->hadDumper && (te->reqs & REQ_DATA) != 0)
>>
>> + {
>> + /* Check if the compression algorithm is supported */
>> + pg_compress_specification spec;
>> + parse_compress_specification(AH->compression_spec.algorithm, NULL, &spec);
>>
>> + if (spec->parse_error != NULL)
>>
>> + pg_fatal(spec->parse_error);
>>
>> + }
> 
> I am not certain how that would work in the example with ZSTD above.
> If I am not wrong, parse_compress_specification() will not throw an error
> if the codebase supports ZSTD, yet this specific pg_dump binary will not
> support it because ZSTD is not implemented. parse_compress_specification()
> is not aware of that and should not be aware of it, should it?
> 

Not sure. What happens in a similar situation now? That is, when trying
to deal with an archive gzip-compressed in a build without libz?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Wed, Jan 25, 2023 at 03:37:12PM +0000, gkokolatos@pm.me wrote:
> Of course, one can throw the error before entering the loop, yet I think
> that it does not help the readability of the code. IMHO it is easier to
> follow if the error is thrown once during that check.

> If anything, I can suggest to throw an error much earlier, i.e. in ReadHead(),
> and remove altogether this check. On the other hand, I like the belts
> and suspenders approach because there are no more checks after this point.

While looking at this, I realized that commit 5e73a6048 introduced a
regression:

@@ -3740,19 +3762,24 @@ ReadHead(ArchiveHandle *AH)

-       if (AH->compression != 0)
-               pg_log_warning("archive is compressed, but this installation does not support compression -- no data
willbe available");
 
+       if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
+               pg_fatal("archive is compressed, but this installation does not support compression");

Before, it was possible to restore non-data chunks of a dump file, even
if the current build didn't support its compression.  But that's now
impossible - and it makes the code we're discussing in RestoreArchive()
unreachable.

I don't think we can currently test for that, since it requires creating a dump
using a build --with compression and then trying to restore using a build
--without compression.  The coverage report disagrees with me, though...
https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_archiver.c.gcov.html#3901

> > I think it could be written to avoid the need to change for added
> > compression algorithms:
...
> 
> I am not certain how that would work in the example with ZSTD above.
> If I am not wrong, parse_compress_specification() will not throw an error
> if the codebase supports ZSTD, yet this specific pg_dump binary will not
> support it because ZSTD is not implemented. parse_compress_specification()
> is not aware of that and should not be aware of it, should it?

You're right.

I think the 001 patch should try to remove hardcoded references to
LIBZ/GZIP, such that the later patches don't need to update those same
places for LZ4.  For example in ReadHead() and RestoreArchive(), and
maybe other places dealing with file extensions.  Maybe that could be
done by adding a function specific to pg_dump indicating whether or not
an algorithm is implemented and supported.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Wednesday, January 25th, 2023 at 6:28 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
>
> On 1/25/23 16:37, gkokolatos@pm.me wrote:
>
> > ------- Original Message -------
> > On Wednesday, January 25th, 2023 at 2:42 AM, Justin Pryzby pryzby@telsasoft.com wrote:
> >
> > > On Tue, Jan 24, 2023 at 03:56:20PM +0000, gkokolatos@pm.me wrote:
> > >
> > > > On Monday, January 23rd, 2023 at 7:00 PM, Justin Pryzby pryzby@telsasoft.com wrote:
> > > >
> > > > > On Mon, Jan 23, 2023 at 05:31:55PM +0000, gkokolatos@pm.me wrote:
> > > > >
> > > > > > Please find attached v23 which reintroduces the split.
> > > > > >
> > > > > > 0001 is reworked to have a reduced footprint than before. Also in an attempt
> > > > > > to facilitate the readability, 0002 splits the API's and the uncompressed
> > > > > > implementation in separate files.
> > > > >
> > > > > Thanks for updating the patch. Could you address the review comments I
> > > > > sent here ?
> > > > > https://www.postgresql.org/message-id/20230108194524.GA27637%40telsasoft.com
> > > >
> > > > Please find v24 attached.
> > >
> > > Thanks for updating the patch.
> > >
> > > In 001, RestoreArchive() does:
> > >
> > > > -#ifndef HAVE_LIBZ
> > > > - if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP &&
> > > > - AH->PrintTocDataPtr != NULL)
> > > > + supports_compression = false;
> > > > + if (AH->compression_spec.algorithm == PG_COMPRESSION_NONE ||
> > > > + AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> > > > + supports_compression = true;
> > > > +
> > > > + if (AH->PrintTocDataPtr != NULL)
> > > > {
> > > > for (te = AH->toc->next; te != AH->toc; te = te->next)
> > > > {
> > > > if (te->hadDumper && (te->reqs & REQ_DATA) != 0)
> > > > - pg_fatal("cannot restore from compressed archive (compression not supported in this installation)");
> > > > + {
> > > > +#ifndef HAVE_LIBZ
> > > > + if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> > > > + supports_compression = false;
> > > > +#endif
> > > > + if (supports_compression == false)
> > > > + pg_fatal("cannot restore from compressed archive (compression not supported in this installation)");
> > > > + }
> > > > }
> > > > }
> > > > -#endif
> > >
> > > This first checks if the algorithm is implemented, and then checks if
> > > the algorithm is supported by the current build - that confused me for a
> > > bit. It seems unnecessary to check for unimplemented algorithms before
> > > looping. That also requires referencing both GZIP and LZ4 in two
> > > places.
> >
> > I am not certain that it is unnecessary, at least not in the way that is
> > described. The idea is that new compression methods can be added, without
> > changing the archive's version number. It is very possible that it is
> > requested to restore an archive compressed with a method not implemented
> > in the current binary. The first check takes care of that and sets
> > supports_compression only for the supported versions. It is possible to
> > enter the loop with supports_compression already set to false, for example
> > because the archive was compressed with ZSTD, triggering the fatal error.
> >
> > Of course, one can throw the error before entering the loop, yet I think
> > that it does not help the readability of the code. IMHO it is easier to
> > follow if the error is thrown once during that check.
>
>
> Actually, I don't understand why 0001 moves the check into the loop. I
> mean, why not check HAVE_LIBZ before the loop?

The intention is to be able to restore archives that don't contain
data. In that case compression becomes irrelevant as only the data in
an archive is compressed.

>
> > > I think it could be written to avoid the need to change for added
> > > compression algorithms:
> > >
> > > + if (te->hadDumper && (te->reqs & REQ_DATA) != 0)
> > >
> > > + {
> > > + /* Check if the compression algorithm is supported */
> > > + pg_compress_specification spec;
> > > + parse_compress_specification(AH->compression_spec.algorithm, NULL, &spec);
> > >
> > > + if (spec->parse_error != NULL)
> > >
> > > + pg_fatal(spec->parse_error);
> > >
> > > + }
> >
> > I am not certain how that would work in the example with ZSTD above.
> > If I am not wrong, parse_compress_specification() will not throw an error
> > if the codebase supports ZSTD, yet this specific pg_dump binary will not
> > support it because ZSTD is not implemented. parse_compress_specification()
> > is not aware of that and should not be aware of it, should it?
>
>
> Not sure. What happens in a similar situation now? That is, when trying
> to deal with an archive gzip-compressed in a build without libz?


In case that there are no data chunks, the archive will be restored.

Cheers,
//Georgios


>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Wednesday, January 25th, 2023 at 7:00 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:


>
>
> On Wed, Jan 25, 2023 at 03:37:12PM +0000, gkokolatos@pm.me wrote:
>

> While looking at this, I realized that commit 5e73a6048 introduced a
> regression:
>
> @@ -3740,19 +3762,24 @@ ReadHead(ArchiveHandle *AH)
>
> - if (AH->compression != 0)
>
> - pg_log_warning("archive is compressed, but this installation does not support compression -- no data will be
available");
> + if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
>
> + pg_fatal("archive is compressed, but this installation does not support compression");
>
> Before, it was possible to restore non-data chunks of a dump file, even
> if the current build didn't support its compression. But that's now
> impossible - and it makes the code we're discussing in RestoreArchive()
> unreachable.

Nice catch!

Cheers,
//Georgios

> --
> Justin



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Wed, Jan 25, 2023 at 07:57:18PM +0000, gkokolatos@pm.me wrote:
> Nice catch!

Let me see..
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Wed, Jan 25, 2023 at 12:00:20PM -0600, Justin Pryzby wrote:
> While looking at this, I realized that commit 5e73a6048 introduced a
> regression:
>
> @@ -3740,19 +3762,24 @@ ReadHead(ArchiveHandle *AH)
>
> -       if (AH->compression != 0)
> -               pg_log_warning("archive is compressed, but this installation does not support compression -- no data
willbe available"); 
> +       if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> +               pg_fatal("archive is compressed, but this installation does not support compression");
>
> Before, it was possible to restore non-data chunks of a dump file, even
> if the current build didn't support its compression.  But that's now
> impossible - and it makes the code we're discussing in RestoreArchive()
> unreachable.

Right.  The impacts the possibility of looking at the header data,
which is useful with pg_restore -l for example.  On a dump that's been
compressed, pg_restore <= 15 would always print the TOC entries with
or without compression support.  On HEAD, this code prevents the
header lookup.  All *nix or BSD platforms should have support for
zlib, I hope..  Still that could be an issue on Windows, and this
would prevent folks to check the contents of the dumps after saving it
on a WIN32 host, so let's undo that.

So, I have been testing the attached with four sets of binaries from
15/HEAD and with[out] zlib support, and this brings HEAD back to the
pre-15 state (header information able to show up, still failure when
attempting to restore the dump's data without zlib).

> I don't think we can currently test for that, since it requires creating a dump
> using a build --with compression and then trying to restore using a build
> --without compression.

Right, the location of the data is in the header, and I don't see how
you would be able to do that without two sets of binaries at hand, but
our tests run under the assumption that you have only one.  Well,
that's not entirely true as well, as you could create a TAP test like
pg_upgrade that relies on a environment variable pointing to a second
set of binaries.  That's not worth the complication involved, IMO.

> The coverage report disagrees with me, though...
> https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_archiver.c.gcov.html#3901

Isn't that one of the tests like compression_gzip_plain?

Thoughts?
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Thu, Jan 26, 2023 at 02:49:27PM +0900, Michael Paquier wrote:
> On Wed, Jan 25, 2023 at 12:00:20PM -0600, Justin Pryzby wrote:
> > While looking at this, I realized that commit 5e73a6048 introduced a
> > regression:
> > 
> > @@ -3740,19 +3762,24 @@ ReadHead(ArchiveHandle *AH)
> > 
> > -       if (AH->compression != 0)
> > -               pg_log_warning("archive is compressed, but this installation does not support compression -- no
datawill be available");
 
> > +       if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> > +               pg_fatal("archive is compressed, but this installation does not support compression");
> > 
> > Before, it was possible to restore non-data chunks of a dump file, even
> > if the current build didn't support its compression.  But that's now
> > impossible - and it makes the code we're discussing in RestoreArchive()
> > unreachable.
> 
> Right.  The impacts the possibility of looking at the header data,
> which is useful with pg_restore -l for example.

It's not just header data - it's schema and (I think) everything other
than table data.

> > The coverage report disagrees with me, though...
> > https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_archiver.c.gcov.html#3901
> 
> Isn't that one of the tests like compression_gzip_plain?

I'm not sure what you mean.  Plain dump is restored with psql and not
with pg_restore.

My line number was wrong:
https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_archiver.c.gcov.html#390

What test would hit that code without rebuilding ?

394             : #ifndef HAVE_LIBZ
395             :     if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP &&

> Thoughts?
>  #ifndef HAVE_LIBZ
>      if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> -        pg_fatal("archive is compressed, but this installation does not support compression");
> +        pg_log_warning("archive is compressed, but this installation does not support compression -- no data will be
available");

Your patch is fine for now, but these errors should eventually specify
*which* compression algorithm is unavailable.  I think that should be a
part of the 001 patch, ideally in a way that minimizes the number of
places which need to be updated when adding an algorithm.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Thursday, January 26th, 2023 at 7:28 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:


>
>
> On Thu, Jan 26, 2023 at 02:49:27PM +0900, Michael Paquier wrote:
>
> > On Wed, Jan 25, 2023 at 12:00:20PM -0600, Justin Pryzby wrote:
> >
> > > While looking at this, I realized that commit 5e73a6048 introduced a
> > > regression:
> > >
> > > @@ -3740,19 +3762,24 @@ ReadHead(ArchiveHandle *AH)
> > >
> > > - if (AH->compression != 0)
> > > - pg_log_warning("archive is compressed, but this installation does not support compression -- no data will be
available");
> > > + if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> > > + pg_fatal("archive is compressed, but this installation does not support compression");
> > >
> > > Before, it was possible to restore non-data chunks of a dump file, even
> > > if the current build didn't support its compression. But that's now
> > > impossible - and it makes the code we're discussing in RestoreArchive()
> > > unreachable.
> >
> > Right. The impacts the possibility of looking at the header data,
> > which is useful with pg_restore -l for example.
>
>
> It's not just header data - it's schema and (I think) everything other
> than table data.
>
> > > The coverage report disagrees with me, though...
> > > https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_archiver.c.gcov.html#3901
> >
> > Isn't that one of the tests like compression_gzip_plain?
>
>
> I'm not sure what you mean. Plain dump is restored with psql and not
> with pg_restore.
>
> My line number was wrong:
> https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_archiver.c.gcov.html#390
>
> What test would hit that code without rebuilding ?
>
> 394 : #ifndef HAVE_LIBZ
> 395 : if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP &&
>
> > Thoughts?
> > #ifndef HAVE_LIBZ
> > if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> > - pg_fatal("archive is compressed, but this installation does not support compression");
> > + pg_log_warning("archive is compressed, but this installation does not support compression -- no data will be
available");
>
>
> Your patch is fine for now, but these errors should eventually specify
> which compression algorithm is unavailable. I think that should be a
> part of the 001 patch, ideally in a way that minimizes the number of
> places which need to be updated when adding an algorithm.

I gave this a little bit of thought. I think that ReadHead should not
emit a warning, or at least not this warning as it is slightly misleading.
It implies that it will automatically turn off data restoration, which is
false. Further ahead, the code will fail with a conflicting error message
stating that the compression is not available.

Instead, it would be cleaner both for the user and the maintainer to
move the check in RestoreArchive and make it the sole responsible for
this logic.

Please find v26 attached. 0001 does the above and 0002 addresses Justin's
complaints regarding the code footprint.

//Cheers,
Georgios


>
> --
> Justin
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Thu, Jan 26, 2023 at 11:24:47AM +0000, gkokolatos@pm.me wrote:
> I gave this a little bit of thought. I think that ReadHead should not
> emit a warning, or at least not this warning as it is slightly misleading.
> It implies that it will automatically turn off data restoration, which is
> false. Further ahead, the code will fail with a conflicting error message
> stating that the compression is not available.
>
> Instead, it would be cleaner both for the user and the maintainer to
> move the check in RestoreArchive and make it the sole responsible for
> this logic.

-    pg_fatal("cannot restore from compressed archive (compression not supported in this installation)");
+    pg_fatal("cannot restore data from compressed archive (compression not supported in this installation)");
Hmm.  I don't mind changing this part as you suggest.

-#ifndef HAVE_LIBZ
-       if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
-               pg_fatal("archive is compressed, but this installation does not support compression");
-#endif
However I think that we'd better keep the warning, as it can offer a
hint when using pg_restore -l not built with compression support if
looking at a dump that has been compressed.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Thursday, January 26th, 2023 at 12:53 PM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Thu, Jan 26, 2023 at 11:24:47AM +0000, gkokolatos@pm.me wrote:
>
> > I gave this a little bit of thought. I think that ReadHead should not
> > emit a warning, or at least not this warning as it is slightly misleading.
> > It implies that it will automatically turn off data restoration, which is
> > false. Further ahead, the code will fail with a conflicting error message
> > stating that the compression is not available.
> >
> > Instead, it would be cleaner both for the user and the maintainer to
> > move the check in RestoreArchive and make it the sole responsible for
> > this logic.
>
>
> - pg_fatal("cannot restore from compressed archive (compression not supported in this installation)");
> + pg_fatal("cannot restore data from compressed archive (compression not supported in this installation)");
> Hmm. I don't mind changing this part as you suggest.
>
> -#ifndef HAVE_LIBZ
> - if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
>
> - pg_fatal("archive is compressed, but this installation does not support compression");
> -#endif
> However I think that we'd better keep the warning, as it can offer a
> hint when using pg_restore -l not built with compression support if
> looking at a dump that has been compressed.

Fair enough. Please find v27 attached.

Cheers,
//Georgios


> --
> Michael
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Wed, Jan 25, 2023 at 07:57:18PM +0000, gkokolatos@pm.me wrote:
> On Wednesday, January 25th, 2023 at 7:00 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> > While looking at this, I realized that commit 5e73a6048 introduced a
> > regression:
> > 
> > @@ -3740,19 +3762,24 @@ ReadHead(ArchiveHandle *AH)
> > 
> > - if (AH->compression != 0)
> > 
> > - pg_log_warning("archive is compressed, but this installation does not support compression -- no data will be
available");
> > + if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> > 
> > + pg_fatal("archive is compressed, but this installation does not support compression");
> > 
> > Before, it was possible to restore non-data chunks of a dump file, even
> > if the current build didn't support its compression. But that's now
> > impossible - and it makes the code we're discussing in RestoreArchive()
> > unreachable.

On Thu, Jan 26, 2023 at 08:53:28PM +0900, Michael Paquier wrote:
> On Thu, Jan 26, 2023 at 11:24:47AM +0000, gkokolatos@pm.me wrote:
> > I gave this a little bit of thought. I think that ReadHead should not
> > emit a warning, or at least not this warning as it is slightly misleading.
> > It implies that it will automatically turn off data restoration, which is
> > false. Further ahead, the code will fail with a conflicting error message
> > stating that the compression is not available.
> > 
> > Instead, it would be cleaner both for the user and the maintainer to
> > move the check in RestoreArchive and make it the sole responsible for
> > this logic.
> 
> -    pg_fatal("cannot restore from compressed archive (compression not supported in this installation)");
> +    pg_fatal("cannot restore data from compressed archive (compression not supported in this installation)");
> Hmm.  I don't mind changing this part as you suggest.
> 
> -#ifndef HAVE_LIBZ
> -       if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> -               pg_fatal("archive is compressed, but this installation does not support compression");
> -#endif
> However I think that we'd better keep the warning, as it can offer a
> hint when using pg_restore -l not built with compression support if
> looking at a dump that has been compressed.

Yeah.  But the original log_warning text was better, and should be
restored:

-       if (AH->compression != 0)
-               pg_log_warning("archive is compressed, but this installation does not support compression -- no data
willbe available");
 

That commit also added this to pg-dump.c:

+               case PG_COMPRESSION_ZSTD:
+                       pg_fatal("compression with %s is not yet supported", "ZSTD");
+                       break;
+               case PG_COMPRESSION_LZ4:
+                       pg_fatal("compression with %s is not yet supported", "LZ4");
+                       break;

In 002, that could be simplified by re-using the supports_compression()
function.  (And maybe the same in WriteDataToArchive()?)

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Thu, Jan 26, 2023 at 12:22:45PM -0600, Justin Pryzby wrote:
> Yeah.  But the original log_warning text was better, and should be
> restored:
>
> -       if (AH->compression != 0)
> -               pg_log_warning("archive is compressed, but this installation does not support compression -- no data
willbe available"); 

Yeah, this one's on me.  So I have gone with the simplest solution and
applied a fix to restore the original behavior, with the same warning
showing up.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Mon, Jan 16, 2023 at 11:54:46AM +0900, Michael Paquier wrote:
> On Sun, Jan 15, 2023 at 07:56:25PM -0600, Justin Pryzby wrote:
> > On Mon, Jan 16, 2023 at 10:28:50AM +0900, Michael Paquier wrote:
> >> The functions changed by 0001 are cfopen[_write](),
> >> AllocateCompressor() and ReadDataFromArchive().  Why is it a good idea
> >> to change these interfaces which basically exist to handle inputs?
> > 
> > I changed to pass pg_compress_specification as a pointer, since that's
> > the usual convention for structs, as followed by the existing uses of
> > pg_compress_specification.
> 
> Okay, but what do we gain here?  It seems to me that this introduces
> the risk that a careless change in one of the internal routines if
> they change slight;ly compress_spec, hence impacting any of their
> callers?  Or is that fixing an actual bug (except if I am missing your
> point, that does not seem to be the case)?  

To circle back to this: I was not saying there's any bug.  The proposed
change was only to follow normal and existing normal conventions for
passing structs.  It could also be a pointer to const.  It's fine with
me if you say that it's intentional how it's written already.

> >> Is there some benefit in changing compression_spec within the
> >> internals of these routines before going back one layer down to their
> >> callers?  Changing the compression_spec on-the-fly in these internal
> >> paths could be risky, actually, no?
> > 
> > I think what you're saying is that if the spec is passed as a pointer,
> > then the called functions shouldn't set spec->algorithm=something.
> 
> Yes.  HEAD makes sure of that, 0001 would not prevent that.  So I am a
> bit confused in seeing how this is a benefit.



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Thu, Jan 26, 2023 at 12:22:45PM -0600, Justin Pryzby wrote:
> That commit also added this to pg-dump.c:
> 
> +               case PG_COMPRESSION_ZSTD:
> +                       pg_fatal("compression with %s is not yet supported", "ZSTD");
> +                       break;
> +               case PG_COMPRESSION_LZ4:
> +                       pg_fatal("compression with %s is not yet supported", "LZ4");
> +                       break;
> 
> In 002, that could be simplified by re-using the supports_compression()
> function.  (And maybe the same in WriteDataToArchive()?)

The first patch aims to minimize references to ".gz" and "GZIP" and
ZLIB.  pg_backup_directory.c comments still refers to ".gz".  I think
the patch should ideally change to refer to "the compressed file
extension" (similar to compress_io.c), avoiding the need to update it
later.

I think the file extension stuff could be generalized, so it doesn't
need to be updated in multiple places (pg_backup_directory.c and
compress_io.c).  Maybe it's useful to add a function to return the
extension of a given compression method.  It could go in compression.c,
and be useful in basebackup.

For the 2nd patch:

I might be in the minority, but I still think some references to "gzip"
should say "zlib":

+} GzipCompressorState;
+
+/* Private routines that support gzip compressed data I/O */
+static void
+DeflateCompressorGzip(ArchiveHandle *AH, CompressorState *cs, bool flush)

In my mind, three things here are misleading, because it doesn't use
gzip headers:

| GzipCompressorState, DeflateCompressorGzip, "gzip compressed".

This comment is about exactly that:

  * underlying stream. The second API is a wrapper around fopen/gzopen and
  * friends, providing an interface similar to those, but abstracts away
  * the possible compression. Both APIs use libz for the compression, but
  * the second API uses gzip headers, so the resulting files can be easily
  * manipulated with the gzip utility.

AIUI, Michael says that it's fine that the user-facing command-line
options use "-Z gzip" (even though the "custom" format doesn't use gzip
headers).  I'm okay with that, as long as that's discussed/understood.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Friday, January 27th, 2023 at 6:23 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:


>
>
> On Thu, Jan 26, 2023 at 12:22:45PM -0600, Justin Pryzby wrote:
>
> > That commit also added this to pg-dump.c:
> >
> > + case PG_COMPRESSION_ZSTD:
> > + pg_fatal("compression with %s is not yet supported", "ZSTD");
> > + break;
> > + case PG_COMPRESSION_LZ4:
> > + pg_fatal("compression with %s is not yet supported", "LZ4");
> > + break;
> >
> > In 002, that could be simplified by re-using the supports_compression()
> > function. (And maybe the same in WriteDataToArchive()?)
>
>
> The first patch aims to minimize references to ".gz" and "GZIP" and
> ZLIB. pg_backup_directory.c comments still refers to ".gz". I think
> the patch should ideally change to refer to "the compressed file
> extension" (similar to compress_io.c), avoiding the need to update it
> later.
>
> I think the file extension stuff could be generalized, so it doesn't
> need to be updated in multiple places (pg_backup_directory.c and
> compress_io.c). Maybe it's useful to add a function to return the
> extension of a given compression method. It could go in compression.c,
> and be useful in basebackup.
>
> For the 2nd patch:
>
> I might be in the minority, but I still think some references to "gzip"
> should say "zlib":
>
> +} GzipCompressorState;
> +
> +/* Private routines that support gzip compressed data I/O */
> +static void
> +DeflateCompressorGzip(ArchiveHandle *AH, CompressorState *cs, bool flush)
>
> In my mind, three things here are misleading, because it doesn't use
> gzip headers:
>
> | GzipCompressorState, DeflateCompressorGzip, "gzip compressed".
>
> This comment is about exactly that:
>
> * underlying stream. The second API is a wrapper around fopen/gzopen and
> * friends, providing an interface similar to those, but abstracts away
> * the possible compression. Both APIs use libz for the compression, but
> * the second API uses gzip headers, so the resulting files can be easily
> * manipulated with the gzip utility.
>
> AIUI, Michael says that it's fine that the user-facing command-line
> options use "-Z gzip" (even though the "custom" format doesn't use gzip
> headers). I'm okay with that, as long as that's discussed/understood.
>

Thank you for the input Justin. I am currently waiting for input from a
third person to get some conclusion. I thought that it should be stated
before my inactiveness is considered as indifference, which is not.

Cheers,
//Georgios

> --
> Justin



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Tue, Jan 31, 2023 at 09:00:56AM +0000, gkokolatos@pm.me wrote:
> > In my mind, three things here are misleading, because it doesn't use
> > gzip headers:
> > 
> > | GzipCompressorState, DeflateCompressorGzip, "gzip compressed".
> > 
> > This comment is about exactly that:
> > 
> > * underlying stream. The second API is a wrapper around fopen/gzopen and
> > * friends, providing an interface similar to those, but abstracts away
> > * the possible compression. Both APIs use libz for the compression, but
> > * the second API uses gzip headers, so the resulting files can be easily
> > * manipulated with the gzip utility.
> > 
> > AIUI, Michael says that it's fine that the user-facing command-line
> > options use "-Z gzip" (even though the "custom" format doesn't use gzip
> > headers). I'm okay with that, as long as that's discussed/understood.
> 
> Thank you for the input Justin. I am currently waiting for input from a
> third person to get some conclusion. I thought that it should be stated
> before my inactiveness is considered as indifference, which is not.

I'm not sure what there is to lose by making the names more accurate -
especially since they're private/internal-only.

Tomas marked himself as a committer, so maybe could comment.

It'd be nice to also come to some conclusion about whether -Fc -Z gzip
is confusing (due to not actually using gzip).

BTW, do you intend to merge this for v16 ?  I verified in earlier patch
versions that tests all pass with lz4 as the default compression method.
And checked that gzip output is compatible with before, and that old
dumps restore correctly, and there's no memory leaks or other errors.

-- 
Justin



RE: Add LZ4 compression in pg_dump

From
"shiy.fnst@fujitsu.com"
Date:
On Fri, Jan 27, 2023 2:04 AM gkokolatos@pm.me <gkokolatos@pm.me> wrote:
> 
> ------- Original Message -------
> On Thursday, January 26th, 2023 at 12:53 PM, Michael Paquier
> <michael@paquier.xyz> wrote:
> 
> 
> >
> >
> > On Thu, Jan 26, 2023 at 11:24:47AM +0000, gkokolatos@pm.me wrote:
> >
> > > I gave this a little bit of thought. I think that ReadHead should not
> > > emit a warning, or at least not this warning as it is slightly misleading.
> > > It implies that it will automatically turn off data restoration, which is
> > > false. Further ahead, the code will fail with a conflicting error message
> > > stating that the compression is not available.
> > >
> > > Instead, it would be cleaner both for the user and the maintainer to
> > > move the check in RestoreArchive and make it the sole responsible for
> > > this logic.
> >
> >
> > - pg_fatal("cannot restore from compressed archive (compression not
> supported in this installation)");
> > + pg_fatal("cannot restore data from compressed archive (compression not
> supported in this installation)");
> > Hmm. I don't mind changing this part as you suggest.
> >
> > -#ifndef HAVE_LIBZ
> > - if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP)
> >
> > - pg_fatal("archive is compressed, but this installation does not support
> compression");
> > -#endif
> > However I think that we'd better keep the warning, as it can offer a
> > hint when using pg_restore -l not built with compression support if
> > looking at a dump that has been compressed.
> 
> Fair enough. Please find v27 attached.
> 

Hi,

I am interested in this feature and tried the patch. While reading the comments,
I noticed some minor things that could possibly be improved (in v27-0003 patch).

1.
+    /*
+     * Open a file for writing.
+     *
+     * 'mode' can be one of ''w', 'wb', 'a', and 'ab'. Requrires an already
+     * initialized CompressFileHandle.
+     */
+    int            (*open_write_func) (const char *path, const char *mode,
+                                    CompressFileHandle *CFH);

There is a redundant single quote in front of 'w'.

2.
/*
 * Callback function for WriteDataToArchive. Writes one block of (compressed)
 * data to the archive.
 */
/*
 * Callback function for ReadDataFromArchive. To keep things simple, we
 * always read one compressed block at a time.
 */

Should the function names in the comments be updated?

WriteDataToArchive
->
writeData

ReadDataFromArchive
->
readData

3.
+    Assert(strcmp(mode, "r") == 0 || strcmp(mode, "rb") == 0);

Could we use PG_BINARY_R instead of "r" and "rb" here?

Regards,
Shi Yu


RE: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Wednesday, February 15th, 2023 at 2:51 PM, shiy.fnst@fujitsu.com <shiy.fnst@fujitsu.com> wrote:


>
> Hi,
>
> I am interested in this feature and tried the patch. While reading the comments,
> I noticed some minor things that could possibly be improved (in v27-0003 patch).

Thank you very much for the interest. Please find a rebased v28 attached. Due to
the rebase, 0001 of v27 is no longer relevant and has been removed. Your comments
are applied on v28-0002.

>
> 1.
> + /*
> + * Open a file for writing.
> + *
> + * 'mode' can be one of ''w', 'wb', 'a', and 'ab'. Requrires an already
> + * initialized CompressFileHandle.
> + */
> + int (*open_write_func) (const char *path, const char *mode,
> + CompressFileHandle CFH);
>
> There is a redundant single quote in front of 'w'.

Fixed.

>
> 2.
> /
> * Callback function for WriteDataToArchive. Writes one block of (compressed)
> * data to the archive.
> /
> /
> * Callback function for ReadDataFromArchive. To keep things simple, we
> * always read one compressed block at a time.
> */
>
> Should the function names in the comments be updated?

Agreed. Fixed.

>
> 3.
> + Assert(strcmp(mode, "r") == 0 || strcmp(mode, "rb") == 0);
>
> Could we use PG_BINARY_R instead of "r" and "rb" here?

We could and we should. Using PG_BINARY_R has the added benefit
of needing only one strcmp() call. Fixed.

Cheers,
//Georgios

>
> Regards,
> Shi Yu
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
Hi Georgios,

I spent some time looking at the patch again, and IMO it's RFC. But I
need some help with the commit messages - I updated 0001 and 0002 but I
wasn't quite sure what some of the stuff meant to say and/or it seemed
maybe coming from an earlier patch version and obsolete.

Could you go over them and check if I got it right? Also feel free to
update the list of reviewers (I compiled that from substantial reviews
on the thread).

The 0003 commit message seems somewhat confusing - I added some XXX
lines asking about unclear stuff.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
Some little updates since I last checked:

+ * This file also includes the implementation when compression is none for
+ * both API's.

=> this comment is obsolete.

s/deffer/infer/ ?
or determine ?
This typo occurs multiple times.

currently this includes only ".gz"
=> remove this phase from the 002 patch (or at least update it in 003).

deferred by iteratively
=> inferred?

s/Requrires/Requires/
twice.

s/occured/occurred/

s/disc/disk/ ?
Probably unimportant, but "disc" isn't used anywhere else.

"compress file handle"
=> maybe these should say "compressed"

supports_compression():
Since this is an exported function, it should probably be called
pgdump_supports_compresion.



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Sunday, February 19th, 2023 at 6:10 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
> Hi Georgios,
>
> I spent some time looking at the patch again, and IMO it's RFC. But I
> need some help with the commit messages - I updated 0001 and 0002 but I
> wasn't quite sure what some of the stuff meant to say and/or it seemed
> maybe coming from an earlier patch version and obsolete.

Thank you very much Tomas! Indeed I have not being paying any attention
at the commit messages.

> Could you go over them and check if I got it right? Also feel free to
> update the list of reviewers (I compiled that from substantial reviews
> on the thread).

Done. Rachel has been correctly identified as author in the relevant parts
up to commit 98fe74218d. After that, she had offered review comments and I
have taken the liberty to add her as a reviewer through out.

Also I think that Shi Yu should be credited as a reviewer of 0003.

>
> The 0003 commit message seems somewhat confusing - I added some XXX
> lines asking about unclear stuff.

Please find in the attached v30 an updated message, as well as an amended
reviewer list. Also v30 addresses the final comments raised by Justin.

Cheers,
//Georgios

> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
Thanks for v30 with the updated commit messages. I've pushed 0001 after
fixing a comment typo and removing (I think) an unnecessary change in an
error message.

I'll give the buildfarm a bit of time before pushing 0002 and 0003.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 2/23/23 16:26, Tomas Vondra wrote:
> Thanks for v30 with the updated commit messages. I've pushed 0001 after
> fixing a comment typo and removing (I think) an unnecessary change in an
> error message.
> 
> I'll give the buildfarm a bit of time before pushing 0002 and 0003.
> 

I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
and marked the CF entry as committed. Thanks for the patch!

I wonder how difficult would it be to add the zstd compression, so that
we don't have the annoying "unsupported" cases.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Thu, Feb 23, 2023 at 09:24:46PM +0100, Tomas Vondra wrote:
> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
> and marked the CF entry as committed. Thanks for the patch!

A big thanks from me to everyone involved.

> I wonder how difficult would it be to add the zstd compression, so that
> we don't have the annoying "unsupported" cases.

I'll send a patch soon.  I first submitted patches for that 2 years ago
(before PGDG was ready to add zstd).
https://commitfest.postgresql.org/31/2888/

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Thu, Feb 23, 2023 at 07:51:16PM -0600, Justin Pryzby wrote:
> On Thu, Feb 23, 2023 at 09:24:46PM +0100, Tomas Vondra wrote:
>> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
>> and marked the CF entry as committed. Thanks for the patch!
>
> A big thanks from me to everyone involved.

Wow, nice!  The APIs are clear to follow.

> I'll send a patch soon.  I first submitted patches for that 2 years ago
> (before PGDG was ready to add zstd).
> https://commitfest.postgresql.org/31/2888/

Thanks.  It should be straight-forward to see that in 16, I guess.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Friday, February 24th, 2023 at 5:35 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Thu, Feb 23, 2023 at 07:51:16PM -0600, Justin Pryzby wrote:
>
> > On Thu, Feb 23, 2023 at 09:24:46PM +0100, Tomas Vondra wrote:
> >
> > > I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
> > > and marked the CF entry as committed. Thanks for the patch!
> >
> > A big thanks from me to everyone involved.
>
>
> Wow, nice! The APIs are clear to follow.

I am out of words, thank you all so very much. I learned a lot.

>
> > I'll send a patch soon. I first submitted patches for that 2 years ago
> > (before PGDG was ready to add zstd).
> > https://commitfest.postgresql.org/31/2888/
>
>
> Thanks. It should be straight-forward to see that in 16, I guess.
> --
> Michael



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
I have some fixes (attached) and questions while polishing the patch for
zstd compression.  The fixes are small and could be integrated with the
patch for zstd, but could be applied independently.

- I'm unclear about get_error_func().  That's called in three places
  from pg_backup_directory.c, after failures from write_func(), to
  supply an compression-specific error message to pg_fatal().  But it's
  not being used outside of directory format, nor for errors for other
  function pointers, or even for all errors in write_func().  Is there
  some reason why each compression method's write_func() shouldn't call
  pg_fatal() directly, with its compression-specific message ?

- I still think supports_compression() should be renamed, or made into a
  static function in the necessary file.  The main reason is that it's
  more clear what it indicates - whether compression is "implemented by
  pgdump" and not whether compression is "supported by this postgres
  build".  It also seems possible that we'd want to add a function
  called something like supports_compression(), indicating whether the
  algorithm is supported by the current build.  It'd be better if pgdump
  didn't subjugate that name.

- Finally, the "Nothing to do in the default case" comment comes from
  Michael's commit 5e73a6048:

+       /*
+        * Custom and directory formats are compressed by default with gzip when
+        * available, not the others.
+        */
+       if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
+               !user_compression_defined)
        {
 #ifdef HAVE_LIBZ
-               if (archiveFormat == archCustom || archiveFormat == archDirectory)
-                       compressLevel = Z_DEFAULT_COMPRESSION;
-               else
+               parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
+                                                                        &compression_spec);
+#else
+               /* Nothing to do in the default case */
 #endif
-                       compressLevel = 0;
        }


As the comment says: for -Fc and -Fd, the compression is set to zlib, if
enabled, and when not otherwise specified by the user.

Before 5e73a6048, this set compressLevel=0 for -Fp and -Ft, *and* when
zlib was unavailable.

But I'm not sure why there's now an empty "#else".  I also don't know
what "the default case" refers to.

Maybe the best thing here is to move the preprocessor #if, since it's no
longer in the middle of a runtime conditional:

 #ifdef HAVE_LIBZ
+       if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
+               !user_compression_defined)
+               parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
+                                            &compression_spec);
 #endif

...but that elicits a warning about "variable set but not used"...

-- 
Justin

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> I have some fixes (attached) and questions while polishing the patch for
> zstd compression.  The fixes are small and could be integrated with the
> patch for zstd, but could be applied independently.

One more - WriteDataToArchiveGzip() says:

+       if (cs->compression_spec.level == 0)
+           pg_fatal("requested to compress the archive yet no level was specified");

That was added at e9960732a.  

But if you specify gzip:0, the compression level is already enforced by
validate_compress_specification(), before hitting gzip.c:

| pg_dump: error: invalid compression specification: compression algorithm "gzip" expects a compression level between 1
and9 (default at -1)
 

5e73a6048 intended that to work as before, and you *can* specify -Z0:

    The change is backward-compatible, hence specifying only an integer
    leads to no compression for a level of 0 and gzip compression when the
    level is greater than 0.

    $ time ./src/bin/pg_dump/pg_dump -h /tmp regression -t int8_tbl -Fp --compress 0 |file -
    /dev/stdin: ASCII text

Right now, I think that pg_fatal in gzip.c is dead code - that was first
added in the patch version sent on 21 Dec 2022.

-- 
Justin

Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 2/25/23 06:02, Justin Pryzby wrote:
> I have some fixes (attached) and questions while polishing the patch for
> zstd compression.  The fixes are small and could be integrated with the
> patch for zstd, but could be applied independently.
> 
> - I'm unclear about get_error_func().  That's called in three places
>   from pg_backup_directory.c, after failures from write_func(), to
>   supply an compression-specific error message to pg_fatal().  But it's
>   not being used outside of directory format, nor for errors for other
>   function pointers, or even for all errors in write_func().  Is there
>   some reason why each compression method's write_func() shouldn't call
>   pg_fatal() directly, with its compression-specific message ?
> 

I think there are a couple more places that might/should call
get_error_func(). For example ahwrite() in pg_backup_archiver.c now
simply does

    if (bytes_written != size * nmemb)
        WRITE_ERROR_EXIT;

but perhaps it should call get_error_func() too. There are probably
other places that call write_func() and should use get_error_func().

> - I still think supports_compression() should be renamed, or made into a
>   static function in the necessary file.  The main reason is that it's
>   more clear what it indicates - whether compression is "implemented by
>   pgdump" and not whether compression is "supported by this postgres
>   build".  It also seems possible that we'd want to add a function
>   called something like supports_compression(), indicating whether the
>   algorithm is supported by the current build.  It'd be better if pgdump
>   didn't subjugate that name.
> 

If we choose to rename this to have pgdump_ prefix, fine with me. But I
don't think there's a realistic chance of conflict, as it's restricted
to pgdump header etc. And it's not part of an API, so I guess we could
rename that in the future if needed.

> - Finally, the "Nothing to do in the default case" comment comes from
>   Michael's commit 5e73a6048:
> 
> +       /*
> +        * Custom and directory formats are compressed by default with gzip when
> +        * available, not the others.
> +        */
> +       if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
> +               !user_compression_defined)
>         {
>  #ifdef HAVE_LIBZ
> -               if (archiveFormat == archCustom || archiveFormat == archDirectory)
> -                       compressLevel = Z_DEFAULT_COMPRESSION;
> -               else
> +               parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
> +                                                                        &compression_spec);
> +#else
> +               /* Nothing to do in the default case */
>  #endif
> -                       compressLevel = 0;
>         }
> 
> 
> As the comment says: for -Fc and -Fd, the compression is set to zlib, if
> enabled, and when not otherwise specified by the user.
> 
> Before 5e73a6048, this set compressLevel=0 for -Fp and -Ft, *and* when
> zlib was unavailable.
> 
> But I'm not sure why there's now an empty "#else".  I also don't know
> what "the default case" refers to.
> 
> Maybe the best thing here is to move the preprocessor #if, since it's no
> longer in the middle of a runtime conditional:
> 
>  #ifdef HAVE_LIBZ
> +       if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
> +               !user_compression_defined)
> +               parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
> +                                            &compression_spec);
>  #endif
> 
> ...but that elicits a warning about "variable set but not used"...
> 

Not sure, I need to think about this a bit.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> > I have some fixes (attached) and questions while polishing the patch for
> > zstd compression.  The fixes are small and could be integrated with the
> > patch for zstd, but could be applied independently.
> 
> One more - WriteDataToArchiveGzip() says:

One more again.

The LZ4 path is using non-streaming mode, which compresses each block
without persistent state, giving poor compression for -Fc compared with
-Fp.  If the data is highly compressible, the difference can be orders
of magnitude.

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
12351763
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
21890708

That's not true for gzip:

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
2118869
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
2115832

The function ought to at least use streaming mode, so each block/row
isn't compressioned in isolation.  003 is a simple patch to use
streaming mode, which improves the -Fc case:

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
15178283

However, that still flushes the compression buffer, writing a block
header, for every row.  With a single-column table, pg_dump -Fc -Z lz4
still outputs ~10% *more* data than with no compression at all.  And
that's for compressible data.

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
12890296
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
11890296

I think this should use the LZ4F API with frames, which are buffered to
avoid outputting a header for every single row.  The LZ4F format isn't
compatible with the LZ4 format, so (unlike changing to the streaming
API) that's not something we can change in a bugfix release.  I consider
this an Opened Item.

With the LZ4F API in 004, -Fp and -Fc are essentially the same size
(like gzip).  (Oh, and the output is three times smaller, too.)

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
4155448
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
4156548

-- 
Justin

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Sunday, February 26th, 2023 at 3:59 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
>
>
> On 2/25/23 06:02, Justin Pryzby wrote:
>
> > I have some fixes (attached) and questions while polishing the patch for
> > zstd compression. The fixes are small and could be integrated with the
> > patch for zstd, but could be applied independently.
> >
> > - I'm unclear about get_error_func(). That's called in three places
> > from pg_backup_directory.c, after failures from write_func(), to
> > supply an compression-specific error message to pg_fatal(). But it's
> > not being used outside of directory format, nor for errors for other
> > function pointers, or even for all errors in write_func(). Is there
> > some reason why each compression method's write_func() shouldn't call
> > pg_fatal() directly, with its compression-specific message ?
>
>
> I think there are a couple more places that might/should call
> get_error_func(). For example ahwrite() in pg_backup_archiver.c now
> simply does
>
> if (bytes_written != size * nmemb)
> WRITE_ERROR_EXIT;
>
> but perhaps it should call get_error_func() too. There are probably
> other places that call write_func() and should use get_error_func().

Agreed, calling get_error_func() would be preferable to a fatal error. It
should be the caller of the api who decides how to proceed.

>
> > - I still think supports_compression() should be renamed, or made into a
> > static function in the necessary file. The main reason is that it's
> > more clear what it indicates - whether compression is "implemented by
> > pgdump" and not whether compression is "supported by this postgres
> > build". It also seems possible that we'd want to add a function
> > called something like supports_compression(), indicating whether the
> > algorithm is supported by the current build. It'd be better if pgdump
> > didn't subjugate that name.
>
>
> If we choose to rename this to have pgdump_ prefix, fine with me. But I
> don't think there's a realistic chance of conflict, as it's restricted
> to pgdump header etc. And it's not part of an API, so I guess we could
> rename that in the future if needed.

Agreed, it is very unrealistic that one will include that header file anywhere
but within pg_dump. Also. I think that adding a prefix, "pgdump", "pg_dump",
or similar does not add value and subtracts readability.

>
> > - Finally, the "Nothing to do in the default case" comment comes from
> > Michael's commit 5e73a6048:
> >
> > + /*
> > + * Custom and directory formats are compressed by default with gzip when
> > + * available, not the others.
> > + /
> > + if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
> > + !user_compression_defined)
> > {
> > #ifdef HAVE_LIBZ
> > - if (archiveFormat == archCustom || archiveFormat == archDirectory)
> > - compressLevel = Z_DEFAULT_COMPRESSION;
> > - else
> > + parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
> > + &compression_spec);
> > +#else
> > + / Nothing to do in the default case */
> > #endif
> > - compressLevel = 0;
> > }
> >
> > As the comment says: for -Fc and -Fd, the compression is set to zlib, if
> > enabled, and when not otherwise specified by the user.
> >
> > Before 5e73a6048, this set compressLevel=0 for -Fp and -Ft, and when
> > zlib was unavailable.
> >
> > But I'm not sure why there's now an empty "#else". I also don't know
> > what "the default case" refers to.
> >
> > Maybe the best thing here is to move the preprocessor #if, since it's no
> > longer in the middle of a runtime conditional:
> >
> > #ifdef HAVE_LIBZ
> > + if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
> > + !user_compression_defined)
> > + parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
> > + &compression_spec);
> > #endif
> >
> > ...but that elicits a warning about "variable set but not used"...
>
>
> Not sure, I need to think about this a bit.

Not having warnings is preferable, isn't it? I can understand the confusion
on the message though. Maybe a phrasing like:
/* Nothing to do for the default case when LIBZ is not available */
is easier to understand.

Cheers,
//Georgios

>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Saturday, February 25th, 2023 at 3:05 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:


>
>
> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
>
> > I have some fixes (attached) and questions while polishing the patch for
> > zstd compression. The fixes are small and could be integrated with the
> > patch for zstd, but could be applied independently.


Please find some comments on the rest of the fixes patch that Tomas has not
commented on.

            can be compressed with the <application>gzip</application> or
-           <application>lz4</application>tool.
+           <application>lz4</application> tools.

+1

         The compression method can be set to <literal>gzip</literal> or
-        <literal>lz4</literal> or <literal>none</literal> for no compression.
+        <literal>lz4</literal>, or <literal>none</literal> for no compression.

I am not a native English speaker. Yet I think that if one adds commas
in one of the options, then one should add commas to all the options.
Namely, the aboveis missing a comma between gzip and lz4. However I
think that not having any commas still works grammatically and
syntactically.

-               /*
-                * A level of zero simply copies the input one block at the time. This
-                * is probably not what the user wanted when calling this interface.
-                */
-               if (cs->compression_spec.level == 0)
-                       pg_fatal("requested to compress the archive yet no level was specified");


I disagree with change. WriteDataToArchiveGzip() is far away from
what ever the code in pg_dump.c is doing. Any non valid values for
level will emit an error in when the proper gzip/zlib code is
called. A zero value however, will not emit such error. Having the
extra check there is a future proof guarantee in a very low cost.
Furthermore, it quickly informs the reader of the code about that
specific value helping with readability and comprehension.

If any change is required, something for which I vote strongly
against, I would at least recommend to replace it with an
assertion.

- * Initialize a compress file stream. Deffer the compression algorithm
+ * Initialize a compress file stream. Infer the compression algorithm

:+1:

-       # Skip command-level tests for gzip if there is no support for it.
+       # Skip command-level tests for gzip/lz4 if they're not supported.

We will be back at that again soon. Maybe change to:

Skip command-level test for unsupported compression methods

To include everything.


-               ($pgdump_runs{$run}->{compile_option} eq 'gzip' && !$supports_gzip) ||
-               ($pgdump_runs{$run}->{compile_option} eq 'lz4' && !$supports_lz4))
+               (($pgdump_runs{$run}->{compile_option} eq 'gzip' && !$supports_gzip) ||
+               ($pgdump_runs{$run}->{compile_option} eq 'lz4' && !$supports_lz4)))

Good catch, :+1:

Cheers,
//Georgios

> --
> Justin



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Thu, Feb 23, 2023 at 09:24:46PM +0100, Tomas Vondra wrote:
> On 2/23/23 16:26, Tomas Vondra wrote:
> > Thanks for v30 with the updated commit messages. I've pushed 0001 after
> > fixing a comment typo and removing (I think) an unnecessary change in an
> > error message.
> > 
> > I'll give the buildfarm a bit of time before pushing 0002 and 0003.
> > 
> 
> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
> and marked the CF entry as committed. Thanks for the patch!

I found that e9960732a broke writing of empty gzip-compressed data,
specifically LOs.  pg_dump succeeds, but then the restore fails:

postgres=# SELECT lo_create(1234);
lo_create | 1234

$ time ./src/bin/pg_dump/pg_dump -h /tmp -d postgres -Fc |./src/bin/pg_dump/pg_restore -f /dev/null -v 
pg_restore: implied data-only restore
pg_restore: executing BLOB 1234
pg_restore: processing BLOBS
pg_restore: restoring large object with OID 1234
pg_restore: error: could not uncompress data: (null)

The inline patch below fixes it, but you won't be able to apply it
directly, as it's on top of other patches which rename the functions
back to "Zlib" and rearranges the functions to their original order, to
allow running:

git diff --diff-algorithm=minimal -w e9960732a~:./src/bin/pg_dump/compress_io.c ./src/bin/pg_dump/compress_gzip.c

The current function order avoids 3 lines of declarations, but it's
obviously pretty useful to be able to run that diff command.  I already
argued for not calling the functions "Gzip" on the grounds that the name
was inaccurate.

I'd want to create an empty large object in src/test/sql/largeobject.sql
to exercise this tested during pgupgrade.  But unfortunately that
doesn't use -Fc, so this isn't hit.  Empty input is an important enough
test case to justify a tap test, if there's no better way.

diff --git a/src/bin/pg_dump/compress_gzip.c b/src/bin/pg_dump/compress_gzip.c
index f3f5e87c9a8..68f3111b2fe 100644
--- a/src/bin/pg_dump/compress_gzip.c
+++ b/src/bin/pg_dump/compress_gzip.c
@@ -55,6 +55,32 @@ InitCompressorZlib(CompressorState *cs,
     gzipcs = (ZlibCompressorState *) pg_malloc0(sizeof(ZlibCompressorState));
 
     cs->private_data = gzipcs;
+
+    if (cs->writeF)
+    {
+        z_streamp    zp;
+        zp = gzipcs->zp = (z_streamp) pg_malloc0(sizeof(z_stream));
+        zp->zalloc = Z_NULL;
+        zp->zfree = Z_NULL;
+        zp->opaque = Z_NULL;
+
+        /*
+         * outsize is the buffer size we tell zlib it can output to.  We
+         * actually allocate one extra byte because some routines want to append a
+         * trailing zero byte to the zlib output.
+         */
+
+        gzipcs->outbuf = pg_malloc(ZLIB_OUT_SIZE + 1);
+        gzipcs->outsize = ZLIB_OUT_SIZE;
+
+        if (deflateInit(gzipcs->zp, cs->compression_spec.level) != Z_OK)
+            pg_fatal("could not initialize compression library: %s",
+                    zp->msg);
+
+        /* Just be paranoid - maybe End is called after Start, with no Write */
+        zp->next_out = gzipcs->outbuf;
+        zp->avail_out = gzipcs->outsize;
+    }
 }
 
 static void
@@ -63,7 +89,7 @@ EndCompressorZlib(ArchiveHandle *AH, CompressorState *cs)
     ZlibCompressorState *gzipcs = (ZlibCompressorState *) cs->private_data;
     z_streamp    zp;
 
-    if (gzipcs->zp)
+    if (cs->writeF != NULL)
     {
         zp = gzipcs->zp;
         zp->next_in = NULL;
@@ -131,29 +157,6 @@ WriteDataToArchiveZlib(ArchiveHandle *AH, CompressorState *cs,
                        const void *data, size_t dLen)
 {
     ZlibCompressorState *gzipcs = (ZlibCompressorState *) cs->private_data;
-    z_streamp    zp;
-
-    if (!gzipcs->zp)
-    {
-        zp = gzipcs->zp = (z_streamp) pg_malloc(sizeof(z_stream));
-        zp->zalloc = Z_NULL;
-        zp->zfree = Z_NULL;
-        zp->opaque = Z_NULL;
-
-        /*
-         * outsize is the buffer size we tell zlib it can output to.  We
-         * actually allocate one extra byte because some routines want to
-         * append a trailing zero byte to the zlib output.
-         */
-        gzipcs->outbuf = pg_malloc(ZLIB_OUT_SIZE + 1);
-        gzipcs->outsize = ZLIB_OUT_SIZE;
-
-        if (deflateInit(zp, cs->compression_spec.level) != Z_OK)
-            pg_fatal("could not initialize compression library: %s", zp->msg);
-
-        zp->next_out = gzipcs->outbuf;
-        zp->avail_out = gzipcs->outsize;
-    }
 
     gzipcs->zp->next_in = (void *) unconstify(void *, data);
     gzipcs->zp->avail_in = dLen;

Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, Feb 28, 2023 at 05:58:34PM -0600, Justin Pryzby wrote:
> I found that e9960732a broke writing of empty gzip-compressed data,
> specifically LOs.  pg_dump succeeds, but then the restore fails:

The number of issues you have been reporting here begins to worries
me..  How many of them have you found?  Is it right to assume that all
of them have basically 03d02f5 as oldest origin point?
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Wednesday, March 1st, 2023 at 12:58 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:



> I found that e9960732a broke writing of empty gzip-compressed data,
> specifically LOs. pg_dump succeeds, but then the restore fails:
>
> postgres=# SELECT lo_create(1234);
> lo_create | 1234
>
> $ time ./src/bin/pg_dump/pg_dump -h /tmp -d postgres -Fc |./src/bin/pg_dump/pg_restore -f /dev/null -v
> pg_restore: implied data-only restore
> pg_restore: executing BLOB 1234
> pg_restore: processing BLOBS
> pg_restore: restoring large object with OID 1234
> pg_restore: error: could not uncompress data: (null)
>

Thank you for looking. This was an untested case.

> The inline patch below fixes it, but you won't be able to apply it
> directly, as it's on top of other patches which rename the functions
> back to "Zlib" and rearranges the functions to their original order, to
> allow running:
>
> git diff --diff-algorithm=minimal -w e9960732a~:./src/bin/pg_dump/compress_io.c ./src/bin/pg_dump/compress_gzip.c
>

Please find a patch attached that can be applied directly.

> The current function order avoids 3 lines of declarations, but it's
> obviously pretty useful to be able to run that diff command. I already
> argued for not calling the functions "Gzip" on the grounds that the name
> was inaccurate.

I have no idea why we are back on the naming issue. I stand by the name
because in my humble opinion helps the code reader. There is a certain
uniformity when the compression_spec.algorithm and the compressor
functions match as the following code sample shows.

    if (compression_spec.algorithm == PG_COMPRESSION_NONE)
        InitCompressorNone(cs, compression_spec);
    else if (compression_spec.algorithm == PG_COMPRESSION_GZIP)
        InitCompressorGzip(cs, compression_spec);
    else if (compression_spec.algorithm == PG_COMPRESSION_LZ4)
        InitCompressorLZ4(cs, compression_spec);

When the reader wants to see what happens when the PG_COMPRESSION_XXX
is set, has to simply search for the XXX part. I think that this is
justification enough for the use of the names.

>
> I'd want to create an empty large object in src/test/sql/largeobject.sql
> to exercise this tested during pgupgrade. But unfortunately that
> doesn't use -Fc, so this isn't hit. Empty input is an important enough
> test case to justify a tap test, if there's no better way.

Please find in the attached a test case that exercises this codepath.

Cheers,
//Georgios
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 3/1/23 08:24, Michael Paquier wrote:
> On Tue, Feb 28, 2023 at 05:58:34PM -0600, Justin Pryzby wrote:
>> I found that e9960732a broke writing of empty gzip-compressed data,
>> specifically LOs.  pg_dump succeeds, but then the restore fails:
> 
> The number of issues you have been reporting here begins to worries
> me..  How many of them have you found?  Is it right to assume that all
> of them have basically 03d02f5 as oldest origin point?

AFAICS a lot of the issues are more a discussion about wording in a
couple places, whether it's nicer to do A or B, name the functions
differently or what.

I'm aware of three genuine issues that I intend to fix shortly:

1) incorrect "if" condition in a TAP test

2) failure when compressing empty LO (which we had no test for)

3) change in handling "compression level = 0" (which I believe should be
made to behave like before)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 3/1/23 14:39, gkokolatos@pm.me wrote:
> 
> 
> 
> 
> 
> ------- Original Message -------
> On Wednesday, March 1st, 2023 at 12:58 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> 
> 
> 
>> I found that e9960732a broke writing of empty gzip-compressed data,
>> specifically LOs. pg_dump succeeds, but then the restore fails:
>>
>> postgres=# SELECT lo_create(1234);
>> lo_create | 1234
>>
>> $ time ./src/bin/pg_dump/pg_dump -h /tmp -d postgres -Fc |./src/bin/pg_dump/pg_restore -f /dev/null -v
>> pg_restore: implied data-only restore
>> pg_restore: executing BLOB 1234
>> pg_restore: processing BLOBS
>> pg_restore: restoring large object with OID 1234
>> pg_restore: error: could not uncompress data: (null)
>>
> 
> Thank you for looking. This was an untested case.
> 

Yeah :-(

>> The inline patch below fixes it, but you won't be able to apply it
>> directly, as it's on top of other patches which rename the functions
>> back to "Zlib" and rearranges the functions to their original order, to
>> allow running:
>>
>> git diff --diff-algorithm=minimal -w e9960732a~:./src/bin/pg_dump/compress_io.c ./src/bin/pg_dump/compress_gzip.c
>>
> 
> Please find a patch attached that can be applied directly.
> 
>> The current function order avoids 3 lines of declarations, but it's
>> obviously pretty useful to be able to run that diff command. I already
>> argued for not calling the functions "Gzip" on the grounds that the name
>> was inaccurate.
> 
> I have no idea why we are back on the naming issue. I stand by the name
> because in my humble opinion helps the code reader. There is a certain
> uniformity when the compression_spec.algorithm and the compressor
> functions match as the following code sample shows.
> 
>     if (compression_spec.algorithm == PG_COMPRESSION_NONE)         
>         InitCompressorNone(cs, compression_spec);
>     else if (compression_spec.algorithm == PG_COMPRESSION_GZIP)
>         InitCompressorGzip(cs, compression_spec);
>     else if (compression_spec.algorithm == PG_COMPRESSION_LZ4)
>         InitCompressorLZ4(cs, compression_spec);        
>                                                  
> When the reader wants to see what happens when the PG_COMPRESSION_XXX
> is set, has to simply search for the XXX part. I think that this is
> justification enough for the use of the names.
> 

I don't recall the previous discussion about the naming, but I'm not
sure why would it be inaccurate. We call it 'gzip' pretty much
everywhere, and I agree with Georgios there's it helps to make this
consistent with the PG_COMPRESSION_ stuff.

The one thing that concerned me while reviewing it earlier was that it
might make the backpatcheing harder. But that's mostly irrelevant due to
all the other changes I think.

>>
>> I'd want to create an empty large object in src/test/sql/largeobject.sql
>> to exercise this tested during pgupgrade. But unfortunately that
>> doesn't use -Fc, so this isn't hit. Empty input is an important enough
>> test case to justify a tap test, if there's no better way.
> 
> Please find in the attached a test case that exercises this codepath.
> 

Thanks. That seems correct to me, but I find it somewhat confusing,
because we now have

 DeflateCompressorInit vs. InitCompressorGzip

 DeflateCompressorEnd vs. EndCompressorGzip

 DeflateCompressorData - The name doesn't really say what it does (would
                         be better to have a verb in there, I think).

I wonder if we can make this somehow clearer?

Also, InitCompressorGzip says this:

   /*
    * If the caller has defined a write function, prepare the necessary
    * state. Avoid initializing during the first write call, because End
    * may be called without ever writing any data.
    */
    if (cs->writeF)
        DeflateCompressorInit(cs);

Does it actually make sense to not have writeF defined in some cases?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 2/27/23 15:56, gkokolatos@pm.me wrote:
> 
> 
> 
> 
> 
> ------- Original Message -------
> On Saturday, February 25th, 2023 at 3:05 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> 
> 
>>
>>
>> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
>>
>>> I have some fixes (attached) and questions while polishing the patch for
>>> zstd compression. The fixes are small and could be integrated with the
>>> patch for zstd, but could be applied independently.
> 
> 
> Please find some comments on the rest of the fixes patch that Tomas has not
> commented on.
> 
>             can be compressed with the <application>gzip</application> or
> -           <application>lz4</application>tool.
> +           <application>lz4</application> tools.
> 
> +1
> 
>          The compression method can be set to <literal>gzip</literal> or
> -        <literal>lz4</literal> or <literal>none</literal> for no compression.
> +        <literal>lz4</literal>, or <literal>none</literal> for no compression.
> 
> I am not a native English speaker. Yet I think that if one adds commas
> in one of the options, then one should add commas to all the options.
> Namely, the aboveis missing a comma between gzip and lz4. However I
> think that not having any commas still works grammatically and
> syntactically.
> 

I pushed a fix with most of these wording changes. As for this comma, I
believe the correct style is

   a, b, or c

At least that's what the other places in the pg_dump.sgml file do.

> -               ($pgdump_runs{$run}->{compile_option} eq 'gzip' && !$supports_gzip) ||
> -               ($pgdump_runs{$run}->{compile_option} eq 'lz4' && !$supports_lz4))
> +               (($pgdump_runs{$run}->{compile_option} eq 'gzip' && !$supports_gzip) ||
> +               ($pgdump_runs{$run}->{compile_option} eq 'lz4' && !$supports_lz4)))
> 

Pushed a fix for this too.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 2/25/23 15:05, Justin Pryzby wrote:
> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
>> I have some fixes (attached) and questions while polishing the patch for
>> zstd compression.  The fixes are small and could be integrated with the
>> patch for zstd, but could be applied independently.
> 
> One more - WriteDataToArchiveGzip() says:
> 
> +       if (cs->compression_spec.level == 0)
> +           pg_fatal("requested to compress the archive yet no level was specified");
> 
> That was added at e9960732a.  
> 
> But if you specify gzip:0, the compression level is already enforced by
> validate_compress_specification(), before hitting gzip.c:
> 
> | pg_dump: error: invalid compression specification: compression algorithm "gzip" expects a compression level between
1and 9 (default at -1)
 
> 
> 5e73a6048 intended that to work as before, and you *can* specify -Z0:
> 
>     The change is backward-compatible, hence specifying only an integer
>     leads to no compression for a level of 0 and gzip compression when the
>     level is greater than 0.
> 
>     $ time ./src/bin/pg_dump/pg_dump -h /tmp regression -t int8_tbl -Fp --compress 0 |file -
>     /dev/stdin: ASCII text
> 

FWIW I agree we should make this backwards-compatible - accept "0" and
treat it as no compression.

Georgios, can you prepare a patch doing that?


regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 2/27/23 05:49, Justin Pryzby wrote:
> On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
>> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
>>> I have some fixes (attached) and questions while polishing the patch for
>>> zstd compression.  The fixes are small and could be integrated with the
>>> patch for zstd, but could be applied independently.
>>
>> One more - WriteDataToArchiveGzip() says:
> 
> One more again.
> 
> The LZ4 path is using non-streaming mode, which compresses each block
> without persistent state, giving poor compression for -Fc compared with
> -Fp.  If the data is highly compressible, the difference can be orders
> of magnitude.
> 
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
> 12351763
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> 21890708
> 
> That's not true for gzip:
> 
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
> 2118869
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
> 2115832
> 
> The function ought to at least use streaming mode, so each block/row
> isn't compressioned in isolation.  003 is a simple patch to use
> streaming mode, which improves the -Fc case:
> 
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> 15178283
> 
> However, that still flushes the compression buffer, writing a block
> header, for every row.  With a single-column table, pg_dump -Fc -Z lz4
> still outputs ~10% *more* data than with no compression at all.  And
> that's for compressible data.
> 
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
> 12890296
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
> 11890296
> 
> I think this should use the LZ4F API with frames, which are buffered to
> avoid outputting a header for every single row.  The LZ4F format isn't
> compatible with the LZ4 format, so (unlike changing to the streaming
> API) that's not something we can change in a bugfix release.  I consider
> this an Opened Item.
> 
> With the LZ4F API in 004, -Fp and -Fc are essentially the same size
> (like gzip).  (Oh, and the output is three times smaller, too.)
> 
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
> 4155448
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
> 4156548
> 

Thanks. Those are definitely interesting improvements/optimizations!

I suggest we track them as a separate patch series - please add them to
the CF app (I guess you'll have to add them to 2023-07 at this point,
but we can get them in, I think).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Wed, Mar 01, 2023 at 01:39:14PM +0000, gkokolatos@pm.me wrote:
> On Wednesday, March 1st, 2023 at 12:58 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> 
> > The current function order avoids 3 lines of declarations, but it's
> > obviously pretty useful to be able to run that diff command. I already
> > argued for not calling the functions "Gzip" on the grounds that the name
> > was inaccurate.
> 
> I have no idea why we are back on the naming issue. I stand by the name
> because in my humble opinion helps the code reader. There is a certain
> uniformity when the compression_spec.algorithm and the compressor
> functions match as the following code sample shows.

I mentioned that it's because this allows usefully running "diff"
against the previous commits.

>     if (compression_spec.algorithm == PG_COMPRESSION_NONE)         
>         InitCompressorNone(cs, compression_spec);
>     else if (compression_spec.algorithm == PG_COMPRESSION_GZIP)
>         InitCompressorGzip(cs, compression_spec);
>     else if (compression_spec.algorithm == PG_COMPRESSION_LZ4)
>         InitCompressorLZ4(cs, compression_spec);        
>                                                  
> When the reader wants to see what happens when the PG_COMPRESSION_XXX
> is set, has to simply search for the XXX part. I think that this is
> justification enough for the use of the names.

You're right about that.

But (with the exception of InitCompressorGzip), I'm referring to the
naming of internal functions, static to gzip.c, so renaming can't be
said to cause a loss of clarity.

> > I'd want to create an empty large object in src/test/sql/largeobject.sql
> > to exercise this tested during pgupgrade. But unfortunately that
> > doesn't use -Fc, so this isn't hit. Empty input is an important enough
> > test case to justify a tap test, if there's no better way.
> 
> Please find in the attached a test case that exercises this codepath.

Thanks for writing it.

This patch could be an opportunity to improve the "diff" output, without
renaming anything.

The old order of functions was:
-InitCompressorZlib
-EndCompressorZlib
-DeflateCompressorZlib
-WriteDataToArchiveZlib
-ReadDataFromArchiveZlib

If you put DeflateCompressorEnd immediately after DeflateCompressorInit,
diff works nicely.  You'll have to add at least one declaration, which
seems very worth it.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Wednesday, March 1st, 2023 at 5:20 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
>
>
> On 2/25/23 15:05, Justin Pryzby wrote:
>
> > On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> >
> > > I have some fixes (attached) and questions while polishing the patch for
> > > zstd compression. The fixes are small and could be integrated with the
> > > patch for zstd, but could be applied independently.
> >
> > One more - WriteDataToArchiveGzip() says:
> >
> > + if (cs->compression_spec.level == 0)
> > + pg_fatal("requested to compress the archive yet no level was specified");
> >
> > That was added at e9960732a.
> >
> > But if you specify gzip:0, the compression level is already enforced by
> > validate_compress_specification(), before hitting gzip.c:
> >
> > | pg_dump: error: invalid compression specification: compression algorithm "gzip" expects a compression level
between1 and 9 (default at -1) 
> >
> > 5e73a6048 intended that to work as before, and you can specify -Z0:
> >
> > The change is backward-compatible, hence specifying only an integer
> > leads to no compression for a level of 0 and gzip compression when the
> > level is greater than 0.
> >
> > $ time ./src/bin/pg_dump/pg_dump -h /tmp regression -t int8_tbl -Fp --compress 0 |file -
> > /dev/stdin: ASCII text
>
>
> FWIW I agree we should make this backwards-compatible - accept "0" and
> treat it as no compression.
>
> Georgios, can you prepare a patch doing that?

Please find a patch attached. However I am a bit at a loss, the backwards
compatible behaviour has not changed. Passing -Z0/--compress=0 does produce
a non compressed output. So I am not really certain as to what broke and
needs fixing.

What commit 5e73a6048 did fail to do, is test the backwards compatible
behaviour. The attached amends it.

Cheers,
//Georgios

>
>
> regards
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Wed, Mar 01, 2023 at 05:20:05PM +0100, Tomas Vondra wrote:
> On 2/25/23 15:05, Justin Pryzby wrote:
> > On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> >> I have some fixes (attached) and questions while polishing the patch for
> >> zstd compression.  The fixes are small and could be integrated with the
> >> patch for zstd, but could be applied independently.
> > 
> > One more - WriteDataToArchiveGzip() says:
> > 
> > +       if (cs->compression_spec.level == 0)
> > +           pg_fatal("requested to compress the archive yet no level was specified");
> > 
> > That was added at e9960732a.  
> > 
> > But if you specify gzip:0, the compression level is already enforced by
> > validate_compress_specification(), before hitting gzip.c:
> > 
> > | pg_dump: error: invalid compression specification: compression algorithm "gzip" expects a compression level
between1 and 9 (default at -1)
 
> > 
> > 5e73a6048 intended that to work as before, and you *can* specify -Z0:
> > 
> >     The change is backward-compatible, hence specifying only an integer
> >     leads to no compression for a level of 0 and gzip compression when the
> >     level is greater than 0.
> > 
> >     $ time ./src/bin/pg_dump/pg_dump -h /tmp regression -t int8_tbl -Fp --compress 0 |file -
> >     /dev/stdin: ASCII text
> 
> FWIW I agree we should make this backwards-compatible - accept "0" and
> treat it as no compression.
> 
> Georgios, can you prepare a patch doing that?

I think maybe Tomas misunderstood.  What I was trying to say is that -Z
0 *is* accepted to mean no compression.  This part wasn't quoted, but I
said:

> Right now, I think that pg_fatal in gzip.c is dead code - that was first
> added in the patch version sent on 21 Dec 2022.

If you run the diff command that I've been talking about, you'll see
that InitCompressorZlib was almost unchanged - e9960732 is essentially a
refactoring.  I don't think it's desirable to add a pg_fatal() in a
function that's otherwise nearly-unchanged.  The fact that it's
nearly-unchanged is a good thing: it simplifies reading of what changed.
If someone wants to add a pg_fatal() in that code path, it'd be better
done in its own commit, with a separate message explaining the change.

If you insist on changing anything here, you might add an assertion (as
you said earlier) along with a comment like
/* -Z 0 uses the "None" compressor rather than zlib with no compression */

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 3/2/23 18:18, Justin Pryzby wrote:
> On Wed, Mar 01, 2023 at 05:20:05PM +0100, Tomas Vondra wrote:
>> On 2/25/23 15:05, Justin Pryzby wrote:
>>> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
>>>> I have some fixes (attached) and questions while polishing the patch for
>>>> zstd compression.  The fixes are small and could be integrated with the
>>>> patch for zstd, but could be applied independently.
>>>
>>> One more - WriteDataToArchiveGzip() says:
>>>
>>> +       if (cs->compression_spec.level == 0)
>>> +           pg_fatal("requested to compress the archive yet no level was specified");
>>>
>>> That was added at e9960732a.  
>>>
>>> But if you specify gzip:0, the compression level is already enforced by
>>> validate_compress_specification(), before hitting gzip.c:
>>>
>>> | pg_dump: error: invalid compression specification: compression algorithm "gzip" expects a compression level
between1 and 9 (default at -1)
 
>>>
>>> 5e73a6048 intended that to work as before, and you *can* specify -Z0:
>>>
>>>     The change is backward-compatible, hence specifying only an integer
>>>     leads to no compression for a level of 0 and gzip compression when the
>>>     level is greater than 0.
>>>
>>>     $ time ./src/bin/pg_dump/pg_dump -h /tmp regression -t int8_tbl -Fp --compress 0 |file -
>>>     /dev/stdin: ASCII text
>>
>> FWIW I agree we should make this backwards-compatible - accept "0" and
>> treat it as no compression.
>>
>> Georgios, can you prepare a patch doing that?
> 
> I think maybe Tomas misunderstood.  What I was trying to say is that -Z
> 0 *is* accepted to mean no compression.  This part wasn't quoted, but I
> said:
> 

Ah, I see. Well, I also tried but with "-Z gzip:0" (and not -Z 0), and
that does fail:

  error: invalid compression specification: compression algorithm "gzip"
  expects a compression level between 1 and 9 (default at -1)

It's a bit weird these two cases behave differently, when both translate
to the same default compression method (gzip).

>> Right now, I think that pg_fatal in gzip.c is dead code - that was first
>> added in the patch version sent on 21 Dec 2022.
> 
> If you run the diff command that I've been talking about, you'll see
> that InitCompressorZlib was almost unchanged - e9960732 is essentially a
> refactoring.  I don't think it's desirable to add a pg_fatal() in a
> function that's otherwise nearly-unchanged.  The fact that it's
> nearly-unchanged is a good thing: it simplifies reading of what changed.
> If someone wants to add a pg_fatal() in that code path, it'd be better
> done in its own commit, with a separate message explaining the change.
> 
> If you insist on changing anything here, you might add an assertion (as
> you said earlier) along with a comment like
> /* -Z 0 uses the "None" compressor rather than zlib with no compression */
> 

Yeah, a comment would be helpful.

Also, after thinking about it a bit more maybe having the unreachable
pg_fatal() is not a good thing, as it will just confuse people (I'd
certainly assume having such check means there's a way in which case it
might trigger.). Maybe an assert would be better?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Wed, Mar 01, 2023 at 04:52:49PM +0100, Tomas Vondra wrote:
> Thanks. That seems correct to me, but I find it somewhat confusing,
> because we now have
> 
>  DeflateCompressorInit vs. InitCompressorGzip
> 
>  DeflateCompressorEnd vs. EndCompressorGzip
> 
>  DeflateCompressorData - The name doesn't really say what it does (would
>                          be better to have a verb in there, I think).
> 
> I wonder if we can make this somehow clearer?

To move things along, I updated Georgios' patch:

Rename DeflateCompressorData() to DeflateCompressorCommon();
Rearrange functions to their original order allowing a cleaner diff to the prior code;
Change pg_fatal() to an assertion+comment;
Update the commit message and fix a few typos;

> Also, InitCompressorGzip says this:
> 
>    /*
>     * If the caller has defined a write function, prepare the necessary
>     * state. Avoid initializing during the first write call, because End
>     * may be called without ever writing any data.
>     */
>     if (cs->writeF)
>         DeflateCompressorInit(cs);
>
> Does it actually make sense to not have writeF defined in some cases?

InitCompressor is being called for either reading or writing, either of
which could be null:

src/bin/pg_dump/pg_backup_custom.c:     ctx->cs = AllocateCompressor(AH->compression_spec,
src/bin/pg_dump/pg_backup_custom.c-                                                              NULL,
src/bin/pg_dump/pg_backup_custom.c-                                                              _CustomWriteFunc);
--
src/bin/pg_dump/pg_backup_custom.c:     cs = AllocateCompressor(AH->compression_spec,
src/bin/pg_dump/pg_backup_custom.c-                                                     _CustomReadFunc, NULL);

It's confusing that the comment says "Avoid initializing...".  What it
really means is "Initialize eagerly...".  But that makes more sense in
the context of the commit message for this bugfix than in a comment.  So
I changed that too.

+       /* If deflation was initialized, finalize it */
                    
 
+       if (cs->private_data)
                    
 
+               DeflateCompressorEnd(AH, cs);
                    
 

Maybe it'd be more clear if this used "if (cs->writeF)", like in the
init function ?

-- 
Justin

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Wed, Mar 01, 2023 at 05:39:54PM +0100, Tomas Vondra wrote:
> On 2/27/23 05:49, Justin Pryzby wrote:
> > On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
> >> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> >>> I have some fixes (attached) and questions while polishing the patch for
> >>> zstd compression.  The fixes are small and could be integrated with the
> >>> patch for zstd, but could be applied independently.
> >>
> >> One more - WriteDataToArchiveGzip() says:
> > 
> > One more again.
> > 
> > The LZ4 path is using non-streaming mode, which compresses each block
> > without persistent state, giving poor compression for -Fc compared with
> > -Fp.  If the data is highly compressible, the difference can be orders
> > of magnitude.
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
> > 12351763
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> > 21890708
> > 
> > That's not true for gzip:
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
> > 2118869
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
> > 2115832
> > 
> > The function ought to at least use streaming mode, so each block/row
> > isn't compressioned in isolation.  003 is a simple patch to use
> > streaming mode, which improves the -Fc case:
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> > 15178283
> > 
> > However, that still flushes the compression buffer, writing a block
> > header, for every row.  With a single-column table, pg_dump -Fc -Z lz4
> > still outputs ~10% *more* data than with no compression at all.  And
> > that's for compressible data.
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
> > 12890296
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
> > 11890296
> > 
> > I think this should use the LZ4F API with frames, which are buffered to
> > avoid outputting a header for every single row.  The LZ4F format isn't
> > compatible with the LZ4 format, so (unlike changing to the streaming
> > API) that's not something we can change in a bugfix release.  I consider
> > this an Opened Item.
> > 
> > With the LZ4F API in 004, -Fp and -Fc are essentially the same size
> > (like gzip).  (Oh, and the output is three times smaller, too.)
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
> > 4155448
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
> > 4156548
> 
> Thanks. Those are definitely interesting improvements/optimizations!
> 
> I suggest we track them as a separate patch series - please add them to
> the CF app (I guess you'll have to add them to 2023-07 at this point,
> but we can get them in, I think).

Thanks for looking.  I'm not sure if I'm the best person to write/submit
the patch to implement that for LZ4.  Georgios, would you want to take
on this change ?

I think that needs to be changed for v16, since 1) LZ4F works so much
better like this, and 2) we can't change it later without breaking
compatibility of the dumpfiles by changing the header with another name
other than "lz4".  Also, I imagine we'd want to continue supporting the
ability to *restore* a dumpfile using the old(current) format, which
would be untestable code unless we also preserved the ability to write
it somehow (like -Z lz4-old).

One issue is that LZ4F_createCompressionContext() and
LZ4F_compressBegin() ought to be called in InitCompressorLZ4().  It
seems like it might *need* to be called there to avoid exactly the kind
of issue that I reported with empty LOs with gzip.  But
InitCompressorLZ4() isn't currently passed the ArchiveHandle, so can't
write the header.  And LZ4CompressorState has a simple char *buf, and
not an more elaborate data structure like zlib.  You could work around
that by keeping storing the "len" of the existing buffer, and flushing
it in EndCompressorLZ4(), but that adds needless complexity to the Write
and End functions.  Maybe the Init function should be passed the AH.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 3/9/23 17:15, Justin Pryzby wrote:
> On Wed, Mar 01, 2023 at 05:39:54PM +0100, Tomas Vondra wrote:
>> On 2/27/23 05:49, Justin Pryzby wrote:
>>> On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
>>>> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
>>>>> I have some fixes (attached) and questions while polishing the patch for
>>>>> zstd compression.  The fixes are small and could be integrated with the
>>>>> patch for zstd, but could be applied independently.
>>>>
>>>> One more - WriteDataToArchiveGzip() says:
>>>
>>> One more again.
>>>
>>> The LZ4 path is using non-streaming mode, which compresses each block
>>> without persistent state, giving poor compression for -Fc compared with
>>> -Fp.  If the data is highly compressible, the difference can be orders
>>> of magnitude.
>>>
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
>>> 12351763
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
>>> 21890708
>>>
>>> That's not true for gzip:
>>>
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
>>> 2118869
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
>>> 2115832
>>>
>>> The function ought to at least use streaming mode, so each block/row
>>> isn't compressioned in isolation.  003 is a simple patch to use
>>> streaming mode, which improves the -Fc case:
>>>
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
>>> 15178283
>>>
>>> However, that still flushes the compression buffer, writing a block
>>> header, for every row.  With a single-column table, pg_dump -Fc -Z lz4
>>> still outputs ~10% *more* data than with no compression at all.  And
>>> that's for compressible data.
>>>
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
>>> 12890296
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
>>> 11890296
>>>
>>> I think this should use the LZ4F API with frames, which are buffered to
>>> avoid outputting a header for every single row.  The LZ4F format isn't
>>> compatible with the LZ4 format, so (unlike changing to the streaming
>>> API) that's not something we can change in a bugfix release.  I consider
>>> this an Opened Item.
>>>
>>> With the LZ4F API in 004, -Fp and -Fc are essentially the same size
>>> (like gzip).  (Oh, and the output is three times smaller, too.)
>>>
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
>>> 4155448
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
>>> 4156548
>>
>> Thanks. Those are definitely interesting improvements/optimizations!
>>
>> I suggest we track them as a separate patch series - please add them to
>> the CF app (I guess you'll have to add them to 2023-07 at this point,
>> but we can get them in, I think).
> 
> Thanks for looking.  I'm not sure if I'm the best person to write/submit
> the patch to implement that for LZ4.  Georgios, would you want to take
> on this change ?
> 
> I think that needs to be changed for v16, since 1) LZ4F works so much
> better like this, and 2) we can't change it later without breaking
> compatibility of the dumpfiles by changing the header with another name
> other than "lz4".  Also, I imagine we'd want to continue supporting the
> ability to *restore* a dumpfile using the old(current) format, which
> would be untestable code unless we also preserved the ability to write
> it somehow (like -Z lz4-old).
> 

I'm a bit confused about the lz4 vs. lz4f stuff, TBH. If we switch to
lz4f, doesn't that mean it (e.g. restore) won't work on systems that
only have older lz4 version? What would/should happen if we take backup
compressed with lz4f, an then try restoring it on an older system where
lz4 does not support lz4f?

Maybe if lz4f format is incompatible with regular lz4, we should treat
it as a separate compression method 'lz4f'?

I'm mostly afk until the end of the week, but I tried searching for lz4f
info - the results are not particularly enlightening, unfortunately.

AFAICS this only applies to lz4f stuff. Or would the streaming mode be a
breaking change too?

> One issue is that LZ4F_createCompressionContext() and
> LZ4F_compressBegin() ought to be called in InitCompressorLZ4().  It
> seems like it might *need* to be called there to avoid exactly the kind
> of issue that I reported with empty LOs with gzip.  But
> InitCompressorLZ4() isn't currently passed the ArchiveHandle, so can't
> write the header.  And LZ4CompressorState has a simple char *buf, and
> not an more elaborate data structure like zlib.  You could work around
> that by keeping storing the "len" of the existing buffer, and flushing
> it in EndCompressorLZ4(), but that adds needless complexity to the Write
> and End functions.  Maybe the Init function should be passed the AH.
> 

Not sure, but looking at GzipCompressorState I see the only extra thing
it has (compared to LZ4CompressorState) is "z_streamp". I can't
experiment with this until the end of this week, so perhaps that's not
workable, but wouldn't it be better to add a similar field into
LZ4CompressorState? Passing AH to the init function seems like a
violation of abstraction.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Thu, Mar 09, 2023 at 06:58:20PM +0100, Tomas Vondra wrote:
> I'm a bit confused about the lz4 vs. lz4f stuff, TBH. If we switch to
> lz4f, doesn't that mean it (e.g. restore) won't work on systems that
> only have older lz4 version? What would/should happen if we take backup
> compressed with lz4f, an then try restoring it on an older system where
> lz4 does not support lz4f?

You seem to be thinking about LZ4F as a weird, new innovation I'm
experimenting with, but compress_lz4.c already uses LZ4F for its "file"
API.  LZ4F is also what's written by the lz4 CLI tool, and I found that
LZ4F has been included in the library for ~8 years:

https://github.com/lz4/lz4/releases?page=2
r126 Dec 24, 2014
New : lz4frame API is now integrated into liblz4

> Maybe if lz4f format is incompatible with regular lz4, we should treat
> it as a separate compression method 'lz4f'?
> 
> I'm mostly afk until the end of the week, but I tried searching for lz4f
> info - the results are not particularly enlightening, unfortunately.
> 
> AFAICS this only applies to lz4f stuff. Or would the streaming mode be a
> breaking change too?

Streaming mode outputs the same format as the existing code, but gives
better compression.  We could (theoretically) change it in a bugfix
release, and old output would still be restorable (I think new output
would even be restorable with the old versions of pg_restore).

But that's not true for LZ4F.  The benefit there is that it avoids
outputing a separate block for each row.  That's essential for narrow
tables, for which the block header currently being written has an
overhead several times larger than the data.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Fri, Mar 10, 2023 at 07:05:49AM -0600, Justin Pryzby wrote:
> On Thu, Mar 09, 2023 at 06:58:20PM +0100, Tomas Vondra wrote:
>> I'm a bit confused about the lz4 vs. lz4f stuff, TBH. If we switch to
>> lz4f, doesn't that mean it (e.g. restore) won't work on systems that
>> only have older lz4 version? What would/should happen if we take backup
>> compressed with lz4f, an then try restoring it on an older system where
>> lz4 does not support lz4f?
>
> You seem to be thinking about LZ4F as a weird, new innovation I'm
> experimenting with, but compress_lz4.c already uses LZ4F for its "file"
> API.

Note: we already use lz4 frames in pg_receivewal (for WAL) and
pg_basebackup (bbstreamer).
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Alexander Lakhin
Date:
Hello,
23.02.2023 23:24, Tomas Vondra wrote:
> On 2/23/23 16:26, Tomas Vondra wrote:
>> Thanks for v30 with the updated commit messages. I've pushed 0001 after
>> fixing a comment typo and removing (I think) an unnecessary change in an
>> error message.
>>
>> I'll give the buildfarm a bit of time before pushing 0002 and 0003.
>>
> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
> and marked the CF entry as committed. Thanks for the patch!
>
> I wonder how difficult would it be to add the zstd compression, so that
> we don't have the annoying "unsupported" cases.

With the patch 0003 committed, a single warning -Wtype-limits appeared in the
master branch:
$ CPPFLAGS="-Og -Wtype-limits" ./configure --with-lz4 -q && make -s -j8
compress_lz4.c: In function ‘LZ4File_gets’:
compress_lz4.c:492:19: warning: comparison of unsigned expression in ‘< 0’ is 
always false [-Wtype-limits]
   492 |         if (dsize < 0)
       |
(I wonder, is it accidental that there no other places that triggers
the warning, or some buildfarm animals had this check enabled before?)

It is not a false positive as can be proved by the 002_pg_dump.pl modified as
follows:
-                       program => $ENV{'LZ4'},
+                       program => 'mv',
                         args    => [
-                               '-z', '-f', '--rm',
"$tempdir/compression_lz4_dir/blobs.toc",
"$tempdir/compression_lz4_dir/blobs.toc.lz4",
                         ],
                 },
A diagnostic logging added shows:
LZ4File_gets() after LZ4File_read_internal; dsize: 18446744073709551615

and pg_restore fails with:
error: invalid line in large object TOC file 
".../src/bin/pg_dump/tmp_check/tmp_test_22ri/compression_lz4_dir/blobs.toc": "????"

Best regards,
Alexander



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:
------- Original Message -------
On Saturday, March 11th, 2023 at 7:00 AM, Alexander Lakhin <exclusion@gmail.com> wrote:

> Hello,
> 23.02.2023 23:24, Tomas Vondra wrote:
>
> > On 2/23/23 16:26, Tomas Vondra wrote:
> >
> > > Thanks for v30 with the updated commit messages. I've pushed 0001 after
> > > fixing a comment typo and removing (I think) an unnecessary change in an
> > > error message.
> > >
> > > I'll give the buildfarm a bit of time before pushing 0002 and 0003.
> >
> > I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
> > and marked the CF entry as committed. Thanks for the patch!
> >
> > I wonder how difficult would it be to add the zstd compression, so that
> > we don't have the annoying "unsupported" cases.
>
>
> With the patch 0003 committed, a single warning -Wtype-limits appeared in the
> master branch:
> $ CPPFLAGS="-Og -Wtype-limits" ./configure --with-lz4 -q && make -s -j8
> compress_lz4.c: In function ‘LZ4File_gets’:
> compress_lz4.c:492:19: warning: comparison of unsigned expression in ‘< 0’ is
> always false [-Wtype-limits]
> 492 | if (dsize < 0)
> |

Thank you Alexander. Please find attached an attempt to address it.

> (I wonder, is it accidental that there no other places that triggers
> the warning, or some buildfarm animals had this check enabled before?)

I can not answer about the buildfarms. Do you think that adding an explicit
check for this warning in meson would help? I am a bit uncertain as I think
that type-limits are included in extra.

@@ -1748,6 +1748,7 @@ common_warning_flags = [
   '-Wshadow=compatible-local',
   # This was included in -Wall/-Wformat in older GCC versions
   '-Wformat-security',
+  '-Wtype-limits',
 ]

>
> It is not a false positive as can be proved by the 002_pg_dump.pl modified as
> follows:
> - program => $ENV{'LZ4'},
>
> + program => 'mv',
>
> args => [
>
> - '-z', '-f', '--rm',
> "$tempdir/compression_lz4_dir/blobs.toc",
> "$tempdir/compression_lz4_dir/blobs.toc.lz4",
> ],
> },

Correct, it is not a false positive. The existing testing framework provides
limited support for exercising error branches. Especially so when those are
dependent on generated output.

> A diagnostic logging added shows:
> LZ4File_gets() after LZ4File_read_internal; dsize: 18446744073709551615
>
> and pg_restore fails with:
> error: invalid line in large object TOC file
> ".../src/bin/pg_dump/tmp_check/tmp_test_22ri/compression_lz4_dir/blobs.toc": "????"

It is a good thing that the restore fails with bad input. Yet it should
have failed earlier. The attached makes certain it does fail earlier.

Cheers,
//Georgios

>
> Best regards,
> Alexander
Attachment

Re: Add LZ4 compression in pg_dump

From
Alexander Lakhin
Date:
Hi Georgios,

11.03.2023 13:50, gkokolatos@pm.me wrote:
> I can not answer about the buildfarms. Do you think that adding an explicit
> check for this warning in meson would help? I am a bit uncertain as I think
> that type-limits are included in extra.
>
> @@ -1748,6 +1748,7 @@ common_warning_flags = [
>     '-Wshadow=compatible-local',
>     # This was included in -Wall/-Wformat in older GCC versions
>     '-Wformat-security',
> +  '-Wtype-limits',
>   ]
I'm not sure that I can promote additional checks (or determine where
to put them), but if some patch introduces a warning of a type that wasn't
present before, I think it's worth to eliminate the warning (if it is
sensible) to keep the source code check baseline at the same level
or even lift it up gradually.
I've also found that the same commit introduced a single instance of
the analyzer-possible-null-argument warning:
CPPFLAGS="-Og -fanalyzer -Wno-analyzer-malloc-leak -Wno-analyzer-file-leak 
-Wno-analyzer-null-dereference -Wno-analyzer-shift-count-overflow 
-Wno-analyzer-free-of-non-heap -Wno-analyzer-null-argument 
-Wno-analyzer-double-free -Wanalyzer-possible-null-argument" ./configure 
--with-lz4 -q && make -s -j8
compress_io.c: In function ‘hasSuffix’:
compress_io.c:158:47: warning: use of possibly-NULL ‘filename’ where non-null 
expected [CWE-690] [-Wanalyzer-possible-null-argument]
   158 |         int                     filenamelen = strlen(filename);
       | ^~~~~~~~~~~~~~~~
   ‘InitDiscoverCompressFileHandle’: events 1-3
...

(I use gcc-11.3.)
As I can see, many existing uses of strdup() are followed by a check for
null result, so maybe it's a common practice and a similar check should
be added in InitDiscoverCompressFileHandle().
(There also a couple of other warnings introduced with the lz4 compression
patches, but those ones are not unique, so I maybe they aren't worth fixing.)

>> It is a good thing that the restore fails with bad input. Yet it should
>> have failed earlier. The attached makes certain it does fail earlier.
>>
Thanks! Your patch definitely fixes the issue.

Best regards,
Alexander



Re: Add LZ4 compression in pg_dump

From
Peter Eisentraut
Date:
On 11.03.23 07:00, Alexander Lakhin wrote:
> Hello,
> 23.02.2023 23:24, Tomas Vondra wrote:
>> On 2/23/23 16:26, Tomas Vondra wrote:
>>> Thanks for v30 with the updated commit messages. I've pushed 0001 after
>>> fixing a comment typo and removing (I think) an unnecessary change in an
>>> error message.
>>>
>>> I'll give the buildfarm a bit of time before pushing 0002 and 0003.
>>>
>> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
>> and marked the CF entry as committed. Thanks for the patch!
>>
>> I wonder how difficult would it be to add the zstd compression, so that
>> we don't have the annoying "unsupported" cases.
> 
> With the patch 0003 committed, a single warning -Wtype-limits appeared 
> in the
> master branch:
> $ CPPFLAGS="-Og -Wtype-limits" ./configure --with-lz4 -q && make -s -j8
> compress_lz4.c: In function ‘LZ4File_gets’:
> compress_lz4.c:492:19: warning: comparison of unsigned expression in ‘< 
> 0’ is always false [-Wtype-limits]
>    492 |         if (dsize < 0)
>        |
> (I wonder, is it accidental that there no other places that triggers
> the warning, or some buildfarm animals had this check enabled before?)

I think there is an underlying problem in this code that it dances back 
and forth between size_t and int in an unprincipled way.

In the code that triggers the warning, dsize is size_t.  dsize is the 
return from LZ4File_read_internal(), which is declared to return int. 
The variable that LZ4File_read_internal() returns in the success case is 
size_t, but in case of an error it returns -1.  (So the code that is 
warning is meaning to catch this error case, but it won't ever work.) 
Further below LZ4File_read_internal() calls LZ4File_read_overflow(), 
which is declared to return int, but in some cases it returns 
fs->overflowlen, which is size_t.

This should be cleaned up.

AFAICT, the upstream API in lz4.h uses int for size values, but 
lz4frame.h uses size_t, so I don't know what the correct approach is.



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 3/12/23 11:07, Peter Eisentraut wrote:
> On 11.03.23 07:00, Alexander Lakhin wrote:
>> Hello,
>> 23.02.2023 23:24, Tomas Vondra wrote:
>>> On 2/23/23 16:26, Tomas Vondra wrote:
>>>> Thanks for v30 with the updated commit messages. I've pushed 0001 after
>>>> fixing a comment typo and removing (I think) an unnecessary change
>>>> in an
>>>> error message.
>>>>
>>>> I'll give the buildfarm a bit of time before pushing 0002 and 0003.
>>>>
>>> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
>>> and marked the CF entry as committed. Thanks for the patch!
>>>
>>> I wonder how difficult would it be to add the zstd compression, so that
>>> we don't have the annoying "unsupported" cases.
>>
>> With the patch 0003 committed, a single warning -Wtype-limits appeared
>> in the
>> master branch:
>> $ CPPFLAGS="-Og -Wtype-limits" ./configure --with-lz4 -q && make -s -j8
>> compress_lz4.c: In function ‘LZ4File_gets’:
>> compress_lz4.c:492:19: warning: comparison of unsigned expression in
>> ‘< 0’ is always false [-Wtype-limits]
>>    492 |         if (dsize < 0)
>>        |
>> (I wonder, is it accidental that there no other places that triggers
>> the warning, or some buildfarm animals had this check enabled before?)
> 
> I think there is an underlying problem in this code that it dances back
> and forth between size_t and int in an unprincipled way.
> 
> In the code that triggers the warning, dsize is size_t.  dsize is the
> return from LZ4File_read_internal(), which is declared to return int.
> The variable that LZ4File_read_internal() returns in the success case is
> size_t, but in case of an error it returns -1.  (So the code that is
> warning is meaning to catch this error case, but it won't ever work.)
> Further below LZ4File_read_internal() calls LZ4File_read_overflow(),
> which is declared to return int, but in some cases it returns
> fs->overflowlen, which is size_t.
> 

I agree. I just got home so I looked at this only very briefly, but I
think it's clearly wrong to assign the LZ4File_read_internal() result to
a size_t variable (and it seems to me LZ4File_gets does the same mistake
with LZ4File_read_internal() result).

I'll get this fixed early next week, I'm too tired to do that now
without likely causing further issues.

> This should be cleaned up.
> 
> AFAICT, the upstream API in lz4.h uses int for size values, but
> lz4frame.h uses size_t, so I don't know what the correct approach is.

Yeah, that's a good point. I think Justin is right we should be using
the LZ4F stuff, so ultimately we'll probably switch to size_t. But IMO
it's definitely better to correct the current code first, and only then
switch to LZ4F (from one correct state to another).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 3/11/23 11:50, gkokolatos@pm.me wrote:
> ------- Original Message -------
> On Saturday, March 11th, 2023 at 7:00 AM, Alexander Lakhin <exclusion@gmail.com> wrote:
> 
>> Hello,
>> 23.02.2023 23:24, Tomas Vondra wrote:
>>
>>> On 2/23/23 16:26, Tomas Vondra wrote:
>>>
>>>> Thanks for v30 with the updated commit messages. I've pushed 0001 after
>>>> fixing a comment typo and removing (I think) an unnecessary change in an
>>>> error message.
>>>>
>>>> I'll give the buildfarm a bit of time before pushing 0002 and 0003.
>>>
>>> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
>>> and marked the CF entry as committed. Thanks for the patch!
>>>
>>> I wonder how difficult would it be to add the zstd compression, so that
>>> we don't have the annoying "unsupported" cases.
>>
>>
>> With the patch 0003 committed, a single warning -Wtype-limits appeared in the
>> master branch:
>> $ CPPFLAGS="-Og -Wtype-limits" ./configure --with-lz4 -q && make -s -j8
>> compress_lz4.c: In function ‘LZ4File_gets’:
>> compress_lz4.c:492:19: warning: comparison of unsigned expression in ‘< 0’ is
>> always false [-Wtype-limits]
>> 492 | if (dsize < 0)
>> |
> 
> Thank you Alexander. Please find attached an attempt to address it.
> 
>> (I wonder, is it accidental that there no other places that triggers
>> the warning, or some buildfarm animals had this check enabled before?)
> 
> I can not answer about the buildfarms. Do you think that adding an explicit
> check for this warning in meson would help? I am a bit uncertain as I think
> that type-limits are included in extra.
> 
> @@ -1748,6 +1748,7 @@ common_warning_flags = [
>    '-Wshadow=compatible-local',
>    # This was included in -Wall/-Wformat in older GCC versions
>    '-Wformat-security',
> +  '-Wtype-limits',
>  ]
> 
>>
>> It is not a false positive as can be proved by the 002_pg_dump.pl modified as
>> follows:
>> - program => $ENV{'LZ4'},
>>
>> + program => 'mv',
>>
>> args => [
>>
>> - '-z', '-f', '--rm',
>> "$tempdir/compression_lz4_dir/blobs.toc",
>> "$tempdir/compression_lz4_dir/blobs.toc.lz4",
>> ],
>> },
> 
> Correct, it is not a false positive. The existing testing framework provides
> limited support for exercising error branches. Especially so when those are
> dependent on generated output. 
> 
>> A diagnostic logging added shows:
>> LZ4File_gets() after LZ4File_read_internal; dsize: 18446744073709551615
>>
>> and pg_restore fails with:
>> error: invalid line in large object TOC file
>> ".../src/bin/pg_dump/tmp_check/tmp_test_22ri/compression_lz4_dir/blobs.toc": "????"
> 
> It is a good thing that the restore fails with bad input. Yet it should
> have failed earlier. The attached makes certain it does fail earlier. 
> 

Thanks for the patch.

I did look if there are other places that might have the same issue, and
I think there are - see attached 0002. For example LZ4File_write is
declared to return size_t, but then it also does

        if (LZ4F_isError(status))
        {
            fs->errcode = status;
            return -1;
        }

That won't work :-(

And these issues may not be restricted to lz4 code - Gzip_write is
declared to return size_t, but it does

    return gzwrite(gzfp, ptr, size);

and gzwrite returns int. Although, maybe that's correct, because
gzwrite() is "0 on error" so maybe this is fine ...

However, Gzip_read assigns gzread() to size_t, and that does not seem
great. It probably will still trigger the following pg_fatal() because
it'd be very lucky to match the expected 'size', but it's confusing.


I wonder whether CompressorState should use int or size_t for the
read_func/write_func callbacks. I guess no option is perfect, i.e. no
data type will work for all compression libraries we might use (lz4 uses
int while lz4f uses size_t, to there's that).

It's a bit weird the "open" functions return int and the read/write
size_t. Maybe we should stick to int, which is what the old functions
(cfwrite etc.) did.


But I think the actual problem here is that the API does not clearly
define how errors are communicated. I mean, it's nice to return the
value returned by the library function without "mangling" it by
conversion to size_t, but what if the libraries communicate errors in
different way? Some may return "0" while others may return "-1".

I think the right approach is to handle all library errors and not just
let them through. So Gzip_write() needs to check the return value, and
either call pg_fatal() or translate it to an error defined by the API.

For example we might say "returns 0 on error" and then translate all
library-specific errors to that.


While looking at the code I realized a couple function comments don't
say what's returned in case of error, etc. So 0004 adds those.

0003 is a couple minor assorted comments/questions:

- Should we move ZLIB_OUT_SIZE/ZLIB_IN_SIZE to compress_gzip.c?

- Why are LZ4 buffer sizes different (ZLIB has both 4kB)?

- I wonder if we actually need LZ4F_HEADER_SIZE_MAX? Is it even possible
for LZ4F_compressBound to return value this small (especially for 16kB
input buffer)?



regards


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
Hi Justin,

Thanks for the patch.

On 3/8/23 02:45, Justin Pryzby wrote:
> On Wed, Mar 01, 2023 at 04:52:49PM +0100, Tomas Vondra wrote:
>> Thanks. That seems correct to me, but I find it somewhat confusing,
>> because we now have
>>
>>  DeflateCompressorInit vs. InitCompressorGzip
>>
>>  DeflateCompressorEnd vs. EndCompressorGzip
>>
>>  DeflateCompressorData - The name doesn't really say what it does (would
>>                          be better to have a verb in there, I think).
>>
>> I wonder if we can make this somehow clearer?
> 
> To move things along, I updated Georgios' patch:
> 
> Rename DeflateCompressorData() to DeflateCompressorCommon();

Hmmm, I don't find "common" any clearer than "data" :-( There needs to
at least be a comment explaining what "common" does.

> Rearrange functions to their original order allowing a cleaner diff to the prior code;

OK. I wasn't very enthusiastic about this initially, but after thinking
about it a bit I think it's meaningful to make diffs clearer. But I
don't see much difference with/without the patch. The

git diff --diff-algorithm=minimal -w
e9960732a~:src/bin/pg_dump/compress_io.c src/bin/pg_dump/compress_gzip.c

Produces ~25k diff with/without the patch. What am I doing wrong?

> Change pg_fatal() to an assertion+comment;

Yeah, that's reasonable. I'd even ditch the assert/comment, TBH. We
could add such protections against "impossible" stuff to a zillion other
places and the confusion likely outweighs the benefits.

> Update the commit message and fix a few typos;
> 

Thanks. I don't want to annoy you too much, but could you split the
patch into the "empty-data" fix and all the other changes (rearranging
functions etc.)? I'd rather not mix those in the same commit.


>> Also, InitCompressorGzip says this:
>>
>>    /*
>>     * If the caller has defined a write function, prepare the necessary
>>     * state. Avoid initializing during the first write call, because End
>>     * may be called without ever writing any data.
>>     */
>>     if (cs->writeF)
>>         DeflateCompressorInit(cs);
>>
>> Does it actually make sense to not have writeF defined in some cases?
> 
> InitCompressor is being called for either reading or writing, either of
> which could be null:
> 
> src/bin/pg_dump/pg_backup_custom.c:     ctx->cs = AllocateCompressor(AH->compression_spec,
> src/bin/pg_dump/pg_backup_custom.c-                                                              NULL,
> src/bin/pg_dump/pg_backup_custom.c-                                                              _CustomWriteFunc);
> --
> src/bin/pg_dump/pg_backup_custom.c:     cs = AllocateCompressor(AH->compression_spec,
> src/bin/pg_dump/pg_backup_custom.c-                                                     _CustomReadFunc, NULL);
> 
> It's confusing that the comment says "Avoid initializing...".  What it
> really means is "Initialize eagerly...".  But that makes more sense in
> the context of the commit message for this bugfix than in a comment.  So
> I changed that too.
> 
> +       /* If deflation was initialized, finalize it */
                      
 
> +       if (cs->private_data)
                      
 
> +               DeflateCompressorEnd(AH, cs);
                      
 
> 
> Maybe it'd be more clear if this used "if (cs->writeF)", like in the
> init function ?
> 

Yeah, if the two checks are equivalent, it'd be better to stick to the
same check everywhere.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:

------- Original Message -------
On Monday, March 13th, 2023 at 10:47 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:



>
> > Change pg_fatal() to an assertion+comment;
>
>
> Yeah, that's reasonable. I'd even ditch the assert/comment, TBH. We
> could add such protections against "impossible" stuff to a zillion other
> places and the confusion likely outweighs the benefits.
>

A minor note to add is to not ignore the lessons learned from a7885c9bb.

For example, as the testing framework stands, one can not test that the
contents of the custom format are indeed compressed. One can infer it by
examining the header of the produced dump and searching for the
compression flag. The code responsible for writing the header and the
code responsible for actually dealing with data, is not the same. Also,
the compression library itself will happily read and write uncompressed
data.

A pg_fatal, assertion, or similar, is the only guard rail against this
kind of error. Without it, the tests will continue passing even after
e.g. a wrong initialization of the API. It was such a case that lead to
a7885c9bb in the first place. I do think that we wish it to be an
"impossible" case. Also it will be an untested case with some history
without such a guard rail.

Of course I will not object to removing it, if you think that is more
confusing than useful.

Cheers,
//Georgios

>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Monday, March 13th, 2023 at 9:21 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
>
>
> On 3/11/23 11:50, gkokolatos@pm.me wrote:
>
> > ------- Original Message -------
> > On Saturday, March 11th, 2023 at 7:00 AM, Alexander Lakhin exclusion@gmail.com wrote:
> >
> > > Hello,
> > > 23.02.2023 23:24, Tomas Vondra wrote:

>
>
> Thanks for the patch.
>
> I did look if there are other places that might have the same issue, and
> I think there are - see attached 0002. For example LZ4File_write is
> declared to return size_t, but then it also does
>
> if (LZ4F_isError(status))
> {
> fs->errcode = status;
>
> return -1;
> }
>
> That won't work :-(

You are right. It is confusing.

>
> And these issues may not be restricted to lz4 code - Gzip_write is
> declared to return size_t, but it does
>
> return gzwrite(gzfp, ptr, size);
>
> and gzwrite returns int. Although, maybe that's correct, because
> gzwrite() is "0 on error" so maybe this is fine ...
>
> However, Gzip_read assigns gzread() to size_t, and that does not seem
> great. It probably will still trigger the following pg_fatal() because
> it'd be very lucky to match the expected 'size', but it's confusing.

Agreed.

>
>
> I wonder whether CompressorState should use int or size_t for the
> read_func/write_func callbacks. I guess no option is perfect, i.e. no
> data type will work for all compression libraries we might use (lz4 uses
> int while lz4f uses size_t, to there's that).
>
> It's a bit weird the "open" functions return int and the read/write
> size_t. Maybe we should stick to int, which is what the old functions
> (cfwrite etc.) did.
>
You are right. These functions are modeled by the open/fread/
fwrite etc, and they have kept the return types of these ones. Their
callers do check the return value of read_func and write_func against
the requested size of bytes to be transferred.

>
> But I think the actual problem here is that the API does not clearly
> define how errors are communicated. I mean, it's nice to return the
> value returned by the library function without "mangling" it by
> conversion to size_t, but what if the libraries communicate errors in
> different way? Some may return "0" while others may return "-1".

Agreed.

>
> I think the right approach is to handle all library errors and not just
> let them through. So Gzip_write() needs to check the return value, and
> either call pg_fatal() or translate it to an error defined by the API.

It makes sense. It will change some of the behaviour of the callers,
mostly on what constitutes an error, and what error message is emitted.
This is a reasonable change though.

>
> For example we might say "returns 0 on error" and then translate all
> library-specific errors to that.

Ok.

> While looking at the code I realized a couple function comments don't
> say what's returned in case of error, etc. So 0004 adds those.
>
> 0003 is a couple minor assorted comments/questions:
>
> - Should we move ZLIB_OUT_SIZE/ZLIB_IN_SIZE to compress_gzip.c?

It would make things clearer.

> - Why are LZ4 buffer sizes different (ZLIB has both 4kB)?

Clearly some comments are needed, if the difference makes sense.

> - I wonder if we actually need LZ4F_HEADER_SIZE_MAX? Is it even possible
> for LZ4F_compressBound to return value this small (especially for 16kB
> input buffer)?
>

I would recommend to keep it. Earlier versions of LZ4F_HEADER_SIZE_MAX
do not have it. Later versions do advise to use it.

Would you mind me trying to come with a patch to address your points?

Cheers,
//Georgios

>
>
> regards
>
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 3/14/23 16:18, gkokolatos@pm.me wrote:
> ...> Would you mind me trying to come with a patch to address your points?
> 

That'd be great, thanks. Please keep it split into smaller patches - two
might work, with one patch for "cosmetic" changes and the other tweaking
the API error-handling stuff.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 3/14/23 12:07, gkokolatos@pm.me wrote:
> 
> 
> ------- Original Message -------
> On Monday, March 13th, 2023 at 10:47 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> 
> 
> 
>>
>>> Change pg_fatal() to an assertion+comment;
>>
>>
>> Yeah, that's reasonable. I'd even ditch the assert/comment, TBH. We
>> could add such protections against "impossible" stuff to a zillion other
>> places and the confusion likely outweighs the benefits.
>>
> 
> A minor note to add is to not ignore the lessons learned from a7885c9bb.
> 
> For example, as the testing framework stands, one can not test that the
> contents of the custom format are indeed compressed. One can infer it by
> examining the header of the produced dump and searching for the
> compression flag. The code responsible for writing the header and the
> code responsible for actually dealing with data, is not the same. Also,
> the compression library itself will happily read and write uncompressed
> data.
> 
> A pg_fatal, assertion, or similar, is the only guard rail against this
> kind of error. Without it, the tests will continue passing even after
> e.g. a wrong initialization of the API. It was such a case that lead to
> a7885c9bb in the first place. I do think that we wish it to be an
> "impossible" case. Also it will be an untested case with some history
> without such a guard rail.
> 

So is the pg_fatal() a dead code or not? My understanding was it's not
really reachable, and the main purpose is to remind people this is not
possible. Or am I mistaken/confused?

If it's reachable, can we test it? AFAICS we don't, per the coverage
reports.

If it's just a protection against incorrect API initialization, then an
assert is the right solution, I think. With proper comment. But can't we
actually verify that *during* the initialization?

Also, how come WriteDataToArchiveLZ4() doesn't need this protection too?
Or is that due to gzip being the default compression method?

> Of course I will not object to removing it, if you think that is more
> confusing than useful.
> 

Not sure, I have a feeling I don't quite understand in what situation
this actually helps.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Mon, Mar 13, 2023 at 10:47:12PM +0100, Tomas Vondra wrote:
> > Rearrange functions to their original order allowing a cleaner diff to the prior code;
> 
> OK. I wasn't very enthusiastic about this initially, but after thinking
> about it a bit I think it's meaningful to make diffs clearer. But I
> don't see much difference with/without the patch. The
> 
> git diff --diff-algorithm=minimal -w e9960732a~:src/bin/pg_dump/compress_io.c src/bin/pg_dump/compress_gzip.c
> 
> Produces ~25k diff with/without the patch. What am I doing wrong?

Do you mean 25 kB of diff ?  I agree that the statistics of the diff
output don't change a lot:

  1 file changed, 201 insertions(+), 570 deletions(-)
  1 file changed, 198 insertions(+), 548 deletions(-)

But try reading the diff while looking for the cause of a bug.  It's the
difference between reading 50, two-line changes, and reading a hunk that
replaces 100 lines with a different 100 lines, with empty/unrelated
lines randomly thrown in as context.

When the diff is readable, the pg_fatal() also stands out.

> > Change pg_fatal() to an assertion+comment;
> 
> Yeah, that's reasonable. I'd even ditch the assert/comment, TBH. We
> could add such protections against "impossible" stuff to a zillion other
> places and the confusion likely outweighs the benefits.
> 
> > Update the commit message and fix a few typos;
> 
> Thanks. I don't want to annoy you too much, but could you split the
> patch into the "empty-data" fix and all the other changes (rearranging
> functions etc.)? I'd rather not mix those in the same commit.

I don't know if that makes sense?  The "empty-data" fix creates a new
function called DeflateCompressorInit().  My proposal was to add the new
function in the same place in the file as it used to be.

The patch also moves the pg_fatal() that's being removed.  I don't think
it's going to look any cleaner to read a history involving the
pg_fatal() first being added, then moved, then removed.  Anyway, I'll
wait while the community continues discussion about the pg_fatal().

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:

------- Original Message -------
On Tuesday, March 14th, 2023 at 4:32 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
>
>
> On 3/14/23 16:18, gkokolatos@pm.me wrote:
>
> > ...> Would you mind me trying to come with a patch to address your points?
>
>
> That'd be great, thanks. Please keep it split into smaller patches - two
> might work, with one patch for "cosmetic" changes and the other tweaking
> the API error-handling stuff.

Please find attached a set for it. I will admit that the splitting in the
series might not be ideal and what you requested. It is split on what
seemed as a logical units. Please advice how a better split can look like.

0001 is unifying types and return values on the API
0002 is addressing the constant definitions
0003 is your previous 0004 adding comments

As far as the error handling is concerned, you had said upthread:

> I think the right approach is to handle all library errors and not just
> let them through. So Gzip_write() needs to check the return value, and
> either call pg_fatal() or translate it to an error defined by the API.

While working on it, I thought it would be clearer and more consistent
for the pg_fatal() to be called by the caller of the individual functions.
Each individual function can keep track of the specifics of the error
internally. Then the caller upon detecting that there was an error by
checking the return value, can call pg_fatal() with a uniform error
message and then add the specifics by calling the get_error_func().

Thoughts?

Cheers,
//Georgios

>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 3/16/23 18:04, gkokolatos@pm.me wrote:
> 
> ------- Original Message -------
> On Tuesday, March 14th, 2023 at 4:32 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 3/14/23 16:18, gkokolatos@pm.me wrote:
>>
>>> ...> Would you mind me trying to come with a patch to address your points?
>>
>>
>> That'd be great, thanks. Please keep it split into smaller patches - two
>> might work, with one patch for "cosmetic" changes and the other tweaking
>> the API error-handling stuff.
> 
> Please find attached a set for it. I will admit that the splitting in the
> series might not be ideal and what you requested. It is split on what
> seemed as a logical units. Please advice how a better split can look like.
> 
> 0001 is unifying types and return values on the API
> 0002 is addressing the constant definitions
> 0003 is your previous 0004 adding comments
> 

Thanks. I think the split seems reasonable - the goal was to not mix
different changes, and from that POV it works.

I'm not sure I understand the Gzip_read/Gzip_write changes in 0001. I
mean, gzread/gzwrite returns int, so how does renaming the size_t
variable solve the issue of negative values for errors? I mean, this

-    size_t    ret;
+    size_t    gzret;

-    ret = gzread(gzfp, ptr, size);
+    gzret = gzread(gzfp, ptr, size);

means we still lost the information gzread() returned a negative value,
no? We'll still probably trigger an error, but it's a bit weird.


ISTM all this kinda assumes we're processing chunks of memory small
enough that we'll never actually overflow int - I did check what the
code in 15 does, and it seems use int and size_t quite arbitrarily.

For example cfread() seems quite sane:

    int
    cfread(void *ptr, int size, cfp *fp)
    {
        int ret;
        ...
        ret = gzread(fp->compressedfp, ptr, size);
        ...
        return ret;
    }

but then _PrintFileData() happily stashes it into a size_t, ignoring the
signedness. Surely, if

    static void
    _PrintFileData(ArchiveHandle *AH, char *filename)
    {
        size_t        cnt;
        ...
        while ((cnt = cfread(buf, buflen, cfp)))
        {
            ahwrite(buf, 1, cnt, AH);
        }
        ...
    }

Unless I'm missing something, if gzread() ever returns -1 or some other
negative error value, we'll cast it to  size_t, while condition will
evaluate to "true" and we'll happily chew on some random chunk of data.

So the confusion is (at least partially) a preexisting issue ...

For gzwrite() it seems to be fine, because that only returns 0 on error.
OTOH it's defined to take 'int size' but then we happily pass size_t
values to it.

As I wrote earlier, this apparently assumes we never need to deal with
buffers larger than int, and I don't think we have the ambition to relax
that (I'm not sure it's even needed / possible).


I see the read/write functions are now defined as int, but we only ever
return 0/1 from them, and then interpret that as bool. Why not to define
it like that? I don't think we need to adhere to the custom that
everything returns "int". This is an internal API. Or if we want to
stick to int, I'd define meaningful "nice" constants for 0/1.



0002 seems fine to me. I see you've ditched the idea of having two
separate buffers, and replaced them with DEFAULT_IO_BUFFER_SIZE. Fine
with me, although I wonder if this might have negative impact on
performance or something (but I doubt that).

0003 seems fine too.


> As far as the error handling is concerned, you had said upthread:
> 
>> I think the right approach is to handle all library errors and not just
>> let them through. So Gzip_write() needs to check the return value, and
>> either call pg_fatal() or translate it to an error defined by the API.
> 
> While working on it, I thought it would be clearer and more consistent
> for the pg_fatal() to be called by the caller of the individual functions.
> Each individual function can keep track of the specifics of the error
> internally. Then the caller upon detecting that there was an error by
> checking the return value, can call pg_fatal() with a uniform error
> message and then add the specifics by calling the get_error_func().
> 

I agree it's cleaner the way you did it.

I was thinking that with each compression function handling error
internally, the callers would not need to do that. But I haven't
realized there's logic to detect ENOSPC and so on, and we'd need to
duplicate that in every compression func.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 3/16/23 01:20, Justin Pryzby wrote:
> On Mon, Mar 13, 2023 at 10:47:12PM +0100, Tomas Vondra wrote:
>>> Rearrange functions to their original order allowing a cleaner diff to the prior code;
>>
>> OK. I wasn't very enthusiastic about this initially, but after thinking
>> about it a bit I think it's meaningful to make diffs clearer. But I
>> don't see much difference with/without the patch. The
>>
>> git diff --diff-algorithm=minimal -w e9960732a~:src/bin/pg_dump/compress_io.c src/bin/pg_dump/compress_gzip.c
>>
>> Produces ~25k diff with/without the patch. What am I doing wrong?
> 
> Do you mean 25 kB of diff ?

Yes, if you redirect the git-diff to a file, it's a 25kB file.

> I agree that the statistics of the diff output don't change a lot:
> 
>   1 file changed, 201 insertions(+), 570 deletions(-)
>   1 file changed, 198 insertions(+), 548 deletions(-)
> 
> But try reading the diff while looking for the cause of a bug.  It's the
> difference between reading 50, two-line changes, and reading a hunk that
> replaces 100 lines with a different 100 lines, with empty/unrelated
> lines randomly thrown in as context.
> 
> When the diff is readable, the pg_fatal() also stands out.
> 

I don't know, maybe I'm doing something wrong or maybe I just am bad at
looking at diffs, but if I apply the patch you submitted on 8/3 and do
the git-diff above (output attached), it seems pretty incomprehensible
to me :-( I don't see 50 two-line changes (I certainly wouldn't be able
to identify the root cause of the bug based on that).

>>> Change pg_fatal() to an assertion+comment;
>>
>> Yeah, that's reasonable. I'd even ditch the assert/comment, TBH. We
>> could add such protections against "impossible" stuff to a zillion other
>> places and the confusion likely outweighs the benefits.
>>
>>> Update the commit message and fix a few typos;
>>
>> Thanks. I don't want to annoy you too much, but could you split the
>> patch into the "empty-data" fix and all the other changes (rearranging
>> functions etc.)? I'd rather not mix those in the same commit.
> 
> I don't know if that makes sense?  The "empty-data" fix creates a new
> function called DeflateCompressorInit().  My proposal was to add the new
> function in the same place in the file as it used to be.
> 

Got it. In that case I agree it's fine to do that in a single commit.

> The patch also moves the pg_fatal() that's being removed.  I don't think
> it's going to look any cleaner to read a history involving the
> pg_fatal() first being added, then moved, then removed.  Anyway, I'll
> wait while the community continues discussion about the pg_fatal().
> 

I think the agreement was to replace the pg_fatal with and assert, and I
see your patch already does that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Thu, Mar 16, 2023 at 11:30:50PM +0100, Tomas Vondra wrote:
> On 3/16/23 01:20, Justin Pryzby wrote:
> > But try reading the diff while looking for the cause of a bug.  It's the
> > difference between reading 50, two-line changes, and reading a hunk that
> > replaces 100 lines with a different 100 lines, with empty/unrelated
> > lines randomly thrown in as context.
>
> I don't know, maybe I'm doing something wrong or maybe I just am bad at
> looking at diffs, but if I apply the patch you submitted on 8/3 and do
> the git-diff above (output attached), it seems pretty incomprehensible
> to me :-( I don't see 50 two-line changes (I certainly wouldn't be able
> to identify the root cause of the bug based on that).

It's true that most of the diff is still incomprehensible...

But look at the part relevant to the "empty-data" bug:

[... incomprehensible changes elided ...]
>  static void
> -InitCompressorZlib(CompressorState *cs, int level)
> +DeflateCompressorInit(CompressorState *cs)
>  {
> +    GzipCompressorState *gzipcs;
>      z_streamp    zp;
>  
> -    zp = cs->zp = (z_streamp) pg_malloc(sizeof(z_stream));
> +    gzipcs = (GzipCompressorState *) pg_malloc0(sizeof(GzipCompressorState));
> +    zp = gzipcs->zp = (z_streamp) pg_malloc(sizeof(z_stream));
>      zp->zalloc = Z_NULL;
>      zp->zfree = Z_NULL;
>      zp->opaque = Z_NULL;
>  
>      /*
> -     * zlibOutSize is the buffer size we tell zlib it can output to.  We
> -     * actually allocate one extra byte because some routines want to append a
> -     * trailing zero byte to the zlib output.
> +     * outsize is the buffer size we tell zlib it can output to.  We actually
> +     * allocate one extra byte because some routines want to append a trailing
> +     * zero byte to the zlib output.
>       */
> -    cs->zlibOut = (char *) pg_malloc(ZLIB_OUT_SIZE + 1);
> -    cs->zlibOutSize = ZLIB_OUT_SIZE;
> +    gzipcs->outbuf = pg_malloc(ZLIB_OUT_SIZE + 1);
> +    gzipcs->outsize = ZLIB_OUT_SIZE;
>  
> -    if (deflateInit(zp, level) != Z_OK)
> -        pg_fatal("could not initialize compression library: %s",
> -                 zp->msg);
> +    /* -Z 0 uses the "None" compressor -- not zlib with no compression */
> +    Assert(cs->compression_spec.level != 0);
> +
> +    if (deflateInit(zp, cs->compression_spec.level) != Z_OK)
> +        pg_fatal("could not initialize compression library: %s", zp->msg);
>  
>      /* Just be paranoid - maybe End is called after Start, with no Write */
> -    zp->next_out = (void *) cs->zlibOut;
> -    zp->avail_out = cs->zlibOutSize;
> +    zp->next_out = gzipcs->outbuf;
> +    zp->avail_out = gzipcs->outsize;
> +
> +    /* Keep track of gzipcs */
> +    cs->private_data = gzipcs;
>  }
>  
>  static void
> -EndCompressorZlib(ArchiveHandle *AH, CompressorState *cs)
> +DeflateCompressorEnd(ArchiveHandle *AH, CompressorState *cs)
>  {
> -    z_streamp    zp = cs->zp;
> +    GzipCompressorState *gzipcs = (GzipCompressorState *) cs->private_data;
> +    z_streamp    zp;
>  
> +    zp = gzipcs->zp;
>      zp->next_in = NULL;
>      zp->avail_in = 0;
>  
>      /* Flush any remaining data from zlib buffer */
> -    DeflateCompressorZlib(AH, cs, true);
> +    DeflateCompressorCommon(AH, cs, true);
>  
>      if (deflateEnd(zp) != Z_OK)
>          pg_fatal("could not close compression stream: %s", zp->msg);
>  
> -    free(cs->zlibOut);
> -    free(cs->zp);
> +    pg_free(gzipcs->outbuf);
> +    pg_free(gzipcs->zp);
> +    pg_free(gzipcs);
> +    cs->private_data = NULL;
>  }
>  
>  static void
> -DeflateCompressorZlib(ArchiveHandle *AH, CompressorState *cs, bool flush)
> +DeflateCompressorCommon(ArchiveHandle *AH, CompressorState *cs, bool flush)
>  {
> -    z_streamp    zp = cs->zp;
> -    char       *out = cs->zlibOut;
> +    GzipCompressorState *gzipcs = (GzipCompressorState *) cs->private_data;
> +    z_streamp    zp = gzipcs->zp;
> +    void       *out = gzipcs->outbuf;
>      int            res = Z_OK;
>  
> -    while (cs->zp->avail_in != 0 || flush)
> +    while (gzipcs->zp->avail_in != 0 || flush)
>      {
>          res = deflate(zp, flush ? Z_FINISH : Z_NO_FLUSH);
>          if (res == Z_STREAM_ERROR)
>              pg_fatal("could not compress data: %s", zp->msg);
> -        if ((flush && (zp->avail_out < cs->zlibOutSize))
> +        if ((flush && (zp->avail_out < gzipcs->outsize))
>              || (zp->avail_out == 0)
>              || (zp->avail_in != 0)
>              )
> @@ -289,18 +122,18 @@ DeflateCompressorZlib(ArchiveHandle *AH, CompressorState *cs, bool flush)
>               * chunk is the EOF marker in the custom format. This should never
>               * happen but ...
>               */
> -            if (zp->avail_out < cs->zlibOutSize)
> +            if (zp->avail_out < gzipcs->outsize)
>              {
>                  /*
>                   * Any write function should do its own error checking but to
>                   * make sure we do a check here as well ...
>                   */
> -                size_t        len = cs->zlibOutSize - zp->avail_out;
> +                size_t        len = gzipcs->outsize - zp->avail_out;
>  
> -                cs->writeF(AH, out, len);
> +                cs->writeF(AH, (char *) out, len);
>              }
> -            zp->next_out = (void *) out;
> -            zp->avail_out = cs->zlibOutSize;
> +            zp->next_out = out;
> +            zp->avail_out = gzipcs->outsize;
>          }
>  
>          if (res == Z_STREAM_END)
> @@ -309,16 +142,26 @@ DeflateCompressorZlib(ArchiveHandle *AH, CompressorState *cs, bool flush)
>  }
>  
>  static void
> -WriteDataToArchiveZlib(ArchiveHandle *AH, CompressorState *cs,
> -                       const char *data, size_t dLen)
> +EndCompressorGzip(ArchiveHandle *AH, CompressorState *cs)
>  {
> -    cs->zp->next_in = (void *) unconstify(char *, data);
> -    cs->zp->avail_in = dLen;
> -    DeflateCompressorZlib(AH, cs, false);
> +    /* If deflation was initialized, finalize it */
> +    if (cs->private_data)
> +        DeflateCompressorEnd(AH, cs);
>  }
>  
>  static void
> -ReadDataFromArchiveZlib(ArchiveHandle *AH, ReadFunc readF)
> +WriteDataToArchiveGzip(ArchiveHandle *AH, CompressorState *cs,
> +                       const void *data, size_t dLen)
> +{
> +    GzipCompressorState *gzipcs = (GzipCompressorState *) cs->private_data;
> +
> +    gzipcs->zp->next_in = (void *) unconstify(void *, data);
> +    gzipcs->zp->avail_in = dLen;
> +    DeflateCompressorCommon(AH, cs, false);
> +}
> +
> +static void
> +ReadDataFromArchiveGzip(ArchiveHandle *AH, CompressorState *cs)
>  {
>      z_streamp    zp;
>      char       *out;
> @@ -342,7 +185,7 @@ ReadDataFromArchiveZlib(ArchiveHandle *AH, ReadFunc readF)
>                   zp->msg);
>  
>      /* no minimal chunk size for zlib */
> -    while ((cnt = readF(AH, &buf, &buflen)))
> +    while ((cnt = cs->readF(AH, &buf, &buflen)))
>      {
>          zp->next_in = (void *) buf;
>          zp->avail_in = cnt;
> @@ -382,389 +225,196 @@ ReadDataFromArchiveZlib(ArchiveHandle *AH, ReadFunc readF)
>      free(out);
>      free(zp);
>  }
[... more incomprehensible changes elided ...]



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 3/16/23 23:58, Justin Pryzby wrote:
> On Thu, Mar 16, 2023 at 11:30:50PM +0100, Tomas Vondra wrote:
>> On 3/16/23 01:20, Justin Pryzby wrote:
>>> But try reading the diff while looking for the cause of a bug.  It's the
>>> difference between reading 50, two-line changes, and reading a hunk that
>>> replaces 100 lines with a different 100 lines, with empty/unrelated
>>> lines randomly thrown in as context.
>>
>> I don't know, maybe I'm doing something wrong or maybe I just am bad at
>> looking at diffs, but if I apply the patch you submitted on 8/3 and do
>> the git-diff above (output attached), it seems pretty incomprehensible
>> to me :-( I don't see 50 two-line changes (I certainly wouldn't be able
>> to identify the root cause of the bug based on that).
> 
> It's true that most of the diff is still incomprehensible...
> 
> But look at the part relevant to the "empty-data" bug:
> 

Well, yeah. If you know where to look, and if you squint just the right
way, then you can see any bug. I don't think I'd be able to spot the bug
in the diff unless I knew in advance what the bug is.

That being said, I don't object to moving the function etc. Unless there
are alternative ideas how to fix the empty-data issue, I'll get this
committed after playing with it a bit more.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Thursday, March 16th, 2023 at 10:20 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
>
>
> On 3/16/23 18:04, gkokolatos@pm.me wrote:
>
> > ------- Original Message -------
> > On Tuesday, March 14th, 2023 at 4:32 PM, Tomas Vondra tomas.vondra@enterprisedb.com wrote:
> >
> > > On 3/14/23 16:18, gkokolatos@pm.me wrote:
> > >
> > > > ...> Would you mind me trying to come with a patch to address your points?
> > >
> > > That'd be great, thanks. Please keep it split into smaller patches - two
> > > might work, with one patch for "cosmetic" changes and the other tweaking
> > > the API error-handling stuff.
> >
> > Please find attached a set for it. I will admit that the splitting in the
> > series might not be ideal and what you requested. It is split on what
> > seemed as a logical units. Please advice how a better split can look like.
> >
> > 0001 is unifying types and return values on the API
> > 0002 is addressing the constant definitions
> > 0003 is your previous 0004 adding comments
>
>
> Thanks. I think the split seems reasonable - the goal was to not mix
> different changes, and from that POV it works.
>
> I'm not sure I understand the Gzip_read/Gzip_write changes in 0001. I
> mean, gzread/gzwrite returns int, so how does renaming the size_t
> variable solve the issue of negative values for errors? I mean, this
>
> - size_t ret;
> + size_t gzret;
>
> - ret = gzread(gzfp, ptr, size);
> + gzret = gzread(gzfp, ptr, size);
>
> means we still lost the information gzread() returned a negative value,
> no? We'll still probably trigger an error, but it's a bit weird.

You are obviously correct. My bad, I miss-read the return type of gzread().

Please find an amended version attached.

> Unless I'm missing something, if gzread() ever returns -1 or some other
> negative error value, we'll cast it to size_t, while condition will
> evaluate to "true" and we'll happily chew on some random chunk of data.
>
> So the confusion is (at least partially) a preexisting issue ...
>
> For gzwrite() it seems to be fine, because that only returns 0 on error.
> OTOH it's defined to take 'int size' but then we happily pass size_t
> values to it.
>
> As I wrote earlier, this apparently assumes we never need to deal with
> buffers larger than int, and I don't think we have the ambition to relax
> that (I'm not sure it's even needed / possible).

Agreed.


> I see the read/write functions are now defined as int, but we only ever
> return 0/1 from them, and then interpret that as bool. Why not to define
> it like that? I don't think we need to adhere to the custom that
> everything returns "int". This is an internal API. Or if we want to
> stick to int, I'd define meaningful "nice" constants for 0/1.

The return types are now booleans and the callers have been made aware.


> 0002 seems fine to me. I see you've ditched the idea of having two
> separate buffers, and replaced them with DEFAULT_IO_BUFFER_SIZE. Fine
> with me, although I wonder if this might have negative impact on
> performance or something (but I doubt that).
>

I doubt that too. Thank you.

> 0003 seems fine too.

Thank you.


> > As far as the error handling is concerned, you had said upthread:
> >
> > > I think the right approach is to handle all library errors and not just
> > > let them through. So Gzip_write() needs to check the return value, and
> > > either call pg_fatal() or translate it to an error defined by the API.
> >
> > While working on it, I thought it would be clearer and more consistent
> > for the pg_fatal() to be called by the caller of the individual functions.
> > Each individual function can keep track of the specifics of the error
> > internally. Then the caller upon detecting that there was an error by
> > checking the return value, can call pg_fatal() with a uniform error
> > message and then add the specifics by calling the get_error_func().
>
>
> I agree it's cleaner the way you did it.
>
> I was thinking that with each compression function handling error
> internally, the callers would not need to do that. But I haven't
> realized there's logic to detect ENOSPC and so on, and we'd need to
> duplicate that in every compression func.
>

If you agree, I can prepare a patch to improve on the error handling
aspect of the API as a separate thread, since here we are trying to
focus on correctness.

Cheers,
//Georgios

>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 3/17/23 16:43, gkokolatos@pm.me wrote:
>>
>> ...
>>
>> I agree it's cleaner the way you did it.
>>
>> I was thinking that with each compression function handling error
>> internally, the callers would not need to do that. But I haven't
>> realized there's logic to detect ENOSPC and so on, and we'd need to
>> duplicate that in every compression func.
>>
> 
> If you agree, I can prepare a patch to improve on the error handling
> aspect of the API as a separate thread, since here we are trying to
> focus on correctness.
> 

Yes, that makes sense. There are far too many patches in this thread
already ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
Hi,

I was preparing to get the 3 cleanup patches pushed, so I
updated/reworded the commit messages a bit (attached, please check).

But I noticed the commit message for 0001 said:

  In passing save the appropriate errno in LZ4File_open_write in
  case that the caller is not using the API's get_error_func.

I think that's far too low-level for a commit message, it'd be much more
appropriate for a comment at the function.

However, do we even need this behavior? I was looking for code calling
this function without using get_error_func(), but no luck. And if there
is such caller, shouldn't we fix it to use get_error_func()?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Fri, Mar 17, 2023 at 03:43:58PM +0000, gkokolatos@pm.me wrote:
> From a174cdff4ec8aad59f5bcc7e8d52218a14fe56fc Mon Sep 17 00:00:00 2001
> From: Georgios Kokolatos <gkokolatos@pm.me>
> Date: Fri, 17 Mar 2023 14:45:58 +0000
> Subject: [PATCH v3 1/3] Improve type handling in pg_dump's compress file API

> -int
> +bool
>  EndCompressFileHandle(CompressFileHandle *CFH)
>  {
> -    int            ret = 0;
> +    bool        ret = 0;

Should say "= false" ?

>      /*
>       * Write 'size' bytes of data into the file from 'ptr'.
> +     *
> +     * Returns true on success and false on error.
> +     */
> +       bool            (*write_func) (const void *ptr, size_t size,

> -        * Get a pointer to a string that describes an error that occurred during a
> -        * compress file handle operation.
> +        * Get a pointer to a string that describes an error that occurred during
> +        * a compress file handle operation.
>          */
>         const char *(*get_error_func) (CompressFileHandle *CFH);

This should mention that the error accessible in error_func() applies (only) to
write_func() ?

As long as this touches pg_backup_directory.c you could update the
header comment to refer to "compressed extensions", not just .gz.

I noticed that EndCompressorLZ4() tests "if (LZ4cs)", but that should
always be true.

I was able to convert the zstd patch to this new API with no issue.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
Hi,

I looked at this again, and I realized I misunderstood the bit about
errno in LZ4File_open_write a bit. I now see it simply just brings the
function in line with Gzip_open_write(), so that the callers can just do
pg_fatal("%m"). I still think the special "errno" handling in this one
place feels a bit random, and handling it by get_error_func() would be
nicer, but we can leave that for a separate patch - no need to block
these changes because of that.

So pushed all three parts, after updating the commit messages a bit.

This leaves the empty-data issue (which we have a fix for) and the
switch to LZ4F. And then the zstd part.


On 3/20/23 23:40, Justin Pryzby wrote:
> On Fri, Mar 17, 2023 at 03:43:58PM +0000, gkokolatos@pm.me wrote:
>> From a174cdff4ec8aad59f5bcc7e8d52218a14fe56fc Mon Sep 17 00:00:00 2001
>> From: Georgios Kokolatos <gkokolatos@pm.me>
>> Date: Fri, 17 Mar 2023 14:45:58 +0000
>> Subject: [PATCH v3 1/3] Improve type handling in pg_dump's compress file API
> 
>> -int
>> +bool
>>  EndCompressFileHandle(CompressFileHandle *CFH)
>>  {
>> -    int            ret = 0;
>> +    bool        ret = 0;
> 
> Should say "= false" ?
> 

Right, fixed.

>>      /*
>>       * Write 'size' bytes of data into the file from 'ptr'.
>> +     *
>> +     * Returns true on success and false on error.
>> +     */
>> +       bool            (*write_func) (const void *ptr, size_t size,
> 
>> -        * Get a pointer to a string that describes an error that occurred during a
>> -        * compress file handle operation.
>> +        * Get a pointer to a string that describes an error that occurred during
>> +        * a compress file handle operation.
>>          */
>>         const char *(*get_error_func) (CompressFileHandle *CFH);
> 
> This should mention that the error accessible in error_func() applies (only) to
> write_func() ?
> 
> As long as this touches pg_backup_directory.c you could update the
> header comment to refer to "compressed extensions", not just .gz.
> 
> I noticed that EndCompressorLZ4() tests "if (LZ4cs)", but that should
> always be true.
> 

I haven't done these two things. We can/should do that, but it didn't
fit into the three patches.

> I was able to convert the zstd patch to this new API with no issue.
> 

Good to hear.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:



>
> So pushed all three parts, after updating the commit messages a bit.

Thank you very much.

>
> This leaves the empty-data issue (which we have a fix for) and the
> switch to LZ4F. And then the zstd part.

Please expect promptly a patch for the switch to frames.

Cheers,
//Georgios




Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Thursday, March 16th, 2023 at 11:30 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
>
>
> On 3/16/23 01:20, Justin Pryzby wrote:
>
> > On Mon, Mar 13, 2023 at 10:47:12PM +0100, Tomas Vondra wrote:
> >
> > >
> > > Thanks. I don't want to annoy you too much, but could you split the
> > > patch into the "empty-data" fix and all the other changes (rearranging
> > > functions etc.)? I'd rather not mix those in the same commit.
> >
> > I don't know if that makes sense? The "empty-data" fix creates a new
> > function called DeflateCompressorInit(). My proposal was to add the new
> > function in the same place in the file as it used to be.
>
>
> Got it. In that case I agree it's fine to do that in a single commit.

For what is worth, I think that this patch should get a +1 and get in. It
solves the empty writes problem and includes a test to a previous untested
case.

Cheers,
//Georgios

>
> > The patch also moves the pg_fatal() that's being removed. I don't think
> > it's going to look any cleaner to read a history involving the
> > pg_fatal() first being added, then moved, then removed. Anyway, I'll
> > wait while the community continues discussion about the pg_fatal().
>
>
> I think the agreement was to replace the pg_fatal with and assert, and I
> see your patch already does that.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Friday, March 24th, 2023 at 10:30 AM, gkokolatos@pm.me <gkokolatos@pm.me> wrote:

>
> ------- Original Message -------
> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra tomas.vondra@enterprisedb.com wrote:
>
> > This leaves the empty-data issue (which we have a fix for) and the
> > switch to LZ4F. And then the zstd part.
>
> Please expect promptly a patch for the switch to frames.

Please find the expected patch attached. Note that the bulk of the
patch is code unification, variable renaming to something more
appropriate, and comment addition. These are changes that are not
strictly necessary to switch to LZ4F. I do believe that are
essential for code hygiene after the switch and they do belong
on the same commit.

Cheers,
//Georgios

>
> Cheers,
> //Georgios
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 3/28/23 18:07, gkokolatos@pm.me wrote:
> 
> 
> 
> 
> 
> ------- Original Message -------
> On Friday, March 24th, 2023 at 10:30 AM, gkokolatos@pm.me <gkokolatos@pm.me> wrote:
> 
>>
>> ------- Original Message -------
>> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra tomas.vondra@enterprisedb.com wrote:
>>
>>> This leaves the empty-data issue (which we have a fix for) and the
>>> switch to LZ4F. And then the zstd part.
>>
>> Please expect promptly a patch for the switch to frames.
> 
> Please find the expected patch attached. Note that the bulk of the
> patch is code unification, variable renaming to something more
> appropriate, and comment addition. These are changes that are not
> strictly necessary to switch to LZ4F. I do believe that are
> essential for code hygiene after the switch and they do belong
> on the same commit. 
> 

Thanks!

I agree the renames & cleanup are appropriate - it'd be silly to stick
to misleading naming etc. Would it make sense to split the patch into
two, to separate the renames and the switch to lz4f?

That'd make it the changes necessary for lz4f switch clearer.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Tue, Mar 28, 2023 at 06:40:03PM +0200, Tomas Vondra wrote:
> On 3/28/23 18:07, gkokolatos@pm.me wrote:
> > ------- Original Message -------
> > On Friday, March 24th, 2023 at 10:30 AM, gkokolatos@pm.me <gkokolatos@pm.me> wrote:
> > 
> >> ------- Original Message -------
> >> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra tomas.vondra@enterprisedb.com wrote:
> >>
> >>> This leaves the empty-data issue (which we have a fix for) and the
> >>> switch to LZ4F. And then the zstd part.
> >>
> >> Please expect promptly a patch for the switch to frames.
> > 
> > Please find the expected patch attached. Note that the bulk of the
> > patch is code unification, variable renaming to something more
> > appropriate, and comment addition. These are changes that are not
> > strictly necessary to switch to LZ4F. I do believe that are
> > essential for code hygiene after the switch and they do belong
> > on the same commit. 
> 
> Thanks!
> 
> I agree the renames & cleanup are appropriate - it'd be silly to stick
> to misleading naming etc. Would it make sense to split the patch into
> two, to separate the renames and the switch to lz4f?
> That'd make it the changes necessary for lz4f switch clearer.

I don't think so.  Did you mean separate commits only for review ?

The patch is pretty readable - the File API has just some renames, and
the compressor API is what's being replaced, which isn't going to be any
more clear.

@Georgeos: did you consider using a C union in LZ4State, to separate the
parts used by the different APIs ?

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 3/28/23 18:07, gkokolatos@pm.me wrote:
> 
> ------- Original Message -------
> On Friday, March 24th, 2023 at 10:30 AM, gkokolatos@pm.me <gkokolatos@pm.me> wrote:
> 
>>
>> ------- Original Message -------
>> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra tomas.vondra@enterprisedb.com wrote:
>>
>>> This leaves the empty-data issue (which we have a fix for) and the
>>> switch to LZ4F. And then the zstd part.
>>
>> Please expect promptly a patch for the switch to frames.
> 
> Please find the expected patch attached. Note that the bulk of the
> patch is code unification, variable renaming to something more
> appropriate, and comment addition. These are changes that are not
> strictly necessary to switch to LZ4F. I do believe that are
> essential for code hygiene after the switch and they do belong
> on the same commit. 
> 


I think the patch is fine, but I'm wondering if the renames shouldn't go
a bit further. It removes references to LZ4File struct, but there's a
bunch of functions with LZ4File_ prefix. Why not to simply use LZ4_
prefix? We don't have GzipFile either.

Sure, it might be a bit confusing because lz4.h uses LZ4_ prefix, but
then we probably should not define LZ4_compressor_init ...

Also, maybe the comments shouldn't use "File API" when compress_io.c
calls that "Compressed stream API".


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 3/28/23 00:34, gkokolatos@pm.me wrote:
> 
> ...
>
>> Got it. In that case I agree it's fine to do that in a single commit.
> 
> For what is worth, I think that this patch should get a +1 and get in. It
> solves the empty writes problem and includes a test to a previous untested
> case.
> 

Pushed, after updating / rewording the commit message a little bit.

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Wednesday, March 29th, 2023 at 12:02 AM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
> On 3/28/23 18:07, gkokolatos@pm.me wrote:
>
> > ------- Original Message -------
> > On Friday, March 24th, 2023 at 10:30 AM, gkokolatos@pm.me gkokolatos@pm.me wrote:
> >
> > > ------- Original Message -------
> > > On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra tomas.vondra@enterprisedb.com wrote:
> > >
> > > > This leaves the empty-data issue (which we have a fix for) and the
> > > > switch to LZ4F. And then the zstd part.
> > >
> > > Please expect promptly a patch for the switch to frames.
> >
> > Please find the expected patch attached. Note that the bulk of the
> > patch is code unification, variable renaming to something more
> > appropriate, and comment addition. These are changes that are not
> > strictly necessary to switch to LZ4F. I do believe that are
> > essential for code hygiene after the switch and they do belong
> > on the same commit.
>
>
> I think the patch is fine, but I'm wondering if the renames shouldn't go
> a bit further. It removes references to LZ4File struct, but there's a
> bunch of functions with LZ4File_ prefix. Why not to simply use LZ4_
> prefix? We don't have GzipFile either.
>
> Sure, it might be a bit confusing because lz4.h uses LZ4_ prefix, but
> then we probably should not define LZ4_compressor_init ...

This is a good point. The initial thought was that since lz4.h is now
removed, such ambiguity will not be present. In v2 of the patch the
function is renamed to `LZ4State_compression_init` since this name
describes better its purpose. It initializes the LZ4State for
compression.

As for the LZ4File_ prefix, I have no objections. Please find the
prefix changed to LZ4Stream_. For the record, the word 'File' is not
unique to the lz4 implementation. The common data structure used by
the API in compress_io.h:

   typedef struct CompressFileHandle CompressFileHandle;

The public functions for this API are named:

  InitCompressFileHandle
  InitDiscoverCompressFileHandle
  EndCompressFileHandle

And within InitCompressFileHandle the pattern is:

    if (compression_spec.algorithm == PG_COMPRESSION_NONE)
        InitCompressFileHandleNone(CFH, compression_spec);
    else if (compression_spec.algorithm == PG_COMPRESSION_GZIP)
        InitCompressFileHandleGzip(CFH, compression_spec);
    else if (compression_spec.algorithm == PG_COMPRESSION_LZ4)
        InitCompressFileHandleLZ4(CFH, compression_spec);

It was felt that a prefix was required due to the inclusion 'lz4.h'
header where naming functions as 'LZ4_' would be wrong. The prefix
'LZ4File_' seemed to be in line with the naming of the rest of
the relevant functions and structures. Other compressions, gzip and
none, did not face the same issue.

To conclude, I think that having a prefix is slightly preferred
over not having one. Since the prefix `LZ4File_` is not desired,
I propose `LZ4Stream_` in v2.

I will not object to dismissing the argument and drop `File` from
the prefix, if so requested.

>
> Also, maybe the comments shouldn't use "File API" when compress_io.c
> calls that "Compressed stream API".

Done.

Cheers,
//Georgios

>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 3/31/23 11:19, gkokolatos@pm.me wrote:
> 
>> ...
>>
>>
>> I think the patch is fine, but I'm wondering if the renames shouldn't go
>> a bit further. It removes references to LZ4File struct, but there's a
>> bunch of functions with LZ4File_ prefix. Why not to simply use LZ4_
>> prefix? We don't have GzipFile either.
>>
>> Sure, it might be a bit confusing because lz4.h uses LZ4_ prefix, but
>> then we probably should not define LZ4_compressor_init ...
> 
> This is a good point. The initial thought was that since lz4.h is now
> removed, such ambiguity will not be present. In v2 of the patch the
> function is renamed to `LZ4State_compression_init` since this name
> describes better its purpose. It initializes the LZ4State for
> compression.
> 
> As for the LZ4File_ prefix, I have no objections. Please find the
> prefix changed to LZ4Stream_. For the record, the word 'File' is not
> unique to the lz4 implementation. The common data structure used by
> the API in compress_io.h:
> 
>    typedef struct CompressFileHandle CompressFileHandle; 
> 
> The public functions for this API are named:
> 
>   InitCompressFileHandle
>   InitDiscoverCompressFileHandle
>   EndCompressFileHandle
> 
> And within InitCompressFileHandle the pattern is:
> 
>     if (compression_spec.algorithm == PG_COMPRESSION_NONE)
>         InitCompressFileHandleNone(CFH, compression_spec);
>     else if (compression_spec.algorithm == PG_COMPRESSION_GZIP)
>         InitCompressFileHandleGzip(CFH, compression_spec);
>     else if (compression_spec.algorithm == PG_COMPRESSION_LZ4)
>         InitCompressFileHandleLZ4(CFH, compression_spec);
> 
> It was felt that a prefix was required due to the inclusion 'lz4.h'
> header where naming functions as 'LZ4_' would be wrong. The prefix
> 'LZ4File_' seemed to be in line with the naming of the rest of
> the relevant functions and structures. Other compressions, gzip and
> none, did not face the same issue.
> 
> To conclude, I think that having a prefix is slightly preferred
> over not having one. Since the prefix `LZ4File_` is not desired,
> I propose `LZ4Stream_` in v2.
> 
> I will not object to dismissing the argument and drop `File` from
> the prefix, if so requested.
> 

Thanks.

I think the LZ4Stream prefix is reasonable, so let's roll with that. I
cleaned up the patch a little bit (mostly comment tweaks, etc.), updated
the commit message and pushed it.

The main tweak I did is renaming all the LZ4State variables from "fs" to
state. The old name referred to the now abandoned "file state", but
after the rename to LZ4State that seems confusing. Some of the places
already used "state"and it's easier to know "state" is always LZ4State,
so let's keep it consistent.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Mon, Feb 27, 2023 at 02:33:04PM +0000, gkokolatos@pm.me wrote:
> > > - Finally, the "Nothing to do in the default case" comment comes from
> > > Michael's commit 5e73a6048:
> > > 
> > > + /*
> > > + * Custom and directory formats are compressed by default with gzip when
> > > + * available, not the others.
> > > + /
> > > + if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
> > > + !user_compression_defined)
> > > {
> > > #ifdef HAVE_LIBZ
> > > - if (archiveFormat == archCustom || archiveFormat == archDirectory)
> > > - compressLevel = Z_DEFAULT_COMPRESSION;
> > > - else
> > > + parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
> > > + &compression_spec);
> > > +#else
> > > + / Nothing to do in the default case */
> > > #endif
> > > - compressLevel = 0;
> > > }
> > > 
> > > As the comment says: for -Fc and -Fd, the compression is set to zlib, if
> > > enabled, and when not otherwise specified by the user.
> > > 
> > > Before 5e73a6048, this set compressLevel=0 for -Fp and -Ft, and when
> > > zlib was unavailable.
> > > 
> > > But I'm not sure why there's now an empty "#else". I also don't know
> > > what "the default case" refers to.
> > > 
> > > Maybe the best thing here is to move the preprocessor #if, since it's no
> > > longer in the middle of a runtime conditional:
> > > 
> > > #ifdef HAVE_LIBZ
> > > + if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
> > > + !user_compression_defined)
> > > + parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
> > > + &compression_spec);
> > > #endif
> > > 
> > > ...but that elicits a warning about "variable set but not used"...
> > 
> > 
> > Not sure, I need to think about this a bit.

> /* Nothing to do for the default case when LIBZ is not available */
> is easier to understand.

Maybe I would write it as: "if zlib is unavailable, default to no
compression".  But I think that's best done in the leading comment, and
not inside an empty preprocessor #else.

I was hoping Michael would comment on this.
The placement and phrasing of the comment makes no sense to me.

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, Apr 11, 2023 at 07:41:11PM -0500, Justin Pryzby wrote:
> Maybe I would write it as: "if zlib is unavailable, default to no
> compression".  But I think that's best done in the leading comment, and
> not inside an empty preprocessor #else.
>
> I was hoping Michael would comment on this.

(Sorry for the late reply, somewhat missed that.)

> The placement and phrasing of the comment makes no sense to me.

Yes, this comment gives no value as it stands.  I would be tempted to
follow the suggestion to group the whole code block in a single ifdef,
including the check, and remove this comment.  Like the attached
perhaps?
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Wed, Apr 12, 2023 at 10:07:08AM +0900, Michael Paquier wrote:
> On Tue, Apr 11, 2023 at 07:41:11PM -0500, Justin Pryzby wrote:
> > Maybe I would write it as: "if zlib is unavailable, default to no
> > compression".  But I think that's best done in the leading comment, and
> > not inside an empty preprocessor #else.
> > 
> > I was hoping Michael would comment on this.
> 
> (Sorry for the late reply, somewhat missed that.)
> 
> > The placement and phrasing of the comment makes no sense to me.
> 
> Yes, this comment gives no value as it stands.  I would be tempted to
> follow the suggestion to group the whole code block in a single ifdef,
> including the check, and remove this comment.  Like the attached
> perhaps?

+1



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, Apr 11, 2023 at 08:19:59PM -0500, Justin Pryzby wrote:
> On Wed, Apr 12, 2023 at 10:07:08AM +0900, Michael Paquier wrote:
>> Yes, this comment gives no value as it stands.  I would be tempted to
>> follow the suggestion to group the whole code block in a single ifdef,
>> including the check, and remove this comment.  Like the attached
>> perhaps?
>
> +1

Let me try this one again, as the previous patch would cause a warning
under --without:-zlib as user_compression_defined would be unused.  We
could do something like the attached instead.  It means doing twice
parse_compress_specification() for the non-zlib path, still we are
already doing so for the zlib path.

If there are other ideas, feel free.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Thu, Apr 13, 2023 at 07:23:48AM +0900, Michael Paquier wrote:
> On Tue, Apr 11, 2023 at 08:19:59PM -0500, Justin Pryzby wrote:
> > On Wed, Apr 12, 2023 at 10:07:08AM +0900, Michael Paquier wrote:
> >> Yes, this comment gives no value as it stands.  I would be tempted to
> >> follow the suggestion to group the whole code block in a single ifdef,
> >> including the check, and remove this comment.  Like the attached
> >> perhaps?
> > 
> > +1
> 
> Let me try this one again, as the previous patch would cause a warning
> under --without:-zlib as user_compression_defined would be unused.  We
> could do something like the attached instead.  It means doing twice
> parse_compress_specification() for the non-zlib path, still we are
> already doing so for the zlib path.
> 
> If there are other ideas, feel free.

I don't think you need to call parse_compress_specification(NONE).
As you wrote it, if zlib is unavailable, there's no parse(NONE) call,
even for directory and custom formats.  And there's no parse(NONE) call
for plan format when zlib is available.

The old way had preprocessor #if around both the "if" and "else" - is
that what you meant ?

If you don't insist on calling parse(NONE), the only change is to remove
the empty #else, which was my original patch.

"if no compression specification has been specified" is redundant with
"by default", and causes "not the others" to dangle.

If I were to rewrite the comment, it'd say:

+        * When gzip is available, custom and directory formats are compressed by
+        * default



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Wed, Apr 12, 2023 at 05:52:40PM -0500, Justin Pryzby wrote:
> I don't think you need to call parse_compress_specification(NONE).
> As you wrote it, if zlib is unavailable, there's no parse(NONE) call,
> even for directory and custom formats.  And there's no parse(NONE) call
> for plan format when zlib is available.

Yeah, that's not necessary, but I was wondering if it made the code a
bit cleaner, or else the non-zlib path would rely on the default
compression method string.

> The old way had preprocessor #if around both the "if" and "else" - is
> that what you meant?
>
> If you don't insist on calling parse(NONE), the only change is to remove
> the empty #else, which was my original patch.

Removing the empty else has as problem to create an empty if block,
which could be itself a cause of warnings?

> If I were to rewrite the comment, it'd say:
>
> +        * When gzip is available, custom and directory formats are compressed by
> +        * default

Okay.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Justin Pryzby
Date:
On Thu, Apr 13, 2023 at 09:37:06AM +0900, Michael Paquier wrote:
> > If you don't insist on calling parse(NONE), the only change is to remove
> > the empty #else, which was my original patch.
> 
> Removing the empty else has as problem to create an empty if block,
> which could be itself a cause of warnings?

I doubt it - in the !HAVE_LIBZ case, it's currently an "if" statement
with nothing but a comment, which isn't a problem.

I think the only issue with an empty "if" is when you have no braces,
like:

    if (...)
#if ...
        something;
#endif

    // problem here //

-- 
Justin



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Wed, Apr 12, 2023 at 07:53:53PM -0500, Justin Pryzby wrote:
> I doubt it - in the !HAVE_LIBZ case, it's currently an "if" statement
> with nothing but a comment, which isn't a problem.
>
> I think the only issue with an empty "if" is when you have no braces,
> like:
>
>     if (...)
> #if ...
>         something;
> #endif
>
>     // problem here //

(My apologies for the late reply.)

Still it could be easily messed up, and that's not a style that
really exists in the tree, either, because there are always #else
blocks set up in such cases.  Another part that makes me a bit
uncomfortable is that we would still call twice
parse_compress_specification(), something that should not happen but
we are doing so on HEAD because the default compression_algorithm_str
is "none" and we want to enforce "gzip" for custom and directory
formats when building with zlib.

What about just moving this block a bit up, just before the
compression spec parsing, then?  If we set compression_algorithm_str,
the specification is compiled with the expected default, once instead
of twice.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Tuesday, April 25th, 2023 at 8:02 AM, Michael Paquier <michael@paquier.xyz> wrote:


>
>
> On Wed, Apr 12, 2023 at 07:53:53PM -0500, Justin Pryzby wrote:
>
> > I doubt it - in the !HAVE_LIBZ case, it's currently an "if" statement
> > with nothing but a comment, which isn't a problem.
> >
> > I think the only issue with an empty "if" is when you have no braces,
> > like:
> >
> > if (...)
> > #if ...
> > something;
> > #endif
> >
> > // problem here //
>
>
> (My apologies for the late reply.)
>
> Still it could be easily messed up, and that's not a style that
> really exists in the tree, either, because there are always #else
> blocks set up in such cases. Another part that makes me a bit
> uncomfortable is that we would still call twice
> parse_compress_specification(), something that should not happen but
> we are doing so on HEAD because the default compression_algorithm_str
> is "none" and we want to enforce "gzip" for custom and directory
> formats when building with zlib.
>
> What about just moving this block a bit up, just before the
> compression spec parsing, then? If we set compression_algorithm_str,
> the specification is compiled with the expected default, once instead
> of twice.

For what is worth, I think this would be the best approach. +1

Cheers,
//Georgios

> --
> Michael



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Wed, Apr 26, 2023 at 08:50:46AM +0000, gkokolatos@pm.me wrote:
> For what is worth, I think this would be the best approach. +1

Thanks.  I have gone with that, then!
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Alexander Lakhin
Date:
23.03.2023 20:10, Tomas Vondra wrote:
> So pushed all three parts, after updating the commit messages a bit.
>
> This leaves the empty-data issue (which we have a fix for) and the
> switch to LZ4F. And then the zstd part.
>

I'm sorry that I haven't noticed/checked that before, but when trying to
perform check-world with Valgrind I've discovered another issue presumably
related to LZ4File_gets().
When running under Valgrind:
PROVE_TESTS=t/002_pg_dump.pl make check -C src/bin/pg_dump/
I get:
...
[07:07:11.683](0.000s) ok 1939 - compression_lz4_dir: glob check for 
.../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir/*.dat.lz4
# Running: pg_restore --jobs=2 --file=.../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir.sql 
.../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir

==00:00:00:00.579 2811926== Conditional jump or move depends on uninitialised value(s)
==00:00:00:00.579 2811926==    at 0x4853376: rawmemchr (vg_replace_strmem.c:1548)
==00:00:00:00.579 2811926==    by 0x4C96A67: _IO_str_init_static_internal (strops.c:41)
==00:00:00:00.579 2811926==    by 0x4C693A2: _IO_strfile_read (strfile.h:95)
==00:00:00:00.579 2811926==    by 0x4C693A2: __isoc99_sscanf (isoc99_sscanf.c:28)
==00:00:00:00.579 2811926==    by 0x11DB6F: _LoadLOs (pg_backup_directory.c:458)
==00:00:00:00.579 2811926==    by 0x11DD1E: _PrintTocData (pg_backup_directory.c:422)
==00:00:00:00.579 2811926==    by 0x118484: restore_toc_entry (pg_backup_archiver.c:882)
==00:00:00:00.579 2811926==    by 0x1190CC: RestoreArchive (pg_backup_archiver.c:699)
==00:00:00:00.579 2811926==    by 0x10F25D: main (pg_restore.c:414)
==00:00:00:00.579 2811926==
...

It looks like the line variable returned by gets_func() here is not
null-terminated:
     while ((CFH->gets_func(line, MAXPGPATH, CFH)) != NULL)
     {
...
         if (sscanf(line, "%u %" CppAsString2(MAXPGPATH) "s\n", &oid, lofname) != 2)
...
And Valgrind doesn't like it.

Best regards,
Alexander



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Friday, May 5th, 2023 at 8:00 AM, Alexander Lakhin <exclusion@gmail.com> wrote:


>
>
> 23.03.2023 20:10, Tomas Vondra wrote:
>
> > So pushed all three parts, after updating the commit messages a bit.
> >
> > This leaves the empty-data issue (which we have a fix for) and the
> > switch to LZ4F. And then the zstd part.
>
>
> I'm sorry that I haven't noticed/checked that before, but when trying to
> perform check-world with Valgrind I've discovered another issue presumably
> related to LZ4File_gets().
> When running under Valgrind:
> PROVE_TESTS=t/002_pg_dump.pl make check -C src/bin/pg_dump/
> I get:
> ...
> 07:07:11.683 ok 1939 - compression_lz4_dir: glob check for
> .../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir/*.dat.lz4
> # Running: pg_restore --jobs=2 --file=.../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir.sql
> .../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir
>
> ==00:00:00:00.579 2811926== Conditional jump or move depends on uninitialised value(s)
> ==00:00:00:00.579 2811926== at 0x4853376: rawmemchr (vg_replace_strmem.c:1548)
> ==00:00:00:00.579 2811926== by 0x4C96A67: _IO_str_init_static_internal (strops.c:41)
> ==00:00:00:00.579 2811926== by 0x4C693A2: _IO_strfile_read (strfile.h:95)
> ==00:00:00:00.579 2811926== by 0x4C693A2: __isoc99_sscanf (isoc99_sscanf.c:28)
> ==00:00:00:00.579 2811926== by 0x11DB6F: _LoadLOs (pg_backup_directory.c:458)
> ==00:00:00:00.579 2811926== by 0x11DD1E: _PrintTocData (pg_backup_directory.c:422)
> ==00:00:00:00.579 2811926== by 0x118484: restore_toc_entry (pg_backup_archiver.c:882)
> ==00:00:00:00.579 2811926== by 0x1190CC: RestoreArchive (pg_backup_archiver.c:699)
> ==00:00:00:00.579 2811926== by 0x10F25D: main (pg_restore.c:414)
> ==00:00:00:00.579 2811926==
> ...
>
> It looks like the line variable returned by gets_func() here is not
> null-terminated:
> while ((CFH->gets_func(line, MAXPGPATH, CFH)) != NULL)
>
> {
> ...
> if (sscanf(line, "%u %" CppAsString2(MAXPGPATH) "s\n", &oid, lofname) != 2)
> ...
> And Valgrind doesn't like it.
>

Valgrind is correct to not like it. LZ4Stream_gets() got modeled after
gets() when it should have been modeled after fgets().

Please find a patch attached to address it.

Cheers,
//Georgios

> Best regards,
> Alexander
Attachment

Re: Add LZ4 compression in pg_dump

From
Andrew Dunstan
Date:


On 2023-05-05 Fr 06:02, gkokolatos@pm.me wrote:




------- Original Message -------
On Friday, May 5th, 2023 at 8:00 AM, Alexander Lakhin <exclusion@gmail.com> wrote:



23.03.2023 20:10, Tomas Vondra wrote:

So pushed all three parts, after updating the commit messages a bit.

This leaves the empty-data issue (which we have a fix for) and the
switch to LZ4F. And then the zstd part.

I'm sorry that I haven't noticed/checked that before, but when trying to
perform check-world with Valgrind I've discovered another issue presumably
related to LZ4File_gets().
When running under Valgrind:
PROVE_TESTS=t/002_pg_dump.pl make check -C src/bin/pg_dump/
I get:
...
07:07:11.683 ok 1939 - compression_lz4_dir: glob check for
.../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir/*.dat.lz4
# Running: pg_restore --jobs=2 --file=.../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir.sql
.../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir

==00:00:00:00.579 2811926== Conditional jump or move depends on uninitialised value(s)
==00:00:00:00.579 2811926== at 0x4853376: rawmemchr (vg_replace_strmem.c:1548)
==00:00:00:00.579 2811926== by 0x4C96A67: _IO_str_init_static_internal (strops.c:41)
==00:00:00:00.579 2811926== by 0x4C693A2: _IO_strfile_read (strfile.h:95)
==00:00:00:00.579 2811926== by 0x4C693A2: __isoc99_sscanf (isoc99_sscanf.c:28)
==00:00:00:00.579 2811926== by 0x11DB6F: _LoadLOs (pg_backup_directory.c:458)
==00:00:00:00.579 2811926== by 0x11DD1E: _PrintTocData (pg_backup_directory.c:422)
==00:00:00:00.579 2811926== by 0x118484: restore_toc_entry (pg_backup_archiver.c:882)
==00:00:00:00.579 2811926== by 0x1190CC: RestoreArchive (pg_backup_archiver.c:699)
==00:00:00:00.579 2811926== by 0x10F25D: main (pg_restore.c:414)
==00:00:00:00.579 2811926==
...

It looks like the line variable returned by gets_func() here is not
null-terminated:
while ((CFH->gets_func(line, MAXPGPATH, CFH)) != NULL)

{
...
if (sscanf(line, "%u %" CppAsString2(MAXPGPATH) "s\n", &oid, lofname) != 2)
...
And Valgrind doesn't like it.

Valgrind is correct to not like it. LZ4Stream_gets() got modeled after
gets() when it should have been modeled after fgets().

Please find a patch attached to address it.



Isn't using memset here a bit wasteful? Why not just put a null at the end after calling LZ4Stream_read_internal(), which tells you how many bytes it has written?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:


------- Original Message -------
On Friday, May 5th, 2023 at 3:23 PM, Andrew Dunstan <andrew@dunslane.net> wrote:


On 2023-05-05 Fr 06:02, gkokolatos@pm.me wrote:
------- Original Message -------
On Friday, May 5th, 2023 at 8:00 AM, Alexander Lakhin <exclusion@gmail.com> wrote:


23.03.2023 20:10, Tomas Vondra wrote:

So pushed all three parts, after updating the commit messages a bit.

This leaves the empty-data issue (which we have a fix for) and the
switch to LZ4F. And then the zstd part.
I'm sorry that I haven't noticed/checked that before, but when trying to
perform check-world with Valgrind I've discovered another issue presumably
related to LZ4File_gets().
When running under Valgrind:
PROVE_TESTS=t/002_pg_dump.pl make check -C src/bin/pg_dump/
I get:
...
07:07:11.683 ok 1939 - compression_lz4_dir: glob check for
.../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir/*.dat.lz4
# Running: pg_restore --jobs=2 --file=.../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir.sql
.../src/bin/pg_dump/tmp_check/tmp_test_HB6A/compression_lz4_dir

==00:00:00:00.579 2811926== Conditional jump or move depends on uninitialised value(s)
==00:00:00:00.579 2811926== at 0x4853376: rawmemchr (vg_replace_strmem.c:1548)
==00:00:00:00.579 2811926== by 0x4C96A67: _IO_str_init_static_internal (strops.c:41)
==00:00:00:00.579 2811926== by 0x4C693A2: _IO_strfile_read (strfile.h:95)
==00:00:00:00.579 2811926== by 0x4C693A2: __isoc99_sscanf (isoc99_sscanf.c:28)
==00:00:00:00.579 2811926== by 0x11DB6F: _LoadLOs (pg_backup_directory.c:458)
==00:00:00:00.579 2811926== by 0x11DD1E: _PrintTocData (pg_backup_directory.c:422)
==00:00:00:00.579 2811926== by 0x118484: restore_toc_entry (pg_backup_archiver.c:882)
==00:00:00:00.579 2811926== by 0x1190CC: RestoreArchive (pg_backup_archiver.c:699)
==00:00:00:00.579 2811926== by 0x10F25D: main (pg_restore.c:414)
==00:00:00:00.579 2811926==
...

It looks like the line variable returned by gets_func() here is not
null-terminated:
while ((CFH->gets_func(line, MAXPGPATH, CFH)) != NULL)

{
...
if (sscanf(line, "%u %" CppAsString2(MAXPGPATH) "s\n", &oid, lofname) != 2)
...
And Valgrind doesn't like it.

Valgrind is correct to not like it. LZ4Stream_gets() got modeled after
gets() when it should have been modeled after fgets().

Please find a patch attached to address it.



Isn't using memset here a bit wasteful? Why not just put a null at the end after calling LZ4Stream_read_internal(), which tells you how many bytes it has written?

Good point. I thought about it before submitting the patch. I concluded that given the complexity and operations involved in LZ4Stream_read_internal() and the rest of the pg_dump/pg_restore code, the memset() call will be negligible. However from the readability point of view, the function is a bit cleaner with the memset().

I will not object to any suggestion though, as this is a very trivial point. Please find attached a v2 of the patch following the suggested approach.

Cheers,

//Georgios



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Fri, May 05, 2023 at 02:13:28PM +0000, gkokolatos@pm.me wrote:
> Good point. I thought about it before submitting the patch. I
> concluded that given the complexity and operations involved in
> LZ4Stream_read_internal() and the rest of t he pg_dump/pg_restore
> code, the memset() call will be negligible. However from the
> readability point of view, the function is a bit cleaner with the
> memset().
>
> I will not object to any suggestion though, as this is a very
> trivial point. Please find attached a v2 of the patch following the
> suggested approach.

Please note that an open item has been added for this stuff.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:



On Sat, May 6, 2023 at 04:51, Michael Paquier <michael@paquier.xyz> wrote:
On Fri, May 05, 2023 at 02:13:28PM +0000, gkokolatos@pm.me wrote:
> Good point. I thought about it before submitting the patch. I
> concluded that given the complexity and operations involved in
> LZ4Stream_read_internal() and the rest of t he pg_dump/pg_restore
> code, the memset() call will be negligible. However from the
> readability point of view, the function is a bit cleaner with the
> memset().
>
> I will not object to any suggestion though, as this is a very
> trivial point. Please find attached a v2 of the patch following the
> suggested approach.

Please note that an open item has been added for this stuff.
Thank you but I am not certain I know what that means. Can you please explain?

Cheers,
//Georgios
--
Michael

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Sun, May 07, 2023 at 03:01:52PM +0000, gkokolatos@pm.me wrote:
> Thank you but I am not certain I know what that means. Can you please explain?

It means that this thread has been added to the following list:
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items#Open_Issues

pg_dump/compress_lz4.c is new as of PostgreSQL 16, and this patch is
fixing a deficiency.  That's just a way outside of the commit fest to
track any problems and make sure these are fixed before the release
happens.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Fri, May 05, 2023 at 02:13:28PM +0000, gkokolatos@pm.me wrote:
> Good point. I thought about it before submitting the patch. I
> concluded that given the complexity and operations involved in
> LZ4Stream_read_internal() and the rest of t he pg_dump/pg_restore
> code, the memset() call will be negligible. However from the
> readability point of view, the function is a bit cleaner with the
> memset().
>
> I will not object to any suggestion though, as this is a very
> trivial point. Please find attached a v2 of the patch following the
> suggested approach.

Hmm.  I was looking at this patch, and what you are trying to do
sounds rather right to keep a parallel with the gzip and zstd code
paths.

Looking at the code of gzread.c, gzgets() enforces a null-termination
on the string read.  Still, isn't that something we'd better enforce
in read_none() as well?  compress_io.h lists this as a requirement of
the callback, and Zstd_gets() does so already.  read_none() does not
enforce that, unfortunately.

+   /* No work needs to be done for a zero-sized output buffer */
+   if (size <= 0)
+       return 0;

Indeed.  This should be OK.

-   ret = LZ4Stream_read_internal(state, ptr, size, true);
+   Assert(size > 1);

The addition of this assertion is a bit surprising, and this is
inconsistent with Zstd_gets where a length of 1 is authorized.  We
should be more consistent across all the callbacks, IMO, not less, so
as we apply the same API contract across all the compression methods.

While testing this patch, I have triggered an error pointing out that
the decompression path of LZ4 is broken for table data.  I can
reproduce that with a dump of the regression database, as of:
make installcheck
pg_dump --format=d --file=dump_lz4 --compress=lz4 regression
createdb regress_lz4
pg_restore --format=d -d regress_lz4 dump_lz4
pg_restore: error: COPY failed for table "clstr_tst": ERROR:  extra data after last expected column
CONTEXT:  COPY clstr_tst, line 15: "32    6    seis
xyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzy..."
pg_restore: warning: errors ignored on restore: 1

This does not show up with gzip or zstd, and the patch does not
influence the result.  In short it shows up with and without the
patch, on HEAD.  That does not look really stable :/
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> While testing this patch, I have triggered an error pointing out that
> the decompression path of LZ4 is broken for table data.  I can
> reproduce that with a dump of the regression database, as of:
> make installcheck
> pg_dump --format=d --file=dump_lz4 --compress=lz4 regression
> createdb regress_lz4
> pg_restore --format=d -d regress_lz4 dump_lz4
> pg_restore: error: COPY failed for table "clstr_tst": ERROR:  extra data after last expected column
> CONTEXT:  COPY clstr_tst, line 15: "32    6    seis
xyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzyxyzzy..."
> pg_restore: warning: errors ignored on restore: 1

Ugh.  Reproduced here ... so we need an open item for this.

            regards, tom lane



Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Sun, May 07, 2023 at 09:09:25PM -0400, Tom Lane wrote:
> Ugh.  Reproduced here ... so we need an open item for this.

Yep.  Already added.
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Tom Lane
Date:
I wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> While testing this patch, I have triggered an error pointing out that
>> the decompression path of LZ4 is broken for table data.  I can
>> reproduce that with a dump of the regression database, as of:
>> make installcheck
>> pg_dump --format=d --file=dump_lz4 --compress=lz4 regression

> Ugh.  Reproduced here ... so we need an open item for this.

BTW, it seems to work with --format=c.

            regards, tom lane



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 5/7/23 17:01, gkokolatos@pm.me wrote:
> 
> 
> 
> On Sat, May 6, 2023 at 04:51, Michael Paquier <michael@paquier.xyz
> <mailto:On Sat, May 6, 2023 at 04:51, Michael Paquier <<a href=>> wrote:
>> On Fri, May 05, 2023 at 02:13:28PM +0000, gkokolatos@pm.me wrote:
>> > Good point. I thought about it before submitting the patch. I
>> > concluded that given the complexity and operations involved in
>> > LZ4Stream_read_internal() and the rest of t he pg_dump/pg_restore
>> > code, the memset() call will be negligible. However from the
>> > readability point of view, the function is a bit cleaner with the
>> > memset().
>> >
>> > I will not object to any suggestion though, as this is a very
>> > trivial point. Please find attached a v2 of the patch following the
>> > suggested approach.
>>
>> Please note that an open item has been added for this stuff.
> Thank you but I am not certain I know what that means. Can you please
> explain?
> 

It means it was added to the list of items we need to fix before PG16
gets out:

https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Monday, May 8th, 2023 at 3:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:


>
>
> I wrote:
>
> > Michael Paquier michael@paquier.xyz writes:
> >
> > > While testing this patch, I have triggered an error pointing out that
> > > the decompression path of LZ4 is broken for table data. I can
> > > reproduce that with a dump of the regression database, as of:
> > > make installcheck
> > > pg_dump --format=d --file=dump_lz4 --compress=lz4 regression
>
> > Ugh. Reproduced here ... so we need an open item for this.
>
>
> BTW, it seems to work with --format=c.
>

Thank you for the extra tests. It seems that exists a gap in the test
coverage. Please find a patch attached that is addressing the issue
and attempt to provide tests for it.

Cheers,
//Georgios

> regards, tom lane
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 5/8/23 03:16, Tom Lane wrote:
> I wrote:
>> Michael Paquier <michael@paquier.xyz> writes:
>>> While testing this patch, I have triggered an error pointing out that
>>> the decompression path of LZ4 is broken for table data.  I can
>>> reproduce that with a dump of the regression database, as of:
>>> make installcheck
>>> pg_dump --format=d --file=dump_lz4 --compress=lz4 regression
> 
>> Ugh.  Reproduced here ... so we need an open item for this.
> 
> BTW, it seems to work with --format=c.
> 

The LZ4Stream_write() forgot to move the pointer to the next chunk, so
it was happily decompressing the initial chunk over and over. A bit
embarrassing oversight :-(

The custom format calls WriteDataToArchiveLZ4(), which was correct.

The attached patch fixes this for me.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 5/8/23 18:19, gkokolatos@pm.me wrote:
> 
> 
> 
> 
> 
> ------- Original Message -------
> On Monday, May 8th, 2023 at 3:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> 
>>
>>
>> I wrote:
>>
>>> Michael Paquier michael@paquier.xyz writes:
>>>
>>>> While testing this patch, I have triggered an error pointing out that
>>>> the decompression path of LZ4 is broken for table data. I can
>>>> reproduce that with a dump of the regression database, as of:
>>>> make installcheck
>>>> pg_dump --format=d --file=dump_lz4 --compress=lz4 regression
>>
>>> Ugh. Reproduced here ... so we need an open item for this.
>>
>>
>> BTW, it seems to work with --format=c.
>>
> 
> Thank you for the extra tests. It seems that exists a gap in the test
> coverage. Please find a patch attached that is addressing the issue
> and attempt to provide tests for it.
> 

Seems I'm getting messages with a delay - this is mostly the same fix I
ended up with, not realizing you already posted a fix.

I don't think we need the local "in" variable - the pointer parameter is
local in the function, so we can modify it directly (with a cast).
WriteDataToArchiveLZ4 does it that way too.

The tests are definitely a good idea. I wonder if we should add a
comment to DEFAULT_IO_BUFFER_SIZE mentioning that if we choose to
increase the value in the future, we needs to tweak the tests too to use
more data in order to exercise the buffering etc. Maybe it's obvious?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Monday, May 8th, 2023 at 8:20 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
>
>
> On 5/8/23 18:19, gkokolatos@pm.me wrote:
>
> > ------- Original Message -------
> > On Monday, May 8th, 2023 at 3:16 AM, Tom Lane tgl@sss.pgh.pa.us wrote:
> >
> > > I wrote:
> > >
> > > > Michael Paquier michael@paquier.xyz writes:
> > > >
> > > > > While testing this patch, I have triggered an error pointing out that
> > > > > the decompression path of LZ4 is broken for table data. I can
> > > > > reproduce that with a dump of the regression database, as of:
> > > > > make installcheck
> > > > > pg_dump --format=d --file=dump_lz4 --compress=lz4 regression
> > >
> > > > Ugh. Reproduced here ... so we need an open item for this.
> > >
> > > BTW, it seems to work with --format=c.
> >
> > Thank you for the extra tests. It seems that exists a gap in the test
> > coverage. Please find a patch attached that is addressing the issue
> > and attempt to provide tests for it.
>
>
> Seems I'm getting messages with a delay - this is mostly the same fix I
> ended up with, not realizing you already posted a fix.

Thank you very much for looking.

> I don't think we need the local "in" variable - the pointer parameter is
> local in the function, so we can modify it directly (with a cast).
> WriteDataToArchiveLZ4 does it that way too.

Sure, patch updated.

> The tests are definitely a good idea.

Thank you.

> I wonder if we should add a
> comment to DEFAULT_IO_BUFFER_SIZE mentioning that if we choose to
> increase the value in the future, we needs to tweak the tests too to use
> more data in order to exercise the buffering etc. Maybe it's obvious?
>

You are right. Added a comment both in the header and in the test.

I hope v2 gets closer to closing the open item for this.

Cheers,
//Georgios


>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Mon, May 08, 2023 at 08:00:39PM +0200, Tomas Vondra wrote:
> The LZ4Stream_write() forgot to move the pointer to the next chunk, so
> it was happily decompressing the initial chunk over and over. A bit
> embarrassing oversight :-(
>
> The custom format calls WriteDataToArchiveLZ4(), which was correct.
>
> The attached patch fixes this for me.

Ouch.  So this was corrupting the dumps and the compression when
trying to write more than two chunks at once, not the decompression
steps.  That addresses the issue here as well, thanks!
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 5/9/23 00:10, Michael Paquier wrote:
> On Mon, May 08, 2023 at 08:00:39PM +0200, Tomas Vondra wrote:
>> The LZ4Stream_write() forgot to move the pointer to the next chunk, so
>> it was happily decompressing the initial chunk over and over. A bit
>> embarrassing oversight :-(
>>
>> The custom format calls WriteDataToArchiveLZ4(), which was correct.
>>
>> The attached patch fixes this for me.
> 
> Ouch.  So this was corrupting the dumps and the compression when
> trying to write more than two chunks at once, not the decompression
> steps.  That addresses the issue here as well, thanks!

Yeah. Thanks for the report, should have been found during review.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
gkokolatos@pm.me
Date:




------- Original Message -------
On Tuesday, May 9th, 2023 at 2:54 PM, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


>
>
> On 5/9/23 00:10, Michael Paquier wrote:
>
> > On Mon, May 08, 2023 at 08:00:39PM +0200, Tomas Vondra wrote:
> >
> > > The LZ4Stream_write() forgot to move the pointer to the next chunk, so
> > > it was happily decompressing the initial chunk over and over. A bit
> > > embarrassing oversight :-(
> > >
> > > The custom format calls WriteDataToArchiveLZ4(), which was correct.
> > >
> > > The attached patch fixes this for me.
> >
> > Ouch. So this was corrupting the dumps and the compression when
> > trying to write more than two chunks at once, not the decompression
> > steps. That addresses the issue here as well, thanks!
>
>
> Yeah. Thanks for the report, should have been found during review.

Thank you both for looking. A small consolation is that now there are
tests for this case.

Moving on to the other open item for this, please find attached v2
of the patch as requested.

Cheers,
//Georgios

>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, May 09, 2023 at 02:12:44PM +0000, gkokolatos@pm.me wrote:
> Thank you both for looking. A small consolation is that now there are
> tests for this case.

+1, noticing that was pure luck ;)

Worth noting that the patch posted in [1] has these tests, not the
version posted in [2].

+    create_sql   => 'INSERT INTO dump_test.test_compression_method (col1) '
+      . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,4096) a;',

Yep, good and cheap idea to check for longer chunks.  That should be
enough to loop twice.

[1]:
https://www.postgresql.org/message-id/SYTRcNgtAbzyn3y3IInh1x-UfNTKMNpnFvI3mr6SyqyVf3PkaDsMy_cpKKgsl3_HdLy2MFAH4zwjxDmFfiLO8rWtSiJWBtqT06OMjeNo4GA=@pm.me
[2]: https://www.postgresql.org/message-id/f735df01-0bb4-2fbc-1297-73a520cfc534@enterprisedb.com

> Moving on to the other open item for this, please find attached v2
> of the patch as requested.

Did you notice the comments of [3] about the second patch that aims to
add the null termination in the line from the LZ4 fgets() callback?

[3]: https://www.postgresql.org/message-id/ZFhCyn4Gm2eu60rB@paquier.xyz
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Michael Paquier
Date:
On Tue, May 09, 2023 at 02:54:31PM +0200, Tomas Vondra wrote:
> Yeah. Thanks for the report, should have been found during review.

Tomas, are you planning to do something by the end of this week for
beta1?  Or do you need some help of any kind?
--
Michael

Attachment

Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:
On 5/17/23 08:18, Michael Paquier wrote:
> On Tue, May 09, 2023 at 02:54:31PM +0200, Tomas Vondra wrote:
>> Yeah. Thanks for the report, should have been found during review.
> 
> Tomas, are you planning to do something by the end of this week for
> beta1?  Or do you need some help of any kind?

I'll take care of it.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add LZ4 compression in pg_dump

From
Tomas Vondra
Date:

On 5/17/23 10:59, Tomas Vondra wrote:
> On 5/17/23 08:18, Michael Paquier wrote:
>> On Tue, May 09, 2023 at 02:54:31PM +0200, Tomas Vondra wrote:
>>> Yeah. Thanks for the report, should have been found during review.
>>
>> Tomas, are you planning to do something by the end of this week for
>> beta1?  Or do you need some help of any kind?
> 
> I'll take care of it.
> 

FWIW I've pushed fixes for both open issues associated with the pg_dump
compression. I'll keep an eye on the buildfarm, but hopefully that'll do
it for beta1.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company