Re: zstd compression for pg_dump - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: zstd compression for pg_dump
Date
Msg-id ZBKgBz+4blcKru6x@telsasoft.com
Whole thread Raw
In response to Re: zstd compression for pg_dump  (Jacob Champion <jchampion@timescale.com>)
Responses Re: zstd compression for pg_dump  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: zstd compression for pg_dump  (Jacob Champion <jchampion@timescale.com>)
List pgsql-hackers
On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> > I did some smoke testing against zstd's GitHub release on Windows. To
> > build against it, I had to construct an import library, and put that
> > and the DLL into the `lib` folder expected by the MSVC scripts...
> > which makes me wonder if I've chosen a harder way than necessary?
> 
> It looks like pg_dump's meson.build is missing dependencies on zstd
> (meson couldn't find the headers in the subproject without them).

I saw that this was added for LZ4, but I hadn't added it for zstd since
I didn't run into an issue without it.  Could you check that what I've
added works for your case ?

> > Parallel zstd dumps seem to work as expected, in that the resulting
> > pg_restore output is identical to uncompressed dumps and nothing
> > explodes. I haven't inspected the threading implementation for safety
> > yet, as you mentioned.
> 
> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> handling safety for this, by isolating each thread's state. I don't feel
> comfortable pronouncing this new addition safe or not, because I'm not
> sure I understand what the comments in the format-specific _Clone()
> callbacks are saying yet.

My line of reasoning for unix is that pg_dump forks before any calls to
zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
doesn't apply to pg_dump under windows.  This is an opened question.  If
there's no solid answer, I could disable/ignore the option (maybe only
under windows).

> On to code (not a complete review):
> 
> >     if (hasSuffix(fname, ".gz"))
> >         compression_spec.algorithm = PG_COMPRESSION_GZIP;
> >     else
> >     {
> >         bool        exists;
> > 
> >         exists = (stat(path, &st) == 0);
> >         /* avoid unused warning if it is not built with compression */
> >         if (exists)
> >             compression_spec.algorithm = PG_COMPRESSION_NONE;
> > -#ifdef HAVE_LIBZ
> > -       if (!exists)
> > -       {
> > -           free_keep_errno(fname);
> > -           fname = psprintf("%s.gz", path);
> > -           exists = (stat(fname, &st) == 0);
> > -
> > -           if (exists)
> > -               compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > -       }
> > -#endif
> > -#ifdef USE_LZ4
> > -       if (!exists)
> > -       {
> > -           free_keep_errno(fname);
> > -           fname = psprintf("%s.lz4", path);
> > -           exists = (stat(fname, &st) == 0);
> > -
> > -           if (exists)
> > -               compression_spec.algorithm = PG_COMPRESSION_LZ4;
> > -       }
> > -#endif
> > +       else if (check_compressed_file(path, &fname, "gz"))
> > +           compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > +       else if (check_compressed_file(path, &fname, "lz4"))
> > +           compression_spec.algorithm = PG_COMPRESSION_LZ4;
> > +       else if (check_compressed_file(path, &fname, "zst"))
> > +           compression_spec.algorithm = PG_COMPRESSION_ZSTD;
> >     }
> 
> This function lost some coherence, I think. Should there be a hasSuffix
> check at the top for ".zstd" (and, for that matter, ".lz4")?

The function is first checking if it was passed a filename which already
has a suffix.  And if not, it searches through a list of suffixes,
testing for an existing file with each suffix.  The search with stat()
doesn't happen if it has a suffix.  I'm having trouble seeing how the
hasSuffix() branch isn't dead code.  Another opened question.

> I'm a little suspicious of the replacement of supports_compression()
> with parse_compress_specification(). For example:
> 
> > -   errmsg = supports_compression(AH->compression_spec);
> > -   if (errmsg)
> > +   parse_compress_specification(AH->compression_spec.algorithm,
> > +           NULL, &compress_spec);
> > +   if (compress_spec.parse_error != NULL)
> >     {
> >         pg_log_warning("archive is compressed, but this installation does not support compression (%s
> > -                       errmsg);
> > -       pg_free(errmsg);
> > +                       compress_spec.parse_error);
> > +       pg_free(compress_spec.parse_error);
> >     }
> 
> The top-level error here is "does not support compression", but wouldn't
> a bad specification option with a supported compression method trip this
> path too?

No - since the 2nd argument is passed as NULL, it just checks whether
the compression is supported.  Maybe there ought to be a more
direct/clean way to do it.  But up to now evidently nobody needed to do
that.

> > +static void
> > +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream,
> > +                         ZSTD_cParameter param, int value, char *paramname)
> 
> IMO we should avoid stepping on the ZSTD_ namespace with our own
> internal function names.

done

> > +   if (cs->readF != NULL)
> > +
> > +   if (cs->writeF != NULL)
> 
> This seems to suggest that both cs->readF and cs->writeF could be set,
> but in that case, the output buffer gets reallocated.

I put back an assertion that exactly one of them was set, since that's
true of how it currently works.

> I was curious about the extra byte allocated in the decompression case.
> I see that ReadDataFromArchiveZstd() is null-terminating the buffer
> before handing it to ahwrite(), but why does it need to do that?

I was trying to figure that out, too.  I think the unterminated case
might be for ExecuteSqlCommandBuf(), and that may only (have) been
needed to allow pg_restore to handle ancient/development versions of
pg_dump...  It's not currently hit.
https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_db.c.gcov.html#470

I found that the terminator was added for the uncompressed case was
added at e8f69be05 and removed in bf9aa490d.

> > +Zstd_get_error(CompressFileHandle *CFH)
> 
> Seems like this should be using the zstderror stored in the handle?

Yes - I'd already addressed that locally.

> In ReadDataFromArchiveZstd():
> 
> > +        * Read compressed data.  Note that readF can resize the buffer; the
> > +        * new size is tracked and used for future loops.
> This is pretty complex for what it's doing. I'm a little worried that we
> let the reallocated buffer escape to the caller while losing track of
> how big it is. I think that works today, since there's only ever one
> call per handle, but any future refactoring that allowed cs->readData()
> to be called more than once would subtly break this code.

Note that nothing bad happens if we lose track of how big it is (well,
assuming that readF doesn't *shrink* the buffer).

The previous patch version didn't keep track of its new size, and the only
consequence is that readF() might re-resize it again on a future iteration,
even if it was already sufficiently large.

When I originally wrote it (and up until that patch version), I left
this as an XXX comment about reusing the resized buffer.  But it seemed
easy enough to fix so I did.

> In ZstdWriteCommon():
> 
> Elsewhere, output->pos is set to zero before compressing, but here we do
> it after, which I think leads to subtle differences in the function
> preconditions. If that's an intentional difference, can the reason be
> called out in a comment?

It's not deliberate.  I think it had no effect, but changed - thanks.

-- 
Justin

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Simplify some codes in pgoutput
Next
From: Bharath Rupireddy
Date:
Subject: Re: Add macros for ReorderBufferTXN toptxn