Re: zstd compression for pg_dump - Mailing list pgsql-hackers
From | Jacob Champion |
---|---|
Subject | Re: zstd compression for pg_dump |
Date | |
Msg-id | 3d04afca-7d9d-c90f-5fef-5cf4fdb2173f@timescale.com Whole thread Raw |
In response to | Re: zstd compression for pg_dump (Jacob Champion <jchampion@timescale.com>) |
Responses |
Re: zstd compression for pg_dump
|
List | pgsql-hackers |
Hi, This'll need another rebase over the meson ICU changes. On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote: > I did some smoke testing against zstd's GitHub release on Windows. To > build against it, I had to construct an import library, and put that > and the DLL into the `lib` folder expected by the MSVC scripts... > which makes me wonder if I've chosen a harder way than necessary? A meson wrap made this much easier! It looks like pg_dump's meson.build is missing dependencies on zstd (meson couldn't find the headers in the subproject without them). > Parallel zstd dumps seem to work as expected, in that the resulting > pg_restore output is identical to uncompressed dumps and nothing > explodes. I haven't inspected the threading implementation for safety > yet, as you mentioned. Hm. Best I can tell, the CloneArchive() machinery is supposed to be handling safety for this, by isolating each thread's state. I don't feel comfortable pronouncing this new addition safe or not, because I'm not sure I understand what the comments in the format-specific _Clone() callbacks are saying yet. > And I still wasn't able to test :workers, since > it looks like the official libzstd for Windows isn't built for > multithreading. That'll be another day's project. The wrapped installation enabled threading too, so I was able to try with :workers=8. Everything seems to work, but I didn't have a dataset that showed speed improvements at the time. It did seem to affect the overall compressibility negatively -- which makes sense, I think, assuming each thread is looking at a separate and/or smaller window. On to code (not a complete review): > if (hasSuffix(fname, ".gz")) > compression_spec.algorithm = PG_COMPRESSION_GZIP; > else > { > bool exists; > > exists = (stat(path, &st) == 0); > /* avoid unused warning if it is not built with compression */ > if (exists) > compression_spec.algorithm = PG_COMPRESSION_NONE; > -#ifdef HAVE_LIBZ > - if (!exists) > - { > - free_keep_errno(fname); > - fname = psprintf("%s.gz", path); > - exists = (stat(fname, &st) == 0); > - > - if (exists) > - compression_spec.algorithm = PG_COMPRESSION_GZIP; > - } > -#endif > -#ifdef USE_LZ4 > - if (!exists) > - { > - free_keep_errno(fname); > - fname = psprintf("%s.lz4", path); > - exists = (stat(fname, &st) == 0); > - > - if (exists) > - compression_spec.algorithm = PG_COMPRESSION_LZ4; > - } > -#endif > + else if (check_compressed_file(path, &fname, "gz")) > + compression_spec.algorithm = PG_COMPRESSION_GZIP; > + else if (check_compressed_file(path, &fname, "lz4")) > + compression_spec.algorithm = PG_COMPRESSION_LZ4; > + else if (check_compressed_file(path, &fname, "zst")) > + compression_spec.algorithm = PG_COMPRESSION_ZSTD; > } This function lost some coherence, I think. Should there be a hasSuffix check at the top for ".zstd" (and, for that matter, ".lz4")? And the comment references an unused warning, which is only possible with the #ifdef blocks that were removed. I'm a little suspicious of the replacement of supports_compression() with parse_compress_specification(). For example: > - errmsg = supports_compression(AH->compression_spec); > - if (errmsg) > + parse_compress_specification(AH->compression_spec.algorithm, > + NULL, &compress_spec); > + if (compress_spec.parse_error != NULL) > { > pg_log_warning("archive is compressed, but this installation does not support compression (%s > - errmsg); > - pg_free(errmsg); > + compress_spec.parse_error); > + pg_free(compress_spec.parse_error); > } The top-level error here is "does not support compression", but wouldn't a bad specification option with a supported compression method trip this path too? > +static void > +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream, > + ZSTD_cParameter param, int value, char *paramname) IMO we should avoid stepping on the ZSTD_ namespace with our own internal function names. > + if (cs->readF != NULL) > + { > + zstdcs->dstream = ZSTD_createDStream(); > + if (zstdcs->dstream == NULL) > + pg_fatal("could not initialize compression library"); > + > + zstdcs->input.size = ZSTD_DStreamInSize(); > + zstdcs->input.src = pg_malloc(zstdcs->input.size); > + > + zstdcs->output.size = ZSTD_DStreamOutSize(); > + zstdcs->output.dst = pg_malloc(zstdcs->output.size + 1); > + } > + > + if (cs->writeF != NULL) > + { > + zstdcs->cstream = ZstdCStreamParams(cs->compression_spec); > + > + zstdcs->output.size = ZSTD_CStreamOutSize(); > + zstdcs->output.dst = pg_malloc(zstdcs->output.size); > + zstdcs->output.pos = 0; > + } This seems to suggest that both cs->readF and cs->writeF could be set, but in that case, the output buffer gets reallocated. I was curious about the extra byte allocated in the decompression case. I see that ReadDataFromArchiveZstd() is null-terminating the buffer before handing it to ahwrite(), but why does it need to do that? > +static const char * > +Zstd_get_error(CompressFileHandle *CFH) > +{ > + return strerror(errno); > +} Seems like this should be using the zstderror stored in the handle? In ReadDataFromArchiveZstd(): > + size_t input_allocated_size = ZSTD_DStreamInSize(); > + size_t res; > + > + for (;;) > + { > + size_t cnt; > + > + /* > + * Read compressed data. Note that readF can resize the buffer; the > + * new size is tracked and used for future loops. > + */ > + input->size = input_allocated_size; > + cnt = cs->readF(AH, (char **) unconstify(void **, &input->src), &input->size); > + input_allocated_size = input->size; > + input->size = cnt; This is pretty complex for what it's doing. I'm a little worried that we let the reallocated buffer escape to the caller while losing track of how big it is. I think that works today, since there's only ever one call per handle, but any future refactoring that allowed cs->readData() to be called more than once would subtly break this code. In ZstdWriteCommon(): > + /* > + * Extra paranoia: avoid zero-length chunks, since a zero length chunk > + * is the EOF marker in the custom format. This should never happen > + * but... > + */ > + if (output->pos > 0) > + cs->writeF(AH, output->dst, output->pos); > + > + output->pos = 0; Elsewhere, output->pos is set to zero before compressing, but here we do it after, which I think leads to subtle differences in the function preconditions. If that's an intentional difference, can the reason be called out in a comment? --Jacob
pgsql-hackers by date: