Re: zstd compression for pg_dump - Mailing list pgsql-hackers

From Jacob Champion
Subject Re: zstd compression for pg_dump
Date
Msg-id CAAWbhmgEpYPn2sjxr0Kar_XKS4dzyEnussd0emwe2YHxb_tk6g@mail.gmail.com
Whole thread Raw
In response to Re: zstd compression for pg_dump  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: zstd compression for pg_dump  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> This resolves cfbot warnings: windows and cppcheck.
> And refactors zstd routines.
> And updates docs.
> And includes some fixes for earlier patches that these patches conflicts
> with/depends on.

This'll need a rebase (cfbot took a while to catch up). The patchset
includes basebackup modifications, which are part of a different CF
entry; was that intended?

I tried this on a local, 3.5GB, mostly-text table (from the UK Price
Paid dataset [1]) and the comparison against the other methods was
impressive. (I'm no good at constructing compression benchmarks, so
this is a super naive setup. Client's on the same laptop as the
server.)

    $ time ./src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
zstd > /tmp/zstd.dump
    real    1m17.632s
    user    0m35.521s
    sys    0m2.683s

    $ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
lz4 > /tmp/lz4.dump
    real    1m13.125s
    user    0m19.795s
    sys    0m3.370s

    $ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
gzip > /tmp/gzip.dump
    real    2m24.523s
    user    2m22.114s
    sys    0m1.848s

    $ ls -l /tmp/*.dump
    -rw-rw-r-- 1 jacob jacob 1331493925 Mar  3 09:45 /tmp/gzip.dump
    -rw-rw-r-- 1 jacob jacob 2125998939 Mar  3 09:42 /tmp/lz4.dump
    -rw-rw-r-- 1 jacob jacob 1215834718 Mar  3 09:40 /tmp/zstd.dump

Default gzip was the only method that bottlenecked on pg_dump rather
than the server, and default zstd outcompressed it at a fraction of
the CPU time. So, naively, this looks really good.

With this particular dataset, I don't see much improvement with
zstd:long. (At nearly double the CPU time, I get a <1% improvement in
compression size.) I assume it's heavily data dependent, but from the
notes on --long [2] it seems like they expect you to play around with
the window size to further tailor it to your data. Does it make sense
to provide the long option without the windowLog parameter?

Thanks,
--Jacob

[1] https://landregistry.data.gov.uk/
[2] https://github.com/facebook/zstd/releases/tag/v1.3.2



pgsql-hackers by date:

Previous
From: Jeroen Vermeulen
Date:
Subject: Re: libpq: PQgetCopyData() and allocation overhead
Next
From: Tom Lane
Date:
Subject: Re: libpq-fe.h should compile *entirely* standalone