Re: zstd compression for pg_dump - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: zstd compression for pg_dump
Date
Msg-id 20230304165747.GH12850@telsasoft.com
Whole thread Raw
In response to Re: zstd compression for pg_dump  (Jacob Champion <jchampion@timescale.com>)
Responses Re: zstd compression for pg_dump  (Jacob Champion <jchampion@timescale.com>)
List pgsql-hackers
On Fri, Mar 03, 2023 at 01:38:05PM -0800, Jacob Champion wrote:
> > > With this particular dataset, I don't see much improvement with
> > > zstd:long.
> >
> > Yeah.  I this could be because either 1) you already got very good
> > comprssion without looking at more data; and/or 2) the neighboring data
> > is already very similar, maybe equally or more similar, than the further
> > data, from which there's nothing to gain.
> 
> What kinds of improvements do you see with your setup? I'm wondering
> when we would suggest that people use it.

On customer data, I see small improvements - below 10%.

But on my first two tries, I made synthetic data sets where it's a lot:

$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long |wc -c
286107
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long=0 |wc -c
1709695

That's just 6 identical tables like:
pryzbyj=# CREATE TABLE t1 AS SELECT generate_series(1,999999);

In this case, "custom" format doesn't see that benefit, because the
greatest similarity is across tables, which don't share compressor
state.  But I think the note that I wrote in the docs about that should
be removed - custom format could see a big benefit, as long as the table
is big enough, and there's more similarity/repetition at longer
distances.

Here's one where custom format *does* benefit, due to long-distance
repetition within a single table.  The data is contrived, but the schema
of ID => data is not.  What's notable isn't how compressible the data
is, but how much *more* compressible it is with long-distance matching.

pryzbyj=# CREATE TABLE t1 AS SELECT i,array_agg(j) FROM generate_series(1,444)i,generate_series(1,99999)j GROUP BY 1;
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=1 |wc -c
82023
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=0 |wc -c
1048267

-- 
Justin



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Request for comment on setting binary format output per session
Next
From: Tom Lane
Date:
Subject: Re: libpq-fe.h should compile *entirely* standalone