Re: design for parallel backup - Mailing list pgsql-hackers

From Andres Freund
Subject Re: design for parallel backup
Date
Msg-id 20200422180641.s6w65lfacprfsvcf@alap3.anarazel.de
Whole thread Raw
In response to Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2020-04-22 12:12:32 -0400, Robert Haas wrote:
> On Wed, Apr 22, 2020 at 11:24 AM Andres Freund <andres@anarazel.de> wrote:
> > *My* gut feeling is that you're going to have a harder time using CPU
> > time efficiently when doing parallel compression via multiple processes
> > and independent connections. You're e.g. going to have a lot more
> > context switches, I think. And there will be network overhead from doing
> > more connections (including worse congestion control).
> 
> OK, noted. I'm still doubtful that the optimal number of connections
> is 1, but it might be that the optimal number of CPU cores to apply to
> compression is much higher than the optimal number of connections.

Yea, that's basically what I think too.


> For instance, suppose there are two equally sized tablespaces on
> separate drives, but zstd with 10-way parallelism is our chosen
> compression strategy. It seems to me that two connections has an
> excellent chance of being faster than one, because with only one
> connection I don't see how you can benefit from the opportunity to do
> I/O in parallel.

Yea. That's exactly the case for "connection level" parallelism I had
upthread as well. It'd require being somewhat careful about different
tablespaces in the selection for each connection, but that's not that
hard.

I also can see a case for using N backends and one connection, but I
think that'll be too complicated / too much bound by lcoking around the
socket etc.


> 
> > this results in a 16GB base backup. I think this is probably a good bit
> > less compressible than most PG databases.
> >
> > method  level   parallelism     wall-time       cpu-user-time   cpu-kernel-time size            rate    format
> > gzip    1       1               305.37          299.72          5.52            7067232465      2.28
> > lz4     1       1               33.26           27.26           5.99            8961063439      1.80     .lz4
> > lz4     3       1               188.50          182.91          5.58            8204501460      1.97     .lz4
> > zstd    1       1               66.41           58.38           6.04            6925634128      2.33     .zstd
> > zstd    1       10              9.64            67.04           4.82            6980075316      2.31     .zstd
> > zstd    3       1               122.04          115.79          6.24            6440274143      2.50     .zstd
> > zstd    3       10              13.65           106.11          5.64            6438439095      2.51     .zstd
> > zstd    9       10              100.06          955.63          6.79            5963827497      2.71     .zstd
> > zstd    15      10              259.84          2491.39         8.88            5912617243      2.73     .zstd
> > pixz    1       10              162.59          1626.61         15.52           5350138420      3.02     .xz
> > plzip   1       20              135.54          2705.28         9.25            5270033640      3.06     .lz
> 
> So, picking a better compressor in this case looks a lot less
> exciting.

Oh? I find it *extremely* exciting here. This is pretty close to the
worst case compressability-wise, and zstd takes only ~22% of the time as
gzip does, while still delivering better compression.  A nearly 5x
improvement in compression times seems pretty exciting to me.

Or do you mean for zstd over lz4, rather than anything over gzip?  1.8x
-> 2.3x is a pretty decent improvement still, no? And being able to do
do it in 1/3 of the wall time seems pretty helpful.

> Parallel zstd still compresses somewhat better than single-core lz4,
> but the difference in compression ratio is far less, and the amount of
> CPU you have to burn in order to get that extra compression is pretty
> large.

It's "just" a ~2x difference for "level 1" compression, right? For
having 1.9GiB less to write / permanently store of a 16GiB base
backup that doesn't seem that bad to me.


> > I don't really see a problem with emitting .zip files. It's an extremely
> > widely used container format for all sorts of file formats these days.
> > Except for needing a bit more complicated (and I don't think it's *that*
> > big of a difference) code during generation / unpacking, it seems
> > clearly advantageous over .tar.gz etc.
> 
> Wouldn't that imply buying into DEFLATE as our preferred compression algorithm?

zip doesn't have to imply DEFLATE although it is the most common
option. There's a compression method associated with each file.


> Either way, I don't really like the idea of having PostgreSQL have its
> own code to generate and interpret various archive formats. That seems
> like a maintenance nightmare and a recipe for bugs. How can anyone
> even verify that our existing 'tar' code works with all 'tar'
> implementations out there, or that it's correct in all cases? Do we
> really want to maintain similar code for other formats, or even for
> this one? I'd say "no". We should pick archive formats that have good,
> well-maintained libraries with permissive licenses and then use those.
> I don't know whether "zip" falls into that category or not.

I agree we should pick one. I think tar is not a great choice. .zip
seems like it'd be a significant improvement - but not necessarily
optimal.


> > > Other options include, perhaps, (1) emitting a tarfile of compressed
> > > files instead of a compressed tarfile
> >
> > Yea, that'd help some. Although I am not sure how good the tooling to
> > seek through tarfiles in an O(files) rather than O(bytes) manner is.
> 
> Well, considering that at present we're using hand-rolled code...

Good point.

Also looks like at least gnu tar supports seeking (when not reading from
a pipe etc).


> > I think there some cases where using separate compression state for each
> > file would hurt us. Some of the archive formats have support for reusing
> > compression state, but I don't know which.
> 
> Yeah, I had the same thought. People with mostly 1GB relation segments
> might not notice much difference, but people with lots of little
> relations might see a more significant difference.

Yea. I suspect it's close to immeasurable for large relations.  Reusing
the dictionary might help, although it likely would imply some
overhead. OTOH, the overhead of small relations will usually probably be
in the number of files, rather than the actual size.


FWIW, not that it's really relevant to this discussion, but I played
around with using trained compression dictionaries for postgres
contents. Can improve e.g. lz4's compression ratio a fair bit, in
particular when compressing small amounts of data. E.g. per-block
compression or such.


> FWIW, I don't see it as being entirely necessary to create a seekable
> compressed archive format, let alone to make all of our compressed
> archive formats seekable. I think supporting multiple compression
> algorithms in a flexible way that's not too tied to the capabilities
> of particular algorithms is more important. If you want fast restores
> of incremental and differential backups, consider using -Fp rather
> than -Ft.

Given how compressible many real-world databases are (maybe not quite
the 50x as in the pgbench -i case, but still extremely so), I don't
quite find -Fp a convincing alternative.


> Or we can have a new option that's like -Fp but every file
> is compressed individually in place, or files larger than N bytes are
> compressed in place using a configurable algorithm. It might be
> somewhat less efficient but it's also way less complicated to
> implement, and I think that should count for something.

Yea, I think that'd be a decent workaround.


> I don't want to get so caught up in advanced features here that we
> don't make any useful progress at all. If we can add better features
> without a large complexity increment, and without drawing objections
> from others on this list, great. If not, I'm prepared to summarily
> jettison it as nice-to-have but not essential.

Just to be clear: I am not at all advocating tying a change of the
archive format to compression method / parallelism changes or anything.


> > I don't really see any of the concerns there to apply for the base
> > backup case.
> 
> I felt like there was some reason that threads were bad, but it may
> have just been the case you mentioned and not relevant here.

I mean, they do have some serious issues when postgres infrastructure is
needed. Not being threadsafe and all. One needs to be careful to not let
"threads escape", to not fork() etc. That doesn't seems like a problem
here though.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Do we need to handle orphaned prepared transactions in theserver?
Next
From: Andres Freund
Date:
Subject: Re: 2pc leaks fds