Re: design for parallel backup - Mailing list pgsql-hackers

From Andres Freund
Subject Re: design for parallel backup
Date
Msg-id 20200422152402.ck7ziyzgfopgz7bd@alap3.anarazel.de
Whole thread Raw
In response to Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2020-04-22 09:52:53 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 6:57 PM Andres Freund <andres@anarazel.de> wrote:
> > I agree that trying to make backups very fast is a good goal (or well, I
> > think not very slow would be a good descriptor for the current
> > situation). I am just trying to make sure we tackle the right problems
> > for that. My gut feeling is that we have to tackle compression first,
> > because without addressing that "all hope is lost" ;)
> 
> OK. I have no objection to the idea of starting with (1) server side
> compression and (2) a better compression algorithm. However, I'm not
> very sold on the idea of relying on parallelism that is specific to
> compression. I think that parallelism across the whole operation -
> multiple connections, multiple processes, etc. - may be a more
> promising approach than trying to parallelize specific stages of the
> process. I am not sure about that; it could be wrong, and I'm open to
> the possibility that it is, in fact, wrong.

*My* gut feeling is that you're going to have a harder time using CPU
time efficiently when doing parallel compression via multiple processes
and independent connections. You're e.g. going to have a lot more
context switches, I think. And there will be network overhead from doing
more connections (including worse congestion control).


> Leaving out all the three and four digit wall times from your table:
> 
> > method  level   parallelism     wall-time       cpu-user-time   cpu-kernel-time size            rate    format
> > pigz    1       10              34.35           364.14          23.55           3892401867      16.6    .gz
> > zstd    1       1               82.95           67.97           11.82           2853193736      22.6    .zstd
> > zstd    1       10              25.05           151.84          13.35           2847414913      22.7    .zstd
> > zstd    6       10              43.47           374.30          12.37           2745211100      23.5    .zstd
> > zstd    6       20              32.50           468.18          13.44           2745211100      23.5    .zstd
> > zstd    9       20              57.99           949.91          14.13           2606535138      24.8    .zstd
> > lz4     1       1               49.94           36.60           13.33           7318668265      8.8     .lz4
> > pixz    1       10              92.54           925.52          37.00           1199499772      53.8    .xz
> 
> It's notable that almost all of the fast wall times here are with
> zstd; the surviving entries with pigz and pixz are with ten-way
> parallelism, and both pigz and lz4 have worse compression ratios than
> zstd. My impression, though, is that LZ4 might be getting a bit of a
> raw deal here because of the repetitive nature of the data. I theorize
> based on some reading I did yesterday, and general hand-waving, that
> maybe the compression ratios would be closer together on a more
> realistic data set.

I agree that most datasets won't get even close to what we've seen
here. And that disadvantages e.g. lz4.

To come up with a much less compressible case, I generated data the
following way:

CREATE TABLE random_data(id serial NOT NULL, r1 float not null, r2 float not null, r3 float not null);
ALTER TABLE random_data SET (FILLFACTOR = 100);
ALTER SEQUENCE random_data_id_seq CACHE 1024
-- with pgbench, I ran this in parallel for 100s
INSERT INTO random_data(r1,r2,r3) SELECT random(), random(), random() FROM generate_series(1, 100000);
-- then created indexes, using a high fillfactor to ensure few zeroed out parts
ALTER TABLE random_data ADD CONSTRAINT random_data_id_pkey PRIMARY KEY(id) WITH (FILLFACTOR = 100);
CREATE INDEX random_data_r1 ON random_data(r1) WITH (fillfactor = 100);

this results in a 16GB base backup. I think this is probably a good bit
less compressible than most PG databases.


method  level   parallelism     wall-time       cpu-user-time   cpu-kernel-time size            rate    format
gzip    1       1               305.37          299.72          5.52            7067232465      2.28
lz4     1       1               33.26           27.26           5.99            8961063439      1.80     .lz4
lz4     3       1               188.50          182.91          5.58            8204501460      1.97     .lz4
zstd    1       1               66.41           58.38           6.04            6925634128      2.33     .zstd
zstd    1       10              9.64            67.04           4.82            6980075316      2.31     .zstd
zstd    3       1               122.04          115.79          6.24            6440274143      2.50     .zstd
zstd    3       10              13.65           106.11          5.64            6438439095      2.51     .zstd
zstd    9       10              100.06          955.63          6.79            5963827497      2.71     .zstd
zstd    15      10              259.84          2491.39         8.88            5912617243      2.73     .zstd
pixz    1       10              162.59          1626.61         15.52           5350138420      3.02     .xz
plzip   1       20              135.54          2705.28         9.25            5270033640      3.06     .lz


> It's also notable that lz1 -1 is BY FAR the winner in terms of
> absolute CPU consumption. So I kinda wonder whether supporting both
> LZ4 and ZSTD might be the way to go, especially since once we have the
> LZ4 code we might be able to use it for other things, too.

Yea. I think the case for lz4 is far stronger in other
places. E.g. having lz4 -1 for toast can make a lot of sense, suddenly
repeated detoasting is much less of an issue, while still achieving
higher compression than pglz.

.oO(Now I really see how pglz compares to the above)


> > One thing this reminded me of is whether using a format (tar) that
> > doesn't allow efficient addressing of individual files is a good idea
> > for base backups. The compression rates very likely will be better when
> > not compressing tiny files individually, but at the same time it'd be
> > very useful to be able to access individual files more efficiently than
> > O(N). I can imagine that being important for some cases of incremental
> > backup assembly.
> 
> Yeah, being able to operate directly on the compressed version of the
> file would be very useful, but I'm not sure that we have great options
> available there. I think the only widely-used format that supports
> that is ".zip", and I'm not too sure about emitting zip files.

I don't really see a problem with emitting .zip files. It's an extremely
widely used container format for all sorts of file formats these days.
Except for needing a bit more complicated (and I don't think it's *that*
big of a difference) code during generation / unpacking, it seems
clearly advantageous over .tar.gz etc.


> Apparently, pixz also supports random access to archive members, and
> it did have on entry that survived my arbitrary cut in the table
> above, but the last release was in 2015, and it seems to be only a
> command-line tool, not a library. It also depends on libarchive and
> liblzma, which is not awful, but I'm not sure we want to suck in that
> many dependencies. But that's really a secondary thing: I can't
> imagine us depending on something that hasn't had a release in 5
> years, and has less than 300 total commits.

Oh, yea. I just looked at the various tools I could find that did
parallel compression.


> Other options include, perhaps, (1) emitting a tarfile of compressed
> files instead of a compressed tarfile

Yea, that'd help some. Although I am not sure how good the tooling to
seek through tarfiles in an O(files) rather than O(bytes) manner is.

I think there some cases where using separate compression state for each
file would hurt us. Some of the archive formats have support for reusing
compression state, but I don't know which.


> , and (2) writing our own index files. We don't know when we begin
> emitting the tarfile what files we're going to find our how big they
> will be, so we can't really emit a directory at the beginning of the
> file. Even if we thought we knew, files can disappear or be truncated
> before we get around to archiving them. However, when we reach the end
> of the file, we do know what we included and how big it was, so
> possibly we could generate an index for each tar file, or include
> something in the backup manifest.

Hm. There's some appeal to just store offsets in the manifest, and to
make sure it's a seakable offset in the compression stream. OTOH, it
makes it pretty hard for other tools to generate a compatible archive.


> > The other big benefit is that zstd's library has multi-threaded
> > compression built in, whereas that's not the case for other libraries
> > that I am aware of.
> 
> Wouldn't it be a problem to let the backend become multi-threaded, at
> least on Windows?

We already have threads in windows, e.g. the signal handler emulation
stuff runs in one. Are you thinking of this bit in postmaster.c:

#ifdef HAVE_PTHREAD_IS_THREADED_NP

    /*
     * On macOS, libintl replaces setlocale() with a version that calls
     * CFLocaleCopyCurrent() when its second argument is "" and every relevant
     * environment variable is unset or empty.  CFLocaleCopyCurrent() makes
     * the process multithreaded.  The postmaster calls sigprocmask() and
     * calls fork() without an immediate exec(), both of which have undefined
     * behavior in a multithreaded program.  A multithreaded postmaster is the
     * normal case on Windows, which offers neither fork() nor sigprocmask().
     */
    if (pthread_is_threaded_np() != 0)
        ereport(FATAL,
                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                 errmsg("postmaster became multithreaded during startup"),
                 errhint("Set the LC_ALL environment variable to a valid locale.")));
#endif

?

I don't really see any of the concerns there to apply for the base
backup case.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: [Proposal] Global temporary tables
Next
From: Andres Freund
Date:
Subject: Re: More efficient RI checks - take 2