Re: design for parallel backup - Mailing list pgsql-hackers

From Robert Haas
Subject Re: design for parallel backup
Date
Msg-id CA+Tgmob9xC_OZxWpB1EKjby0g9RqC7rMkU6yQOkRXvaTAjuXig@mail.gmail.com
Whole thread Raw
In response to Re: design for parallel backup  (Andres Freund <andres@anarazel.de>)
Responses Re: design for parallel backup  (Andres Freund <andres@anarazel.de>)
Re: design for parallel backup  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On Mon, Apr 20, 2020 at 4:19 PM Andres Freund <andres@anarazel.de> wrote:
> Why do we want parallelism here. Or to be more precise: What do we hope
> to accelerate by making what part of creating a base backup
> parallel. There's several potential bottlenecks, and I think it's
> important to know the design priorities to evaluate a potential design.
>
> Bottlenecks (not ordered by importance):
> - compression performance (likely best solved by multiple compression
>   threads and a better compression algorithm)
> - unencrypted network performance (I'd like to see benchmarks showing in
>   which cases multiple TCP streams help / at which bandwidth it starts
>   to help)
> - encrypted network performance, i.e. SSL overhead (not sure this is an
>   important problem on modern hardware, given hardware accelerated AES)
> - checksumming overhead (a serious problem for cryptographic checksums,
>   but presumably not for others)
> - file IO (presumably multiple facets here, number of concurrent
>   in-flight IOs, kernel page cache overhead when reading TBs of data)
>
> I'm not really convinced that design addressing the more crucial
> bottlenecks really needs multiple fe/be connections. But that seems to
> be have been the focus of the discussion so far.

I haven't evaluated this. Both BART and pgBackRest offer parallel
backup options, and I'm pretty sure both were performance tested and
found to be very significantly faster, but I didn't write the code for
either, nor have I evaluated either to figure out exactly why it was
faster.

My suspicion is that it has mostly to do with adequately utilizing the
hardware resources on the server side. If you are network-constrained,
adding more connections won't help, unless there's something shaping
the traffic which can be gamed by having multiple connections.
However, as things stand today, at any given point in time the base
backup code on the server will EITHER be attempting a single
filesystem I/O or a single network I/O, and likewise for the client.
If a backup client - either current or hypothetical - is compressing
and encrypting, then it doesn't have either a filesystem I/O or a
network I/O in progress while it's doing so. You take not only the hit
of the time required for compression and/or encryption, but also use
that much less of the available network and/or I/O capacity.

While I agree that some of these problems could likely be addressed in
other ways, parallelism seems to offer an approach that could solve
multiple issues at the same time. If you want to address it without
that, you need asynchronous filesystem I/O and asynchronous network
I/O and both of those on both the client and server side, plus
multithreaded compression and multithreaded encryption and maybe some
other things. That sounds pretty hairy and hard to get right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: new heapcheck contrib module
Next
From: Peter Geoghegan
Date:
Subject: Re: new heapcheck contrib module