Re: design for parallel backup - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: design for parallel backup
Date
Msg-id CAA4eK1LQuN6LCaiHduaF0L47DUQ2e5cL_QNMytPRVUNidOA0dw@mail.gmail.com
Whole thread Raw
In response to Re: design for parallel backup  (Andres Freund <andres@anarazel.de>)
Responses Re: design for parallel backup  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Tue, Apr 21, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2020-04-20 16:36:16 -0400, Robert Haas wrote:
>
> > If a backup client - either current or hypothetical - is compressing
> > and encrypting, then it doesn't have either a filesystem I/O or a
> > network I/O in progress while it's doing so. You take not only the hit
> > of the time required for compression and/or encryption, but also use
> > that much less of the available network and/or I/O capacity.
>
> I don't think it's really the time for network/file I/O that's the
> issue. Sure memcpy()'ing from the kernel takes time, but compared to
> encryption/compression it's not that much.  Especially for compression,
> it's not really lack of cycles for networking that prevent a higher
> throughput, it's that after buffering a few MB there's just no point
> buffering more, given compression will plod along with 20-100MB/s.
>

It is quite likely that compression can benefit more from parallelism
as compared to the network I/O as that is mostly a CPU intensive
operation but I am not sure if we can just ignore the benefit of
utilizing the network bandwidth.  In our case, after copying from the
network we do write that data to disk, so during filesystem I/O the
network can be used if there is some other parallel worker processing
other parts of data.

Also, there may be some users who don't want their data to be
compressed due to some reason like the overhead of decompression is so
high that restore takes more time and they are not comfortable with
that as for them faster restore is much more critical then compressed
or fast back up.  So, for such things, the parallelism during backup
as being discussed in this thread will still be helpful.

OTOH, I think without some measurements it is difficult to say that we
have significant benefit by paralysing the backup without compression.
I have scanned the other thread [1] where the patch for parallel
backup was discussed and didn't find any performance numbers, so
probably having some performance data with that patch might give us a
better understanding of introducing parallelism in the backup.

[1] - https://www.postgresql.org/message-id/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Bogus documentation for bogus geometric operators
Next
From: Michael Paquier
Date:
Subject: Re: Do we need to handle orphaned prepared transactions in theserver?