Re: design for parallel backup - Mailing list pgsql-hackers

From Andres Freund
Subject Re: design for parallel backup
Date
Msg-id 20200421053149.cjqiwohw5ge6bwa4@alap3.anarazel.de
Whole thread Raw
In response to Re: design for parallel backup  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: design for parallel backup  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2020-04-21 10:20:01 +0530, Amit Kapila wrote:
> It is quite likely that compression can benefit more from parallelism
> as compared to the network I/O as that is mostly a CPU intensive
> operation but I am not sure if we can just ignore the benefit of
> utilizing the network bandwidth.  In our case, after copying from the
> network we do write that data to disk, so during filesystem I/O the
> network can be used if there is some other parallel worker processing
> other parts of data.

Well, as I said, network and FS IO as done by server / pg_basebackup are
both fully buffered by the OS. Unless the OS throttles the userland
process, a large chunk of the work will be done by the kernel, in
separate kernel threads.

My workstation and my laptop can, in a single thread each, get close
20GBit/s of network IO (bidirectional 10GBit, I don't have faster - it's
a thunderbolt 10gbe card) and iperf3 is at 55% CPU while doing so. Just
connecting locally it's 45Gbit/s. Or over 8GBbyte/s of buffered
filesystem IO. And it doesn't even have that high per-core clock speed.

I just don't see this being the bottleneck for now.


> Also, there may be some users who don't want their data to be
> compressed due to some reason like the overhead of decompression is so
> high that restore takes more time and they are not comfortable with
> that as for them faster restore is much more critical then compressed
> or fast back up.  So, for such things, the parallelism during backup
> as being discussed in this thread will still be helpful.

I am not even convinced it'll be helpful in a large fraction of
cases. The added overhead of more connections / processes isn't free.

I believe there are some cases where it'd help. E.g. if there are
multiple tablespaces on independent storage, parallelism as described
here could end up to a significantly better utilization of the different
tablespaces. But that'd require sorting work between processes
appropriately.


> OTOH, I think without some measurements it is difficult to say that we
> have significant benefit by paralysing the backup without compression.
> I have scanned the other thread [1] where the patch for parallel
> backup was discussed and didn't find any performance numbers, so
> probably having some performance data with that patch might give us a
> better understanding of introducing parallelism in the backup.

Agreed, we need some numbers.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Remove non-fast promotion Re: Should we remove a fallbackpromotion? take 2
Next
From: Amit Khandekar
Date:
Subject: pgbench testing with contention scenarios