Re: design for parallel backup - Mailing list pgsql-hackers

From Andres Freund
Subject Re: design for parallel backup
Date
Msg-id 20200430195222.jvhtccuguyv5gfei@alap3.anarazel.de
Whole thread Raw
In response to Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2020-04-30 14:50:34 -0400, Robert Haas wrote:
> On Mon, Apr 20, 2020 at 4:19 PM Andres Freund <andres@anarazel.de> wrote:
> > One question I have not really seen answered well:
> >
> > Why do we want parallelism here. Or to be more precise: What do we hope
> > to accelerate by making what part of creating a base backup
> > parallel. There's several potential bottlenecks, and I think it's
> > important to know the design priorities to evaluate a potential design.
>
> I spent some time today trying to understand just one part of this,
> which is how long it will take to write the base backup out to disk
> and whether having multiple independent processes helps. I settled on
> writing and fsyncing 64GB of data, written in 8kB chunks

Why 8kb? That's smaller than what we currently do in pg_basebackup,
afaictl, and you're actually going to be bottlenecked by syscall
overhead at that point (unless you disable / don't have the whole intel
security mitigation stuff).


> , divided into 1, 2, 4, 8, or 16 equal size files, with each file
> written by a separate process, and an fsync() at the end before
> process exit. So in this test, there is no question of whether the
> master can read the data fast enough, nor is there any issue of
> network bandwidth. It's purely a test of whether it's faster to have
> one process write a big file or whether it's faster to have multiple
> processes each write a smaller file.

That's not necessarily the only question though, right? There's also the
approach one process writing out multiple files (via buffered, not async
IO)? E.g. one basebackup connecting to multiple backends, or just
shuffeling multiple files through one copy stream.


> I tested this on EDB's cthulhu. It's an older server, but it happens
> to have 4 mount points available for testing, one with XFS + magnetic
> disks, one with ext4 + magnetic disks, one with XFS + SSD, and one
> with ext4 + SSD.

IIRC cthulhu's SSDs are not that fast, compared to NVMe storage (by
nearly an order of magnitude IIRC). So this might be disadvantaging the
parallel case more than it should. Also perhaps the ext4 disadvantage is
smaller on more modern kernel versions?

If you can provide me with the test program, I'd happily run it on some
decent, but not upper end, NVMe SSDs.


> The fastest write performance of any test was the 16-way XFS-SSD test,
> which wrote at about 2.56 gigabytes per second. The fastest
> single-file test was on ext4-magnetic, though ext4-ssd and
> xfs-magnetic were similar, around 0.66 gigabytes per second.

I think you might also be seeing some interaction with write caching on
the raid controller here. The file sizes are small enough to fit in
there to a significant degree for the single file tests.


> Your system must be a LOT faster, because you were seeing
> pg_basebackup running at, IIUC, ~3 gigabytes per second, and that
> would have been a second process both writing and doing other
> things.

Right. On my workstation I have a NVMe SSD that can do ~2.5 GiB/s
sustained, in my laptop one that peaks to ~3.2GiB/s but then quickly goes
to ~2GiB/s.

FWIW, I ran a "benchmark" just now just using dd, on my laptop, on
battery (so take this with a huge grain of salt). With 1 dd writing out
150GiB in 8kb blocks I get 1.8GiB/s, and with two writing 75GiB each
~840MiB/s, with three writing 50GiB each 550MiB/s.


> Now, I don't know how much this matters. To get limited by this stuff,
> you'd need an incredibly fast network - 10 or maybe 40 or 100 Gigabit
> Ethernet or something like that - or to be doing a local backup. But I
> thought that it was interesting and that I should share it, so here
> you go! I do wonder if the apparently concurrency problems with ext4
> might matter on systems with high connection counts just in normal
> operation, backups aside.

I have seen such problems. Some of them have gotten better though. For
most (all?) linux filesystems we can easily run into filesystem
concurrency issues from within postgres. There's basically a file level
exclusive lock for buffered writes (only for the copy into the page
cache though), due to posix requirements about the effects of a write
being atomic.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Krasiyan Andreev
Date:
Subject: Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options
Next
From: Thomas Munro
Date:
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats