Re: WIP/PoC for parallel backup - Mailing list pgsql-hackers

From Robert Haas
Subject Re: WIP/PoC for parallel backup
Date
Msg-id CA+TgmoaecyUtZ7zmFKnrBH6+r9y-D4O=LdOTPXC+CjyrD0MDFQ@mail.gmail.com
Whole thread Raw
In response to WIP/PoC for parallel backup  (Asif Rehman <asifr.rehman@gmail.com>)
Responses Re: WIP/PoC for parallel backup
List pgsql-hackers
On Wed, Aug 21, 2019 at 9:53 AM Asif Rehman <asifr.rehman@gmail.com> wrote:
> - BASE_BACKUP [PARALLEL] - returns a list of files in PGDATA
> If the parallel option is there, then it will only do pg_start_backup, scans PGDATA and sends a list of file names.

So IIUC, this would mean that BASE_BACKUP without PARALLEL returns
tarfiles, and BASE_BACKUP with PARALLEL returns a result set with a
list of file names. I don't think that's a good approach. It's too
confusing to have one replication command that returns totally
different things depending on whether some option is given.

> - SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given list.
> pg_basebackup will then send back a list of filenames in this command. This commands will be send by each worker and
thatworker will be getting the said files. 

Seems reasonable, but I think you should just pass one file name and
use the command multiple times, once per file.

> - STOP_BACKUP
> when all workers finish then, pg_basebackup will send STOP_BACKUP command.

This also seems reasonable, but surely the matching command should
then be called START_BACKUP, not BASEBACKUP PARALLEL.

> I have done a basic proof of concenpt (POC), which is also attached. I would appreciate some input on this. So far, I
amsimply dividing the list equally and assigning them to worker processes. I intend to fine tune this by taking into
considerationfile sizes. Further to add tar format support, I am considering that each worker process, processes all
filesbelonging to a tablespace in its list (i.e. creates and copies tar file), before it processes the next tablespace.
Asa result, this will create tar files that are disjointed with respect tablespace data. For example: 

Instead of doing this, I suggest that you should just maintain a list
of all the files that need to be fetched and have each worker pull a
file from the head of the list and fetch it when it finishes receiving
the previous file.  That way, if some connections go faster or slower
than others, the distribution of work ends up fairly even.  If you
instead pre-distribute the work, you're guessing what's going to
happen in the future instead of just waiting to see what actually does
happen. Guessing isn't intrinsically bad, but guessing when you could
be sure of doing the right thing *is* bad.

If you want to be really fancy, you could start by sorting the files
in descending order of size, so that big files are fetched before
small ones.  Since the largest possible file is 1GB and any database
where this feature is important is probably hundreds or thousands of
GB, this may not be very important. I suggest not worrying about it
for v1.

> Say, tablespace t1 has 20 files and we have 5 worker processes and tablespace t2 has 10. Ignoring all other factors
forthe sake of this example, each worker process will get a group of 4 files of t1 and 2 files of t2. Each process will
create2 tar files, one for t1 containing 4 files and another for t2 containing 2 files. 

This is one of several possible approaches. If we're doing a
plain-format backup in parallel, we can just write each file where it
needs to go and call it good. But, with a tar-format backup, what
should we do? I can see three options:

1. Error! Tar format parallel backups are not supported.

2. Write multiple tar files. The user might reasonably expect that
they're going to end up with the same files at the end of the backup
regardless of whether they do it in parallel. A user with this
expectation will be disappointed.

3. Write one tar file. In this design, the workers have to take turns
writing to the tar file, so you need some synchronization around that.
Perhaps you'd have N threads that read and buffer a file, and N+1
buffers.  Then you have one additional thread that reads the complete
files from the buffers and writes them to the tar file. There's
obviously some possibility that the writer won't be able to keep up
and writing the backup will therefore be slower than it would be with
approach (2).

There's probably also a possibility that approach (2) would thrash the
disk head back and forth between multiple files that are all being
written at the same time, and approach (3) will therefore win by not
thrashing the disk head. But, since spinning media are becoming less
and less popular and are likely to have multiple disk heads under the
hood when they are used, this is probably not too likely.

I think your choice to go with approach (2) is probably reasonable,
but I'm not sure whether everyone will agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: log message in proto.c
Next
From: Tom Lane
Date:
Subject: Re: Take skip header out of a loop in COPY FROM