Re: WIP/PoC for parallel backup - Mailing list pgsql-hackers

From Asif Rehman
Subject Re: WIP/PoC for parallel backup
Date
Msg-id CADM=JehpCq0sdXbk0NuesBE8mxnpyWSW07D=yGoUPOcVAgb=iA@mail.gmail.com
Whole thread Raw
In response to Re: WIP/PoC for parallel backup  (Rushabh Lathia <rushabh.lathia@gmail.com>)
Responses Re: WIP/PoC for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers


On Mon, Oct 7, 2019 at 1:52 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote:
Thanks Asif for the patch.  I am opting this for a review.  Patch is
bit big, so here are very initial comments to make the review process
easier.

Thanks Rushabh for reviewing the patch.


1) Patch seems doing lot of code shuffling, I think it would be easy
to review if you can break the clean up patch separately.

Example:
a: setup_throttle
b: include_wal_files

2) As I can see this patch basically have three major phase.

a) Introducing new commands like START_BACKUP, SEND_FILES_CONTENT and
STOP_BACKUP.
b) Implementation of actual parallel backup.
c) Testcase

I would suggest, if you can break out in three as a separate patch that
would be nice.  It will benefit in reviewing the patch.

Sure, why not. I will break them into multiple patches.
 

3) In your patch you are preparing the backup manifest (file which
giving the information about the data files). Robert Haas, submitted
the backup manifests patch on another thread [1], and I think we
should use that patch to get the backup manifests for parallel backup.

Sure. Though the backup manifest patch calculates and includes the checksum of backup files and is done
while the file is being transferred to the frontend-end. The manifest file itself is copied at the
very end of the backup. In parallel backup, I need the list of filenames before file contents are transferred, in
order to divide them into multiple workers. For that, the manifest file has to be available when START_BACKUP
 is called. 

That means, backup manifest should support its creation while excluding the checksum during START_BACKUP().
I also need the directory information as well for two reasons:

- In plain format, base path has to exist before we can write the file. we can extract the base path from the file
but doing that for all files does not seem a good idea.
- base backup does not include the content of some directories but those directories although empty, are still
expected in PGDATA.

I can make these changes part of parallel backup (which would be on top of backup manifest patch) or
these changes can be done as part of manifest patch and then parallel can use them. 

Robert what do you suggest?


-- 
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

pgsql-hackers by date:

Previous
From: Alexey Kondratov
Date:
Subject: Re: Two pg_rewind patches (auto generate recovery conf and ensureclean shutdown)
Next
From: Robert Haas
Date:
Subject: Re: stress test for parallel workers