Re: WIP/PoC for parallel backup - Mailing list pgsql-hackers

From Asif Rehman
Subject Re: WIP/PoC for parallel backup
Date
Msg-id CADM=Jejh2gFzeOpYCVJPWmL9NqOPhOcqYLLWcF0CLFjq1JcvqA@mail.gmail.com
Whole thread Raw
In response to Re: WIP/PoC for parallel backup  (Asif Rehman <asifr.rehman@gmail.com>)
Responses Re: WIP/PoC for parallel backup
List pgsql-hackers


On Sat, Jan 4, 2020 at 11:53 AM Asif Rehman <asifr.rehman@gmail.com> wrote:


On Thu, Dec 19, 2019 at 10:47 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 12, 2019 at 10:20 AM Asif Rehman <asifr.rehman@gmail.com> wrote:
> I have updated the patches (v7 attached) and have taken care of all issues pointed by Jeevan, additionally
> ran the pgindent on each patch. Furthermore, Command names have been renamed as suggested and I
> have simplified the SendFiles function. Client can only request the regular files, any other kind such as
> directories or symlinks will be skipped, the client will be responsible for taking care of such.

Hi,

Patch 0001 of this series conflicts with my recent commit
303640199d0436c5e7acdf50b837a027b5726594; that commit was actually
inspired by some previous study of 0001. That being said, I think 0001
has the wrong idea. There's no reason that I can see why it should be
correct to remove the PG_ENSURE_ERROR_CLEANUP calls from
perform_base_backup(). It's true that if we register a long-lived
before_shmem_exit hook, then the backup will get cleaned up even
without the PG_ENSURE_ERROR_CLEANUP block, but there's also the
question of the warning message. I think that our goal should be to
emit the warning message about a backup being stopped too early if the
user uses either pg_start_backup() or the new START_BACKUP command and
does not end the backup with either pg_stop_backup() or the new
STOP_BACKUP command -- but not if a single command that both starts
and ends a backup, like BASE_BACKUP, is interrupted. To accomplish
that goal in the wake of 303640199d0436c5e7acdf50b837a027b5726594, we
need to temporarily register do_pg_abort_backup() as a
before_shmem_exit() handler using PG_ENSURE_ERROR_CLEANUP() during
commands like BASE_BACKUP() -- and for things like pg_start_backup()
or the new START_BACKUP command, we just need to add a single call to
register_persistent_abort_backup_handler().

So I think you can drop 0001, and then in the patch that actually
introduces START_BACKUP, add the call to
register_persistent_abort_backup_handler() before calling
do_pg_start_backup(). Also in that patch, also adjust the warning text
that do_pg_abort_backup() emits to be more generic e.g. "aborting
backup due to backend exiting while a non-exclusive backup is in
progress".

Sure. will do.
 
0003 creates three new functions, moving code from
do_pg_start_backup() to a new function collectTablespaces() and from
perform_base_backup() to new functions setup_throttle() and
include_wal_files(). I'm skeptical about all of these changes. One
general nitpick is that the way these function names are capitalized
and punctuated does not seem to have been chosen very consistently;
how about name_like_this() throughout? A bit more substantively:

- collectTablespaces() is factored out of do_pg_start_backup() so that
it can also be used by SendFileList(), but that means that a client is
going to invoke START_BACKUP, indirectly calling collectTablespaces(),
and then immediately afterward the client is probably going to call
SEND_FILE_LIST, which will again call collectTablespaces(). That does
not appear to be super-great. For one thing, it's duplicate work,
although because SendFileList() is going to pass infotbssize as false,
it's not a lot of duplicated work.

I'll remove this duplication by eliminating this call from START_BACKUP and
SEND_FILE_LIST functions. More about this is explained later in this email.
 
Also, what happens if the two calls
to collectTablespaces() return different answers due to concurrent
CREATE/DROP TABLESPACE commands? Maybe it would all work out fine, but
it seems like there is at least the possibility of bugs if different
parts of the backup have different notions of what tablespaces exist.

The concurrent CREATE/DROP TABLESPACE commands, it can happen and will
be resolved by the WAL files collected for the backup. I don't think we
can do anything when objects are created or dropped in-between start and
stop backup. BASE_BACKUPalso relies on the WAL files to handle such a
scenario and does not error out when some relation files go away.
 

- setup_throttle() is factored out of perform_base_backup() so that it
can be called in StartBackup() and StopBackup() and SendFiles(). This
seems extremely odd. Why does it make any sense to give the user an
option to activate throttling when *ending* a backup? Why does it make
sense to give the user a chance to enable throttling *both* at the
startup of a backup *and also* for each individual file. If we're
going to support throttling here, it seems like it should be either a
backup-level property or a file-level property, not both.

It's a file-level property only. Throttle functionality relies on global
variables. StartBackup() and StopBackup() are calling setup_throttle
function to disable the throttling.

I should have been more explicit here by using -1 to setup_throttle,
Illustrating that throttling is disabled, instead of using 'opt->maxrate'.
(Although it defaults to -1 for these functions).

I'll remove the setup_throttle() call for both functions.
 

- include_wal_files() is factored out of perform_base_backup() so that
it can be called by StopBackup(). This seems like a poor design
decision. The idea behind the BASE_BACKUP command is that you run that
one command, and the server sends you everything. The idea in this new
way of doing business is that the client requests the individual files
it wants -- except for the WAL files, which are for some reason not
requested individually but sent all together as part of the
STOP_BACKUP response. It seems like it would be more consistent if the
client were to decide which WAL files it needs and request them one by
one, just as we do with other files.

As I understand you are suggesting to add another command to fetch the
list of WAL files which would be called by the client after executing stop
backup. Once the client gets that list, it starts requesting the WAL files one
by one.

So I will add LIST_WAL_FILES command that will take start_lsn and end_lsn
as arguments and return the list of WAL files between these LSNs.

Something like this :
LIST_WAL_FILES 'start_lsn'  'end_lsn';
 

I think there's a common theme to all of these complaints, which is
that you haven't done enough to move things that are the
responsibility of the backend in the BASE_BACKUP model to the frontend
in this model. I started wondering, for example, whether it might not
be better to have the client rather than the server construct the
tablespace_map file. After all, the client needs to get the list of
files anyway (hence SEND_FILE_LIST) and if it's got that then it knows
almost enough to construct the tablespace map. The only additional
thing it needs is the full pathname to which the link points. But, it
seems that we could fairly easily extend SEND_FILE_LIST to send, for
files that are symbolic links, the target of the link, using a new
column. Or alternatively, using a separate command, so that instead of
just sending a single SEND_FILE_LIST command, the client might first
ask for a tablespace list and then might ask for a list of files
within each tablespace (e.g. LIST_TABLESPACES, then LIST_FILES <oid>
for each tablespace, with 0 for the main tablespace, perhaps). I'm not
sure which way is better.

do_pg_start_backup is collecting the tablespace information anyway to
build the tablespace_map for BASE_BACKUP. So returning the same seemed
better than adding a new command for the same information. hence multiple
calls to the collectTablespaces() [to be renamed to collect_tablespaces].

tablespace_map can be constructed by the client, but then BASE_BACKUP
is returning it as part of the full backup. If clients in parallel mode
are to construct this themselves, these will seem like two different
approaches. Perhaps this should be done for BASE_BACKUP as
well?

I'll refactor the do_pg_start_backup function to remove the code related
to tablespace information collection (to collect_tablespaces) and
tablespace_map file creation, so that this function does not collect this
information unnecessarily. perform_base_backup function can collect and
send the tablespace information to the client and then the client can
construct the tablespace_map file.

I'll add a new command to fetch the list of tablespaces i.e. LIST_TABLESPACES
which will return the tablespace information to the client for parallel
mode. And will refactor START_BACKUP and STOP_BACKUP commands,
so that they only do the specific job of putting the system in backup mode or
out of it, nothing else.These commands should only return the start and end
LSN to the client.

 

Similarly, for throttling, I have a hard time understanding how what
you've got here is going to work reasonably. It looks like each client
is just going to request whatever MAX_RATE the user specifies, but the
result of that will be that the actual transfer rate is probably a
multiple of the specified rate, approximately equal to the specified
rate times the number of clients. That's probably not what the user
wants. You could take the specified rate and divide it by the number
of workers, but limiting each of 4 workers to a quarter of the rate
will probably lead to a combined rate of less than than the specified
rate, because if one worker doesn't use all of the bandwidth to which
it's entitled, or even exits earlier than the others, the other
workers don't get to go any faster as a result. Another problem is
that, in the current approach, throttling applies overall to the
entire backup, but in this approach, it is applied separately to each
SEND_FILE command. In the current approach, if one file finishes a
little faster or slower than anticipated, the next file in the tarball
will be sent a little slower or faster to compensate. But in this
approach, each SEND_FILES command is throttled separately, so this
property is lost. Furthermore, while BASEBACKUP sends data
continuously, this approach naturally involves pauses between
commands. If files are large, that won't matter much, but if they're
small and numerous, it will tend to cause the actual transfer rate to
be less than the throttling rate.

One potential way to solve this problem is... move it to the client
side. Instead of making it the server's job not to send data too fast,
make it the client's job not to receive data too fast. Let the server
backends write as fast as they want, and on the pg_basebackup side,
have the threads coordinate with each other so that they don't read
data faster than the configured rate. That's not quite the same thing,
though, because the server can get ahead by the size of the client's
receive buffers plus whatever data is on the wire. I don't know
whether that's a big enough problem to be worth caring about. If it
is, then I think we need some server infrastructure to "group
throttle" a group of cooperating backends.

That was a mistake in my code. maxrate should've been equally divided
amongst all threads. I agree that we should move this to the client-side.
When a thread exits, its share should also be equally divided amongst 
the remaining threads (i.e. recalculate maxrate for each remaining thread). 

Say we have 4 running threads with each allocation 25% of the bandwidth.
Thread 1 exits. We recalculate bandwidth and assign the remaining 3 threads
33.33% each. This solves one problem that you had identified. However,
it doesn't solve where one (or more) thread is not fully consuming their 
allocated share. I'm not really sure how we can solve it properly. Suggestions
are welcome.
 

A general comment about 0004 is that it seems like you've proceeded by
taking the code from perform_base_backup() and spreading it across
several different functions without, necessarily, as much thought as
is needed there. For instance, StartBackup() looks like just the
beginning of perform_base_backup(). But, why shouldn't it instead look
like pg_start_backup() -- in fact, a simplified version that only
handles the non-exclusive backup case? Is the extra stuff it's doing
really appropriate? I've already complained about the
tablespace-related stuff here and the throttling, but there's more.
Setting statrelpath here will probably break if somebody tries to use
SEND_FILES without first calling START_BACKUP. Sending the
backup_label file here is oddly asymmetric, because that's done by
pg_stop_backup(), not pg_start_backup(). And similarly, StopBackup()
looks like it's just the end of perform_base_backup(), but that's not
pretty strange-looking too. Again, I've already complained about
include_wal_files() being part of this, but there's also:

+       /* ... and pg_control after everything else. */

...which (1) is an odd thing to say when this is the first thing this
particular function is to send and (2) is another example of a sloppy
division of labor between client and server; apparently, the client is
supposed to know not to request pg_control, because the server is
going to send it unsolicited. There's no particular reason to have
this a special case. The client could just request it last. And then
the server code wouldn't need a special case, and you wouldn't have
this odd logic split between the client and the server.

Overall, I think this needs a lot more work. The overall idea's not
wrong, but there seem to be a very large number of details which, at
least to me, do not seem to be correct.



Thank you Robert for the detailed review. I really appreciate your insights
and very precise feedback.

After the changes suggested above, the design on a high level will look something
like this:

=== SEQUENTIAL EXECUTION ===
START_BACKUP [LABEL | FAST]
- Starts backup on the server
- Returns the start LSN to client

LIST_TABLESPACES
- Sends a list of all tables spaces to client

Loops over LIST_TABLESPACES
- LIST_FILES [tablespace]
- Sends file list for the given tablespace
- Create a list of all files

=== PARALLEL EXECUTION ===
Thread loop until the list of files is exhausted
SEND_FILE <file(s)> [CHECKSUM | WAL_START_LOCATION]
- If the checksum is enabled then WAL_START_LOCATION is required.
- Can request server to send one or more files but we are requesting one at a time
- Pick next file from list of files

- Threads sleep after the list is exhausted
- All threads are sleeping

=== SEQUENTIAL EXECUTION ===
STOP_BACKUP [NOWAIT]
- Stops backup mode
- Return end LSN

If --wal-method=fetch then
LIST_WAL_FILES 'start_lsn' 'end_lsn'
- Sends a list of WAL files between start LSN and end LSN

=== PARALLEL EXECUTION ===
Thread loop until the list of WAL files is exhausted
SEND_FILE <WAL file>
- Can request server to send one or more files but we are requesting one WAL file at a time
- Pick next file from list of WAL files

- Threads terminate and set their status as completed/terminated

=== SEQUENTIAL EXECUTION ===
Cleanup




Here are the the updated patches, taking care of the issues pointed
earlier. This patch adds the following commands (with specified option):

START_BACKUP [LABEL '<label>'] [FAST]
STOP_BACKUP [NOWAIT]
LIST_TABLESPACES [PROGRESS]
LIST_FILES [TABLESPACE]
LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']
SEND_FILES '(' FILE, FILE... ')' [START_WAL_LOCATION 'X/X']
                            [NOVERIFY_CHECKSUMS]


Parallel backup is not making any use of tablespace map, so I have
removed that option from the above commands. There is a patch pending
to remove the exclusive backup; we can further refactor the do_pg_start_backup
function at that time, to remove the tablespace information and move the
creation of tablespace_map file to the client.


I have disabled the maxrate option for parallel backup. I intend to send
out a separate patch for it. Robert previously suggested to implement
throttling on the client-side. I found the original email thread [1] 
where throttling was proposed and added to the server. In that thread,
it was originally implemented on the client-side, but per many suggestions,
it was moved to server-side.

So, I have a few suggestions on how we can implement this:

1- have another option for pg_basebackup (i.e. per-worker-maxrate) where
the user could choose the bandwidth allocation for each worker. This approach
can be implemented on the client-side as well as on the server-side.

2- have the maxrate, be divided among workers equally at first. and the 
let the main thread keep adjusting it whenever one of the workers finishes.
I believe this would only be possible if we handle throttling on the client.
Also, as I understand it, implementing this will introduce additional mutex
for handling of bandwidth consumption data so that rate may be adjusted 
according to data received by threads.


--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Expose lock group leader pid in pg_stat_activity
Next
From: "曾文旌(义从)"
Date:
Subject: Re: [Proposal] Global temporary tables