Re: WIP/PoC for parallel backup - Mailing list pgsql-hackers

From Robert Haas
Subject Re: WIP/PoC for parallel backup
Date
Msg-id CA+TgmoZ81uKVCE3vFxP1EVwbLBUL9hSU_byg5F2-L1eVXajoYw@mail.gmail.com
Whole thread Raw
In response to Re: WIP/PoC for parallel backup  (Asif Rehman <asifr.rehman@gmail.com>)
Responses Re: WIP/PoC for parallel backup
Re: WIP/PoC for parallel backup
List pgsql-hackers
On Tue, Apr 14, 2020 at 10:37 AM Asif Rehman <asifr.rehman@gmail.com> wrote:
> I forgot to make a check for no-manifest. Fixed. Attached is the updated patch.

+typedef struct
+{
...
+} BackupFile;
+
+typedef struct
+{
...
+} BackupState;

These structures need comments.

+list_wal_files_opt_list:
+                       SCONST SCONST
                                {
-                                 $$ = makeDefElem("manifest_checksums",
-
(Node *)makeString($2), -1);
+                                       $$ = list_make2(
+                                       makeDefElem("start_wal_location",
+                                               (Node *)makeString($2), -1),
+                                       makeDefElem("end_wal_location",
+                                               (Node *)makeString($2), -1));
+
                                }

This seems like an unnecessarily complicated parse representation. The
DefElems seem to be completely unnecessary here.

@@ -998,7 +1110,37 @@ SendBaseBackup(BaseBackupCmd *cmd)
                set_ps_display(activitymsg);
        }

-       perform_base_backup(&opt);
+       switch (cmd->cmdtag)

So the design here is that SendBaseBackup() is now going to do a bunch
of things that are NOT sending a base backup? With no updates to the
comments of that function and no change to the process title it sets?

-       return (manifest->buffile != NULL);
+       return (manifest && manifest->buffile != NULL);

Heck no. It appears that you didn't even bother reading the function
header comment.

+ * Send a single resultset containing XLogRecPtr record (in text format)
+ * TimelineID and backup label.
  */
 static void
-SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli)
+SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli,
+                                        StringInfo label, char *backupid)

This just casually breaks wire protocol compatibility, which seems
completely unacceptable.

+       if (strlen(opt->tablespace) > 0)
+               sendTablespace(opt->tablespace, NULL, true, NULL, &files);
+       else
+               sendDir(".", 1, true, NIL, true, NULL, NULL, &files);
+
+       SendFilesHeader(files);

So I guess the idea here is that we buffer the entire list of files in
memory, regardless of size, and then we send it out afterwards. That
doesn't seem like a good idea. The list of files might be very large.
We probably need some code refactoring here rather than just piling
more and more different responsibilities onto sendTablespace() and
sendDir().

+       if (state->parallel_mode)
+               SpinLockAcquire(&state->lock);
+
+       state->throttling_counter += increment;
+
+       if (state->parallel_mode)
+               SpinLockRelease(&state->lock);

I don't like this much. It seems to me that we would do better to use
atomics here all the time, instead of conditional spinlocks.

+static void
+send_file(basebackup_options *opt, char *file, bool missing_ok)
...
+       if (file == NULL)
+               return;

That seems totally inappropriate.

+                       sendFile(file, file + basepathlen, &statbuf,
true, InvalidOid, NULL, NULL);

Maybe I'm misunderstanding, but this looks like it's going to write a
tar header, even though we're not writing a tarfile.

+               else
+                       ereport(WARNING,
+                                       (errmsg("skipping special file
or directory \"%s\"", file)));

So, if the user asks for a directory or symlink, what's going to
happen is that they're going to receive an empty file, and get a
warning. That sounds like terrible behavior.

+       /*
+        * Check for checksum failures. If there are failures across multiple
+        * processes it may not report total checksum count, but it will error
+        * out,terminating the backup.
+        */

In other words, the patch breaks the feature. Not that the feature in
question works particularly well as things stand, but this makes it
worse.

I think this patch (0003) is in really bad shape. I'm having second
thoughts about the design, but it's kind of hard to even have a
discussion about the design when the patch is riddled with minor
problems like inadequate comments, failure to update existing
comments, and breaking a bunch of things. I understand that sometimes
things get missed, but this is version 14 of a patch that's been
kicking around since last August.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: cleaning perl code
Next
From: Tom Lane
Date:
Subject: Re: ERROR: could not determine which collation to use for string comparison