Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs - Mailing list pgsql-hackers

From Chris Travers
Subject Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs
Date
Msg-id CAN-RpxBvGPC9UjPe-=_Vq3p6qK7yj5ajT2RgZHtGcFKQjvQzwA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs
List pgsql-hackers
First, thanks for your thoughts on this, and I am interested in probing them more.

On Mon, Oct 30, 2017 at 9:04 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, Oct 28, 2017 at 4:22 AM, Chris Travers <chris.travers@adjust.com> wrote:
> The Solution:
> The solution is a whitelist of directories specified which are the only ones
> which are synchronised.  The relevant part of this patch is:
>
> +/* List of directories to synchronize:
> + * base data dirs (and ablespaces)
> + * wal/transaction data
> + * and that is it.
> + *
> + * This array is null-terminated to make
> + * it easy to expand
> + */
>
> +const char *rewind_dirs[] = {
> +    "base",
> +    "global",
> +    "pg_commit_ts",
> +    "pg_logical",
> +    "pg_multixact",
> +    "pg_serial",
> +    "pg_subtrans",
> +    "pg_tblspc",
> +    "pg_twophase",
> +    "pg_wal",
> +    "pg_xact",
> +    NULL
> +};

The problem with a white list, which is one reason why I do not favour
it, is in the case where a feature adds in the data folder a new path
for its data, particularly in the case where this is critical for a
base backup. If this folder is not added in this list, then a rewind
may be silently corrupted as well. There are also a set of directories
in this list that we do not care about, pg_serial being one.
pg_subtrans is a second, as it gets zeroed on startup.

The problem with an exclude list is that we cannot safely exclude configuration files or logs (because these could be named anything), right?

Are there any cases right now where you have features added by extensions that write to directories which are required for a rewind?  I am asking because I would like to see if this is the minimum working change or if this change is fundamentally broken currently and must be extended to allow custom paths to be sync'd as well.

And if you think about it, pg_rewind is actually the *same* processing
as a base backup taken using the replication protocol. So instead of
this patch I would recommend using excludeFiles and excludeDirContents
by moving this list to a common header where frontend and backend can
refer to. Note that basebackup.h includes replnodes.h, so my thoughts
is that you should split the current header with something like
basebackup_internal.h which is backend-only, and have pg_rewind fetch
the list of directories to automatically ignore as those ones. You can
also simplify a bit the code of pg_rewind a bit so as things like
postmaster.opts. On top of that, there is visibly no need to tweak the
SQL query fetching the directory list (which is useful for debugging
as shaped now actually), but just change process_source_file so as its
content is not added in the file map.

I am not sure it *should* be the same, however.  In a backup we probably want to backup the postgresql.auto.conf, but on a failover, I don't think we want to clobber configuration.  We certainly do not want to sometimes clobber configuration but not other times (which is what happens right now in some cases).  And under no circumstances do we want to clobber logs on a failed server with logs on a working server.  That's asking for serious problems in my view. 

If you think about it, there's a huge difference in use case in backing up a database cluster (Including replication slots, configs in the dir etc) and re-syncing the data so that replication can resume, and I think there are some dangers that come up when assuming these should be closely tied together.

Then there is a second case where you do not want a set of folders to
be included, but those can be useful by default, an example here being
pg_wal where one might want to have an empty path to begin with. A
third case is a set of folders that you do not care about, but depends
on the deployment, being here "log" for example. For those you could
just add an --exclude-path option which simply piles up paths into a
linked list when called multiple times. This could happen with a
second patch.

Agreed.  And one could add an "--include-path" option to allow for unusual cases where you want extra directories, such as replication slots, or the like.

I think another patch could also specifically empty and perhaps create a replication slot allowing for one to bring tings back up via streaming replication more safely.

> From there we iterate over this array for directory-based approaches in
> copy_fetch.c and with one query per directory in libpq_fetch.c.  This also
> means shifting from the basic interface from PQexec to PQexecParams.  It
> would be possible to move to binary formats too, but this was not done
> currently in order to simplify code review (that could be a separate
> independent patch at a later time).

-    res = PQexec(conn, sql);
+    for (p = 0; rewind_dirs[p] != NULL; ++p)
+    {
+        const char *paths[1];
+        paths[0] = rewind_dirs[p];
+        res = PQexecParams(conn, sql,  1, NULL, paths, NULL, NULL, 0);
Calling multiple times the query to list all directories is messy IMO.
And this is N-costly processing if there are many files to look at,
say many relation files.

We wouldn't recurse into the relation files dir more than once since this is filtered out on the initial query in the recursive CTE before recursion.  In other words, the query filters out the top-level query before it recurses into any directory so there is a smalll cost (planning, the fact that the root of the pgdata is queried once per) but this seems pretty minor given that we need at least one query per file that we sync anyway.

I am open to changing it to an array, but one thing to think about is that if we want to add an ability to add extra directories, this makes it much easier to do that.
 
--
Michael



--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes inVACUUM
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] An unlikely() experiment